In this section, we'll delve into a simple method of data preprocessing, naive binarization, and examine its implica-
tions and utility when applied to the Income dataset. Given the structure of our dataset, we have both numerical
and categorical data. For the purpose of this exploration, we'll treat the numerical data (age and hours-per-week)
equivalently to the categorical data. This means that age=37 will be treated similarly to sector-Private.
1. Pandas and Data Loading. Before we proceed with feature extraction, let's understand how to load our
dataset using the pandas library. The read_csv function facilitates this, and here we showcase loading from
the toy dataset toy.txt (watch video 2):
import pandas as pd
data = pd.read_csv ("toy.txt", sep=", names=["age", "sector"]) # load the toy dataset
11
Here's a breakdown of the parameters:
¹In principle, we could also convert education to a numerical feature, but we choose not to do it to keep it simple./n
Fig: 1
Fig: 2