Search for question
Question

Data Preprocessing and Feature Extraction I: Naive Binarization (3.5 pts)

In this section, we'll delve into a simple method of data preprocessing, naive binarization, and examine its implica-

tions and utility when applied to the Income dataset. Given the structure of our dataset, we have both numerical

and categorical data. For the purpose of this exploration, we'll treat the numerical data (age and hours-per-week)

equivalently to the categorical data. This means that age=37 will be treated similarly to sector-Private.

1. Pandas and Data Loading. Before we proceed with feature extraction, let's understand how to load our

dataset using the pandas library. The read_csv function facilitates this, and here we showcase loading from

the toy dataset toy.txt (watch video 2):

import pandas as pd

data = pd.read_csv ("toy.txt", sep=", names=["age", "sector"]) # load the toy dataset

11

Here's a breakdown of the parameters:

¹In principle, we could also convert education to a numerical feature, but we choose not to do it to keep it simple./n

Fig: 1

Fig: 2