2 binarization in pandas let s now introduce the concept of one hot en
Question
2. Binarization in Pandas. Let's now introduce the concept of one-hot encoding with a practical example from
our toy dataset. One-hot encoding is a technique to convert categorical data into a format that could be fed
into machine learning algorithms. Essentially, for each unique value in a category, a binary feature is created.
Using pandas, the one-hot encoding can be done using pandas.get_dummies ():
encoded_data = pd.get_dummies (data, columns=["age", "sector"])
print (encoded_data)
This produces:
0
235
6
7
8
9
age 19 age 30 age 33
0
0
0
0
1
0
0
0
0
0
0
0
1
000.
1
0
0
0
OOHOO
0
0
age_35 age 37 age_40 age 45 age 47 age 54 sector_Federal-gov sector_Local-gov sector_Private
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
OOO OOHOOOO
0
1
0
0
0
0
0
0
1
0
ooooooooo
0
1
0
0
1
1
0
OOOOOOOOO
0
0
1
0
0
This output showcases the one-hot encoded version of our data. Each unique age and sector has been turned
into a binary feature occupying its own column. In the resulting dataframe, if an individual's age is 45, then
the age 45 feature will be on (1), while all other age features will be off (0). Same for the sector features.
Notice how these binary features are named: the naming convention starts with the original field's name (e.g.,
age or sector), followed by an underscore () and then the unique value (e.g., 19 or Federal-gov). This
naming rule is necessary: You might wonder why not just 19 instead of age_19? Well, there is also hours_19!
Question: Although pandas.get_dummies () is very handy for one-hot encoding, it's absolutely impossible
to be used in machine learning. Why? (0.5 pts) (Hint: It's important to think about the entire pipeline. When
working with training, dev and test sets, we need to ensure consistent representation across all of them.)
0
1
1
1
0
1
1
1
1
0