Search for question
Question

3. Binarization in Scikit-Learn. A better tool is the OneHotEncoder class from the sklearn.preprocessing

library. It ensures consistent representation across training, dev, and test sets because it can "fit" an encoder

to the training set, which is then applied to all datasets. In other words, it's "stateful", as opposed to the

"stateless" pandas.get_dummies (). It's slightly more complex than pandas.get_dummies () but is worth it:

Import the necessary libraries:

from sklearn.preprocessing import OneHotEncoder

• Instantiate the encoder:

encoder = OneHotEncoder (sparse=False, handle_unknown='ignore')

Here handle_unknown='ignore' is important, otherwise when applied to the unseen dev/test data, it will

throw an error on new features that it has not seen on the training set (e.g., a new country). See also Q2.

• Fit the encoder to the training data, and transform the training data:

encoder.fit (data)

# you only fit the encoder once (on training)

binary data = encoder.transform (data) # but use it to transform training, dev, and test set:

In practice, you can combine these two lines by one function encoder.fit_transform(data) but we chose to

separate them for clarity. The output, binary data, looks very similar to the binarized data from pandas:

[[o. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0.]

[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1.]

[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]

0. 0. 1. 0. 0. 0. 1.]

[0. 1. 0. 0. o. o. o. o. 0.

[o. 0. 0. 0. 0.

0. 1. 0.]

[0. 0. 0. 0.

1. 0. 0. 0. 0. 0. 0. 1.]

[0. 0. 0. 1. o. o. o. o. o. 0. 0. 1.]

o. 0. 0. 0. 1.]

[0. 0. 1. 0. o. o. o.

1. 0. 0. 0.

o. o. o. 1. 1. 0. 0.]]

0. 0. 1.]

[o. 0. 0. 0. 0.

[o. o. o. o. o./nYou can check the name of each column by

encoder.get_feature_names_out (), which gives us:

['age_19' 'age_30' 'age_33' 'age_35' 'age_37' 'age_40' 'age_45' 'age_47' 'age_54'

'sector_Federal-gov' 'sector_Local-gov' 'sector_Private']

Question: After implementing the naive binarization to the real training set (NOT the toy set), what is the

feature dimension? Does it match with the result from Part 1 Q5? (0.25 pts) Note: target is not a feature!

4. Fit k-NN via Scikit-Learn. With the dataset now binarized using the OneHotEncoder, an intriguing explo-

ration is to employ the k-nearest neighbors (k-NN) algorithm on this transformed data. The k-NN algorithm

works by classifying a data point based on how its neighbors are classified. The number of neighbors, denoted

as k, is a parameter that can greatly affect the performance of the k-NN algorithm.

To begin with, you would utilize the KNeighbors Classifier from sklearn.neighbors. Given the binary

representation of the dataset, you can follow these general steps:

• Prediction: Predict the labels for both the training set (to get training accuracy) and the dev set.

• Evaluation: Calculate and compare the accuracy scores for the predictions on the training and dev sets.

Questions:

(a) Evaluate k-NN on both the training and dev sets and report the error rates and predicted positive rates

for k from 1 to 100 (odd numbers only, for tie-breaking), e.g., something like:/nk=1

k=3

train_err xx.x% (+:xx.x%) dev_err xx.x% (+:xx.x%)

k=99

Q: what's your best error rate on dev? Which k achieves this best error rate? (Hint: 1-NN dev error should

be ~23% and the best dev error should be ~16%). (1 pt)

(b) Q: When k = 1, is training error 0%? Why or why not? Look at the training data to confirm your answer.

(0.5 pts)

(c) Q: What trends (train and dev error rates and positive ratios, and running speed) do you observe with

increasing k? (0.75 pts)

(d) Q: What does k = ∞ actually do? Is it extreme overfitting or underfitting? What about k = 1? (0.5 pts)

Fig: 1

Fig: 2

Fig: 3