3 binarization in scikit learn a better tool is the onehotencoder clas
3. Binarization in Scikit-Learn. A better tool is the OneHotEncoder class from the sklearn.preprocessing
library. It ensures consistent representation across training, dev, and test sets because it can "fit" an encoder
to the training set, which is then applied to all datasets. In other words, it's "stateful", as opposed to the
"stateless" pandas.get_dummies (). It's slightly more complex than pandas.get_dummies () but is worth it:
Import the necessary libraries:
from sklearn.preprocessing import OneHotEncoder
• Instantiate the encoder:
encoder = OneHotEncoder (sparse=False, handle_unknown='ignore')
Here handle_unknown='ignore' is important, otherwise when applied to the unseen dev/test data, it will
throw an error on new features that it has not seen on the training set (e.g., a new country). See also Q2.
• Fit the encoder to the training data, and transform the training data:
# you only fit the encoder once (on training)
binary data = encoder.transform (data) # but use it to transform training, dev, and test set:
In practice, you can combine these two lines by one function encoder.fit_transform(data) but we chose to
separate them for clarity. The output, binary data, looks very similar to the binarized data from pandas:
[[o. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0.]
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
0. 0. 1. 0. 0. 0. 1.]
[0. 1. 0. 0. o. o. o. o. 0.
[o. 0. 0. 0. 0.
0. 1. 0.]
[0. 0. 0. 0.
1. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 1. o. o. o. o. o. 0. 0. 1.]
o. 0. 0. 0. 1.]
[0. 0. 1. 0. o. o. o.
1. 0. 0. 0.
o. o. o. 1. 1. 0. 0.]]
0. 0. 1.]
[o. 0. 0. 0. 0.
[o. o. o. o. o./nYou can check the name of each column by
encoder.get_feature_names_out (), which gives us:
['age_19' 'age_30' 'age_33' 'age_35' 'age_37' 'age_40' 'age_45' 'age_47' 'age_54'
'sector_Federal-gov' 'sector_Local-gov' 'sector_Private']
Question: After implementing the naive binarization to the real training set (NOT the toy set), what is the
feature dimension? Does it match with the result from Part 1 Q5? (0.25 pts) Note: target is not a feature!
4. Fit k-NN via Scikit-Learn. With the dataset now binarized using the OneHotEncoder, an intriguing explo-
ration is to employ the k-nearest neighbors (k-NN) algorithm on this transformed data. The k-NN algorithm
works by classifying a data point based on how its neighbors are classified. The number of neighbors, denoted
as k, is a parameter that can greatly affect the performance of the k-NN algorithm.
To begin with, you would utilize the KNeighbors Classifier from sklearn.neighbors. Given the binary
representation of the dataset, you can follow these general steps:
• Prediction: Predict the labels for both the training set (to get training accuracy) and the dev set.
• Evaluation: Calculate and compare the accuracy scores for the predictions on the training and dev sets.
(a) Evaluate k-NN on both the training and dev sets and report the error rates and predicted positive rates
for k from 1 to 100 (odd numbers only, for tie-breaking), e.g., something like:/nk=1
train_err xx.x% (+:xx.x%) dev_err xx.x% (+:xx.x%)
Q: what's your best error rate on dev? Which k achieves this best error rate? (Hint: 1-NN dev error should
be ~23% and the best dev error should be ~16%). (1 pt)
(b) Q: When k = 1, is training error 0%? Why or why not? Look at the training data to confirm your answer.
(c) Q: What trends (train and dev error rates and positive ratios, and running speed) do you observe with
increasing k? (0.75 pts)
(d) Q: What does k = ∞ actually do? Is it extreme overfitting or underfitting? What about k = 1? (0.5 pts)
*The amount will be in form of wallet points that you can redeem to pay upto 10% of the price for any assignment. **Use of solution provided by us for unfair practice like cheating will result in action from our end which may include permanent termination of the defaulter’s account.Disclaimer:The website contains certain images which are not owned by the company/ website. Such images are used for indicative purposes only and is a third-party content. All credits go to its rightful owner including its copyright owner. It is also clarified that the use of any photograph on the website including the use of any photograph of any educational institute/ university is not intended to suggest any association, relationship, or sponsorship whatsoever between the company and the said educational institute/ university. Any such use is for representative purposes only and all intellectual property rights belong to the respective owners.