3. Binarization in Scikit-Learn. A better tool is the OneHotEncoder class from the sklearn.preprocessing library. It ensures consistent representation across training, dev, and test sets because it can "fit" an encoder to the training set, which is then applied to all datasets. In other words, it's "stateful", as opposed to the "stateless" pandas.get_dummies (). It's slightly more complex than pandas.get_dummies () but is worth it: Import the necessary libraries: from sklearn.preprocessing import OneHotEncoder • Instantiate the encoder: encoder = OneHotEncoder (sparse=False, handle_unknown='ignore') Here handle_unknown='ignore' is important, otherwise when applied to the unseen dev/test data, it will throw an error on new features that it has not seen on the training set (e.g., a new country). See also Q2. • Fit the encoder to the training data, and transform the training data: (data) # you only fit the encoder once (on training) binary data = encoder.transform (data) # but use it to transform training, dev, and test set: In practice, you can combine these two lines by one function encoder.fit_transform(data) but we chose to separate them for clarity. The output, binary data, looks very similar to the binarized data from pandas: [[o. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0.] [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.] 0. 0. 1. 0. 0. 0. 1.] [0. 1. 0. 0. o. o. o. o. 0. [o. 0. 0. 0. 0. 0. 1. 0.] [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1.] [0. 0. 0. 1. o. o. o. o. o. 0. 0. 1.] o. 0. 0. 0. 1.] [0. 0. 1. 0. o. o. o. 1. 0. 0. 0. o. o. o. 1. 1. 0. 0.]] 0. 0. 1.] [o. 0. 0. 0. 0. [o. o. o. o. o./nYou can check the name of each column by encoder.get_feature_names_out (), which gives us: ['age_19' 'age_30' 'age_33' 'age_35' 'age_37' 'age_40' 'age_45' 'age_47' 'age_54' 'sector_Federal-gov' 'sector_Local-gov' 'sector_Private'] Question: After implementing the naive binarization to the real training set (NOT the toy set), what is the feature dimension? Does it match with the result from Part 1 Q5? (0.25 pts) Note: target is not a feature! 4. Fit k-NN via Scikit-Learn. With the dataset now binarized using the OneHotEncoder, an intriguing explo- ration is to employ the k-nearest neighbors (k-NN) algorithm on this transformed data. The k-NN algorithm works by classifying a data point based on how its neighbors are classified. The number of neighbors, denoted as k, is a parameter that can greatly affect the performance of the k-NN algorithm. To begin with, you would utilize the KNeighbors Classifier from sklearn.neighbors. Given the binary representation of the dataset, you can follow these general steps: • Prediction: Predict the labels for both the training set (to get training accuracy) and the dev set. • Evaluation: Calculate and compare the accuracy scores for the predictions on the training and dev sets. Questions: (a) Evaluate k-NN on both the training and dev sets and report the error rates and predicted positive rates for k from 1 to 100 (odd numbers only, for tie-breaking), e.g., something like:/nk=1 k=3 train_err xx.x% (+:xx.x%) dev_err xx.x% (+:xx.x%) k=99 Q: what's your best error rate on dev? Which k achieves this best error rate? (Hint: 1-NN dev error should be ~23% and the best dev error should be ~16%). (1 pt) (b) Q: When k = 1, is training error 0%? Why or why not? Look at the training data to confirm your answer. (0.5 pts) (c) Q: What trends (train and dev error rates and positive ratios, and running speed) do you observe with increasing k? (0.75 pts) (d) Q: What does k = ∞ actually do? Is it extreme overfitting or underfitting? What about k = 1? (0.5 pts)

Fig: 1

Fig: 2

Fig: 3