#### Machine Learning

1. Give an example of a low dimensional (approx. 20 dimensions), medium dimensional (approx. 1000 dimensions) and high dimensional (approx. 100000 dimensions) problem that you care about.

Decision Trees: 1. Consider the problem from the previous assignments where we want to predict gender from information about height, weight, and age. We will use Decision Trees to make this prediction. Note that as the data attributes are continuous numbers you have to use the 2 attribute and determine a threshold for each node in the tree. As a result, you need to solve the information gain for each threshold that is halfway between two data points and thus the complexity of the computations increases with the number of data items. a) Implement a decision tree learner for this particular problem that can derive decision trees with an arbitrary, pre- determined depth (up to the maximum depth where all data sets at the leaves are pure) using the information gain criterion. b) Divide the data set from Question 1c) in Project 1 (the large training data set) into a training set comprising the first 50 data points and a test set consisting of the last 70 data elements. Use the resulting training set to derive trees of depths 1-5 and evaluate the accuracy of the resulting trees for the 50 training samples and for the test set containing the last 70 data items. Compare the classification accuracy on the test set with the one on the training set for each tree depth. For which depths does the result indicate overfitting?

(a) In the context of binary classification, define the optimal separating hyper-plane (also known as maximal margin hyperplane) and maximal margin clas-sifier.[4 marks] (b) Describe Linear Discriminant Analysis (LDA) giving an explicit formula and making sure to cover the case of one attribute (usually expressed as p = 1)and more than one attribute (usually expressed as p > 1). There is no need to give formulas for parameter estimates.[6 marks] (c) Why are LDA prediction boundaries linear? When answering this question,you may assume that the number of attributes is greater than 1 (p > 1).12 morkel (d) You are given the following training set with two attributes and binary label: points (-1,0) and (0, -1) belong to class 1, point (0,0) belongs to class 2. Do the following tasks: i. Draw the optimal separating hyperplane (line, in this context) for thistraining set and write an equation for it.[4 marks] ii. Use this optimal separating hyperplane to classify point (-0.5, 0.5). (e) Consider the following training set with one attribute and binary label: points -2 and -1 belong to class 1, points 1 and 3 belong to class 2. Answer the following questions about this data set showing all details of your calculations (if any). i. What is the optimal separating hyperplane (point in this context) for the maximal margin classifier?[2 marks] ii. What is the prediction made by the maximal margin classifier for point 0.1?[2 marks] iii. What is the LDA prediction for point 0.1? (f) What is the key difference between the assumptions of linear discriminant analysis and those of quadratic discriminant analysis in the case of morethan one attribute?[4 marks]

5. Does the accuracy of a kNN classifier using the Euclidean distance change if you (a) translate the data(b) scale the data (i.e., multiply the all the points by a constant), or (c) rotate the data? Explain.Answer the same for a kNN classifier using Manhattan distance¹.

Task 0: Naïve Logistic Regression Make a logistic regression and report the accuracy. Task 1: Train Data Transformation Perform the pre-processing to transform the original data into a new feature space by doing feature engineering so the features are linear in the new space. Confirm four assumptions required for a linear classifier. Task 2: Linear Parametric Classification Implement logistic regression model using Scikit-learn. Using the GridSearchCV, optimize the model. 1. Make a logistic regression model. Report the weights and the accuracy of the model. 2. Using the GridSearchCV at various 100 a values from 10-5 to 10, build a logistic regression model. Visualize how the model accuracy behaviors. Then report the best model. If the accuracy is 100%, then the model is overfitted. In this case, the model should be regularized. 3. Using the best model, classify the test data set. Task 3: Transformation using Kernel Method Kernelize the original to a Kernel space using five different valid Kernel functions. Then repeat Task 2. Task 4: Non-parametric KNN Classification 1. Classify the original data with K values from 1 to 200. Then report the accuracy with visualization. 2. Repeat step 1 with the final train data sets from Tasks 1 and 3. Report: Write a report summarizing the work. In the report, all steps must be explicitly explained with visualizations.

(a) Suppose that, when using grid search with cross-validation to select the parameters C and gamma of the Support Vector Machine (SVM), you have obtained these results for the accuracy of the algorithm: (As usual, the accuracy is defined as 1 minus the error rate.) Is this a suitable grid for selecting the optimal values of the two parameters? Explain why. If it is not suitable, describe at least one way of improving it.[7 marks] (b) Give an example of a grid that is too crude and thus does not allow an accurate estimate of the optimal values of the parameters C and gamma of the SVM.[7 marks] (c) Give an example of a grid that clearly does not cover the optimal values ofthe parameters C and gamma of the SVM. Briefly explain why your example achieves its goal.[7 marks]

Q2. Using the data from Problem 2, build a Gaussian Na ive Bayes classifier for this problem. For this you have to learn Gaussian distribution parameters for each input data feature, i.e. for p(height|W), p(height|M), p(weight|W), p(weight|M), p(age|W), p(age|M). a) Learn/derive the parameters for the Gaussian Naive Bayes Classifier for the data from Question 2 a) and apply them to the same target as in problem 1a). b) Implement the Gaussian Naive Bayes Classifier for this problem. c) Repeat the experiment in part 1 c) and 1 d) with the Gaussian Naive Bayes Classifier. Discuss the results, in particular with respect to the performance difference between using all features and using only height and weight. d) Same as 1d but with Naïve Bayes. e) Compare the results of the two classifiers (i.e., the results form 1 c) and 1d) with the ones from 2 c) 2d) and discuss reasons why one might perform better than the other.

Question 2 [10 points]: Bayesian Theorem