Search for question
Question

Q3. In the previous question we observed some redundancy in the features. We would like to try some feature selection heuristic in this question. Consider the same dataset as question 2 (fat.csv), where brozek is the response variable and the other 17 columns are the model features. Follow this steps below. Form an extended version of the dataset, by appending two more columns. One column corresponding to siri² and one column corresponding to density. Your extended dataset should now have 20 columns, where the first column is brozek and used as the response variable, 17 columns identical to the original fat.csv data set, and columns 19 and 20 with the values siri² and density, respectively. We will refer to this dataset as the extended dataset. In a similar way as question 2, split the extended dataset into two sets. Set 1 includes the first 200 rows of the data (do not count the row associated with the feature/response names), and set 2, which includes the last 52 rows of the data. Name the first set train and the second set test. (a) Use the training data to fit a model of the following form brozek =B0+ẞisiri+...+ B₁7wrist + ẞissiri² + B19 density (9) report the fitted parameters, the 95% confidence interval for each estimated parameter and the p-values. What is the R² value? (b) Use the test data to calculate the test error (similar to the formulation in part (c) of the previous question), and call it e full- (c) Let's run a heuristic scheme to perform feature selection (the method is called backward selection and described on page 79 of your textbook, also on the slides). Start with the full model (the model containing all 19 features of the extended dataset) and drop the feature with the highest p-value (or the second largest if the largest p-value is for the intercept), then redo the modeling and drop the next feature with the highest p-value, and continue dropping until all p-values are small and you are left with a set of important features. Implement this approach and stop when all p-values are below 0.03. Which features are selected as the most important ones when your code stops? (d) Apply the model developed in part (c) to the test data and call the error esel (e) Compare e full and esel. Does the feature selection scheme seem to reduce overfitting? (f) Compare esel with e3 from part (h) of question 2. In terms of the test accuracy does your feature selection scheme seem to find the best model? Please hand in your code along with comprehensive response to each part of the question.

Fig: 1