q3 in the previous question we observed some redundancy in the feature
Search for question
Question
Q3. In the previous question we observed some redundancy in the features. We would like to try
some feature selection heuristic in this question. Consider the same dataset as question 2 (fat.csv), where
brozek is the response variable and the other 17 columns are the model features. Follow this steps below.
Form an extended version of the dataset, by appending two more columns. One column corresponding
to siri² and one column corresponding to density. Your extended dataset should now have 20
columns, where the first column is brozek and used as the response variable, 17 columns identical to
the original fat.csv data set, and columns 19 and 20 with the values siri² and density, respectively.
We will refer to this dataset as the extended dataset.
In a similar way as question 2, split the extended dataset into two sets. Set 1 includes the first 200
rows of the data (do not count the row associated with the feature/response names), and set 2, which
includes the last 52 rows of the data. Name the first set train and the second set test.
(a) Use the training data to fit a model of the following form
brozek =B0+ẞisiri+...+ B₁7wrist + ẞissiri² +
B19
density
(9)
report the fitted parameters, the 95% confidence interval for each estimated parameter and the
p-values. What is the R² value?
(b) Use the test data to calculate the test error (similar to the formulation in part (c) of the previous
question), and call it e full-
(c) Let's run a heuristic scheme to perform feature selection (the method is called backward selection
and described on page 79 of your textbook, also on the slides). Start with the full model (the model
containing all 19 features of the extended dataset) and drop the feature with the highest p-value (or
the second largest if the largest p-value is for the intercept), then redo the modeling and drop the
next feature with the highest p-value, and continue dropping until all p-values are small and you are
left with a set of important features. Implement this approach and stop when all p-values are below
0.03. Which features are selected as the most important ones when your code stops?
(d) Apply the model developed in part (c) to the test data and call the error esel
(e) Compare e full and esel. Does the feature selection scheme seem to reduce overfitting?
(f) Compare esel with e3 from part (h) of question 2. In terms of the test accuracy does your feature
selection scheme seem to find the best model?
Please hand in your code along with comprehensive response to each part of the question.