Question

Question: 3. Two files HW5_prob3_X.txt and HW5_prob3_y.txt on the course Teams site contain the design matrix X and the response observations y, respectively, for artificially generated data. In this data, the number of candidate variables is p = 100 and there are n = 70 observations. You can load both onto R by running- X_data_frame <- read.table('HW5_prob3_X.txt') # this is data frame X <-as.matrix(X_data_frame) # convert it into a matrix y_data_frame <-read.table('HW5_prob3_y.txt') # data frame y <-as.matrix(y_data_frame) # convert to a matrix Your goal is to identify the true covarlates that were used to generate the response (y) values via the LASSO. You can run the LASSO using the following commands: library(glmnet) lassoFit <-glmnet(x=X, y=y) plot (lasso Fit) (You might want to check the code "Variable selection by LASSO.ipynb"on CoCalc.) Based on the generated plot, how many variables do you think is reasonable to include in the model? (If there is a large gap in the L¹ norm of the estimated coefficients for two models of size difference 1, then the two models may be considered as well-separated, so choosing the smaller of the two might be reasonable). Also using the command-print(lassoFit) find a value of lambda (A) that selects three variables, and a value of lambda that selects four variables (there can be more than one such lambda values). To check which variables are selected for a given value of lambda, you can run - coef(lassoFit, s-the value of lambda that you choose), Report which variables are selected by the LASSO for the two values of lambda. Also report the estimated coefficients for the selected variables for both cases.