Search for question
Question

Assignment-3 MSCA31010: Linear & Non-Linear Models (Acknowledgement: Special thanks to Francisco Azeredo & Ming_Long Lam for their content creation support) For this homework, you can use either R or Python, use Word docs, R-markdowns, Jupyter notebooks, or html’s for submission. Train a binary logistic regression model on the claim_history.csv. Your model will predict the likelihood of filing more than one claim in one unit of exposure. You will first calculate the Frequency variable by dividing the CLM_COUNT by EXPOSURE. Next, you will create a binary target variable that determines if the Frequency is strictly greater than one (i.e., the Event). You will use MSTATUS, CAR_TYPE, REVOKED, and URBANICITY as the categorical predictors, and CAR_AGE, MVR_PTS, TIF, and TRAVTIME as the interval predictors. Your goal is to train a model that has just the right set of predictors. The standard libraries for R or Python are allowed. You need to drop all missing values (i.e., NaN) of all the predictors and the target variable before training your model. (15 points) Before you train the model, we want to explore the predictors. For each predictor, generate a line chart that shows the odds of the Event by the predictor’s unique values. The predictor’s unique values are displayed in ascending lexical order. (20 points) Enter the predictors into your model using Forward Selection. The Entry Threshold is 0.05. Please provide a detailed report of the Forward Selection. However, you do not need to show steps such as in the previous question. The report should include (1) the predictor entered, (2) the log-likelihood value, (3) the Deviance Chi-squares statistic, (4) the Deviance Degree of Freedom, and (5) the Chi-square significance. (10 points). Which predictors does your final model contain? (10 points). Please show a table of the complete set of parameters of your final model. Please also include the exponentiated estimates (i.e., apply the exp() function on the parameter estimates). 2. You will visually assess your final model in Question 1. Please color-code the markers according to the Exposure value. Also, please briefly comment on the graphs. (10 points). Please plot the predicted Event probability versus the observed Frequency. (10 points). Please plot the Deviance residuals versus the observed Frequency. 3. (15 Points) You will calculate the Accuracy metric to assess your final model in Question 3. If the predicted Event probability of an observation is greater than or equal to 0.25, then you will classify that observation as the Event (i.e., filing more than one claim per unit exposure). An observation is correctly classified if the predicted target value equals the observed target value. The Accuracy metric is the proportion of observations that are correctly classified. Bonus: (20 Points) For questions 1B, 1C and 1D apply recursive feature elimination (RFE) instead of Forward Selection (see: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html )