Search for question
Question

1 Complete the following steps and then answer the exercise questions below. Step 1. Import the training and scoring data sets for this exercise into data frames in RStudio. 2 3 Step 2. Load the R library required to create a logistic regression model. Step 3. Create a logistic regression model to predict RenewedSubscription. Do not include PatronID as an independent variable. Coerce the dependent variable to be treated as a factor. Use the summary() function to inspect your independent variables' p-values. Do not remove any independent variables from the model. Step 4. Using a subset() command, remove observations, if any, from the scoring data set where one or more attributes exceed the range established in the training data set. For example, the range for DifferentUsers in the training data set is 2 to 12. If any observations in the scoring data set have DifferentUsers values below 2 or above 12, remove them. Check all attributes to ensure all scoring observations are within ranges established by the training data. Step 5. Using the predict() function, apply your logistic regression model to the scoring data. Make sure the type of prediction you generate is the model's "response." Step 6. Combine the predictions with the scoring data into a new data frame. View the data frame and answer the following questions. Which attribute is the single poorest predictor of season ticket renewal? ☐ PricePerTicket AvgMinutes BeforeCurtain ConcessionVouchers PerformancesAttended Of the first-year season ticket patrons in the scoring data set, how many are predicted by the logistic regression model to renew their subscriptions? 68 80 65 148 Considering all properties of the logistic regression model, which attribute is a better predictor of season ticket renewal? PricePerTicket ☐ NumberOfTickets PricePerTicket and NumberOfTickets have the exact same predictive strength in this model. Neither PricePerTicket nor NumberOfTickets have predictive strength in this model. If you wished to test the accuracy of your logistic regression model in R, to which data set would you apply the predict() function? The test data The training data The scoring data The validation data 5 How many "No" predictions have a post-probability confidence percent higher than 95%? ☐ 9 80 ☐ 68 71 6 Complete the following steps and then answer the exercise questions below. Step 1. Import the training and scoring data sets for this exercise into data frames in RStudio. Step 2. Load the R libraries required to create a decision tree model and to visualize the model graphically. Step 3. Create a decision tree model to predict InsuranceCategory. Do not include CustomerID as an independent variable. Step 4. Use the summary() function to inspect your decision tree's properties. Do not remove any independent variables from the model. Step 5. Use the predict() function to apply your decision tree model to the scoring data frame to predict InsuranceCategory classes for each observation. Store these predictions in a data frame with a relevant name. Step 6. Use the predict() function to apply your decision tree model to the scoring data frame to predict Insurance Category confidence percentages (post-probabilities) for each observation. Store these predictions in a data frame with a relevant name. Step 7. Create a data frame with a relevant name that contains the Step 5 class predictions, the Step 6 confidence percentages, and the scoring data. Step 8. Enlarge the Plots pane of RStudio and then create a visual depiction of your decision tree. Using the visual depiction (plot) of the decision tree model, which attribute is the first, best predictor of insurance category? NumberOfClaims Age LatePayments ☐ AtFaultAccidents 7 Using the visual depiction (plot) of the decision tree model, what percent of the training observations have one or more at-fault accidents but no comprehensive claims on their insurance policies? 55% 19% 38% 7% 8 The summary() description of the decision tree model shows seven of the eight independent variables under Variable Importance. Which of the seven listed is the least important? MovingViolations ☐ Gender Age CompClaims 9 The summary() description of the decision tree model shows seven of the eight independent variables under Variable Importance. Which of the independent variables in the R model is deemed unimportant? Age CompClaims MaritalStatus U customerID K K K K 10 In the visual depiction (plot) of the decision tree model, only one leaf of the tree leads to a prediction of High Risk-Do Not Insure. Not all of the training observations that follow that branch of the tree are classified in that category, however. What percent of those observations are actually classified into the Potentially High Risk category? 36% 0% 62% 2% 11 The data sets used for this end-of-chapter exercise will need to be normalized in order to create a neural network that can produce reliable predictions. This will be accomplished using the scale() function in R, which has not been covered in the text. The modifications to ensure that nnet() produces usable results will therefore be prescribed in the following steps. Complete each step and then answer the exercise questions below. Step 1. Import the training and scoring data sets for this exercise into data frames in RStudio. Ensure that the training data frame's name is ch11Train and the scoring data frame's name is ch11Score. Step 2. Load the R library required to create a neural network model. Step 3. Use the following two commands to create data frames containing the normalized independent variable values for the training and scoring data sets. ch11TrainNorm <- data.frame(scale(ch11Train[2:8])) ch11ScoreNorm <- data.frame(scale(ch11Score [2:8])) Step 4. The independent variable values that were normalized using the scale() function in step 3 will be used to train the neural network model. Issue the command: attach(ch11TrainNorm) Step 5. Set the seed value to 43. Step 6. You will train a neural network model to predict the Credit Risk dependent variable in the ch11Train data frame, using the independent variables in the normalized ch11TrainNorm data frame. The hidden layer's size attribute is set to 7, which uses the generally accepted formula: ((7 independent variables + 5 dependent variable levels)/2) + 1; which is (12/2) + 1 = 7 Issue the command: ch11NNModel <- nnet(as.factor(ch11 Train$Credit Risk) - Credit Score+Late_Payments+Months In Job+Debt Income Ratio+Loan_Amt+Liquid Assets+Num_Credit Lines, data=ch11TrainNorm, size=7, maxit=1000) Step 7. Use the predict() function to apply your neural network model (ch11NNModel) to the normalized scoring data frame (ch11ScoreNorm) to predict Credit Risk classes for each observation. Store these predictions in a data frame with a relevant name. Step 8. Use the predict() function to apply your neural network model (ch11NNModel) to the normalized scoring data frame (ch11ScoreNorm) to predict Credit Risk confidence percentages (post-probabilities) for each observation. Store these predictions in a data frame with a relevant name. Step 9. Create a data frame with a relevant name that contains the step 7 class predictions, the step 8 confidence percentages, and the unnormalized scoring data (ch11Score). View this data frame in RStudio, then answer the following questions. Using the neural network's predictions as generated using the steps outlined in this exercise, how many loan applicants will have their loans denied? 79 11 23 12 Normalizing the data has resulted in most confidence percentages being 100% in this exercise's predictions. However, not all predictions have 100% confidence. How many predictions of high credit risk in this neural network's results are not 100%? 162 2 12 79 77 K 13 Using the neural network's predictions as generated using the steps outlined in this exercise, and assuming that loan officers can automatically approve all loans predicted to be low or very low risk, how many loans will be automatically approved by loan officers? 126 114 142 16 14 Assume the bank has agreed to approve the loans for the following applicants: 931184, 937005, 451482, and 597325. Based on the model's predictions and confidence percentages, which applicant do you expect will be offered the least favorable terms and the highest interest rate? 597325 931184 937005 15 451482 Based on this model's predictions, the bank is 71% confident that Applicant ID 311882 is high risk. Using the predictions, how confident is the bank that this applicant may actually pose only a moderate risk in lending? 28.9% 15.6% 100% 71.1%