Question

1. Use the insurance fraud dataset. Consider the data quality issues (e.g., missing data) and preprocess the data. Split the data into a 10% train and 90% test set using random_state

= 1. Create a decision tree with a max depth of 3 using a gini measure. Print the accuracy on the test set and the tree. Is this a good approach? Why or why not? 2. Create a decision tree on the same data with max depth of 3 and an entropy measure. Does the accuracy change? Does the tree change? Discuss which measure you think is better. 3. Now split the data into 70% train and 30% test using random_state = 1. Redo 2 and 3. Have the trees and accuracy changed? Are the trees more or less similar now? Discuss which split you think is better and why. 4. Evaluate how the accuracy changes with the depth of the tree with the 70-30 data. Look at the accuracy for a max depth of 1, 2, 3, ... 10, 15, 20. Plot the curve of changing. Do you see underfitting? Do you see overfitting? 5. What variable provides the most information gain in the insurance fraud data (for the 70-30 split)? 6. Decision trees are a "white box" method. What do you observe about the insurance fraud data using decision trees?

Fig: 1