1. Use the insurance fraud dataset. Consider the data quality issues (e.g., missing data) and

Question

preprocess the data. Split the data into a 10% train and 90% test set using random_state =

1. Create a decision tree with a max depth of 3 using a gini measure. Print the accuracy on

the test set and the tree. Is this a good approach? Why or why not?

2. Create a decision tree on the same data with max depth of 3 and an entropy measure. Does

the accuracy change? Does the tree change? Discuss which measure you think is better.

3. Now split the data into 70% train and 30% test using random_state = 1. Redo 2 and 3. Have

the trees and accuracy changed? Are the trees more or less similar now? Discuss which split

you think is better and why.

4. Evaluate how the accuracy changes with the depth of the tree with the 70-30 data. Look at

the accuracy for a max depth of 1, 2, 3, ... 10, 15, 20. Plot the curve of changing. Do you

see underfitting? Do you see overfitting?

5. What variable provides the most information gain in the insurance fraud data (for the 70-30

split)?

6. Decision trees are a "white box" method. What do you observe about the insurance fraud

data using decision trees?