preprocess the data. Split the data into a 10% train and 90% test set using random_state =
1. Create a decision tree with a max depth of 3 using a gini measure. Print the accuracy on
the test set and the tree. Is this a good approach? Why or why not?
2. Create a decision tree on the same data with max depth of 3 and an entropy measure. Does
the accuracy change? Does the tree change? Discuss which measure you think is better.
3. Now split the data into 70% train and 30% test using random_state = 1. Redo 2 and 3. Have
the trees and accuracy changed? Are the trees more or less similar now? Discuss which split
you think is better and why.
4. Evaluate how the accuracy changes with the depth of the tree with the 70-30 data. Look at
the accuracy for a max depth of 1, 2, 3, ... 10, 15, 20. Plot the curve of changing. Do you
see underfitting? Do you see overfitting?
5. What variable provides the most information gain in the insurance fraud data (for the 70-30
split)?
6. Decision trees are a "white box" method. What do you observe about the insurance fraud
data using decision trees?
Fig: 1