binary classification problem with the classes being "Trojan" or "Benign". You will perform all your tasks on the training dataset which has almost 160k rows and the output class is divided almost equally. Your task is to build classification models to predict the "Class" column. First, become familiar with data set, and research some background information on trojan horse detection in a network. 1. First, research some background information on applying data mining in related topics by reviewing published research papers (minimum 3!) on similar topics. Make sure each review is at least one paragraph and you cite your sources (example format see appendix). 2. Go through the detailed process of data preparation on train_data.csv. Apply all preprocessing and data reduction techniques you assume they are necessary and explain why. For every preprocessing technique: a. Explain what the technique is; b. Explain how it is applied: c. Show the summary results of the preprocessing. d. Do not include raw code or raw output or raw screen capture Perform dimensionality reduction in at least two steps (once by applying your understanding of the features and once by using feature selection or extraction methods)/n3. When the data set is with enough quality, apply several predictive based techniques (minimum are 3 techniques!) and create appropriate predictive models. For every predictive technique you applied: a. Explain what the technique is and how the technique works b. Explain what the parameters of the technique are and how the parameters are chosen and tuned. c. Explain and discuss the predictive results and performances of the technique. Analyze different aspect of the result, including but not limited to ROC curve, F score, accuracy, etc. d. Do not include raw code or raw output or raw screen capture 4. After you have built three predictive models, test your models so that you can compare the three data mining techniques you've chosen. You should include the following in comparison discussion: a. The performance comparison among your techniques b. Visualization or table showing the performance differences. Make sure you explain and discuss the visualization or table c. Explain the probable reason behind the performances differences. d. Explain which technique is the best for the dataset. Comparison of data mining techniques (and obtained predictive models) with additional discussion and interpretation of results will be very important part of your report. Do not include raw code or raw output. A conclusion section will be used to summarize your findings throughout the report. Give example of data samples, discuss how you would use your model on your samples and what result can be expected from your model. The final grade will be 90% report and 10% class competition, explained next./nA separate competition dataset (test_data.csv) will be provided for additional evaluation. We will compare your prediction results against the true label for additional unbiased evaluation as a component of your project. Up to 2 points in the final grade of the project will depend upon your model's performance with this dataset. The performance is calculated using Area under ROC. The entire class score will be calculated and ranked. If you achieve the top 20% of score when compare to your peers, you get the full 2 points. The grade distribution is the following: Score Percentile Top 20% 50% - 80% 20% - 50% 10% - 20% 0% - 10% Score Range 2 > 1.5-2 > 1-1.5 > 0.5-1 0-0.5 Please read "Required Structure for the Project 2 Documentation”, "Required Structure for the Test Data Result” and “Required Materials for Project 2 Submission" below.