Search for question
Question

Data Mining IT 270 1 Question 1 - Linear Algebra (15 Pts) Please show all work. You may use R for part g, do not use cov2cor (). You may use R for part h & i Given: A 4 1 [957] 93 B 62 C 3 3 = 253 D == [ 22 ]] F 2 68 Σ = 34 2 " 36 3 028 (a) BA (b) B'F (c) Find the determinant of F. Show all work. (d) (AA)A' (e) Find the trace of matrix F. (f) Assume Sigma Σ is the covariance matrix of the matrix X; calculate V. (g) Calculate the correlation matrix (p) of Sigma Σ (do not use the cov2cor() function). (h) Using R calculate the determinant of matrix Sigma Σ. (i) What is a singular matrix? What is a simple way to tell if a matrix is singular? Identify if (A, B, F, Sigma) are singular - provide your reasoning for each. Σ. 2 Question 2 - PCA (25 Pts) Using the Air Quality data set (See Data Files Link in Blackboard - column descriptions are in blackboard) of hourly averaged responses from an array of 5 metal oxide chemical sensors embedded in an Air Quality Chemical Multisensor Devices, examine the following: 1 (a) Provide a summary of this data, the definitions of the data are on blackboard to help you a bit. Your summary should include the types of data, interesting summary measures and appropriate box plots. Consider outliers and missing values. (b) The analysis wishes to reduce the dimensions of the data set for further analysis. Conduct a principal component analysis and provide the following information : • How many dimensions would you choose and why? • Are there any overlapping loadings based on a .6 cutoff? If so, which ones? Why would this be an issue • Conduct a Varimax rotation on the result, how does this differ from a non rotated solution? Are there any overlapping loadings based on a .6 cutoff? Is this solution better? Using your new dimensions create new variables in your data set and compute values for each PCA dimension from the dimensions you choose. Please show the R code and the first observation. Correlation analysis between your new components, what does this correlation tell you? 3 Question 3 - Factor Analysis (25 Pts) It is a well known fact that sports analytics are very popular. The data set fifa.csv contains information about soccer (football) players obtained from FIFA 19 information. It would be interesting if the ratings that are used could be analyzed as a simple set of factors, based on the skills of the player. (a) Identify the columns that are the best candidates for analysis as factors. (b) Conduct a factor analysis using these columns (if you get an error reduce the number of factors in the function, until the error disappears). (c) Compare the results of a Varimax and Promax rotation, which gives you a better solution. (Note: Consider you may or may not need to change the number of factors). (d) Name the factors, as best as you can based on what you see in the factors. - even if you don't know soccer, try and name them (e) Are there any correlations between the factors? If so, explain what (f) Create the diagram (by hand or Powerpoint) of the factors, be sure to label everything. Please do not use the function in R to create the diagram. (g) Split the data set into two parts (Left footed players and right footed players). Conduct a factor analysis on each data set and explain if you see any differences between right and left footed players. Note you do not need to draw this one. 4 Question 4 - Decision Tree (25 Pts) People Analytics is becoming a more popular and demanding area for Data Mining. The Head of Human Resources is looking to identify reasons for attrition (people leaving either voluntary or otherwise). The task is to develop a decision tree to determine this. You will need to merge three datasets employee_survey_data.csv, general_data.csv, manager_survey_data.csv to accomplish this task. (a) Provide a summary of this data. Your summary should include the types of data (that are essential for this analysis), interesting summary measures and appropriate box plots. Consider outliers and missing values. (b) Determine and list any columns that are not needed in a first “full” model. Explain why you removed those columns (c) Using two algorithms C5.0 and Conditional Inference. Develop a full model decision tree to analyze the attrition. (d) From your full model, develop a more appropriate model (lower variables) that will pro- vide an ”attrition” prediction. What was the formula of your final model, in the C5.0 and conditional inference algorithm. (e) Compare the methods based on appropriate measures of accuracy. (f) Provide the charts from each algorithm of your final model. 5 Question 5 - Theoretical Questions (10 Pts) Provide no more than a paragraph for each, be concise. (a) Explain why the covariance matrix is important for analysis. Additionally, explain what the eigenvectors and eigenvalues are and why they are important. (b) In machine learning, explain the importance of splitting up the dataset. What are the different ways to split and how should an analyst split the data. (c) What are the differences and similarities between factor analysis and PCA. Focus on the equations that are produced. (d) What are some of the challenges with k-NN. How would you advise someone who decides to use k-NN? (e) What is the difference between orthogonal rotation and oblique rotations. When would we choose either, or neither.