Search for question
Question

ECO520 HOMEWORK WEEK3 Clustering Analysis on Mortgage Loan Approval The Home Mortgage Disclosure Act (HMDA) is a federal law that requires certain financial institutions to provide mortgage data to the

public. Congress passed the HMDA in 1975 to promote transparency in the mortgage lending market and protect consumers from discriminatory lending practices. The following variables are from the original data for Illinois in 2020. RUN; ● /* Read Data from bigblue */ filename webdat url "https://bigblue.depaul.edu/jlee141/econdata/eco520/hmda_il.csv" PROC IMPORT DATAFILE= webdat OUT= hmda DBMS=CSV REPLACE; approved: 1 if approved, 0 not approved loan_amount: Mortgage loan amount population: Total population size median_income: Median Family Income minority: the Rate of Minority Populations age_house: the Average age of buildings run; proc surveyselect data=hmda method=srs seed = your_depaul_id n = 200000 out= myhmda ; /* Create your own random sample data. Make sure type your student ID as seed number Replace your_depaul_id with your student id (only numbers) */ /*Creating the census_tract summary table by applying proc sql query */ PROC SQL; Create table tract summary as select distinct census tract, avg (loan_amount) as ave_amount, avg (median_income) as avg_income, avg (population) as population, avg (minority) as minority, avg (approved) as approval_rate from myhmda group by census tract; quit; proc means data=tract_summary ; run ; ; Three Variable Clustering Analysis (Use the “tract_summary" data) Let's find the best way to classify the item using three variables; avg_income, population, and minority. Use the clustering method to find the most suitable clusters. (Explain how you come up with the number of clusters and describe why you prefer the one you chose) Minimum Required work: ● ● Potential issues on outliers or problems of the data (remove only extreme outliers if necessary) ● Show the best number of clusters using various settings of clusters One hierarchical Model and one K-Means Model, and compare the differences. Use graphs to illustrate the different clusters Name each group utilizing the summary statistics by the clusters from the K-Means Model Using the ANOVA test, find if the clusters are related to the approval_rate All questions need to be typed with appropriate graphs and tables from SAS in a PDF file. Submit your SAS code as a separate text file. Do not make a zip file.