Search for question
Question

6. Run the notebook of housing price prediction and answer the following questions. a. Question 1: Frame the Problem Given a dataset like this, how can it be framed as a machine learning problem (try to frame it in different ways other than predicting housing price)? Is the problem you want to solve supervised learning or unsupervised learning? Classification problem or regression problem? b. Question 2 : Sign of Coefficients for Linear Regression Given the equation of linear regression measure, the predicted dependent variable / target equals to the weighted sum of each independent variables / feature plus a bias/ noise term, and the equation to predict the house price with all of the given features is shown below: median_house_value = Bo* longitude + B₁* latitude + B2 housing_median_age + B3 total_rooms + 4 * total_bedrooms + ßs * population + B6 households + ß* median_income + Each feature weight is known as coefficient. The interpretation of a regression coefficient is that it represents the mean change in the dependent variable for each 1 unit change in an independent variable when you hold all of the other independent variables constant. And the machine learning is all about building algorithm to learn these coefficients and use the learned coefficients to predict future unseen data. Given the correlation matrix above. What can you concluded about the sign of each coefficient in this linear regression model?/nc. Question 3: Try StratifiedShuffleSplit by Yourself "median_income" was categorized into 5 groups and we use StratifiedShuffleSplit to make sure that the ratio of each group is exactly the same for training and test set. Apply the same method on feature "housing_median_age" to preserve the percentage of samples for training and test set. (Hint: You have to choose the number of categories and the split thresholds carefully to avoid generating skewed data, you can make the decision based on the output of describe() method, this method can show some important statistics for each feature) d. Question 4: Pros and Cons of One-Hot-Encoding One hot encoding is a way to transform categorical feature into the format that the model can take as input. One hot encoding has the advantage that the result is binary rather than ordinal and that everything sits in an orthogonal vector space. If there was a new feature 'closest_city' would it make sense to use One-Hot-Encoding? Why or why not? e. Question 5: Is Decision Tree Regressor a Great Model ? The decision tree regressor has 0 error on the training set, is that a good model? why or why not? f. Question 6: Try Out SVR Model Try a support vector machine regressor with various hyperparameters such as kernel = "linear" (with various values for the C hyperparameter) or kernel = "rbf" (with various values for the C and gamma hyperparameter). Don't worry about what these parameters mean for now. How does the best SVR predictor perform? You can refer to https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html for more description. g. Question 7: Try Out Different Hyperparameter Tuning Strategies? Try to replace GridSearch CV with RandomizedSearchCV.

Fig: 1

Fig: 2