Search for question
Question

# In this problem I'd like you to use the following code to generate a dataset to evaluate various approaches to regression in the presence of outliers.

1 import numpy as np

2 np.random.seed (2017)

3 n = 100

4 xtrain = np.random.rand (n)

5 ytrain = 0.25 +0.5*xtrain + np. sqrt (0.1) *np.random.randn (n)

6 idx = np.random.randint (0, 100, 10)

7 ytrain [idx] = ytrain [idx] + np.random.randn (10)

The code above generates training data by selecting random values for the zi's, then computing f(x) = + and adding a small amount of Gaussian noise to each observation. It then follows by creating some "outliers" in the y's by picking 10 random entries and adding a much larger amount of noise to just those elements. In the problems below, you should find a linear fit to this data. In all of the methods below, there will be one or more parameters to set. You can do this manually using whatever approach you like. (Do not go crazy optimizing these, just tune the parameters until your estimate looks reasonable.)

1. To begin, find a linear fit using the code for ridge regression that you produced in the first problem. Report the value of A that you selected, and report the slope and intercept of your linear fit.

2. Next, I would like you to find a linear fit using the Huber loss. This can be done via

1 from sklearn import linear_model

2 reg= linear_model. HuberRegressor (epsilon = 1.35, alpha=0.001)

3 reg.fit(xtrain. reshape (-1,1),ytrain)

You have two parameters to choose here: € (which controls the shape of the loss function and needs to be greater than 1.0) and a (the regularization parameter). Report the values of and a you selected, and report the slope and intercept of your linear fit (see reg. intercept_ and reg.coef.).