Search for question
Question

1. In "Lec 04: Parameter Estimation for Generative Models", Slides #70-72, we showed that

given training data X (i.e., N d-dimensional records), the weight vector w* = (X¹X)`¹X¹r

minimizes the sum of squared errors defined on Slide # 70. The derivation uses scalar

calculus shown on Slide #71 which is very verbose.

Explain why w* maximizes the likelihood of observing X (hint: check Slide #34).

Note that the sum of squared errors can be represented as ||Xw - y|| where ||-||2

represents the vector length (i.e. 12-norm). This observation allows us to use vector

calculus to derive the same conclusion in a less verbose manner.

Show how to compute w* using vector calculus.

Hints: (1) ||x|| = x¹x, (2) Matrix Cookbook Eq (69) where a generalizes to matrix

A, (3) Matrix Cookbook Eq (81).

Each question is 10 points

Fig: 1