given training data X (i.e., N d-dimensional records), the weight vector w* = (X¹X)`¹X¹r
minimizes the sum of squared errors defined on Slide # 70. The derivation uses scalar
calculus shown on Slide #71 which is very verbose.
Explain why w* maximizes the likelihood of observing X (hint: check Slide #34).
Note that the sum of squared errors can be represented as ||Xw - y|| where ||-||2
represents the vector length (i.e. 12-norm). This observation allows us to use vector
calculus to derive the same conclusion in a less verbose manner.
Show how to compute w* using vector calculus.
Hints: (1) ||x|| = x¹x, (2) Matrix Cookbook Eq (69) where a generalizes to matrix
A, (3) Matrix Cookbook Eq (81).
Each question is 10 points
