by Suraj Rampure (suraj.rampure@berkeley.edu)

Suppose we're given some set of points

that minimizes

This is a problem you've likely seen multiple times.

- In previous statistics courses, like Data 8, you've derived expressions for
θ1,θ2 (usually calleda,b ) in terms ofr , known as the "correlation coefficient" - In this course, you've learned the tools to formulate this as a scalar optimization problem (take the derivative of the loss, set to
0 and solve for the parameters) - You've also learned the rules to formulate this as a matrix calculus optimization problem, and can use the normal equation to find a solution

Here, we will:

- Provide a warmup to the idea of optimizing scalar loss functions by finding the
θ that optimizesL2 loss ofy=θx - Derive the solution to
θ1,θ2 by taking partial derivatives and solving - Show the connection between the solutions for
θ1,θ2 and the formulas for linear regression given in traditional statistics courses - Create a feature matrix
ϕ and weights vectorθ and show that the solutionθ∗=(ϕTϕ)−1ϕTy yields the same solution as in (2)

Note: In lecture, this was referred to as the "slope-intercept model."

Suppose we're given some set of points

where

Our total loss is then

Taking the partial derivative with respect to

This result should be familiar.

Now, we can move onto the problem at hand.

Suppose we're given some set of points

Using

Since we have two

(Note: We will simplify by using

Taking the partial derivative with respect to

Taking the partial derivative with respect to

where in the last step, we used the fact that

Now, we have a system of two equations and two unknowns in

and substitute this into equation

This gives us our final solution

Technically, to be complete, we'd need to check

This definition for

In such courses, we define

where

From there, the parameters

It is easy to see that this definition of

as required.

Now, instead of dealing with purely scalar values, we will introduce vectors and matrices.

Given our feature matrix

The matrix formulation is more robust than the ones we've seen previously in that we can select features that are more complicated than linear - for example, we could find parameters to minimize estimation error for

(Important note: Now,

Our matrices and vectors are defined as follows:

Now, let's compute

Recall, the inverse of a 2x2 matrix

Then:

Here, we see that

Also, we have

as we saw earlier. We've now shown that the solution to the normal equation

This entire time, we've been looking at the same problem: finding optimal parameters that minimize the

We looked at three solutions to the same problem, and showed that they're all equivalent. We should expect this to be the case, but it's nice to see these connections laid out explicitly.