DS 100 – Linear Regression Connections

by Suraj Rampure (suraj.rampure@berkeley.edu)


Suppose we're given some set of points {(x1,y1),(x2,y2),...,(xn,yn)}, and want to fit the model

ŷ =θ1x+θ2

that minimizes L2 loss.

This is a problem you've likely seen multiple times.

Here, we will:

  1. Provide a warmup to the idea of optimizing scalar loss functions by finding the θ that optimizes L2 loss of y=θx
  2. Derive the solution to θ1,θ2 by taking partial derivatives and solving
  3. Show the connection between the solutions for θ1,θ2 and the formulas for linear regression given in traditional statistics courses
  4. Create a feature matrix ϕ and weights vector θ and show that the solution θ=(ϕTϕ)1ϕTy yields the same solution as in (2)

Note: In lecture, this was referred to as the "slope-intercept model."

1. Warmup

Suppose we're given some set of points {(x1,y1),(x2,y2),...,(xn,yn)}, and want to fit the model

ŷ =θx

where x,θ,ŷ  are all scalars.

Our total loss is then


Taking the partial derivative with respect to θ and setting equal to 0:


This result should be familiar.

Now, we can move onto the problem at hand.

2. 2D Linear Regression, as an Optimization Problem

Suppose we're given some set of points {(x1,y1),(x2,y2),...,(xn,yn)}, and want to fit the model

ŷ =θ1x+θ2

Using L2 loss, we have:


Since we have two θs, we will need to take partial derivatives with respect to each. We will then end up with a system of two equations and two unknowns, allowing us to solve for θ1 and θ2. Up until now, we've only dealt with single parameter equations (i.e. just one θ), and we only needed to take a single partial derivative and optimize over one variable.

(Note: We will simplify by using μx=1nni=1xi and μy=ni=1yi.)

Taking the partial derivative with respect to θ1 and setting equal to 0:


Taking the partial derivative with respect to θ2 and setting equal to 0:


where in the last step, we used the fact that ni=1θ2=nθ2.

Now, we have a system of two equations and two unknowns in a,b. To solve, we can isolate for θ2:


and substitute this into equation (1):


This gives us our final solution


Technically, to be complete, we'd need to check δ2Lδx2 and δ2Lδy2 and ensure that they're both positive, but we can take that leap of faith for now.

3. Connection to Previous Statistics Courses

This definition for θ1 and θ2 we just learned should look very similar to the definition for linear regression you might have learned in previous courses, such as Data 8.

In such courses, we define r, the correlation coefficient, as:


where μ and σ represent the empirical mean and standard deviation, respectively.

From there, the parameters a,b are defined as


It is easy to see that this definition of b matches the θ2 we solved for in the previous section (we had b=μyaμx and θ2=μyθ1μx). Let's show that a and θ1 are also the same:


as required.

4. Matrix Formulation

Now, instead of dealing with purely scalar values, we will introduce vectors and matrices.

Given our feature matrix ϕ and values vector y, we want to find a vector θ that best approximates y=ϕθ; specifically we want the θ that minimizes ||yϕθ||22. This solution is given by θ=(ϕTϕ)1ϕTy.

The matrix formulation is more robust than the ones we've seen previously in that we can select features that are more complicated than linear - for example, we could find parameters to minimize estimation error for ax+bx2+cextan(x2) if we wanted to. However, we're going to keep the problem the same, and try and model using y=θ1x+θ2.

(Important note: Now, θ is a vector, but θ1,θ2 are still scalars!)

Our matrices and vectors are defined as follows:


Now, let's compute (ϕTϕ)1ϕTy to find the matrix least squares solution to θ1 and θ2. (Be warned: this will be relatively algebra heavy. Feel free to skim over the results.)


Recall, the inverse of a 2x2 matrix [acbd] is given by 1adbc[dcba].




Here, we see that θ1=ni=1xiyinμxμyni=1x2inμ2x, as we saw earlier.

Also, we have


as we saw earlier. We've now shown that the solution to the normal equation (ϕTϕ)1ϕTy gives the same values for θ1,θ2 as the scalar optimization method did.


This entire time, we've been looking at the same problem: finding optimal parameters that minimize the L2 loss of

ŷ =θ1x+θ2

We looked at three solutions to the same problem, and showed that they're all equivalent. We should expect this to be the case, but it's nice to see these connections laid out explicitly.