## DS 100 – Linear Regression Connections

by Suraj Rampure (suraj.rampure@berkeley.edu)

## Introduction

Suppose we're given some set of points {(x1,y1),(x2,y2),...,(xn,yn)}$\{ (x_1, y_1), (x_2, y_2), ... , (x_n, y_n) \}$, and want to fit the model

ŷ =θ1x+θ2

that minimizes L2$L_2$ loss.

This is a problem you've likely seen multiple times.

• In previous statistics courses, like Data 8, you've derived expressions for θ1,θ2$\theta_1, \theta_2$ (usually called a,b$a, b$) in terms of r$r$, known as the "correlation coefficient"
• In this course, you've learned the tools to formulate this as a scalar optimization problem (take the derivative of the loss, set to 0$0$ and solve for the parameters)
• You've also learned the rules to formulate this as a matrix calculus optimization problem, and can use the normal equation to find a solution

Here, we will:

1. Provide a warmup to the idea of optimizing scalar loss functions by finding the θ$\theta$ that optimizes L2$L_2$ loss of y=θx$y = \theta x$
2. Derive the solution to θ1,θ2$\theta_1, \theta_2$ by taking partial derivatives and solving
3. Show the connection between the solutions for θ1,θ2$\theta_1, \theta_2$ and the formulas for linear regression given in traditional statistics courses
4. Create a feature matrix ϕ$\phi$ and weights vector θ$\theta$ and show that the solution θ=(ϕTϕ)1ϕTy$\theta^* = (\phi^T \phi)^{-1} \phi^T y$ yields the same solution as in (2)

Note: In lecture, this was referred to as the "slope-intercept model."

## 1. Warmup

Suppose we're given some set of points {(x1,y1),(x2,y2),...,(xn,yn)}$\{ (x_1, y_1), (x_2, y_2), ... , (x_n, y_n) \}$, and want to fit the model

ŷ =θx

where x,θ,ŷ $x, \theta, \hat{y}$ are all scalars.

Our total loss is then

L(θ)=i=1n(yiθxi)2=(y1θx1)2+(y2θx2)2+...+(ynθxn)2

Taking the partial derivative with respect to θ$\theta$ and setting equal to 0$0$:

δLδθ=i=1n2(yiθxi)(xi)=2i=1n(θx2ixiyi)=0θi=1nx2i=i=1nxiyiθ=ni=1xiyini=1x2i

This result should be familiar.

Now, we can move onto the problem at hand.

## 2. 2D Linear Regression, as an Optimization Problem

Suppose we're given some set of points {(x1,y1),(x2,y2),...,(xn,yn)}$\{ (x_1, y_1), (x_2, y_2), ... , (x_n, y_n) \}$, and want to fit the model

ŷ =θ1x+θ2

Using L2$L2$ loss, we have:

L(θ1,θ2)=i=1n(yiθ1xiθ2)2

Since we have two θ$\theta$s, we will need to take partial derivatives with respect to each. We will then end up with a system of two equations and two unknowns, allowing us to solve for θ1$\theta_1$ and θ2$\theta_2$. Up until now, we've only dealt with single parameter equations (i.e. just one θ$\theta$), and we only needed to take a single partial derivative and optimize over one variable.

(Note: We will simplify by using μx=1nni=1xi$\mu_x = \frac{1}{n} \sum_{ i = 1 } ^n x_i$ and μy=ni=1yi$\mu_y = \sum_{i = 1}^n y_i$.)

Taking the partial derivative with respect to θ1$\theta_1$ and setting equal to 0:

δLδθ1=i=1n2(yiθ1xiθ2)(xi)=2i=1n(θ1x2i+θ2xixiyi)=0
θ1i=1nx2i+nμxθ2=i=1nxiyi(1)

Taking the partial derivative with respect to θ2$\theta_2$ and setting equal to 0:

δLδθ2=i=1n2(yiθ1xiθ2)(1)=2i=1n(θ1xi+θ2yi)=0θ1i=1nxi+nθ2=i=1nyinμxθ1+nθ2=nμyμxθ1+θ2=μy(2)

where in the last step, we used the fact that ni=1θ2=nθ2$\sum_{i = 1}^n \theta_2 = n\theta_2$.

Now, we have a system of two equations and two unknowns in a,b$a, b$. To solve, we can isolate for θ2$\theta_2$:

θ2=μyθ1μx

and substitute this into equation (1)$(1)$:

θ1i=1nx2i+nμx(μyθ1μx)=i=1nxiyiθ1(i=1nx2inμ2x)+nμxμy=i=1nxiyiθ1=ni=1xiyinμxμyni=1x2inμ2x

This gives us our final solution

θ1=ni=1xiyinμxμyni=1x2inμ2xθ2=μyθ1μx

Technically, to be complete, we'd need to check δ2Lδx2$\frac{\delta^2 L}{\delta x^2}$ and δ2Lδy2$\frac{\delta^2 L}{\delta y^2}$ and ensure that they're both positive, but we can take that leap of faith for now.

## 3. Connection to Previous Statistics Courses

This definition for θ1$\theta_1$ and θ2$\theta_2$ we just learned should look very similar to the definition for linear regression you might have learned in previous courses, such as Data 8.

In such courses, we define r$r$, the correlation coefficient, as:

r=1ni=1n(xiμxσx)(yiμyσy)

where μ$\mu$ and σ$\sigma$ represent the empirical mean and standard deviation, respectively.

From there, the parameters a,b$a, b$ are defined as

a=rσyσx
b=μyaμx

It is easy to see that this definition of b$b$ matches the θ2$\theta_2$ we solved for in the previous section (we had b=μyaμx$b = \mu_y - a\mu_x$ and θ2=μyθ1μx$\theta_2 = \mu_y - \theta_1 \mu_x$). Let's show that a$a$ and θ1$\theta_1$ are also the same:

rσyσx=1nσyσxi=1n(xiμxσx)(yiμyσy)=1nσ2xi=1n((xiμx)(yiμy))=1nσ2x(i=1nxiyiμyi=1nxiμxi=1nyi+i=1nμxμy)=ni=1xiyinμxμynμxμy+nμxμynni=1(xiμx)2n=ni=1xiyinμxμyni=1x2i2μxni=1xi+ni=1μ2x=ni=1xiyinμxμyni=1x2inμ2x=θ1

as required.

## 4. Matrix Formulation

Now, instead of dealing with purely scalar values, we will introduce vectors and matrices.

Given our feature matrix ϕ$\phi$ and values vector y$y$, we want to find a vector θ$\theta$ that best approximates y=ϕθ$y = \phi \theta$; specifically we want the θ$\theta$ that minimizes ||yϕθ||22$|| y - \phi \theta ||_2^2$. This solution is given by θ=(ϕTϕ)1ϕTy$\theta^* = (\phi^T\phi)^{-1}\phi^Ty$.

The matrix formulation is more robust than the ones we've seen previously in that we can select features that are more complicated than linear - for example, we could find parameters to minimize estimation error for ax+bx2+cextan(x2)$ax + bx^2 + ce^x - tan(x^2)$ if we wanted to. However, we're going to keep the problem the same, and try and model using y=θ1x+θ2$y = \theta_1x + \theta_2$.

(Important note: Now, θ$\theta$ is a vector, but θ1,θ2$\theta_1, \theta_2$ are still scalars!)

Our matrices and vectors are defined as follows:

ϕ=x1x2xn111,θ=[θ1θ2],y=y1y2yn

Now, let's compute (ϕTϕ)1ϕTy$(\phi^T\phi)^{-1}\phi^Ty$ to find the matrix least squares solution to θ1$\theta_1$ and θ2$\theta_2$. (Be warned: this will be relatively algebra heavy. Feel free to skim over the results.)

ϕTϕ=[x11x21......xn1]x1x2xn111
=i=1nx2ii=1nxii=1nxin=i=1nx2inμxnμxn

Recall, the inverse of a 2x2 matrix [acbd]$\left[ \begin{matrix} a & b \\ c & d \end{matrix} \right]$ is given by 1adbc[dcba]$\frac{1}{ad - bc}\left[ \begin{matrix} d & -b \\ -c & a \end{matrix} \right]$.

Then:

(XTX)1=1nni=1x2in2μ2xnnμxnμxi=1nx2i
(XTX)1XTy=1nni=1x2in2μ2xnnμxnμxi=1nx2i[x11x21......xn1]y1y2yn
=1nni=1x2in2μ2xnnμxnμxi=1nx2ii=1nxiyinμy
=nni=1xiyin2μxμynni=1x2in2μ2xnμxni=1xiyi+nμyni=1x2inni=1x2in2μ2x

[θ1θ2]=ni=1xiyinμxμyni=1x2inμ2xμxni=1xiyi+μyni=1x2ini=1x2inμ2x

Here, we see that θ1=ni=1xiyinμxμyni=1x2inμ2x$\theta_1 = \frac{\sum_{i = 1}^n x_i y_i - n \mu_x \mu_y}{\sum_{i = 1}^n x_i^2 - n\mu_x^2}$, as we saw earlier.

Also, we have

μyθ1μx=μyni=1xiyinμxμyni=1x2inμ2xμx=μyni=1x2inμ2xμyμxni=1xiyi+nμ2xμyni=1x2inμ2x=μxni=1xiyi+μyni=1x2ini=1x2inμ2x

as we saw earlier. We've now shown that the solution to the normal equation (ϕTϕ)1ϕTy$(\phi^T \phi)^{-1} \phi^T y$ gives the same values for θ1,θ2$\theta_1, \theta_2$ as the scalar optimization method did.

## Conclusion

This entire time, we've been looking at the same problem: finding optimal parameters that minimize the L2$L_2$ loss of

ŷ =θ1x+θ2

We looked at three solutions to the same problem, and showed that they're all equivalent. We should expect this to be the case, but it's nice to see these connections laid out explicitly.