## Data 100, Discussion 5 – Transformations¶

by Suraj Rampure (suraj.rampure@berkeley.edu)

This notebook is meant to supplement the problem on data transformations from Discussion 5 of Data 100, Spring 2019.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
x = np.array([t + np.random.random() for t in np.linspace(1, 10, 20)])
y = np.array([xi ** 2 + np.random.random() * 15 for xi in x])


Let's see what our data looks like without any transformations. Pay attention to the axes throughout this notebook – the first plot looks similar to the one in the worksheet, and the second will have axes of equal sizes.

In [3]:
plt.scatter(x, y);

In [4]:
plt.scatter(x, y)
plt.axis([0, 100, 0, 100]);


Notice, the relationship in our data is $y \approx x^2$. To linearize our data, roughly speaking, we want to make $y$ "smaller" or make $x$ "bigger".

#### First, we'll look at the resulting plot when plotting $x$ vs $\log(y)$:¶

In [5]:
plt.scatter(x, np.log(y))
plt.axis([0, 10, 0, 10]);


This transformation did a decent job of bringing the magnitudes of $x$ and $y$ closer to one another. However, it's not perfect – it looks like the $x$ axis is significantly larger than the $y$ axis now. This is because the underlying relationship wasn't exponential (i.e. wasn't of the form $y \approx e^x$):

$$\log(y) = \log(x^2) = 2\log(x)$$

Our transformed plot effectively plots $x$ vs $2\log(x)$, which isn't linear.

#### Now, let's look at plotting $x^2$ vs $y$:¶

In [6]:
plt.scatter(x**2, y);


This relationship is almost perfectly linear. This makes sense; our original plot was of $x$ vs $x^2$, and our new plot is of $x^2$ vs $x^2$.

#### And $x$ vs $\sqrt{y}$:¶

In [7]:
plt.scatter(x, np.sqrt(y));


This transformation accomplishes the same job as the previous. Instead of plotting $x$ vs $x^2$, we plotted $x$ vs $\sqrt{x^2}$, which (since we're only looking at non-negative $x$) is equivalent to plotting $x$ vs $x$. Note: Even though our plot has almost the exact same shape as the one in the previous plot, the axes are very different. Why is this the case?

#### Now, let's consider the plots of $\log(x)$ vs $y$ and $x$ vs $y^2$:¶

In [8]:
plt.scatter(np.log(x), y)
plt.axis([0, 100, 0, 100]);

In [9]:
plt.scatter(x, y**2);


The last two transformations had the opposite effect.

With $\log(x)$ vs $y$, the relationship we actually plotted was $y \approx (\log(x))^2$. In the latter, the relationship we plotted was $y \approx (x^2)^2 = x^4$ (note the scaled axes). Both of these transformations made the gap between the size of our inputs and size of our outputs greater, and neither of them resulted in a roughly linear plot.