Ethan P. Marzban
2023-07-27
Let’s start off today’s lecture by taking stock of what we’ve done in this class so far.
We started off by talking about descriptive statistics, which seems to describe data.
Next, we talked about probability, which sought to give us a framework with which to discuss randomness.
After that, we spent a few weeks talking about inference.
This refers to the situation in which we have a population governed by a parameter, and we wish to take samples from this population and use these samples to make inferences about the population parameter.
We also discussed how to compare population means across multiple populations, using two-sample t-tests and ANOVA.
Let’s take a moment to revisit some material from Lecture 2.
Specifically, suppose we have data of the form \(\{(y_i, \ x_i)\}_{i=1}^{n}\).
x
and y
are numerical.We have previously seen that the best way to visualize the relationship between y
and x
is with a scatterplot
The goal of statistical modeling, loosely speaking, is to try and model the relationship between x
and y
.
Specifically, we assume the relationship takes the form \[ \texttt{y} = f(\texttt{x}) + \texttt{noise} \] where \(f()\) is some function (e.g. linear, nonlinear, etc.)
Hang on, what’s the noise
term doing?
Well, take a look at the previous scatterplots. Even though many of these display (what we visually would describe as) relationships between x
and y
, the datapoints do not fall perfectly along a single curve.
As a concrete example, suppose y
represents weight
and x
represents height
.
We do believe there would be some sort of a positive association between weight
and height
(taller people tend to, in general, weigh more) and we may even assume the relationship is linear.
However, just because we know someone’s height, we don’t exactly know what their weight will be.
Alright, so: given variables y
and x
, we assume the relationship between them can be modeled as \[ \texttt{y} = f(\texttt{x}) + \texttt{noise} \]
Here is some terminology we use:
y
is referred to as the response variable, or sometimes the dependent variable.x
is referred to as the explanatory variable, or sometimes the independent variable.We also have some additional terminology about the entire model, based on the type of the response variable.
If y
is numerical, we call the resulting model a regression model (or we just say we are in a regression setting)
If y
is categorical, we call the resulting model a classification model (or we just say we are in a classification setting)
So, for example, trying to model the relationship between weight
and height
(assuming weight
is our response variable) is a regression problem, since weight
is a numerical variable.
As an illustration of a classic classification problem, suppose we have access to a roster of passengers on the Titanic.
Tragically, not all passengers survived.
One question we may want to ask is: how did various factors (e.g. class, gender, etc.) affect whether or not a given passenger survived?
Notice how here, the response variable (i.e. whether or not someone survived) is a binary variable; that is, it takes only one of two values (survived
or did not survive
). As such, this is a classification problem.
I should note: perhaps confusingly, a problem in which the response is binary is sometimes called logistic regression.
But, that is a bit of a misnomer- it is technically a classification problem, since the response variable is categorical.
By the way, there is an actual dataset on the Titanic at https://www.kaggle.com/competitions/titanic
Again, we won’t have a chance to talk about logistic regression in this class- for that, I refer you to a Machine Learning class like PSTAT 131/231!
We don’t have to restrict ourselves to modeling the relationship between a response variable and a single explanatory variable.
For example, suppose y
measures income.
There are several factors that might contribute to someone’s monthly income: things like education_level
, gender
, industry
, etc.
We can easily adapt a linear regression model to allow for multiple explanatory variables, which leads to what is known as a multivariate regression/classification model: \[ \texttt{y} = f(\texttt{x}_1, \ \cdots , \ \texttt{x}_k) + \texttt{noise} \]
For now, let’s focus on modeling the relationship between a numerical response variable and a single numerical explanatory variable: \[ \texttt{y} = f(\texttt{x}) + \texttt{noise} \]
Of particular interest to statisticians are the class of linear models, which assume a linear form for the signal function.
Recall that a line is described by two parameters: an intercept and a slope. As such, to say we are assuming a “linear form” for \(f()\) is to say we are assuming \(f(x) = \beta_0 + \beta_1 \cdot x\), so our model becomes \[ \texttt{y} = \beta_0 + \beta_1 \cdot \texttt{x} + \texttt{noise} \]
Now, before we delve into the mathematics and mechanics of model fitting, there is another thing we should be aware of.
As an example, consider the following two scatterplots:
Y2
and X2
seems to be “stronger” than the relationship between Y1
and X1
, does it not?Ultimately, we would like to develop a mathematical metric to quantify not only the relationship between two variables, but also the strength of the relationship between these two variables.
This quantity is referred to as the correlation coefficient.
Now, it turns out there are actually a few different correlation coefficients out there. The one we will use in this class (and one of the metrics that is very widely used by statisticians) is called Pearson’s Correlation Coefficient, or often just Pearson’s r (as we use the letter r to denote it.)
Given two sets \(X = \{y_i\}_{i=1}^{n}\) and \(Y = \{y_i\}_{i=1}^{n}\) (note that we require the two sets to have the same number of elements!), we compute r using the formula \[ r = \frac{1}{n - 1} \sum_{i=1}^{n} \left( \frac{x_i - \overline{x}}{s_X} \right) \left( \frac{y_i - \overline{y}}{s_Y} \right) \] where:
I find it useful to sometimes consider extreme cases, and ensure that the math matches up with our intuition.
For example, consider the sets \(X = \{1, 2, 3\}\) and \(Y = \{1, 2, 3\}\).
From a scatterplot, I think we would all agree that \(X\) and \(Y\) have a positive linear relationship, and that the relationship is very strong!
Indeed, \(\overline{x} = 2 = \overline{y}\) and \(s_X = 1 = s_Y\), meaning \[\begin{align*} r & = \frac{1}{3 - 1} \left[ \left( \frac{1 - 2}{1} \right) \left( \frac{1 - 2}{1} \right) + \left( \frac{2 - 2}{1} \right) \left( \frac{2 - 2}{1} \right) \right. \\ & \hspace{45mm} \left. + \left( \frac{3 - 2}{1} \right) \left( \frac{3 - 2}{1} \right) \right] \\ & = \frac{1}{2} \left[ 1 + 0 + 1 \right] = \boxed{1} \end{align*}\]
It turns out, r will always be between \(-1\) and \(1\), inclusive, regardless of what two sets we are comparing!
So, here is how we interpret the value of r.
The sign of r (i.e. whether it is positive or negative) indicates whether or not the linear association between the two variables is positive or negative.
The magnitude of r indicates how strong the linear relationship between the two variables is, with magnitudes close to \(1\) or \(-1\) indicating very strong linear relationships.
An r value of 0 indicates no linear relationship between the variables.
Now, something that is very important to mention is that r only quantifies linear relationships- it is very bad at quantifying nonlinear relationships.
For example, consider the following scatterplot:
I think we would all agree that Y
and X
have a fairly strong relationship.
However, the correlation between Y
and X
is actually only 0.1953333!
So, again- r should only be used as a determination of the strength of linear trends, not nonlinear trends.
Exercise 1
Compute the correlation between the following two sets of numbers: \[\begin{align*} \boldsymbol{x} & = \{-1, \ 0, \ 1\} \\ \boldsymbol{y} & = \{1, \ 2, \ 0\} \end{align*}\]
There is another thing to note about correlation.
Let’s see this by way of an example: consider the following two scatterplots:
X
, Y1
) and cor(X
, Y2
) are equal to 1, despite the fact that a one unit increase in x
corresponds to a different unit increase in y1
as opposed to y2
.So, don’t be fooled- the magnitude of r says nothing about how a one-unit increase in x
translates to a change in y
!
To figure out exactly how a change in x
translates to a change in y
, we need to return to our model.
Letting y
denote our response variable (weight) and x
denote our explanatory variable (height), our model is \[ \texttt{y} = \beta_0 + \beta_1 \cdot \texttt{x} + \texttt{noise} \]
Ultimately, we would like to know the values of \(\beta_0\) and \(\beta_1\).
However, we will never be able to determine their values exactly!
Why? Well, take a look at the model again- though we are assuming a linear relationship between height
and weight
, our weight
observations contain some noise, the magnitude of which we never get to see!
Our goal, therefore, is to try and determine good estimators \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\) of \(\beta_0\) and \(\beta_1\), respectively.
In the more general setting, given a model \[ \texttt{y} = f(\texttt{x}) + \texttt{noise} \] the goal of model fitting is to take data \(\{(y_i, \ x_i)\}_{i=1}^{n}\) and use it to construct a function \(\widehat{f}()\) that we believe best approximates the function \(f():\) \[ \widehat{\texttt{y}} = \widehat{f}(\texttt{x}) + \texttt{noise} \]
The regression problem basically boils down to finding the line that best fits the data.
The specific line we will discuss in a bit is called the ordinary least squares line (or just OLS line):
Here is how simple linear regression works.
We are given observations \(\{(y_i, \ x_i)\}_{i=1}^{n}\) on a response variable y
and an explanatory variable x
.
In simple linear regression, we adopt the following model: \[ \texttt{y} = \beta_0 + \beta_1 \cdot \texttt{x} + \texttt{noise} \]
Our goal is to use the data \(\{(y_i, \ x_i)\}_{i=1}^{n}\) to determine suitable estimators for \(\beta_0\) and \(\beta_1\).
y
and x
. But, because of natural variability due to randomness, we cannot figure out exactly what the true relationship is.Now, if we are to find the line that best fits the data, we first need to quantify what we mean by “best”.
Here is one idea: consider minimizing the average distance from the datapoints to the line.
As a measure of “average distance from the points to the line”, we will use the so-called residual sum of squares (often abbreviated as RSS).
It turns out, using a bit of Calculus, the estimators we seek (i.e. the ones that minimize the RSS) are \[\begin{align*} \widehat{\beta_1} & = \frac{\sum\limits_{i=1}^{n} (x_i - \overline{x})(y_i - \overline{y})}{\sum\limits_{i=1}^{n} (x_i - \overline{x})^2} \\ \widehat{\beta_0} & = \overline{y} - \widehat{\beta_1} \overline{x} \end{align*}\]
These are what are known as the ordinary least squares estimators of \(\beta_0\) and \(\beta_1\), and the line \(\widehat{\beta_0} + \widehat{\beta_1} x\) is called the ordinary least-squares regression line (or just OLS regression line, for short).
Perhaps an example may illustrate what I am talking about.
\(\widehat{\beta_0} =\) -0.2056061; \(\widehat{\beta_1} =\) -2.1049432.
I.e. the equation of the line in blue is -0.2056061 + -2.1049432 * x
.
These points are referred to as fitted values; the y-values of the fitted values are denoted as \(\widehat{y}_i\).
In this way, the OLS regression line is commonly written as a relationship between the fitted values and the x-values: \[ \widehat{y} = \widehat{\beta_0} + \widehat{\beta_1} x \]
height
and weight
height
and weight
height
and weight
A quick note:
Though there was no way to know this, the true \(\beta_1\) was actually \(1.0\). Again, this is just to demonstrate that the OLS estimate \(\widehat{\beta_1}\) is just that- an estimate!
\[ \widehat{\texttt{weight}} = 3.367 + 0.979 \cdot \texttt{height} \]
Alright, let’s work through a computation by hand once.
Suppose we have the variables \[\begin{align*} \boldsymbol{x} & = \{3, \ 7, \ 8\} \\ \boldsymbol{y} & = \{20, \ 14, \ 17\} \end{align*}\] and suppose we wish to construct the least-squares regression line when regressing \(\boldsymbol{y}\) onto \(\boldsymbol{x}\).
First, we compute \[\begin{align*} \overline{x} & = 6 \\ \overline{y} & = 17 \end{align*}\]
Next, we compute \[\begin{align*} \sum_{i=1}^{n} (x_i - \overline{x})^2 & = (3 - 6)^2 + (7 - 6)^2 + (8 - 6)^2 = 14 \\ \sum_{i=1}^{n} (y_i - \overline{y})^2 & = (20 - 17)^2 + (14 - 17)^2 + (17 - 17)^2 = 18 \end{align*}\]
Additionally, \[\begin{align*} \sum_{i=1}^{n} (x_i - \overline{x})(y_i - \overline{y}) & = (3 - 6)(20 - 17) + (7 - 6)(14 - 17) \\[-7mm] & \hspace{10mm} + (8 - 6)(17 - 17) \\[5mm] & = -12 \end{align*}\]
Therefore, \[ \widehat{\beta_1} = \frac{\sum_{i=1}^{n} (x_i - \overline{x})(y_i - \overline{y})}{\sum_{i=1}^{n} (x_i - \overline{x})^2} = \frac{-12}{14} = - \frac{6}{7} \]
Additionally, \[ \widehat{\beta_0} = \overline{y} - \widehat{\beta_1} \overline{x} = 17 - \left( - \frac{6}{7} \right) (6) = \frac{155}{7} \]
This means that the ordinary least-squares regression line is \[ \boxed{\widehat{y} = \frac{1}{7} ( 155 - 6 x )} \]
\[ \widehat{y} = \frac{1}{7} ( 155 - 6 x ) \]
Alright, so how do we interpret the OLS regression line? \[\widehat{y} = \widehat{\beta_0} + \widehat{\beta_1} x\]
We can see that a one-unit increase in x
corresponds to a \(\widehat{\beta_1}\)-unit increase in y
.
For example, in our height
and weight
example we found \[ \widehat{\texttt{weight}} = 3.367 + 0.979 \cdot \texttt{height} \]
This means that a one-cm change in height is associated with a (predicted/estimated) 0.979 lbs change in weight.