Ethan P. Marzban
2023-07-31
Last time, we began our foray into statistical modeling.
Given data \(\{(x_i, \ y_i)\}_{i=1}^{n}\) on a response variable y
and an explanatory variable x
, we model the relationship between y
and x
as \[ \texttt{y} = f(\texttt{x}) + \texttt{noise} \] for some signal function \(f()\).
When the response variable is numerical, we call the model a regression model; when the variable is categorical, we call the model a classification model.
Ulimately, we wish to fit a signal function \(\widehat{f}()\) to our data.
Simple Linear Regression refers to a situation in which we have:
y
x
That is, the model in a simple linear regression setting is \[ \texttt{y} = \beta_0 + \beta_1 \cdot x + \texttt{noise} \]
I’d like to stress- writing \(f(x) = \beta_0 + \beta_1 \cdot x\) is exactly the same as our familiar \(mx + b\) form for a line!
The reason we use \(\beta_0\) and \(\beta_1\) in place of \(b\) and \(m\), respectively, is to allow for an extension of the same notation practices to a multivariate setting.
That is, if we have \(k\) explanatory variables x
1 through x
k, it is easier to write a linear model as \[ \texttt{y} = \beta_0 + \beta_1 \cdot x_1 + \cdots + \beta_k \cdot x_k + \texttt{noise} \] instead of having to find new letters for the coefficients of x
1 through x
k.
Back to the model fitting problem: we seek to find “good” estimators \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\) for \(\beta_0\) and \(\beta_1\) respectively.
One quantification of “good” is “minimizing the residual sum of squares”:
\[ \mathrm{RSS} = \sum_{i=1}^{n} e_i^2 \]
The estimators that minimize the RSS are \[\begin{align*} \widehat{\beta}_1 & = \frac{\sum\limits_{i=1}^{n} (x_i - \overline{x})(y_i - \overline{y})}{\sum\limits_{i=1}^{n} (x_i - \overline{x})^2} \\ \widehat{\beta}_0 & = \overline{y} - \widehat{\beta}_1 \overline{x} \end{align*}\] which are called the ordinary least squares (or just OLS) estimators of \(\beta_0\) and \(\beta_1\).
We write the OLS regression line as \[ \widehat{y} = \widehat{\beta}_0 + \widehat{\beta}_1 \cdot x \] which is just the line that “best” fits the data (where, again, “best” is quantified by minimizing the RSS).
By the way: remember that we also talked about Pearson’s correlation coefficient last lecture.
It is defined as \[ r = \frac{1}{n - 1} \sum_{i=1}^{n} \left( \frac{x_i - \overline{x}}{s_X} \right) \left( \frac{y_i - \overline{y}}{s_Y} \right) \] and gives a way of quantifying the strength of the linear association between two lists of numbers \(\{x_i\}_{i=1}^{n}\) and \(\{y_i\}_{i=1}^{n}\).
I mentioned last time that \(r\) does NOT give the slope of the line that best fits the data- that role is given to \(\widehat{\beta}_1\)!
However, there is in fact a connection between \(\widehat{\beta}_1\) and \(r\): it turns out (after a bit of math), we can equivalently compute \(\widehat{\beta}_1\) as \[ \widehat{\beta} = \frac{s_Y}{s_X} \cdot r \]
In this way, we can perhaps see how the OLS regression line can be used to perform prediction.
To see how this works, let’s return to a toy example from last lecture: \[\begin{align*} \boldsymbol{x} & = \{3, \ 7, \ 8\} \\ \boldsymbol{y} & = \{20, \ 14, \ 17\} \end{align*}\]
Notice that we do not have an x
-observation of 5. As such, we don’t know what the y
-value corresponding to an x
-value of 5 is.
However, we do have a decent guess as to what the y
-value corresponding to an x-
value of 5 is- the corresponding fitted value!
\[ \widehat{y}^{(5)} = \frac{1}{7} (155 - 6 \cdot 5) \approx 17.857 \]
Say we want to predict the corresponding y
value of an x
value of, 40.
Following our steps from before, we would just find the fitted value corresponding to x
= 40:
Specifically, the true signal function was quadratic. (When you zoom in close enough, parabolas look linear!)
This is why it is a bad idea to try to extrapolate too far.
Extrapolation is the name we give to trying to apply a model estimate to values that are very far outside the realm of the original data.
How far is “very far”? Statisticians disagree on this front. For the purposes of this class, just use your best judgment.
Exercise 1
An airline is interested in determining the relationship between flight duration (in minutes) and the net amount of soda consumed (in oz.). Letting x
denote flight duration
(the explanatory variable) and y
denote amount of soda consumed
(the response variable), a sample yielded the following results: \[ \begin{array}{cc}
\displaystyle \sum_{i=1}^{102} x_i = 20,\!190.55; & \displaystyle \sum_{i=1}^{102} (x_i - \overline{x})^2 = 101,\!865 \\
\displaystyle \sum_{i=1}^{102} y_i = 166,\!907.8 & \displaystyle \sum_{i=1}^{102} (y_i - \overline{y})^2 = 120,\!794.2 \\
\displaystyle \sum_{i=1}^{100} (x_i - \overline{x})(y_i - \overline{y}) = 80,\!184.62 \\
\end{array} \]
A one-unit change in
x
corresponds to a predicted \(\widehat{\beta}_1\)-unit change iny
.
Notice the word “predicted” there- remember that \(\widehat{\beta}_1\) is an estimator, not the true slope!
That’s right: \(\widehat{\beta}_1\) can be thought of as a random variable.
We know that \(\widehat{\beta}_1\) seeks to estimate \(\beta_1\).
It seems plausible, then, that we might be able to perform inference on \(\beta_1\) (the true slope) using \(\widehat{\beta}_1\) (the OLS estimator).
Indeed, we can!
Let’s start off by considering confidence intervals for the true slope \(\beta_1\).
Recall that (at least in the confines of this course), given a parameter \(\theta\) and an estimator \(\widehat{\theta}\) of \(\theta\), we construct a confidence interval for \(\theta\) as \[ \widehat{\theta} \pm c \cdot \mathrm{SD}(\widehat{\theta}) \] where \(c\) is a constant that depends on both the sampling distribution of \(\widehat{\theta}\) along with the confidence level.
This means we can construct a confidence interval for \(\beta_1\) using \[ \widehat{\beta}_1 \pm c \cdot \mathrm{SD}(\widehat{\beta}_1) \] where \(\widehat{\beta}_1\) is the OLS estimator of \(\beta_1\).
It turns out that finding \(\mathrm{SD}(\widehat{\beta}_1)\) is fairly involved. As such, I won’t expect you to compute it- you will be provided with its value for a given problem (see the practice problems for an example of what I mean).
We also need access to the sampling distribution of \(\widehat{\beta}_1\).
Assuming both the x-
observations and y-
observations are roughly normal, then \[ \frac{\widehat{\beta}_1 - \beta_1}{\mathrm{SD}(\widehat{\beta_1})} \sim t_{n - 2} \]
This means our critical value should be the appropriately-selected quantiles of the tn-2 distribution.
Worked-Out Example 1
Consider the same setup as Exercise 1. Suppose it is known that \[ \mathrm{Var}(\widehat{\beta}_1) \approx 0.006135 \] Construct a 95% confidence interval for \(\beta_1\), the true amount of change in y
(amount of soda consumed) associated with a one-unit change in x
(flight duration)
We previously saw that \(\widehat{\beta}_1 \approx 0.7871656\).
We know to use the \(t_{100}\) distribution; since we are using a 95% confidence level, we take \(1.98\) as our confidence coefficient (make sure you know where this came from!)
Hence, our confidence interval is \[(0.7871656) \pm 1.98 \cdot \sqrt{0.006135} = \boxed{[0.6321 \ , \ 0.9423]}\]
The interpretation of this interval is similar to the interpretation of our confidence intervals thus far:
We are 95% confident that the interval \([0.6321 \ , \ 0.9423]\) covers the true value of \(\beta_1\).
\(\widehat{\beta_0} =\) -2.5884231; \(\widehat{\beta_1} =\) 0.266222
Do we really believe the slope, though?
Without the OLS regression line, the scatterplot on the previous page would likely be one we classify as exhibiting “no relationship” between x
and y
.
However, the OLS regression line has picked up a positive slope.
What’s going on?
In other words, does our data (i.e. the data that gave rise to the value of \(\widehat{\beta}_1\) computed) support our claim that there is no relationship?
What do you know- we’ve entered the realm of hypothesis testing!
Specifically, we are trying to use our data to test the hypotheses \[ \left[ \begin{array}{rl} H_0: & \beta_1 = 0 \\ H_A: & \beta_1 \neq 0 \end{array} \right. \]
y
and x
.x
corresponds to no change in y
”; i.e. that \(\beta_1 = 0\).Our test then takes the form \[ \texttt{decision}(\mathrm{TS}) = \begin{cases} \texttt{reject } H_0 & \text{if } |\mathrm{TS}| > c \\ \texttt{fail to reject } H_0 & \text{otherwise}\\ \end{cases} \] where \(c\) is the appropriately-selected quantile of the tn-2 distribution.
Equivalently, we compute p-values using the tn-2 distribution.
Estimate | Std. Error | t-value | Pr(>|t|) | |
---|---|---|---|---|
Intercept | -2.588 | 2.327 | -1.112 | 0.269 |
Slope | -1.734 | 0.222 | -7.811 | 6.41e-12 |
The first column is the raw estimated value (i.e. \(\widehat{\beta_0}\) and \(\widehat{\beta_1}\), respectively)
The second column is the standard error (i.e. standard deviation) of the estimator
The third column is the test statistic (i.e. the first column divided by the second)
The fourth column is the p-value in a two-sided test, testing whether or not the given parameter is actually zero or not.
Here’s a task for you: write a function called regtab()
that takes in two inputs, x
and y
, and returns a regression table resulting from regressing y
on x
.
scipy.stats.linregress()
functionExercise 2 (modified from StatClass)
Consider the following regression equation, obtained from a sample of size \(50\): \[ \widehat{y} = 3.8 - 0.277 x \] and the standard deviation of \(\widehat{\beta}_1\) is 0.39.
Using a 5% level of significance, perform a test of the hypotheses \[ \left[ \begin{array}{rl} H_0: & \beta_1 = 0 \\ H_A: & \beta_1 \neq 0 \end{array} \right.\]
Exercise 3 (modified from StatClass)
Ten towns were the subject of a study to determine whether or not an increased number of stores selling liquor in their downtown areas is linked with a higher number of DUI arrests downtown during one month. The data and summary information is provided below.
x |
0 | 5 | 6 | 5 | 11 | 9 | 10 | 3 | 7 | 4 |
---|---|---|---|---|---|---|---|---|---|---|
y |
40 | 50 | 55 | 64 | 73 | 75 | 88 | 25 | 20 | 10 |
\[ \begin{array}{lll} \overline{x} = 6 & \displaystyle \sum_{i=1}^{10} (x_i - \overline{x})^2 = 102 \\ \overline{y} = 50 & \displaystyle \sum_{i=1}^{10} (y_i - \overline{y})^2 = 6,\!024 & \displaystyle \sum_{i=1}^{10} (x_i - \overline{x})(y_i - \overline{y}) = 513 \end{array} \]