1.984467454426692
Ethan P. Marzban
2023-06-06
Last time, we started to talk about how to actually model a relationship between two variables.
Specifically, given a respone variable y
and an explanatory variable x
, we typically assume x
and y
are related through the equation \[ \texttt{y} = f(\texttt{x}) + \texttt{noise} \] where \(f\) is some function.
Of particular interest to us in this class is when \(f\) takes the form of a linear equation: i.e. when our model is of the form \[ \texttt{y} = \beta_0 + \beta_1 \cdot \texttt{x} + \texttt{noise} \]
Now, the noise part of our model makes it impossible to know the true values of \(\beta_0\) and \(\beta_1\).
As such, we sought to find point estimators \(\widehat{\beta_0}\) and \(\widehat{\beta_1}\) that best estimate \(\beta_0\) and \(\beta_1\), respectively.
To quantify what we mean by “best”, we employed the condition of minimizing the residual sum of squares.
Such estimators (i.e. those that minimize the RSS) are said to be ordinary least squares (OLS) estimates.
It turns out that the OLS estimates of \(\beta_0\) and \(\beta_1\) are: \[\begin{align*} \widehat{\beta_1} & = \frac{\sum_{i=1}^{n} (x_i - \overline{x})(y_i - \overline{y})}{\sum_{i=1}^{n} (x - \overline{x})^2} = \frac{s_Y}{s_X} \cdot r \\ \widehat{\beta_0} & = \overline{y} - \widehat{\beta_1} \cdot \overline{x} \end{align*}\] where r denotes Pearson’s Correlation Coefficient \[ r = \frac{1}{n - 1} \sum_{i=1}^{n} \left( \frac{x_i - \overline{x}}{s_X} \right) \left( \frac{y_i - \overline{y}}{s_Y} \right) \]
\(\widehat{\beta_0} =\) -2.5884231; \(\widehat{\beta_1} =\) 0.266222
Do we really believe the slope, though?
Well, remember- \(\widehat{\beta_1}\) is just an estimator of \(\beta_1\), the true slope.
As such, we should be a little wary of using a single observed instance of \(\widehat{\beta_1}\) as an estimate for \(\beta_1\).
What do you think would be a better quantity to report?
That’s right- a confidence interval!
It is often desired to report a confidence interval for \(\beta_1\).
Recall: in general, a confidence interval for a parameter \(\theta\) takes the form \[ \widehat{\theta} \pm c \cdot \mathrm{SD}(\widehat{\theta}) \] where:
We’re going to do the same thing for \(\beta_1\): \[ \widehat{\beta_1} \pm c \cdot \mathrm{SD}\left( \widehat{\beta_1} \right) \]
You will not be responsible for computing the standard deviation of \(\widehat{\beta_1}\). On a problem, you will be provided with this information.
We should, however, talk about what distribution \[ \frac{\widehat{\beta_1} - \beta_1}{\mathrm{SD}(\widehat{\beta_1})} \] follows, as this will be the distribution whose quantiles we use when constructing our confidence intervals.
It turns out that, if we assume the noise in our model is normally distributed, then \[ \left( \frac{\widehat{\beta_1} - \beta_1}{\mathrm{SD}(\widehat{\beta_1})} \right) \sim t_{n - 2} \]
This allows us to construct confidence intervals for \(\beta\) using the \(t_{n - 2}\) distribution.
Worked-Out Example 1
Two sets \(\boldsymbol{x} = \{x_i\}_{i=1}^{100}\) and \(\boldsymbol{y} = \{y_i\}_{i=1}^{100}\) have yielded: \[\begin{align*} \sum_{i=1}^{100} x_i = 1009.491 & \qquad \sum_{i=1}^{100} y_i = -2009.075 \\ \sum_{i=1}^{100} (x_i - \overline{x})^2 = 796.16 & \qquad \sum_{i=1}^{100} (y_i - \overline{y})^2 = 6237.68 \\ \sum_{i=1}^{100} (x_i - \overline{x})(y_i - \overline{y}) = -1380.372 \end{align*}\]
x
and y
.For part (a), we simply plug into our formulas: \[\begin{align*} \widehat{\beta_1} & = \frac{\sum_{i=1}^{n} (x_i - \overline{x})(y_i - \overline{y})}{\sum_{i=1}^{n} (x_i - \overline{x})^2} = \frac{-1380.372}{796.16} \approx \boxed{-1.73} \\ \widehat{\beta_0} & = \overline{y} - \widehat{\beta_1} \cdot \overline{x} = \left( \frac{-2009.075}{100} \right) - (-1.734) \cdot \left( \frac{1009.491}{100} \right) \approx \boxed{-2.627} \end{align*}\]
Hence, the equation of the OLS regression line is \[ \boxed{ \widehat{y} = -2.627 - 1.734 \cdot \texttt{x} } \]
For part (b), we should first find the appropriate quantile of the appropriate t-distribution.
Here, \(n = 100\) meaning we will use the \(t_{98}\) distribution.
Additionally, we require a 95% confidence level, meaning the quantile we seek is the \(97.5\)th percentile or, equivalently, negative one times the \(2.5\)th percentile.
What happens if a confidence interval for \(\beta_1\) includes zero?
Well, let’s think about what a confidence interval is actually saying.
A 95% CI \([a \ , b]\) for some parameter \(\theta\) is saying “we are 95% confident that the true value of \(\theta\) lies in the interval \([a \ , b]\).”
As such, if a 95% confidence interval for \(\beta_1\), we are effectively saying that the value of \(\beta_1\) could be zero.
Since \(\beta_1\) represents the slope of the relationship between x
and y
, this means we are saying that there could potentially be no relationship between x
and y
at all!
Indeed, we could make this a bit more rigorous by performing a hypothesis test for \(\beta_1\).
Specifically, we may wish to test \[ \left[ \begin{array}{rr} H_0: & \beta_1 = 0 \\ H_A: & \beta_1 \neq 0 \end{array} \right.\]
A natural test statistic is \[ \frac{\widehat{\beta_1}}{\mathrm{SD}(\widehat{\beta})} \stackrel{H_0}{\sim} t_{n - 2} \]
Since we are considering a two-sided alternative (for now), our test would reject for large values of \(|\mathrm{TS}|\), where the critical value comes from the \(t_{n - 2}\) distribution.
Estimate | Std. Error | t-value | Pr(>|t|) | |
---|---|---|---|---|
Intercept | -2.588 | 2.327 | -1.112 | 0.269 |
Slope | -1.734 | 0.222 | -7.811 | 6.41e-12 |
The first column is the raw estimated value (i.e. \(\widehat{\beta_0}\) and \(\widehat{\beta_1}\), respectively)
The second column is the standard error (i.e. standard deviation) of the estimator
The third column is the test statistic (i.e. the first column divided by the second)
The fourth column is the p-value in a two-sided test, testing whether or not the given parameter is actually zero or not.
Regression truly is one of the workhorses of statistics (the other being Hypothesis Testing).
Indeed, there is so much one can do with regression!
For one, we could consider how a response variable y
is related to multiple covariates x1
, x2
, …, xk
.
We could also explore nonlinear relationships between y
and a single covariate x
(or even multiple covariates!)
Here is an interesting project: suppose we have access to a roster of passengers on the Titanic.
Tragically, not all passengers survived.
One question we may want to ask is: how did various factors (e.g. class, gender, etc.) affect whether or not a given passenger survived?
Notice how here, the response variable (i.e. whether or not someone survived) is a binary variable; that is, it takes only one of two values (survived
or did not survive
).
Regression involving a binary response (like with this Titanic project) is called logistic regression and is highly applicable to lots of other projects as well.
These are all topics that fall under the category of Machine Learning.
Throughout this course, we have been using various pieces of data.
One thing we should discuss, as good Data Scientists, is where this data actually came from?
We will now begin to discuss some possible answers to these questions, as well as some practical strategies for collecting data of our own!
Indeed, most experiments and studies begin with some sort of question.
The next step is something we’ve actually done several times in this class already: identify the population of interest!
Now, we need the most crucial piece of all: data!
This will entail taking a sample (or possibly many samples) from our population.
There are many ways to take a sample!
One way is to take what is known as a simple random sample (or SRS, for short).
A simple random sample is akin to assigning a unique number to each person in the population, and then picking some subset of these numbers uniformly at random.
Now, Simple Random Samples do have some potential downsides.
As an example, let’s consider the following situation: suppose our population is the workplace at some company (which we will call Company X).
Say we are interested in determining whether or not there is systemic racism present in Company X.
To determine whether or not this is the case, we might take a sample and administer a survey (we’ll talk more about surveys in a bit).
Can anyone tell me a potential problem with this setup if we were to take a Simple Random Sample?
That’s right- we run the risk of obtaining a biased sample.
Said differently: if we take an SRS of employees at this company, there is a high probability that this sample will contain a disproportionately larger number of Caucasians than non-Caucasians.
One way to remedy this would be to perform what is known as in which we first divide the population into several strata (i.e. groups), and then take an SRS from each stratum.
This has the benefit of ensuring a roughly equal number of participants from each stratum, but has the downsides of being very dependent on the strata that are created.
That is, sometimes it won’t necessarily be obvious how to divide the population!
Another type of sampling is called cluster sampling. Similar to stratified sampling, the population is first divided into several groups (now called clusters). But, instead of taking an SRS from every cluster we instead take an SRS of clusters and then take an SRS from each included cluster
Note, then, that we are not including every cluster in our sample in this way of sampling.
This has the benefit of being (potentially) cheaper, but has the (obvious) downside of potentially skewing results due to the lack of certain clusters.
By the way, this is slightly different than the notion of Cluster Sampling outlined in the textbook; the textbook calls the above scheme “multistage sampling”. However, this term is not widely used, and as such we will simply refer to the above as “Cluster Sampling”, as most statisticians do.
Example (credit to OpenIntro Statistics)
Suppose we are interested in estimating the malaria rate in a densely tropical portion of rural Indonesia. We learn that there are 30 villages in that part of the Indonesian jungle, each more or less similar to the next. Our goal is to test 150 individuals for malaria. What sampling method should be employed?
At surface level, an SRS may seem tempting.
However, an SRS will likely contain individuals from all (or certainly most) of the villages.
This would require someone to actually visit these villages to collect data, which will be incredibly costly.
As such, an SRS is probably not a good idea.
Indeed, cluster sampling seems to be the way to go.
Specifically, we could assign each village to its own cluster, then select some number of villages (say, maybe 15 or so) randomly, and then select some number of people (say, 10 or so) from each of the selected villages.
Because the villages are “more or less similar”, the people included in our sample would likely not have many obvious differences (in the context of the experiment) and we would believe our sample to be relatively representative.
Additionally, the experimenter (or a volunteer) would only need to visit 15 villages instead of all 30.
Of course, we would need to modify our statistical tools slightly to reflect the fact that our sample was actually collected via a clustering scheme- such modifications (though not too challenging) are outside the scope of this course.
Even if we end up with a fairly representative sample, there are ways bias can creep in.
Suppose we administer a survey to a representative sample of people.
It is not guaranteed that everyone in our sample will actually fill out the survey; even if a particular individual does attempt the survey, there is no guarantee they will finish the entirety of the survey.
This can lead to gaps in our data with respect to certain demographics, which is itself a form of bias.
We call this non-response bias.
Another sampling strategy that is prone to bias is known as convenience sampling.
We can think of convenience sampling as any sort of sampling scheme where individuals who are easily accessed have a higher chance of being included in the sample.
For example, suppose we poll students about their views on the housing crisis in Santa Barbara.
If we conduct this poll at UCSB, we are more likely to get UCSB students and much less likely to get, say, UCLA students.
Similarly, if we conducted our study at UCLA, we would be more likely to include UCLA students than, say, NYU students.
So, why would we ever use a convenience sample?
Well, as the name suggests- it is convenient!
Convenience samples are often both the easiest to obtain, as well as the cheapest.
Having said that, most statisticans agree thate convenience samples are bad. This is due to what is known as the garbage in garbage out philosophy.
Loosely speaking, the “garbage in garbage out” philosophy states that our results are only as good as our data- even the most sophisticated statistical models will output nonsensical or skewed results if we are feeding them nonsensical or skewed data!
There is another distinction we should be aware of: the difference between an observational study and an experiment.
In an observational study, no treatment is ever explicitly applied (or withheld).
This is in contrast to an experiment, in which researchers assign treatments to cases.
For example, suppose a researcher is interested in determining the relationship between cancer rates and tanning beds.
In an experiment, the researcher would take some sample of volunteers, split them into groups, and assign one group to tan regularly and another to not use tanning beds at all for the duration of the study.
If instead the researcher were to conduct an observational study, they might take some sample of people who already use tanning beds, analyze cancer rates, and repeat for a sample of people who do not use tanning beds.
Naturally, experiments come with all sorts of ethical considerations.
For instance, in trying to determine the relationship between cancer rates and smoking, nobody would actually force a group of participants to smoke, just for the purposes of an experiment.
This is one “pro” in favor of observational studies- the experimenter is not responsible for forcing a group of people to do (or not do something) they wouldn’t want to.
Of course, observational studies are not perfect either.
Specifically, observational studies cannot be used to identify causal relationships; they can only ever identify associations (or a lack thereof).
Remember that assocations are not the same things as causation!
Example (OpenIntro Statistics, 1.19)
A large college class has 160 students. All 160 students attend the lectures together, but the students are divided into 4 groups, each of 40 students, for lab sections administered by different teaching assistants. The professor wants to conduct a survey about how satisfied the students are with the course, and he believes that the lab section a student is in might affect the student’s overall satisfaction with the course.
Because treatment has neither been administered nor withheld by the researcher, this is an example of an observational study.
In this case, stratified sampling would likely be a good idea, with each lab section assigned to a stratum. This ensures each stratum (lab section) is represented in the sample.
Suppose a particular drug claims to significantly reduce blood sugar levels. Consider the following two scenarios:
An SRS of 100 people is taken from a certain demographic, and divided randomly into two groups: group A and group B. Group A is administered the drug and Group B is not administered the drug. After a week, the researcher collects data on the blood sugar levels of the two groups.
An SRS of 50 people is taken from a certain demographic. These 50 people have their blood sugar levels recorded, and are then all administered the drug. A week later, the participants have their blood sugar levels recorded.
Note that both of these situations seemingly end up with the same data: 50 measurements of pre-treatment blood sugar levels, and 50 measurements of post-treatment blood sugar levels.
However, we can see that these two studies are fundamentally different.
In the first study, people were divided into two groups (called the treatment and control groups, respectively).
In the second, the same 50 individuals were tracked over time.
This is an example of the distinction between a cross-sectional study and a longitudinal study.
In a longitudinal study, the same set of individuals is tracked over time.
In a cross-sectional study, individuals are divided into several groups.
Notice that data in longitudinal studies necessarily possess serial correlation: that is, measurements are correlated!
Let’s quickly return to our distinction between observational studies and experiments.
Suppose it is decided that we want to conduct an experiment.
We then need to think very carefully about how we want to design our experiment.
This leads us into our final discussion for this course: Experimental Design.
First, let me make clear the distinction between treatment and control groups.
Treatment groups are groups to which one or more treatments is/are administered.
Typically, we always leave at least one group “alone”; i.e. one group to which no treatment is administered. This group is called the control group.
Experimental Design loosely refers to the principles of and procedures related to setting up an experiment.
I won’t go into this in too much detail.
I do highly encourage you to read Section 1.4 of your textbook, as it provides a very nice summary of some of the main tenants of experimental design.
Remember - as Data Scientists, it is very important we understand our data as much as possible before performing analyses.
Alright, so that was the last bit of new material I wanted to cover in this class.
But…. why did we do all of this?
What was the point of this course?
But, more fundamentally, this course is designed to try and provide an introduction to Data Science.
The key operating word in this is “introduction-” we only just scratched the surface of the topics we discussed!
Even within our own PSTAT department, there are lots of different courses that dive deeper into the subjects we discussed.
PSTAT 120A provides a deeper look at Probability, and some more sophisticated probability tools.
PSTAT 120B and 120C provide a deeper look at inferential statistics, and how to answer much more interesting and complex problems than those we looked at in this course!
Interested in Experimental Design? PSTAT 122 is devoted entirely to that!
Want to learn more about regression (including logistic regression)? Take PSTAT 126 and 131!
Now, I know many of you are graduating this quarter.
For those of you who are not, and especially for those of you who aren’t quite sure what you want to do, I’d like to leave you with a story.
I encountered a student who had come into undergrad not knowing what they wanted to do at all.
They had a vague inkling that they might want to do math, but after a quarter switched to Econ, then Physics, and finally wound their way into the equivalent of PSTAT 5A at their undergraduate institution.
By the end of the quarter, they were so intrigued to learn more the decided to take the analog of 120A, and then 120B, and then, before they knew it, they had completed a degree in statistics!
That student…. was me!
Years ago, I stumbled into the analog of PSTAT 5A and was so enamored by the field that here I am now, pursuing a PhD in it!
I truly believe statistics and data science are some of the most useful and applicable fields around.
Wherever there is data, there is the need for a data scientist.
Whenever there is uncertainty, there is the need for a statistician.
Statistics and Data Science have far reaching applications in so many fields!
The best thing about being a statistician is that you get to play in everyone’s backyard.
I couldn’t agree more!
So, now that you’ve learned the basics…
…go out and play!