2.575829303548901
Ethan P. Marzban
2023-05-16
As we have previously seen…
… I like cats.
So, let’s consider another cat example!
It is often stated that only 1 in 5 orange tabby cats is female; i.e. that only 20% of orange tabby cats are female.
Let’s say we take a representative sample of 100 orange tabby cats and find that 19 of these cats are female.
Since we observed a proportion of only 19% female cats in our sample, does that mean the claim of 20% of orange tabby cats being female is wrong?
Well, no! We know that this 19% is actually an observed instance of \(\widehat{P}\), which itself is random.
However, if instead our sample of 100 orange tabby cats contained only 1 female in this sample, we might start to question the claim that 20% of orange tabby cats are female.
Okay, what if in our sample of 100 orange tabby cats we actually only observed 15 female cats?
Things are perhaps a bit less clear now… we know that there will be some variability in our point estimator, but how much variability would we really expect? Enough to plausibly observe a sample proportion of 15%?
This is exactly the framework of hypothesis testing.
In hypothesis testing, we start with a pair of competing claims which we call the null hypothesis and alternative hypothesis, respectively.
For instance, in our cat example above the null hypothesis would be “\(H_0\): the true proportion of orange cats that are female is 20%”.
Oftentimes we will want to phrase our hypotheses in more mathematical terms. This is where the notation we’ve used over the past few lectures comes into play: letting \(p\) denote the true proportion of orange tabby cats that are female, we can write our null hypothesis as \[ H_0: \ p = 0.2 \]
What about the alternative hypothesis?
As the name suggests, the alternative hypothesis provides some sort of alternative to the null.
Let’s look at our cat example again. Here are some potential alternatives to the null:
Each of these alternative hypotheses has a name.
Before we talk about these names, let’s establish a slightly more general framework for conducting hypothesis testing on a proportion.
Our null hypothesis will often take the form \[ H_0 : \ p = p_0 \] for some prespecified value \(p_0\) (e.g. 20%, like in our cat example above).
This leads to four possible alternative hypotheses:
I’d like to stress: in a specific hypothesis testing problem, we need to pick one of these alternative hypotheses
When our alternative hypothesis is of the form \(H_A: \ p \neq p_0\), we refer to the situation as a two-sided hypothesis test.
When our alternative hypothesis is of the form \(H_A: \ p > p_0\) or \(H_A: \ p < p_0\), we refer to the situation as a one-sided hypothesis test. Specifically:
When our alternative hypothesis is of the form \(H_A: \ p = p_1\) (for some value \(p_1\) different than our null value \(p_0\)), we refer to the situation as a simple-vs-simple hypothesis test.
Again, our test will be only one of the above!
Additionally, it is usually up to the tester (i.e. the statistician or datascientist in charge of conducting the hypothesis test) to pick which set of hypotheses to use.
Okay, so which test to we use when?
In practice, there isn’t a one-size-fits-all approach to knowing which set of hypotheses to adapt in a given situation.
Usually, in the absence of any additional information, we adopt a two-sided test as it tends to be the most general.
However, sometimes additional information may be available to us that may influence us to select a different type of test.
How do we set the null hypothesis? Well, typically the null hypothesis is easier to set: I like to think of it as the “status quo”.
Worked-Out Example 1
Forbes magazine has claimed that, as of May 2023, 91.7% of US households own a vehicle.
The two-sided alternative hypothesis would be that the proportion of households that own a vehicle is not equal to 91.7%: \[ H_A: \ p \neq 0.917 \]
The lower-tailed alternative hypothesis would be that the proportion of households that own a vehicle is less than 91.7%: \[ H_A: \ p < 0.917 \]
The lower-tailed alternative hypothesis would be that the proportion of households that own a vehicle is greater than 91.7%: \[ H_A: \ p > 0.917 \]
Alright, here is what we have so far in terms of hypotheses:
So, what is a hypothesis test?
A hypothesis test is a framework/procedure that allows us to determine whether or not the null should be rejected in favor of the alternative.
Naturally, a hypothesis test will depend on data! As such, we can think of a hypothesis test as a function that takes in data and outputs either reject H0
or fail to reject H0
. \[ \texttt{decision}(\texttt{data}) = \begin{cases} \texttt{reject } H_0 & \text{if } \texttt{...} \\ \texttt{fail to reject } H_0 & \text{if } \texttt{...} \\ \end{cases} \]
By the way, the results of a hypothesis test are always framed in terms of the null hypothesis; e.g. “reject \(H_0\)” or “fail to reject \(H_0\)”.
Wait, why are we saying “fail to reject \(H_0\)”? Isn’t that just equivalent to “accept \(H_0\)”?
Well, not quite…
Think of it this way: just because we are saying the particular alternative hypothesis we picked is less plausible than the null, doesn’t mean there isn’t a different alternative hypothesis that is more plausible than the null.
All we are saying when we fail to reject the null is exactly that- we didn’t have enough information to reject \(H_0\) outright. We are not saying that \(H_0\) must be true.
Admittedly, some statisticians have gotten a little lax with this distinction and you may encounter textbooks and/or professors that use terms like “accept the null”.
Okay, so we’ve talked a bit more about what a hypothesis test actually is: it is a procedure that takes in data and outputs a decision on whether or not to reject the null.
Behind the scenes, however, the null will either be true or not.
This leads to the following four situations:
Result of Test | |||
Reject | Fail to Reject | ||
H0 | True | ||
False |
Result of Test | |||
Reject | Fail to Reject | ||
H0 | True | BAD | GOOD |
False | GOOD | BAD |
Result of Test | |||
Reject | Fail to Reject | ||
H0 | True | Type I Error | GOOD |
False | GOOD | Type II Error |
Definition: Type I and Type II errors
A common way of interpreting Type I and Type II errors are in the context of the judicial system.
The US judicial system is built upon a motto of “innocent until proven guilty.” As such, the null hypothesis is that a given person is innocent.
A Type I error represents convicting an innocent person.
A Type II error represents letting a guilty person go free.
Viewing the two errors in the context of the judicial system also highlights a tradeoff.
If we want to reduce the number of times we wrongfully convict an innocent person, we may want to make the conditions for convicting someone even stronger.
But, this would have the consequence of having fewer people overall convicted, thereby (and inadvertently) increasing the chance we let a guilty person go free.
As such, controlling for one type of error increses the likelihood of committing the other type.
Worked-Out Example 2
Forbes magazine has claimed that, as of May 2023, 91.7% of US households own a vehicle.
Assuming we are conducting a two-sided test, what would a Type I error be in the context of this experiment? What about a Type II error?
A Type I error would be concluding that the true proportion of US households that own a vehicle is not 91.7%, when in fact 91.7% of US households own a vehicle.
A Type II error would be concluding that the true proportion of US households that own a vehicle is 91.7%, when in fact the true proportion is not 91.8%.
Alright, now we know about the basics and background surrounding hypothesis tests.
How do we actually construct one?
Let’s focus on hypothesis testing for population proportions for now; we’ll deal with sample means later.
Recall our setup: our hypothesis test should be some sort of decision-making process of the form \[ \texttt{decision}(\texttt{data}) = \begin{cases} \texttt{reject } H_0 & \text{if } \texttt{...} \\ \texttt{fail to reject } H_0 & \text{if } \texttt{...} \\ \end{cases} \]
For the moment, let’s return to the cat example from the beginning of the lecture.
Letting \(p\) denote the true proportion of orange tabby cats that are female, our null hypothesis takes the form \(H_0: \ p = 0.2\).
Suppose we take a two-sided alternative: \(H_A: \ p \neq 0.2\).
Now, we have a good summary statistic for proportions: \(\widehat{P}\).
As such, our decision process should probably be of the form \[ \texttt{decision}(\widehat{p}) = \begin{cases} \texttt{reject } H_0 & \text{if } \texttt{...} \\ \texttt{fail to reject } H_0 & \text{if } \texttt{...} \\ \end{cases} \]
Said differently: if we observe a value of \(\widehat{p} = 0.82\), or a value of \(\widehat{p} = 0.001\), we would likely be inclined to reject the null.
So, it makes sense to reject \(H_0\) when \(\widehat{p}\) is very far away from \(p_0\) (which, in the cat example, is \(0.2\)). \[ \texttt{decision}(\widehat{p}) = \begin{cases} \texttt{reject } H_0 & \text{if $\widehat{p}$ is far from $p_0$} \\ \texttt{fail to reject } H_0 & \text{otherwise}\\ \end{cases} \]
For reasons that will become clear in a few slides, we typically avoid using \(\widehat{p}\) and instead use a standardized version of \(\widehat{p}\): \[ \mathrm{TS} = \frac{\widehat{P} - p_0}{\sqrt{\frac{p_0 (1 - p_0)}{n}}} \] where \(\mathrm{TS}\) stands for test statistic.
Let’s try and convert our decision-making process to be in terms of the test statistic.
First, note that saying \(\widehat{p}\) is “far away” from \(p_0\) could mean one of two things:
These two cases can be combined into a single case if we think in terms of the magnitude of the distance bewteen \(\widehat{p}\) and \(p_0\), which is equivalent to considering \(|\mathrm{TS}|\).
What I’m getting at is this: if \(\widehat{p}\) was far away from \(p_0\), then \(|\mathrm{TS}|\) must be large.
Hence, we can rephrase our decision process as \[ \texttt{decision}(\mathrm{TS}) = \begin{cases} \texttt{reject } H_0 & \text{if $|\mathrm{TS}|$ is large} \\ \texttt{fail to reject } H_0 & \text{otherwise}\\ \end{cases} \]
Okay, but how large is “large”?
That is, for what values of \(|\mathrm{TS}|\) will we reject the null?
reject
is called the rejection region of a test.In other words, if our test takes the form \[ \texttt{decision}(\mathrm{TS}) = \begin{cases} \texttt{reject } H_0 & \text{if } |\mathrm{TS}| > c \\ \texttt{fail to reject } H_0 & \text{otherwise}\\ \end{cases} \] what value should we take \(c\) to be?
Well, to answer this question, we need to return to our considerations of Type II and Type II errors.
Recall that a Type I error occurs when we reject \(H_0\) when \(H_0\) was actually true, and a Type II error occurs when we fail to reject \(H_0\) when \(H_0\) was false.
Changing the value of \(c\) changes the probability of committing the two types of errors!
Specifically, setting a larger value of \(c\) corresponds to rejecting \(H_0\) for fewer values, thereby decreasing the probability of committing a Type I errror but increasing the probability of committing a Type II error.
Conversely, setting a smaller value of \(c\) corresponds to rejecting \(H_0\) for more values, thereby increasing the probability of committing a Type I error but decreasing the probability of committing a Type II error.
We need to compromise!
In practice, we go into the test knowing how much leeway we are going to allow ourselves to commit a Type I error. That is, we prespecify our tolerance for committing a Type I error.
The probability of committing a Type I error is called the level of significance (or just significance level), and is often denoted \(\alpha\).
Statisticians therefore construct a hypothesis test around a specific value of \(\alpha\).
A common level of significance is \(\alpha = 0.05\), though \(\alpha = 0.01\) and \(\alpha = 0.1\) are sometimes used as well.
Okay, so what does this mean for our test?
We now know that \(\alpha\) denotes the probability of rejecting the null when the null is true; i.e. \[ \mathbb{P}_{H_0}(|\mathrm{TS}| > c) = \alpha \] where the symbol \(\mathbb{P}_{H_0}\) just means “assuming the null, the probability of….”
Again, remember that \(\alpha\) is fixed (e.g. \(0.05\)); it is the value of \(c\) we are after!
So, a natural question arises: what is the distribution of \(\mathrm{TS}\) under the null?
Recall that \[ \mathrm{TS} = \frac{\widehat{P} - p_0}{\sqrt{\frac{p_0 (1 - p_0)}{n}}} \]
Now, assuming the null is true (i.e. that \(p = p_0\)), the Central Limit Theorem for Proportions tells us \[ \widehat{P} \stackrel{H_0}{\sim} \mathcal{N}\left(p_0, \ \sqrt{\frac{p_0(1 - p_0)}{n}} \right) \] where the symbol \(\stackrel{H_0}{\sim}\) is just a shorthand for “distributed as, under the null”
Therefore, assuming the null is correct, we have \[ \mathrm{TS} \sim \mathcal{N}(0, \ 1)\]
So, our condition \[ \mathbb{P}_{H_0}(|\mathrm{TS}| > c) = \alpha \] which, by the symmetry of the standard normal distribution, is equivalent to \[ \mathbb{P}_{H_0}(\mathrm{TS} < -c) = \frac{\alpha}{2} \]
Hence, \(-c\) is just the \((\alpha / 2) \times 100\) percentile of the standard normal distribution!!!
Two-Sided Test for a Proportion:
When testing \(H_0: \ p = p_0\) vs \(H_A: \ p \neq p_0\) at an \(\alpha\) level of significance, where \(p\) denotes a population proportion, the test takes the form \[ \texttt{decision}(\mathrm{TS}) = \begin{cases} \texttt{reject } H_0 & \text{if } |\mathrm{TS}| > z_{1 - \alpha/2} \\ \texttt{fail to reject } H_0 & \text{otherwise}\\ \end{cases} \] where:
\(\displaystyle \mathrm{TS} = \frac{\widehat{p} - p_0}{\sqrt{\frac{p_0(1 - p_0)}{n}}}\)
\(z_{1 - \alpha/2}\) denotes the \((\alpha/2) \times 100\)th percentile of the standard normal distribution, scaled by negative 1.
CAUTION!!!
All of this is predicated on our invocation of the Central Limit Theorem for Proportions!
In other words, the test above was derived assuming \[ \widehat{P} \stackrel{H_0}{\sim} \mathcal{N}\left(p_0, \ \sqrt{\frac{p_0(1 - p)}{n}} \right) \]
Two-Sided Test for a Proportion:
When testing \(H_0: \ p = p_0\) vs \(H_A: \ p \neq p_0\) at an \(\alpha\) level of significance, where \(p\) denotes a population proportion:
Check that the success-failure conditions hold. Namely, check that:
Compute the test statistic \[\displaystyle \mathrm{TS} = \frac{\widehat{p} - p_0}{\sqrt{\frac{p_0(1 - p_0)}{n}}}\]
Compute the critical value \(z_{1 - \alpha/2}\), which is the the \((\alpha/2) \times 100\)th percentile of the standard normal distribution, scaled by negative 1.
Reject
\(H_0\) if \(|\mathrm{TS}| > z_{1 - \alpha/2}\), and fail to reject
\(H_0\) otherwise.
Worked-Out Example 3
Forbes magazine has claimed that, as of May 2023, 91.7% of US households own a vehicle. To test that claim, we take a representative sample of 500 US households and observe that 89.4% of these households own a vehicle.
Conduct a two-sided hypothesis test at a \(5\%\) level of significance on Forbes’s claim that 91.7% of US households own a vehicle. Be sure you phrase your conclusion clearly, and in the context of the problem.
Check Conditions
\(n p_0 = 500 \cdot (0.917) = 458.5 \geq 10 \ \checkmark\)
\(n (1 - p_0) = 500 \cdot (1 - 0.917) = 41.5 \geq 10 \ \checkmark\)
Since both conditions are met, we can proceed.
Compute the Test Statistic \[ \mathrm{TS} = \frac{\widehat{p} - p_0}{\sqrt{\frac{p_0 (1 - p_0)}{n}}} = \frac{0.894 - 0.917}{\sqrt{\frac{0.917 (1 - 0.917)}{500}}} = -1.86 \]
Compute the critical value Because \(\alpha = 0.05\), the critical value is \(1.96\).
Now, the problem told us to phrase our conclusions carefully and in the context of the problem.
It is VERY important to include the level of significance in your final conclusions.
So, here is how we would phrase the final conclusion of our test:
At an \(\alpha = 0.05\) level of significance, there is insufficient evidence to reject Forbes’s claim that 91.7% of US households own a vehicle.
Worked-Out Example 4
Forbes magazine has claimed that, as of May 2023, 91.7% of US households own a vehicle. To test that claim, we take a representative sample of 500 US households and observe that 89.4% of these households own a vehicle.
Conduct a two-sided hypothesis test at a \(1\%\) level of significance on Forbes’s claim that 91.7% of US households own a vehicle. Be sure you phrase your conclusion clearly, and in the context of the problem.
The only thing that will change from before is our critical value.
Since we are using an \(\alpha = 0.01\) level of significance, we find the 0.5th percentile [since \((0.01) / 2 \times 100\% = 0.05\%\)]
There are several ways we could find this critical value.
The first is to use our normal table: \(2.575\).
The second is to use our \(t-\)table: \(2.58\)
The third is to use Python:
At an \(\alpha = 0.01\) level of significance, there is insufficient evidence to reject Forbes’s claim that 91.7% of US households own a vehicle.
By the way, I’m sure some of you are wondering exactly which cutoff (the one from the \(z-\)table, the one from the \(t-\)table, or the one from Python) to use?
In a real-world setting, definitely use the one Python generated as it is the most accurate.
On a quiz, I will accept either the \(z-\)table or the \(t-\)table value, provided you VERY CLEARLY AND EXPLICITLY state which one you are using.