13.5.2 Hypothesis testing

Table of Contents

Understanding Hypothesis Testing

Hypothesis testing is a formal way to use sample data to make decisions or judgments about a population. It turns vague questions like “Did this new teaching method help?” into a clear, step-by-step statistical procedure.

This chapter assumes you already know basic ideas of inferential statistics such as population, sample, and confidence levels.

1. The Basic Idea

Hypothesis testing is about deciding between two competing claims (hypotheses) about a population:

A “default” claim we assume to be true unless evidence suggests otherwise.
An “alternative” claim we consider if the evidence strongly contradicts the default.

We then use sample data to see which claim is more consistent with what we observed.

The result of a hypothesis test is not “proof,” but a decision:

Either we reject the default claim because the data are too unlikely under it,
Or we fail to reject the default claim because the data are not surprising enough to contradict it.

We never “prove” a hypothesis in an absolute sense; we only see whether the data give us enough reason to doubt the default claim.

2. Null and Alternative Hypotheses

The two competing claims are:

Null hypothesis $H_0$:
The status quo, no effect, no difference, or a specific claimed value.
Alternative hypothesis $H_1$ (or $H_a$):
What you suspect might be true instead of $H_0$, often an effect, a difference, or a change.

Examples:

Checking a manufacturer’s claim that the mean battery life is 10 hours:

$H_0: \mu = 10$ hours (mean life is 10)
$H_1: \mu \neq 10$ hours (mean life is not 10)

Testing if a new drug has higher cure rate than the standard 60%:

$H_0: p = 0.60$
$H_1: p > 0.60$

Checking if a diet reduces mean weight (compared to 80 kg):

$H_0: \mu = 80$
$H_1: \mu < 80$

The direction of $H_1$ determines the type of test:

Two-sided test: $H_1: \text{parameter} \neq \text{value}$
Right-sided test: $H_1: \text{parameter} > \text{value}$
Left-sided test: $H_1: \text{parameter} < \text{value}$

You always write $H_0$ as an equality (or including equality, like $\le, \ge$); the strict inequality goes into $H_1$.

3. Test Statistic and Sampling Distribution

To decide between $H_0$ and $H_1$, we transform the sample data into a test statistic.

A test statistic:

Is a single number computed from the sample.
Has a known probability distribution when $H_0$ is true.
Measures how far the sample result is from what $H_0$ predicts.

The exact formula for the test statistic depends on:

The parameter being tested (mean, proportion, difference, etc.).
The sampling distribution (normal, $t$, etc.).
Whether population parameters (like standard deviation) are known.

Common examples (you will see specific formulas in other chapters):

$z$-statistic for a mean (when population standard deviation is known).
$t$-statistic for a mean (when population standard deviation is unknown).
$z$-statistic for a proportion.

The key idea:
If $H_0$ is true, the test statistic behaves like a random draw from a known distribution. We compare the statistic we computed to this distribution to judge how unusual our sample is under $H_0$.

4. Significance Level and Decision Rule

Before looking at the data, we choose a significance level $\alpha$ (alpha). This is a threshold that controls how strong the evidence must be to reject $H_0$.

Typical choices: $\alpha = 0.05$, $0.01$, or $0.10$.

Interpretation of $\alpha$:

$\alpha$ is the maximum probability we accept of rejecting $H_0$ when it is actually true (a false alarm).
Smaller $\alpha$ means a stricter standard of evidence.

Using $\alpha$ and the sampling distribution under $H_0$, we set a rejection region (also called a critical region):

For a right-sided test, we reject $H_0$ if the test statistic is too large (in the right tail).
For a left-sided test, we reject if the test statistic is too small (in the left tail).
For a two-sided test, we reject if the test statistic is too extreme in either direction (both tails).

The critical value(s) are the cutoffs separating “not extreme enough” from “extreme enough” given $\alpha$.

Example (conceptual):

Two-sided test with $\alpha = 0.05$ using a $z$-statistic.
The critical values are approximately $z = -1.96$ and $z = 1.96$.
Decision rule:

If $z < -1.96$ or $z > 1.96$, reject $H_0$.
Otherwise, fail to reject $H_0$.

5. The $p$-Value

Instead of (or in addition to) using critical values, we can use the $p$-value.

The $p$-value is:

The probability, assuming $H_0$ is true, of obtaining a test statistic at least as extreme as the one we actually observed, in the direction(s) specified by $H_1$.

Key points:

A small $p$-value means our sample would be very unlikely if $H_0$ were true. This is evidence against $H_0$.
A large $p$-value means our sample is not surprising under $H_0$. We do not have strong evidence against $H_0$.

The decision rule using a $p$-value:

If $p \le \alpha$, reject $H_0$.
If $p > \alpha$, fail to reject $H_0$.

Relation to tails:

For a two-sided test, the $p$-value includes extreme results on both ends of the distribution.
For one-sided tests, the $p$-value only considers one tail.

Interpretation example:

$p = 0.03$ with $\alpha = 0.05$:
There is only a 3% chance of seeing data this extreme or more extreme if $H_0$ is true. Since
$p = 0.20$ with $\alpha = 0.05$:
Data this extreme or more extreme would happen 20% of the time if $H_0$ is true. That is not unusual; we fail to reject $H_0$.

Note: A $p$-value is not the probability that $H_0$ is true. It is a probability about the sample data under the assumption that $H_0$ is true.

6. Types of Errors

Because we base our decision on a sample, mistakes are always possible.

Two kinds of errors can occur:

Type I error:
Rejecting $H_0$ when $H_0$ is actually true.
Think: “False positive” or “false alarm.”
Type II error:
Failing to reject $H_0$ when $H_1$ is actually true.
Think: “False negative” or “missing a real effect.”

Let:

$\alpha = P(\text{Type I error})$
$\beta = P(\text{Type II error})$

Then:

We choose $\alpha$ in advance (for example $\alpha = 0.05$).
$\beta$ depends on:

The true value of the parameter under $H_1$,
The sample size,
The variability in the data,
The test procedure.

The power of a test is:

$$
\text{Power} = 1 - \beta
$$

Power is the probability that the test correctly rejects $H_0$ when $H_1$ is true. Higher power means we are more likely to detect a real effect.

Trade-off:

Lowering $\alpha$ (making it harder to reject $H_0$) usually increases $\beta$ (and reduces power), unless we increase sample size.
Increasing sample size generally reduces both types of error (for a well-designed test), or at least reduces $\beta$ for a fixed $\alpha$.

Practical choices:

In serious situations where a false positive is very costly (for example, approving a dangerous drug), we use a very small $\alpha$ (like $0.01$).
When missing a real effect would be worse (for example, screening for a serious but treatable disease), we want high power (low $\beta$).

7. Steps of a Hypothesis Test

In practice, a hypothesis test usually follows these steps:

State the hypotheses.

Identify the population parameter (mean, proportion, etc.).
Write $H_0$ (with equality) and $H_1$ (direction or two-sided).

Choose the significance level $\alpha$.

Decide how strong the evidence must be to reject $H_0$.

Select the appropriate test and test statistic.

Decide whether to use a $z$-test, $t$-test, or another test, based on:

Type of data,
Sample size,
What is known (e.g., population standard deviation).

Compute the test statistic from the sample data.
Calculate the $p$-value (or compare the statistic to critical values).
Make a decision.

If $p \le \alpha$, reject $H_0$.
If $p > \alpha$, fail to reject $H_0$.

State the conclusion in context.

Relate the statistical decision back to the real-world question in plain language.

Example of a clear conclusion:

“At the 5% significance level, there is sufficient evidence to conclude that the new drug has a higher cure rate than 60%.”

“At the 5% significance level, we do not have sufficient evidence to conclude that the new drug has a higher cure rate than 60%.”

Notice the language “sufficient evidence” and “do not have sufficient evidence” instead of “prove” or “disprove.”

8. One-Sided vs Two-Sided Tests

The choice between a one-sided and a two-sided test is important and must be made before looking at the data.

Use a two-sided test when:

You are interested in any difference from the claimed value (higher or lower).
There is no clear reason to suspect a change only in one direction.
You want a more conservative test in terms of direction.

Use a one-sided test when:

The research question is explicitly directional (e.g., “Is the new method better?”).
A change in the opposite direction is not of practical interest, or would be treated differently.

Effect on $p$-value:

For the same test statistic, the $p$-value in a one-sided test is typically half of that in a two-sided test, because only one tail is considered.
This means it can be easier to reject $H_0$ with a one-sided test, which is why it must be justified and decided in advance.

9. Practical Considerations and Common Misinterpretations

Some important points to keep in mind:

“Fail to reject $H_0$” is not the same as “accept $H_0$ as true.”
It means the data do not provide strong enough evidence against $H_0$, not that we have proved $H_0$.
A small $p$-value does not measure the size or importance of an effect.
It only measures how inconsistent the data are with $H_0$. A tiny difference can have a very small $p$-value if the sample size is large.
A large $p$-value does not prove there is no effect.
It could be that the test has low power (for example, due to a small sample).
The significance level $\alpha$ does not adapt to the data.
It is a threshold chosen in advance, not adjusted afterwards to fit the result.
Statistical significance is not the same as practical or real-world significance.
Even if $H_0$ is rejected, you should still consider whether the detected difference matters in the real context.

10. Overview of Common Tests (Conceptual)

This chapter focuses on the general framework of hypothesis testing, not on specific formulas. However, it is useful to know some common hypothesis tests that fit into this framework:

Test for a single population mean (using $z$ or $t$).
Test for a single population proportion.
Test for the difference between two means.
Test for the difference between two proportions.
Tests for goodness-of-fit or independence (such as chi-square tests).

Each of these uses the same core structure:

Set $H_0$ and $H_1$.
Choose $\alpha$.
Compute an appropriate test statistic.
Find the $p$-value or compare to critical values.
Make a decision and interpret it.

Later chapters will provide the details, formulas, and conditions for each specific test.

Comments

Please login to add a comment.

Don't have an account? Register now!