Module 3 · The p-value

"Statistically significant, p = 0.02." So the drug works — right?

A manufacturer's submission lands on your desk. The headline result is trumpeted in bold: the new drug beat placebo, p = 0.02 — statistically significant. The implication is clear: this is real, believe it.

That little number, p, is about to become one of the most important — and most misunderstood — things you'll ever read in a study.

p = 0.02. Does that mean there's only a 2% chance the drug doesn't work?

The p-value answers a surprisingly narrow question — and it is NOT "what's the chance the drug doesn't work?" To see what it really asks, we have to start by imagining a world where the drug does nothing at all.

The null hypothesis: the boring explanation

Before you can ask whether a result is real, you need something to compare it against. In statistics, that something is deliberately dull: the assumption that nothing is going on.

This is the null hypothesis — the boring explanation. The drug has no effect. The two groups are really the same. Any difference you saw is just the luck of the draw — our old friend chance from earlier in this block, the wobble that comes from measuring a sample instead of the whole truth.

Here's the move at the heart of all of this: we don't try to prove the drug works. Instead, we assume it doesn't — and then ask whether the data we collected would be surprising in that no-effect world. If the data would be very surprising under "nothing's going on," that's our reason to doubt "nothing's going on."

Everything about p-values flows from that one backwards-feeling idea: start by assuming the boring explanation, then see if the data embarrasses it.

The question p actually answers

So here is the exact question — read it slowly, because every word matters:

If the drug truly did nothing, how often would chance alone produce a difference as big as the one I saw?

That "how often" is the p-value. Nothing more, nothing less.

If chance would rarely fake a gap this big — say, 2 times in 100 — then your result is hard to explain away as luck. Evidence against "nothing's going on."
If chance would routinely fake a gap this big — say, 40 times in 100 — then what you saw is unremarkable. Could easily be noise.

Notice what the question is about: it's about the data, in a hypothetical no-effect world. It is NOT about the probability that the no-effect world is the real one. Hold that distinction — we'll come back and sharpen it until it's unmistakable.

See it: the no-effect world

Let's build that no-effect world and watch chance do its thing. Below, the drug truly does nothing — the real difference between groups is exactly zero. But each study still measures a sample, so each one finds some gap, just from luck.

Run a few studies in this no-effect world, then run a hundred. Watch where the gaps land — and compare them to the result you actually got (+6, the line).

Studies run: 0

Try this: run ×100 and watch the gaps pile up into a bell centred on zero. Then look at how many of those pure-chance studies reached all the way out to your +6.

See the shaded tail? Only about 5 in 100 no-effect studies produced a gap as big as yours. That 5% is your p-value — the frequency with which pure chance would have beaten your result. Small, but not vanishingly so. Your result is unusual in a no-effect world — which is exactly why it counts as evidence the world isn't no-effect.

Name it (and the courtroom)

You've just watched a p-value being born. Now the careful definition:

The p-value is the probability of getting a result at least as extreme as the one observed, if the null hypothesis were true.

A courtroom makes the logic click. The null hypothesis is "the defendant is innocent." The evidence is your data. The p-value asks: if this person were truly innocent, how surprising would this evidence be? A tiny p means the evidence would be very strange for an innocent person — so we doubt innocence.

But — and this is the whole lesson — that is NOT the same as "the probability the defendant is innocent." That depends on much more: how plausible guilt was to begin with, what other evidence exists. The p-value only ever speaks about the evidence assuming innocence — never about the probability of innocence given the evidence.

p answers: "how surprising is my data, if nothing's going on?" It does NOT answer: "what's the chance nothing's going on?" Those two questions feel identical. They are completely different.

Where the number comes from (the formula)

So where does a specific p — 0.02, 0.05 — actually come from? You already built the machinery two lessons ago.

Take your observed difference and measure it in standard errors — how many SEs it sits away from zero (the no-effect value). That count has a name, the test statistic:

z = observed difference ÷ standard error

Then translate that distance into a p-value using the normal curve's ruler — the 68 / 95 / 99.7 rule from earlier:

About 95% of a normal curve sits within 2 SE of the centre. So a result 2 SE out lands in the outer 5% → p ≈ 0.05.
A result 3 SE out lands in the outer 0.3% → p ≈ 0.003.
A result only 1 SE out is unremarkable — about a third of chance results reach that far → p ≈ 0.32.

That famous "p < 0.05" threshold? It's literally just "about 2 standard errors away from no effect." Let's compute one.

Worked example: difference = 6 mmHg, SE = 3

z = 6 ÷ 3 = 2 → ~2 SE out → outer 5% → p ≈ 0.05 (borderline significant).

Your turn. Same SE, bigger gap: difference = 9 mmHg, SE = 3.

z = 9 ÷ 3 = ?

~3 SE out → outer 0.3% → p ≈ 0.003 (clearly significant). Same SE, a bigger gap — and the result marches out into the tail, p shrinking fast. (The exact p comes from software or tables; what you just did by hand is the part that matters — turning a difference into "how many SEs from nothing," which is the whole idea.)

The 0.05 line: a convention, not a law

By tradition, researchers draw a line at p = 0.05 and call anything below it "statistically significant," anything above it "not significant."

It's worth knowing where that line came from: a statistician, nearly a century ago, suggested 1-in-20 as a reasonable cut-off — and the habit stuck. There is nothing magic about 0.05. p = 0.049 and p = 0.051 describe almost identical evidence, yet one gets a triumphant "significant!" and the other a disappointed "null result." That cliff-edge is a human convention, not a fact of nature.

Treat "significant" as a rough flag — "this would be fairly unusual by chance alone" — not a verdict of truth. We'll see in a later lesson just how much trouble that arbitrary line causes when people treat it as sacred.

What p does NOT mean

The p-value is surrounded by myths — repeated even by people who should know better. Here are five statements about a result reported as p = 0.04. Most are myths. Find the one that's actually accurate, and learn why each myth is wrong.

The two traps that matter most for HTA

Of all those myths, two will cost you real money and real decisions if you fall for them. Burn them in:

Trap 1 — "Not significant" is read as "no effect."

A manufacturer's drug fails to beat the comparator at p = 0.08, and someone concludes it doesn't work. But maybe the trial was just too small to prove a real benefit — or, just as often, a harm or a difference from a competitor goes undeclared because it "wasn't significant." Non-significant means unproven, in either direction — not absent.

Trap 2 — A tiny p is sold as a big benefit.

A trial in 50,000 patients shows the drug lowers a number by a hair, p < 0.001. The p is microscopic because the sample is enormous — but the actual benefit may be far too small to matter to any patient. Significant is not the same as substantial.

p tells you whether an effect is likely real. It tells you nothing about whether it's big enough to care about. Those are two separate questions — and we tackle the second one head-on soon.

Why this matters for HTA

This lands on your desk constantly. "Statistically significant" is one of the most powerful — and most abused — phrases in a submission, and your job is to read past it to what it actually supports.

When you see "significant, p < 0.05," ask: significant, yes — but how big is the effect, and does it matter clinically? A real effect can be too small to fund.
When you see "no significant difference," ask: was that a genuine finding of equivalence, or just an underpowered study that couldn't detect a difference — or a harm — that's really there?
When you see a dramatic p from a huge trial, stay calm: enormous samples manufacture tiny p-values out of trivial effects.

A p-value is a smoke alarm, not a fire report. It tells you something may be there — never how big the fire is, and never that the absence of an alarm means the absence of a fire.

The p-value, in one breath

The null hypothesis is the boring explanation: no effect, the gap is just chance.
The p-value = the probability of data at least this extreme, if the null were true — how often chance alone would fake your result.
It comes from distance: z = difference ÷ SE; about 2 SE out ≈ p ≈ 0.05 (straight from the 68–95–99.7 rule).
p is NOT the probability the null is true, NOT the chance it's "just luck," NOT a measure of effect size, NOT a promise of replication.
p > 0.05 means unproven, not no effect; a tiny p means probably real, not large or important.
The 0.05 line is a convention, not a law of nature.

p answers "how surprising is my data if nothing's going on?" — never "what's the chance nothing's going on?"

A p-value gives you a blunt yes/no: is this likely more than chance? But a yes/no throws away almost everything useful — it never tells you how big the effect is, or how uncertain. Next, the tool that fixes that, and that good HTA leans on far more than the p-value: the confidence interval — an honest range for the truth, not a verdict.