Module 3 · Significance & Power
The drug that "didn't work" — until it did.
A promising new drug is tested against placebo. The result comes back not statistically significant — no clear benefit. The programme is shelved; the drug is written off as a failure.
Two years later, a much larger trial of the very same drug finds a clear, real benefit. The drug worked all along. So what went wrong the first time?
The first trial found 'no significant effect.' What's the most likely explanation?
A verdict is not the truth
Here's the mental model that fixes almost everything. There are two separate things:
- The truth: does the drug really have an effect, or not? (We never get to see this directly.)
- The verdict: did our study come back "significant" or "not significant"?
These don't always agree. Cross them and you get four possibilities — two where the verdict matches the truth, and two where the test gets it wrong:
The test can raise a false alarm — shout "effect!" when there's really nothing there. Or it can miss — stay silent when there's a real effect to be found. Same test, two completely different ways to be fooled. Let's map them.
The two errors
Tap each square to see what it means. Rows are the hidden truth; columns are your study's verdict.
Two errors, pulling in opposite directions. The first (false alarm) is controlled by your significance threshold. The second (miss) is controlled by something we haven't met yet — power.
Power: the ability to see
Power is a study's ability to detect a real effect when one genuinely exists. If a drug truly works, power is the probability your trial will actually come back "significant" and catch it. (Formally, power = 1 − the miss rate.)
A study with low power is like a blurry camera: even when there's something to photograph, it often comes back with nothing. A "not significant" result from such a study tells you almost nothing — it couldn't have seen the effect even if it were there.
What raises power? Three things:
- Sample size — by far the biggest lever. More patients, more power. (The √n, yet again.)
- The size of the real effect — big effects are easy to catch; tiny ones need huge studies.
- Less noise — cleaner measurements, less variability.
The crucial, counterintuitive part: a non-significant result from an underpowered study is not evidence the drug doesn't work. It's evidence the study couldn't tell. Let's watch that happen.
Watch power grow
The drug below truly works — it really does lower the outcome by 6. That never changes. All you're going to change is the sample size. Watch what happens to the verdict.
95% CI: [-2.0, 14.0] mmHg
Try this: start small. The interval sprawls across zero — "not significant," even though the drug genuinely works. Now drag N up and watch the verdict flip to "significant" and the power climb.
Nothing about the drug changed — it always worked. Yet at a small sample the verdict was "no significant effect," and at a large one it was "significant." The only thing that moved was power. So when you read "not significant," your first question is never "so it doesn't work?" — it's "was this study even big enough to find out?"
Trap 1, closed: "not significant" ≠ "no effect"
Now you can fully dismantle the trap from the p-value lesson. When a study reports "no significant difference," there are two completely different things it could mean — and the confidence interval tells them apart:
- A narrow interval hugging zero: the study had the power to detect even a modest effect, and found essentially nothing. Genuine evidence of little or no effect.
- A wide interval sprawling across zero: the study was underpowered — too small to see anything clearly. No evidence either way. The effect could be large and simply missed.
A wide confidence interval is the visible fingerprint of low power. "Not significant" from a wide interval doesn't close the question — it reopens it, and the honest response is "we need a bigger study," not "the drug doesn't work."
This matters most where it's quietest: a real harm, or a real difference from a competitor, can vanish behind "not significant" simply because nobody powered the study to find it.
Significant vs important: two different questions
Now the opposite trap. Crank the sample size high enough and almost any effect — however trivial — becomes "statistically significant." With 100,000 patients, a drug that lowers blood pressure by 0.3 mmHg will sail past p < 0.05. Real? Yes. Worth anything to a patient? No.
This is the gap between two ideas people constantly merge:
- Statistical significance — the effect is probably real (not just chance).
- Clinical significance — the effect is big enough to matter to a patient.
They are not the same, and one does not imply the other. A huge trial can make a meaningless effect "highly significant." A small trial can leave a genuinely important effect "not significant."
Always ask both questions. First: is it real? (significance, the CI excluding zero). Then — and this is the one that decides funding — is it big enough to care about? And recall from the very first module: with a fixed budget, "real but tiny" is exactly the kind of benefit that isn't worth what it displaces.
Read the result
Put both traps together. Each result below pairs an effect with its study size and CI. Tap what it really tells you.
The winner's curse
One last, subtle trap — and it explains why so many exciting early results fade.
Think about small studies that do reach significance. To clear the bar with few patients, the observed effect has to be large — and the easiest way to get a large observed effect in a small sample is a lucky, exaggerated draw. So among small significant studies, the ones that made the cut are systematically the ones that overstated the effect.
This is the winner's curse: a small study that reaches significance tends to overestimate how big the effect really is. The first dramatic result is usually too good to be true — and a larger, calmer study later brings it back to earth.
It's a major reason real effects "shrink" on replication, and why a single small, spectacular trial should make you cautious, not excited. (We'll meet its cousin — the way extreme results drift back toward average — again later.)
Why this matters for HTA
This is daily ammunition for reading a submission honestly:
- A drug shown "not superior" (not significant) — was the study powered to detect a difference that matters, or just too small? A wide CI means the question is open, not closed.
- A "highly significant" result from a massive trial — go to the effect size and its CI. Is it clinically meaningful, or a trivial change dressed up by an enormous sample?
- A small, spectacular early result — expect the winner's curse; the true effect is probably smaller. Wait for replication before betting on the headline number.
- A "significant" subgroup finding — treat with deep suspicion until you've asked how many comparisons were run (next lesson).
"Significant" answers only "is it probably real?" The decision needs two more answers it can't give: is it big enough to matter, and was the study even able to find out?
Significance, power & the traps, in one breath
- A test delivers a verdict, not the truth — and can fail two ways: a false alarm (Type I, set by the 0.05 threshold) or a miss (Type II).
- Power is the chance of catching a real effect — driven mainly by sample size. Low power = a study that can't see.
- "Not significant" usually means underpowered, not "no effect" — a wide CI is the giveaway. Absence of evidence is not evidence of absence.
- Statistical significance ("probably real") is not clinical significance ("big enough to matter"); huge samples make trivial effects significant.
- The winner's curse: small significant studies systematically overstate the effect.
A verdict of "significant" or "not" is the beginning of reading a result — never the end. Ask: is it real, is it big enough to matter, and was the study able to tell?
Everything so far assumed one test, one question. But real trials test many things at once — multiple outcomes, multiple subgroups, multiple looks at the data. And once you're running many tests, false alarms stop being rare accidents and start becoming almost guaranteed. Next: multiplicity and p-hacking — how testing enough things manufactures "significance" out of pure noise.