Module 2 · Internal & External Validity

A flawless trial whose result you're not allowed to use.

A new drug is tested in a textbook-perfect randomised trial: properly randomised, allocation concealed, everyone blinded, thousands of patients, a rock-solid result. The drug clearly works.

Then you notice the fine print. Every patient in the trial was aged 18–55, with no other illnesses, no other medications. Your patient is 79, with diabetes, kidney trouble, and a fistful of daily pills.

The trial was flawless. Can you trust its result for your 79-year-old patient?

A study can be impeccable and still useless to you — if it was impeccable about the wrong people. "Is this result correct?" and "Is this result about my patients?" are two completely different questions. Today you learn to ask both.

Two questions, not one

Every study faces two separate tests, and passing one says nothing about the other.

Internal validity — is the result true here?

Within the study itself, is the measured effect real — or an artefact of bias, confounding or chance? This is what all of M2 so far has been about: randomisation, blinding, the three enemies. Internal validity is truth inside the study.

External validity — is the result true there?

Does it carry beyond the study, to the patients, settings and conditions of the real world — your world? Also called generalisability. External validity is reach outside the study.

One asks did they get the right answer? The other asks is it the answer to my question? A trial can ace either and flunk the other.

Four kinds of study

Because they're independent, every study lands in one of four quadrants — not on a single scale from bad to good.

HighInternal validityLow
Trap quadrant

Tight RCT in young healthy volunteers — true, but not about your patients.

High internal / Low external

Large, realistic, rigorous trial — true AND relevant.

High internal / High external

Small, sloppy, unrepresentative — tells you little.

Low internal / Low external

Messy real-world study of the right patients — relevant, but maybe biased.

Low internal / High external

LowExternal validityHigh

Notice the trap quadrant: high internal, low external. It's the most dangerous, because the study looks authoritative — perfect methods, impressive numbers — and quietly answers a question you didn't ask.

Each study below is weak on exactly one of the two. Tap which question it fails.

The built-in tension

Here's the uncomfortable part — and it explains why high-internal, low-external is so common.

The very things that make a trial internally strong tend to make it externally weak. To shut out the three enemies, you control everything: narrow entry criteria (no confounding comorbidities), ideal conditions, perfect adherence, expert centres. Each of those controls buys you internal validity — and each one pulls the trial further from the messy reality where your patients actually live.

Explanatoryideal conditions
↑ Internal validity↓ External validity
Pragmaticreal-world conditions
↑ External validity↓ Harder to keep internal

There's even a name for the two ends of this dial:

If this feels familiar, it should: this is exactly the efficacy-versus-effectiveness gap from M1, seen from the methods side. Efficacy is what a high-internal, controlled trial measures. Effectiveness is what external validity asks about. Same tension, named twice.

Five threats to generalisability

When you ask "does this transfer to my patients?", five specific things are worth checking. Each is a way a true result can fail to reach the real world:

1
The patients (inclusion criteria) — Were trial patients narrower, younger, healthier than yours? The commonest gap of all.
2
The comparator — Was the drug tested against placebo or an outdated option — not the treatment your patients would actually get instead?
3
The conditions — Expert centres, intensive monitoring, engineered adherence — versus ordinary clinics and real life?
4
The outcome — Did the trial measure a surrogate (a lab number) instead of what patients care about (living longer, feeling better)? — more on this in M6.
5
The time horizon — Was follow-up months, when the decision needs years? A benefit that's real at six months may say nothing about five.

Run a study through these five and you'll know not just whether it generalises, but where it breaks.

Each trial below is internally sound, but something blocks it from transferring. Tap the threat to generalisability.

Where this sits in M2

Pull the whole module together. You've been building two different skills without quite naming the split:

And here's the crucial part: randomisation buys you internal validity, not external. A coin toss makes the groups comparable; it does nothing to make the patients resemble yours. That's why even a perfect RCT can leave a generalisability gap wide open — and why a big, representative observational study sometimes tells you more about your patients than a narrow trial does. (Remember: a rung on the ladder is a starting presumption, not a verdict.)

Internal validity asks: did they get it right? External validity asks: right for whom? You need a yes to both before a result should move a decision.

Why this matters for HTA

This is the backbone of how you read evidence in a submission, in two moves:

First, the internal question: is this result believable at all? Randomised? Concealed? Blinded where it mattered? Free of the three enemies? If no — stop; the number can't be trusted.

Then, the external question: even if true, is it about us? Our patients, our comparator, our conditions, the outcomes and timescales we care about? This is where the "generalisability gap" becomes one of the most powerful, and most common, challenges an assessor raises — because manufacturers naturally test their drug under the conditions that flatter it most.

A submission's trial can be both unimpeachable and beside the point. Your job is to catch the studies that are perfectly true about the wrong patients, the wrong comparator, or the wrong outcome — and to say so.

Internal vs external validity, in one breath

Before a result can change a decision, it must pass both tests: is it true? — and is it true for us?

That completes the foundations of reading evidence. You can now judge a study's design, defend against the three enemies, and ask whether a result both holds and transfers. What you can't yet do is read the numbers a study reports — the effect sizes, the p-values, the confidence intervals. That's exactly where M3 begins: biostatistics, the language the evidence actually speaks.