Module 2 · Internal & External Validity

A flawless trial whose result you're not allowed to use.

A new drug is tested in a textbook-perfect randomised trial: properly randomised, allocation concealed, everyone blinded, thousands of patients, a rock-solid result. The drug clearly works.

Then you notice the fine print. Every patient in the trial was aged 18–55, with no other illnesses, no other medications. Your patient is 79, with diabetes, kidney trouble, and a fistful of daily pills.

The trial was flawless. Can you trust its result for your 79-year-old patient?

A study can be impeccable and still useless to you — if it was impeccable about the wrong people. "Is this result correct?" and "Is this result about my patients?" are two completely different questions. Today you learn to ask both.

Two questions, not one

Every study faces two separate tests, and passing one says nothing about the other.

Internal validity — is the result true here?

Within the study itself, is the measured effect real — or an artefact of bias, confounding or chance? This is what all of M2 so far has been about: randomisation, blinding, the three enemies. Internal validity is truth inside the study.

External validity — is the result true there?

Does it carry beyond the study, to the patients, settings and conditions of the real world — your world? Also called generalisability. External validity is reach outside the study.

One asks did they get the right answer? The other asks is it the answer to my question? A trial can ace either and flunk the other.

Four kinds of study

Because they're independent, every study lands in one of four quadrants — not on a single scale from bad to good.

HighInternal validityLow

Trap quadrant

Tight RCT in young healthy volunteers — true, but not about your patients.

High internal / Low external

Large, realistic, rigorous trial — true AND relevant.

High internal / High external

Small, sloppy, unrepresentative — tells you little.

Low internal / Low external

Messy real-world study of the right patients — relevant, but maybe biased.

Low internal / High external

LowExternal validityHigh

High internal, high external — a large, rigorous trial in realistic patients. Rare and precious: true and relevant.
High internal, low external — the flawless trial from the hook. Bulletproof, but in volunteers nothing like your patients. True here, useless there.
Low internal, high external — a messy real-world study of exactly the right patients. Relevant, but maybe biased — you can't fully trust the number.
Low internal, low external — small, sloppy, unrepresentative. Tells you almost nothing.

Notice the trap quadrant: high internal, low external. It's the most dangerous, because the study looks authoritative — perfect methods, impressive numbers — and quietly answers a question you didn't ask.

Each study below is weak on exactly one of the two. Tap which question it fails.

The built-in tension

Here's the uncomfortable part — and it explains why high-internal, low-external is so common.

The very things that make a trial internally strong tend to make it externally weak. To shut out the three enemies, you control everything: narrow entry criteria (no confounding comorbidities), ideal conditions, perfect adherence, expert centres. Each of those controls buys you internal validity — and each one pulls the trial further from the messy reality where your patients actually live.

Explanatoryideal conditions

↑ Internal validity↓ External validity

Pragmaticreal-world conditions

↑ External validity↓ Harder to keep internal

There's even a name for the two ends of this dial:

Explanatory trials sit at the controlled end — can this work, under ideal conditions? High internal validity, lower external.
Pragmatic trials sit at the realistic end — does this work, in ordinary practice? Higher external validity, but harder to keep internally tight. (You'll meet these properly in M11.)

If this feels familiar, it should: this is exactly the efficacy-versus-effectiveness gap from M1, seen from the methods side. Efficacy is what a high-internal, controlled trial measures. Effectiveness is what external validity asks about. Same tension, named twice.

Five threats to generalisability

When you ask "does this transfer to my patients?", five specific things are worth checking. Each is a way a true result can fail to reach the real world:

The patients (inclusion criteria) — Were trial patients narrower, younger, healthier than yours? The commonest gap of all.

The comparator — Was the drug tested against placebo or an outdated option — not the treatment your patients would actually get instead?

The conditions — Expert centres, intensive monitoring, engineered adherence — versus ordinary clinics and real life?

The outcome — Did the trial measure a surrogate (a lab number) instead of what patients care about (living longer, feeling better)? — more on this in M6.

The time horizon — Was follow-up months, when the decision needs years? A benefit that's real at six months may say nothing about five.

Run a study through these five and you'll know not just whether it generalises, but where it breaks.

Each trial below is internally sound, but something blocks it from transferring. Tap the threat to generalisability.

Where this sits in M2

Pull the whole module together. You've been building two different skills without quite naming the split:

Everything about the three enemies, randomisation and blinding was about internal validity — making sure the effect is real for the people studied.
This lesson is about external validity — a separate audit of whether that real effect reaches the people you care about.

And here's the crucial part: randomisation buys you internal validity, not external. A coin toss makes the groups comparable; it does nothing to make the patients resemble yours. That's why even a perfect RCT can leave a generalisability gap wide open — and why a big, representative observational study sometimes tells you more about your patients than a narrow trial does. (Remember: a rung on the ladder is a starting presumption, not a verdict.)

Internal validity asks: did they get it right? External validity asks: right for whom? You need a yes to both before a result should move a decision.

Why this matters for HTA

This is the backbone of how you read evidence in a submission, in two moves:

First, the internal question: is this result believable at all? Randomised? Concealed? Blinded where it mattered? Free of the three enemies? If no — stop; the number can't be trusted.

Then, the external question: even if true, is it about us? Our patients, our comparator, our conditions, the outcomes and timescales we care about? This is where the "generalisability gap" becomes one of the most powerful, and most common, challenges an assessor raises — because manufacturers naturally test their drug under the conditions that flatter it most.

A submission's trial can be both unimpeachable and beside the point. Your job is to catch the studies that are perfectly true about the wrong patients, the wrong comparator, or the wrong outcome — and to say so.

Internal vs external validity, in one breath

Two separate questions: internal = true here (within the study); external = true there (out in your world, a.k.a. generalisability).
They're independent axes, not one scale — a study can be strong on either and weak on the other.
The trap quadrant is high-internal, low-external: authoritative-looking, quietly answering the wrong question.
The controls that buy internal validity often cost external validity — the same efficacy-vs-effectiveness tension from M1.
Check generalisability across five threats: patients, comparator, conditions, outcome, time horizon.
Randomisation protects internal validity only — it never guarantees the result transfers.

Before a result can change a decision, it must pass both tests: is it true? — and is it true for us?

That completes the foundations of reading evidence. You can now judge a study's design, defend against the three enemies, and ask whether a result both holds and transfers. What you can't yet do is read the numbers a study reports — the effect sizes, the p-values, the confidence intervals. That's exactly where M3 begins: biostatistics, the language the evidence actually speaks.