M11 · REAL-WORLD EVIDENCE

The reason patients got treated.

Pragmatic trials were the clean answer, but a rare and expensive one. Far more often, all HTA has to work with is observational data: records of patients who were treated, and patients who weren't, with no coin toss anywhere in sight. And that raises the question this whole module has been circling toward, the hardest one in it.

In a randomised trial, treatment was assigned by chance. In the real world, it never is. Every patient who received the new drug got it for a reason: a doctor decided, a patient asked, a guideline suggested, a hospital could afford it. And here's the trouble: those reasons are very often tangled up with how well the patient was going to do anyway. The healthier patients, or the sicker ones, or the wealthier ones, or those at better hospitals, end up on the new drug non-randomly. So when you compare treated to untreated and find a difference, you can't tell how much is the drug and how much is who got it. Recovering the drug's true effect from that tangle is the art of causal inference.

Confounding by indication.

The specific villain here has a name: confounding by indication. It's confounding (the Module 2 threat) arriving through the particular door of why a treatment was chosen.

Here's how it deceives. Imagine a new cancer drug given, in routine practice, mostly to fitter patients, those well enough to tolerate it. Compare survival between the drug group and the rest, and the drug group lives longer. Triumph? Not yet. Those patients were fitter to begin with: they'd have lived longer on anything. The survival gap mixes two things: whatever the drug did, and the head start of the people who received it. The "indication" for treatment (being fit enough) is itself a predictor of the outcome (survival), so it confounds the comparison.

This is not a small or occasional problem: it's the default condition of nearly all observational treatment data. Treatment is steered toward certain patients, and the steering is almost always correlated with prognosis. A naïve comparison of treated versus untreated, however large, measures the drug's effect plus the effect of the steering, with no way to separate them by looking at the raw numbers. Something has to be done to pull them apart.

The toolkit: adjust, match, weight.

So what can you do? The whole enterprise of causal inference from observational data is a toolkit for making the treated and untreated groups comparable, trying, after the fact, to achieve what randomisation would have handed you for free.

The instincts are intuitive:

Adjustment (regression). Put the confounders into a statistical model as variables (age, disease severity, comorbidities) and let the model "hold them constant," estimating the treatment effect as if the groups were alike on those factors.
Matching. Pair each treated patient with an untreated patient who looks the same on the confounders (same age, same severity) and compare within those matched pairs, so the differences cancel.

Both share one logic: identify what differs between the groups, and neutralise it. And both share one Achilles' heel we'll come to shortly. But first, the most celebrated tool of all, one that handles a practical nightmare the others struggle with: what do you do when there are dozens of confounders to balance at once?

Propensity score: many confounders into one number.

Matching on age is easy. Matching on age and severity and ten comorbidities and prior treatments simultaneously is nearly impossible: you'll never find twins on all of them. The propensity score is the elegant escape.

The idea: instead of balancing thirty confounders one by one, collapse them into a single number for each patient, the probability that this patient would receive the treatment, given all their measured characteristics. A frail, elderly patient with many comorbidities might have a low propensity to get the aggressive new drug; a fit young one, a high propensity. You compute that probability for everyone, then compare treated and untreated patients with similar propensity scores: by matching on the score, or weighting by it.

Why it's beautiful: two patients with the same propensity score have, by construction, a similar overall profile of measured confounders, so comparing them is like comparing like with like, across all those variables at once, reduced to one dimension. It also makes two things checkable: you can verify the groups are balanced on measured confounders after matching, and you can see the overlap (common support): whether there even exist comparable untreated patients for your treated ones. For handling measured confounding, the propensity score is genuinely powerful and rightly ubiquitous. Which makes its one limitation all the more important to name, because it's fatal to the thing people most want to claim.

Adjust away the confounding, and watch what's left.

A new drug's true effect is +3 months of survival. But in this observational data, healthier patients got it more often (a measured confounder, severity, worth +4 months of spurious advantage). Slide up the adjustment to correct for what was measured, and watch the estimate move toward the truth. Then flip on a hidden confounder (something real but never recorded) and watch what adjustment can't reach.

Naïve comparison

+7 mo

Adjusted estimate

+7 mo

True effect

+3 mo

Residual confounding: +4 mo (from measured confounding not yet adjusted)

Adjust for measured confoundersnone

Hidden (unmeasured) confounder present

Naïve +7 mo · Adjusted +7 mo · True +3 mo · Residual confounding: +4 mo

Watch the two runs side by side. With everything measured, adjustment marches the estimate right down to the truth, this is causal inference working. Switch on one unmeasured confounder and adjustment stalls partway, leaving a residual bias it can neither remove nor even see. Crank the adjustment slider to maximum: it never closes that last gap, because you cannot correct for a variable you never collected. That wall (between what you measured and what you didn't) is the whole lesson. (This is a simplified additive picture of bias, to show the mechanism; real confounding is messier, but the ceiling is real.)

Now you.

For each variable, choose its role, and therefore what to do with it. (Adjusting for the wrong kind of variable is as damaging as missing a confounder.)

1. "Disease severity at baseline, recorded in the registry."

2. "The blood-pressure drop caused by the drug, on the path to fewer strokes."

3. "Patient motivation and health-seeking behaviour, captured nowhere."

4. "Age, recorded for every patient."

5. "The physician's unrecorded gut sense of who'll do well."

6. "Tumour shrinkage caused by the drug, leading to longer survival."

Measured vs unmeasured: the wall randomisation clears.

Now the heart of everything. Every method in the toolkit (adjustment, matching, propensity scores) shares one absolute limit: it can only correct for confounders you measured. And that is precisely, exactly, the thing randomisation does not need.

Think about what a coin toss actually accomplishes. When you randomise, the two groups come out balanced on every variable: age, severity, motivation, genetics, the physician's intuition, factors nobody has ever named or thought to record. The coin doesn't know what the confounders are, and doesn't need to: by assigning treatment independently of everything, it balances the measured and the unmeasured alike, automatically, for free. That is randomisation's superpower, and it is unique.

Observational methods can only ever balance the columns that exist in your dataset. Whatever drove treatment but wasn't recorded (and in medicine, something always is) stays unbalanced, and biases the result. This leftover is residual confounding (or unmeasured confounding), and it has a diabolical property: it's invisible inside the study. A propensity-score analysis can show flawless balance tables, identical groups on every measured variable, and that demonstrates precisely nothing about the unmeasured ones. You cannot check the balance of a variable you don't have. So a beautiful balance on what you can see is no evidence at all about what you can't. This is why "they adjusted for confounders" and "they proved causation" are different sentences, and why observational causal inference is not an emulation of randomisation, but a partial, blindfolded imitation of it.

So is observational evidence useless? No, but humble.

If unmeasured confounding can never be removed or seen, is observational evidence worthless? No, that's too despairing, and wrong. But it demands a particular humility, and a particular set of moves.

You can't eliminate unmeasured confounding, but you can interrogate how much it would take to matter. The key question flips from "is there residual confounding?" (there always is) to "how strong would an unmeasured confounder have to be to explain away this result?" If the answer is "impossibly strong, stronger than any known risk factor," the finding is fairly robust. If "a mild unmeasured difference would erase it," the finding is fragile. There's even a standard metric for this (the E-value) that quantifies exactly that tipping point. Other moves help too: negative controls (check the method on an outcome the drug couldn't affect: if it shows a "benefit" there, your analysis is confounded), and formal sensitivity analysis for unmeasured confounding.

None of these removes the confounding: they measure your exposure to it. Combined with triangulation (does the observational result agree with trials, with different data, with biological plausibility?), they let RWE contribute real knowledge. The posture is: observational causal inference is not proof, it's an argument, one whose strength depends entirely on how honestly its unmeasured-confounding exposure is confronted. Used that way, RWE is valuable. Mistaken for a randomised trial, it's dangerous.

What's the reviewer's strongest objection?

An observational study of 200,000 patients uses propensity-score matching and reports that, after matching, the treated and untreated groups are almost perfectly balanced on all 40 recorded variables. It concludes the drug causes a large survival benefit. A reviewer is unconvinced. What is the reviewer's strongest objection?

Why this matters for HTA

Observational causal claims arrive in submissions constantly: a real-world study showing a drug "improves survival," a database analysis "demonstrating" benefit. Reading them well is one of the sharpest skills in HTA:

Ask what wasn't measured, not just what was. A submission will proudly list the confounders it adjusted for and show immaculate balance. The assessor's job is the opposite: name the prognostic factors that weren't recorded (motivation, frailty the codes miss, access, physician selection) and reason about which way they'd bias the result. Balance on the measured variables is where a naïve reader stops and a good one starts.
Never let sample size or perfect balance stand in for causation. Both are seductive and neither addresses unmeasured confounding. Demand a sensitivity analysis: how strong would an unrecorded confounder have to be to overturn this? If they haven't provided one, ask for it: an E-value, negative controls, something that quantifies the exposure to bias.
Watch for over-adjustment, too. The error runs both ways. Adjusting for a mediator (a step on the drug's own causal path) quietly removes part of the real effect. A submission that "controls for everything" may have controlled away the benefit, or introduced bias through a collider. The right variables, not the most variables.

Randomisation balances the factors you know and the ones you'll never think of. Statistics can only balance the ones you wrote down. The whole discipline of observational causal inference lives in that gap, doing its careful best inside it, while never quite escaping it.

Causal inference, in one breath.

In observational data, treatment is never random: patients get it for reasons tied to prognosis, so a naïve treated-vs-untreated comparison suffers confounding by indication: it mixes the drug's effect with the effect of who received it.
The toolkit, adjustment, matching, and especially the propensity score (which collapses many measured confounders into one number, the probability of treatment), makes the groups comparable on measured confounders, and lets you check balance and overlap.
But every observational method shares one wall: it can only correct for what was measured. Randomisation balances measured and unmeasured factors alike; statistics cannot. What's left is residual confounding, invisible inside the study, since balance on measured variables says nothing about unmeasured ones.
So observational evidence isn't proof and isn't useless: treat it as an argument, tested by how strong an unmeasured confounder would need to be to overturn it (E-value, negative controls, sensitivity analysis) and by triangulation with other evidence. And never adjust for a mediator: you'd erase the real effect.

You can adjust away the confounding you can see. The confounding you can't see is the reason a database of a million patients still isn't a randomised trial, and the reason honesty about the unmeasured is the whole craft.

We've now confronted real-world evidence at its hardest: what it can prove, and the wall it can't climb. The final question of the module is intensely practical: given all these strengths and limits, how does HTA actually use real-world evidence in real decisions? Where does it genuinely help, where is it (rightly) treated with suspicion, and how does it underpin the conditional deals (coverage with evidence, managed entry) from Module 10? That's the last lesson.