M6 · MEASURING HEALTH OUTCOMES
The drug improved every number on the chart. Then it killed people.
In the late 1980s, doctors had a beautifully logical idea. Patients who'd had a heart attack often developed abnormal heart rhythms, and those abnormal rhythms predicted sudden death. So: give a drug that suppresses the abnormal rhythms, and you should prevent the deaths. The drugs worked exactly as designed — the arrhythmias on the ECG went quiet. The marker improved.
Then the trial (CAST) counted the bodies. The patients on the rhythm-suppressing drugs were dying at more than twice the rate of those on placebo. The drugs smoothed the ECG and killed the patients. The number got better; the person got worse.
This is not a freak story — it's a recurring one, and it exposes the single most important question in measuring health: are you measuring the thing that matters, or a stand-in you're hoping represents it? A cholesterol level, a tumour shrinking on a scan, a smoothed heartbeat — these are markers. What the patient actually cares about is living longer and living better. When the marker and the patient part ways, and you were only watching the marker, you fund a catastrophe with a straight face. This lesson is about the gap between the two, and why closing it is an empirical question almost nobody asks properly.
Before you measure how well a treatment works, you choose what to measure. That choice governs everything after it.
An endpoint is simply the outcome a study measures to judge whether a treatment worked. And endpoints divide into two kinds that are worlds apart.
A hard endpoint (or "final" / "patient-relevant" endpoint) is something that matters directly and intrinsically to the patient: death, a stroke, a hip fracture, years of life, quality of life. Nobody has to argue that these matter — they're the point of medicine. If a treatment reduces deaths, that reduction is the benefit, not a proxy for it.
Everything else is a stand-in. And here's why the choice of endpoint is the first decision, not a detail: every downstream number — the effect size, the QALYs, the ICER, the funding decision — inherits whatever the endpoint measured. Pick the wrong endpoint and you can run flawless statistics, build an immaculate economic model, and reach a confident, wrong answer. Get the arithmetic perfect on the wrong quantity and you've measured nothing. So the endpoint is where scrutiny should start — and, perversely, it's usually where it's weakest.
Waiting for hard endpoints is often slow, expensive, or impossible. Surrogates are the pragmatic answer — and a genuine bargain with risk.
Suppose you want to prove a cholesterol drug prevents heart attacks using the hard endpoint directly. You'd need thousands of patients followed for five to ten years, because heart attacks accumulate slowly. That's enormous cost, enormous delay, and every year of delay is a year patients don't get a drug that might help. For some questions the hard endpoint is ethically out of reach — you can't withhold a promising therapy for a decade to count deaths.
So instead you measure a surrogate endpoint: a marker that changes sooner and stands in for the hard outcome. Cholesterol level instead of heart attacks. Blood pressure instead of strokes. Viral load instead of AIDS progression. Tumour shrinkage, or time-to-tumour-growth, instead of survival. The surrogate is faster, cheaper, and earlier — you get an answer in months, not decades.
This is a real bargain, not a cheat. Surrogates are indispensable; modern medicine could not function without them. The danger isn't using surrogates — it's using a surrogate you haven't verified, as though it were the hard endpoint itself. And verifying it turns out to be far subtler than it looks.
Here is the mistake at the heart of the whole subject. It hides in a single word: "correlates."
Everyone knows LDL cholesterol "correlates with" heart attacks. People with high LDL have more heart attacks. That is completely true — and completely insufficient to justify LDL as a surrogate. Because it's the wrong arrow.
There are two different arrows, and they are not the same claim:
- Arrow 1 — prognostic (patient level): across patients, the marker tracks the outcome. High LDL → more heart attacks. This describes who is at risk.
- Arrow 2 — surrogacy (treatment-effect level): when a treatment changes the marker, the outcome changes correspondingly. Lowering LDL with this drug → fewer heart attacks. This describes whether changing the marker helps.
Arrow 1 does not imply Arrow 2. A marker can perfectly predict risk in patients, yet moving it with a drug does nothing — or harm — because the drug affects the marker through a mechanism that doesn't touch the cause of the outcome. Validating a surrogate means establishing Arrow 2, and Arrow 2 needs its own evidence: trials showing that the treatment effect on the surrogate predicts the treatment effect on the hard endpoint. Almost every casual claim that "X is a good surrogate because it correlates with Y" is quietly substituting Arrow 1 for Arrow 2 — and that substitution is exactly where drugs like the ones in this lesson slip through.
When the chain breaks.
Watch a drug move the surrogate in the perfect direction — and kill the patient anyway.
Here's the causal chain a surrogate depends on: Drug → lowers LDL → fewer heart attacks → longer survival. For the surrogate to be valid, every link must hold. Switch the drug's mechanism below and watch a link snap.
The surrogate holds — because the mechanism that moved it also drives the outcome.
Sit with the torcetrapib panel, because it really happened. The drug did everything the surrogate asked: LDL down, HDL up — textbook. The trial (ILLUMINATE, 2006) was halted early because patients on the drug were dying more. The surrogate wasn't lying about LDL; it was silent about the arrow nobody had tested — whether lowering LDL this particular way helped. It didn't. A surrogate moving in the right direction told you nothing about the direction of the patient.
Surrogacy isn't yes/no — it's a strength, and there's a number for it.
Once you accept that Arrow 2 needs proving, the natural question is: proved how strongly? The key measure is the trial-level association — across many trials, how well does the treatment effect on the surrogate predict the treatment effect on the hard endpoint? It's usually reported as an R², running from 0 to 1.
Read R² as "the fraction of the variation in the hard-endpoint benefit that's explained by the surrogate benefit."
A surrogate reports a trial-level R² of 0.15. How should you read it — and how much should you trust the surrogate?
Not all evidence for a surrogate is equal. There's a hierarchy — and most surrogates never climb past the bottom rung.
Justifications for a surrogate stack up in increasing strength:
- Biological plausibility — a mechanistic story for why the marker should matter. Necessary, but the weakest evidence: CAST and torcetrapib both had beautiful biological stories. Plausibility is where surrogates are born, not where they're validated.
- Patient-level association (Arrow 1) — the marker predicts the outcome across patients. Reassuring, and still not enough — it's the wrong arrow.
- Trial-level association (Arrow 2) — across multiple trials, the treatment effect on the surrogate predicts the treatment effect on the outcome. This is the evidence that actually counts, and it needs many trials to establish.
- Formal surrogacy — trial-level validation strong and consistent enough that regulators/HTA bodies accept the surrogate as standing in for the hard endpoint, for a given drug class and mechanism.
The trap is that a surrogate with only the bottom rung — a nice mechanism and a patient-level correlation — gets treated as if it had the top rung. That's the gap every failed surrogate fell through.
A manufacturer argues their surrogate is valid because "it's biologically plausible and strongly correlated with survival in patients." What's missing?
Even a genuinely validated surrogate doesn't transfer automatically to the next drug. Validation is mechanism-specific.
Here's a subtlety that catches people who have absorbed everything so far. Suppose LDL is a properly validated surrogate for statins — trial-level evidence shows that statin-driven LDL reduction reliably predicts fewer cardiac events. Does that validate LDL as a surrogate for any LDL-lowering drug?
No. Because validity depends on how the drug moves the marker. A statin lowers LDL by upregulating LDL receptors, clearing cholesterol in a way tied to the disease process — arrow intact. Torcetrapib lowered LDL too, but through a mechanism that dragged along off-target harm — arrow broken. Same surrogate, same direction of movement, opposite consequence, because the mechanisms differ.
This is why "it lowers LDL, and LDL is a validated surrogate" is not a valid argument for a new drug with a new mechanism. The surrogate was validated for a mechanism, not for a number. When a novel drug moves an old surrogate a new way, the validation resets — you're back to needing Arrow 2 for this drug. (Ezetimibe, another LDL-lowerer with yet another mechanism, took years and a large hard-endpoint trial to confirm its LDL reduction did translate — it happened to, but that had to be shown, not assumed.)
A surrogate isn't just a clinical curiosity. It's an input to the economic model — and an unvalidated one poisons the whole ICER.
Trace how a surrogate flows into a funding decision. The economic model runs a chain: Δsurrogate → Δhard outcome → ΔQALYs → Δcost-effectiveness. The very first link — translating the measured surrogate effect into a hard-outcome effect — rests entirely on the surrogate's validity. If that link is uncertain, everything downstream is uncertain: the QALY gain, the ICER, the recommendation. A pristine economic model built on an unvalidated surrogate is a precise answer to an unanswerable question.
This is also where two worlds diverge. Regulators (EMA, FDA) increasingly grant accelerated approval on surrogate endpoints — tumour response, PFS — to get drugs to patients faster, accepting the uncertainty as a deliberate trade. HTA bodies are far more sceptical, because they're deciding whether to pay, and a surrogate-based benefit may evaporate when the hard-endpoint data mature. So an assessor's toolkit is specific: treat surrogate-based effects as a source of major uncertainty (a GRADE indirectness downgrade — the surrogate answers a neighbouring question, not the patient's); demand the trial-level validation, not the patient-level correlation; run scenarios where the surrogate-to-outcome translation is weaker; and where possible, hold out for mature overall survival data rather than accept progression-free survival at face value. The surrogate is the seam where clinical uncertainty leaks into economic certainty — and sealing it is core HTA work.
The other chair
The other chair. Reading a submission: find the pivotal endpoint first, and ask the blunt question — is it what patients care about, or a stand-in? If it's a surrogate, do not accept "it's well-established" or "it correlates with survival." Demand the trial-level validation for this drug's mechanism, and if it's a patient-level correlation dressed as validation, treat the entire benefit as uncertain. Ask what the ICER becomes if the surrogate-to-outcome relationship is weaker than assumed; if the manufacturer hasn't run that scenario, run it. A surrogate is where an optimistic case is most quietly built. Building one: lead with hard-endpoint data if you have it; if you must rely on a surrogate, bring the trial-level validation, not a plausibility story and a patient-level graph. Pre-empt the mechanism question — show why your drug's way of moving the marker preserves the arrow. Present the surrogate-uncertainty scenario yourself rather than waiting to be caught without it; an assessor trusts a submission that discloses its own weakest link far more than one that hides it behind a confident correlation.
Same skill from both chairs — refusing to let "the marker moved" stand in for "the patient benefited" until the arrow between them has actually been tested.
Why this matters for HTA
When it lands on your desk: a large share of submissions — especially in oncology and chronic disease — rest their pivotal benefit on a surrogate, because that's what got the drug approved and what the trial was powered to show. The hard-endpoint data are often immature or absent. So judging surrogates isn't an edge case; it's a routine, central assessment skill.
- You locate the endpoint and interrogate it before anything else. The effect size, the QALYs, the ICER all inherit the endpoint's validity. A dazzling model on a surrogate you can't trust is a dazzling model you can't trust. Endpoint scrutiny is upstream of everything.
- You demand the right arrow. Patient-level correlation is not validation; trial-level association for the relevant mechanism is. Holding that line is often what separates a defensible appraisal from one that funds the next CAST.
- You price surrogate uncertainty into the decision, not out of it. Where a benefit rests on an unvalidated surrogate, that's grounds for caution — a GRADE downgrade, a conservative scenario, a request for mature data, or a managed-access arrangement that pays only if the hard outcome materialises. The uncertainty doesn't disappear because the surrogate is convenient; your job is to make it visible and let it shape the decision.
The surrogate is chosen for speed, accepted for convenience, and questioned last of all. An assessor's value is in questioning it first.
Endpoints, in one breath.
- The endpoint is what a study measures to judge benefit, and choosing it is the first decision — every downstream number inherits it.
- A hard endpoint matters directly to the patient (death, stroke, quality of life). A surrogate is a faster, cheaper stand-in (LDL, blood pressure, tumour shrinkage, PFS).
- Surrogates are necessary, not lazy — hard endpoints can be too slow, costly, or unethical to measure directly. The danger is using an unvalidated surrogate as if it were hard.
- Two arrows, routinely confused: a marker predicting outcomes in patients (prognostic) does not prove that changing it by treatment changes outcomes (surrogacy). Only the second validates a surrogate.
- Validation is a ladder (plausibility → patient-level → trial-level → formal) and mechanism-specific: a surrogate validated for one drug class isn't automatically valid for a new mechanism.
- In HTA, an unvalidated surrogate is injected uncertainty: it sits at the head of the model's chain, so it makes the whole ICER uncertain — grounds for a GRADE downgrade, conservative scenarios, or a demand for mature hard-endpoint data.
The endpoint is the first thing chosen and the last thing questioned — reverse that order.
You've now got the first tool of outcome measurement: knowing whether you're measuring something that matters. But one of those hard endpoints — "quality of life" — has been sitting there unexamined. It's obviously what patients care about, yet the QALY treats it as a number: 0.7 of a year in perfect health. How on earth do you put a single number on how good it is to be alive in imperfect health? That question — measuring quality of life, and where utilities come from — is the heart of Module 6, and it's next.