M4 · EVIDENCE SYNTHESIS
You've learned to find flaws. Now deliver a verdict.
Look back at what M4 taught you to spot. Studies at risk of bias. Effects that scatter more than chance allows. Missing trials that never got published. Comparisons rebuilt across populations that don't quite match. You've become good at finding what's wrong with a body of evidence.
But an appraisal can't be a list of complaints. At some point someone asks the only question that matters for a decision: given everything we've found, how much do we actually trust this estimate? Not "is it flawless" — nothing is — but "is it solid enough to bet a reimbursement decision on?"
That verdict has a name: GRADE. It takes your whole pile of concerns and turns it into one rating of certainty — High, Moderate, Low, or Very Low.
And here's the thing to hold onto before we start: that rating is a judgement, not a calculation. GRADE won't hand you an objective number. What it does instead is more useful — it forces every step of the judgement out into the open, where it can be seen, questioned, and argued. By the end of this lesson you'll see why that's the entire point.
Certainty is confidence in the estimate — and it's rated one outcome at a time.
GRADE rates the certainty of evidence on four levels. Read them as statements about how close the truth probably is to your estimate:
- High — we're very confident the true effect sits close to the estimate.
- Moderate — probably close, but it could be substantially different.
- Low — our confidence is limited; the truth may be substantially different.
- Very low — we have very little confidence; the truth is likely to be substantially different.
Two things about this trip almost everyone up.
First: certainty is not effect size. These are different axes. You can have high certainty that a drug helps only a little — a big, clean trial showing a tiny benefit. You can have low certainty about a drug that might be a blockbuster — a promising signal from one small, flawed study. "How sure are we" and "how much does it do" are separate questions, and GRADE only answers the first.
Second — and this is the one that reorganises everything: GRADE rates certainty per outcome, not per study. You don't grade "the trial." You grade your confidence in the effect estimate for one specific outcome. The very same randomised trial can deliver high-certainty evidence for mortality — a hard, well-measured endpoint — and low-certainty evidence for quality of life in the same patients, because that outcome was measured on a shaky scale, in fewer people, with more dropout. One study, two certainties. Grade the outcome, never the study.
Where the rating begins depends on the study design — but that's only the starting line.
GRADE gives every outcome a provisional certainty before any scrutiny, based on design:
- Evidence from randomised trials starts at High. Randomisation is the strongest defence against confounding, so you begin by trusting it — then look for reasons to trust it less.
- Evidence from observational studies starts at Low. Without randomisation, confounding is the default suspicion, so you begin cautious — then look for reasons to trust it more.
Now the crucial caveat: the starting line is not the finish. "RCT" does not mean "high certainty." An RCT begins at High and then routinely gets marked down — sometimes two or three levels — until it lands at Low or Very Low. Meanwhile a strong observational study can be marked up. A downgraded trial and an upgraded cohort can meet in the middle.
So the design tells you where the pen starts on the ladder. Five forces can push it down. Three can push it up. The rest of this lesson is those forces — and you already know most of them.
Here's the reveal: every reason to downgrade is something this module already taught you.
GRADE marks an outcome down for five reasons. Each is a concept you've already met — GRADE just gives it a slot on the form. Each domain can cost one level (serious) or two (very serious).
- Risk of bias — are the studies themselves internally flawed? This is RoB 2 and ROBINS-I, from earlier in this module. Flawed studies, less trust.
- Inconsistency — do the studies disagree more than chance explains? This is heterogeneity — high, unexplained I². When effects scatter without reason, you're less sure which one is real.
- Indirectness — does the evidence actually answer your question? This reaches back to PICO and to indirect comparisons — we'll unpack it next screen, because it's the slippery one.
- Imprecision — is the estimate too uncertain to act on? This is the confidence interval from M3: wide intervals, too few events, a CI that crosses the line where your decision would flip.
- Publication bias — are you missing the studies that didn't flatter the drug? This is the funnel plot from last lesson. Suspected suppression, less trust in the pooled number.
That's the whole machine. Five dials, and you've already learned to read every one of them. GRADE's contribution isn't new statistics — it's the discipline of checking all five, every time, in the open.
Now operate the machine yourself.
Below is an outcome from a real appraisal shape: a randomised trial of an anticoagulant versus warfarin, outcome = strokes prevented. It starts at High. Set each domain to how serious the concern is, and watch the certainty fall.
Current certainty: High (4 − 0 = 4)
Risk of bias
Inconsistency
Indirectness
Imprecision
Publication bias
Play with it. Set two domains to "Serious" and the High evidence becomes Low. Set one to "Very serious" and you drop two rungs at once. Notice there's a floor — you can't fall below Very Low, no matter how many concerns pile up. This is the actual mechanic. Nothing hidden. Which is exactly why the next question is: who decides what counts as "serious"? Hold that thought.
Put a number on it.
Use the ladder values: High = 4, Moderate = 3, Low = 2, Very Low = 1.
Take this outcome. It's randomised evidence, so it starts at 4 (High). On review:
- Risk of bias: serious (high dropout, open-label component) → −1
- Inconsistency: no concern → 0
- Indirectness: serious (trial patients were younger and healthier than the population you're funding for) → −1
- Imprecision: no concern → 0
- Publication bias: undetected → 0
Total downgrade = 1 (risk of bias) + 1 (indirectness) = 2. Final level = 4 − 2 = ?
4 − 2 = 2, which on the ladder is Low. Two defensible, one-level downgrades turned "high-quality randomised evidence" into low-certainty evidence for this outcome, in this population. Nothing exotic happened — no domain pushed to "very serious," no publication bias, no wild heterogeneity. Just two ordinary, honestly-argued concerns. That's how routinely a trial's headline pedigree gets spent down to Low.
(There's a floor: penalties can't drive you below Very Low = 1. And a start-High outcome with no serious concerns in any domain simply stays High.)
The real skill isn't spotting a problem. It's filing it under the right domain.
Most GRADE mistakes aren't missed flaws — they're flaws slotted into the wrong dial, which double-counts one and ignores another. And the domain people most often misuse is indirectness, because it's easy to confuse with last lesson's topic.
They are not the same thing. An indirect comparison (Bucher, NMA) is a method — rebuilding A-vs-B through a shared comparator. Indirectness in GRADE is broader: it's any gap between the evidence you have and the question you're asking. Your PICO versus the trial's PICO. It shows up in several flavours:
- Population — trial patients differ from your target patients.
- Intervention or comparator — the trial used a dose, or a comparator, you don't care about.
- Outcome — the trial measured a surrogate (a lab marker) when your decision needs a hard clinical endpoint.
An indirect comparison is one contributor to indirectness — but a perfectly direct head-to-head trial can still be drowning in indirectness if it was run in the wrong patients on the wrong endpoint.
Tap the concern that belongs in the indirectness box:
Sometimes the pen moves up. Rarely, and only for observational evidence.
Observational studies start at Low because confounding is the default worry. But occasionally the evidence is so striking that confounding can't plausibly explain it away — and GRADE lets you rate up. Three triggers:
- Large magnitude. When the effect is enormous, no realistic amount of confounding could manufacture it. The classic case: smoking and lung cancer. No randomised trial exists, yet the relative risk is roughly ten-to-twenty-fold — far too large for hidden confounders to fake. We're confident anyway.
- Dose-response gradient. More of the exposure, more of the effect, in an orderly ramp — smoking again: more pack-years, more risk. Nature rarely arranges a clean gradient by accident.
- Confounding working against the effect. When every plausible bias would have shrunk the observed effect, the real effect is probably even bigger than what you're seeing. The finding survives despite the biases pulling it down.
This is why nobody demands a randomised trial of parachutes, or of insulin for diabetic coma. The effect is so overwhelming that observational evidence earns high certainty on its own.
You won't compute upgrades here — in practice they're uncommon, and they're the one part of GRADE that leans hardest on judgement. But know the direction exists: Low is a starting point, not a ceiling.
Two competent analysts can GRADE the same evidence to different levels. That's not the flaw. That's the feature.
Everything you've done rests on a soft word: serious. How wide a CI is "serious" imprecision? How different must the trial population be before indirectness is "serious"? GRADE gives you the domains and the discipline — it does not give you an objective cut-off. The judgement is yours.
So watch what happens with one real body of evidence, graded from two chairs — same anticoagulant outcome, same trials:
Manufacturer's profile
"The trial population is close enough."
Assessor's profile
"The funded population is older and sicker than the trial's."
Same trials. Same numbers. One is Moderate, the other Low — and the entire disagreement lives in a single domain: indirectness. You can see it. You can point at it. You can argue it on its clinical merits — is the funded population different enough to matter?
Compare that to a black-box "quality score" that just spits out "high" versus "low" with no visible reason. There, the two sides would simply disagree, with nowhere to stand. GRADE's real gift isn't objectivity it doesn't have — it's making the subjectivity local, named, and contestable.
That's also the fair critique of GRADE, and you should carry it: inter-rater agreement is only moderate, the domain thresholds are genuinely fuzzy, and a determined analyst can steer the result. GRADE doesn't remove that. It just guarantees that when someone steers, they leave tyre marks you can follow.
The other chair
The other chair. Reading a submission: don't argue with the certainty label — argue with the domain behind it. If a dossier claims "High certainty," go straight to the five dials: which did they check, and what did they call each? The soft calls — imprecision, indirectness — are where an optimistic grade hides. Ask whether the certainty is quoted per outcome or smuggled in as one grade for the whole submission; the flattering trick is to grade the well-measured outcome and let it stand in for the shaky one. And check which outcomes are critical to the decision — overall certainty is dragged down by the weakest critical outcome, not lifted by the strongest convenient one. Building one: grade every critical outcome separately and show the full profile, not just the verdict. State your reason for each domain in a sentence — "not downgraded for indirectness because…" — so the assessor is arguing with your reasoning, not guessing at it. Where you made an optimistic call, expect it to be probed, and pre-arm the justification. A transparent Low beats a Moderate you can't defend line by line.
Same skill from both chairs — knowing that in GRADE the label is cheap and the reasoning is everything, and being able to defend, or dismantle, one domain at a time.
Why this matters for HTA
When it lands on your desk: a submission arrives claiming its evidence is high quality. GRADE is how you turn that claim into something you can inspect — and how you decide how much weight the rest of the appraisal can bear.
- You grade the certainty of each critical outcome before you trust any estimate built on it. The effect size feeding a cost-effectiveness model is only as trustworthy as its GRADE certainty. A crisp ICER built on Low-certainty inputs is a crisp number wrapped around a soft core.
- You read certainty as the weakest critical link, not an average. If mortality is High but the quality-of-life input driving most of the QALY gain is Low, the decision inherits the Low. Averaging certainty across outcomes hides exactly the risk you're paid to find.
- You keep GRADE certainty separate from the funding decision. GRADE was built to move from evidence to a clinical recommendation. An HTA reimbursement decision also weighs cost, budget impact, and equity — things certainty says nothing about. High-certainty evidence of a tiny benefit can still fail on cost; Low-certainty evidence of a huge benefit can still warrant a managed-access deal. Certainty informs the decision; it doesn't make it. (When the evidence is a network meta-analysis, the GRADE extension CINeMA carries the same logic across the whole network.)
The certainty rating won't decide the appraisal for you. It tells you how hard the estimate can be leaned on before it gives way.
GRADE, in one breath.
- GRADE rates the certainty of evidence — how close the truth probably sits to the estimate — on four levels: High, Moderate, Low, Very Low.
- It rates per outcome, not per study. One trial can be High for mortality and Low for quality of life.
- Certainty is not effect size. You can be highly certain of a small effect, or unsure of a large one.
- Randomised evidence starts High; observational starts Low. The start is not the finish.
- Five domains mark it down — risk of bias, inconsistency, indirectness, imprecision, publication bias — and you learned every one across this module. Three mark observational evidence up, led by a large magnitude of effect.
- The final label is a structured judgement, not a calculation. Its value isn't objectivity — it's that every step is visible, named, and open to challenge, one domain at a time.
GRADE doesn't settle the argument. It makes sure everyone is arguing about the same, visible thing.
That closes Module 4 — and with it, the whole first half of the course. You now have the full toolkit for one question: is the evidence any good? You can read a study, weigh bias against chance, pool results, measure their scatter, hunt the missing ones, compare treatments that never met, and grade how much of it you can trust. But "does it work, and how sure are we?" is only half of an HTA. The other half is the question every payer asks next: is the benefit worth what it costs? That's where Module 5 begins — health economics, opportunity cost, and the machinery that turns a trusted effect into a decision about money.