M4 · EVIDENCE SYNTHESIS

You've learned to find flaws. Now deliver a verdict.

Look back at what M4 taught you to spot. Studies at risk of bias. Effects that scatter more than chance allows. Missing trials that never got published. Comparisons rebuilt across populations that don't quite match. You've become good at finding what's wrong with a body of evidence.

But an appraisal can't be a list of complaints. At some point someone asks the only question that matters for a decision: given everything we've found, how much do we actually trust this estimate? Not "is it flawless" — nothing is — but "is it solid enough to bet a reimbursement decision on?"

That verdict has a name: GRADE. It takes your whole pile of concerns and turns it into one rating of certainty — High, Moderate, Low, or Very Low.

And here's the thing to hold onto before we start: that rating is a judgement, not a calculation. GRADE won't hand you an objective number. What it does instead is more useful — it forces every step of the judgement out into the open, where it can be seen, questioned, and argued. By the end of this lesson you'll see why that's the entire point.

Certainty is confidence in the estimate — and it's rated one outcome at a time.

GRADE rates the certainty of evidence on four levels. Read them as statements about how close the truth probably is to your estimate:

Two things about this trip almost everyone up.

First: certainty is not effect size. These are different axes. You can have high certainty that a drug helps only a little — a big, clean trial showing a tiny benefit. You can have low certainty about a drug that might be a blockbuster — a promising signal from one small, flawed study. "How sure are we" and "how much does it do" are separate questions, and GRADE only answers the first.

Second — and this is the one that reorganises everything: GRADE rates certainty per outcome, not per study. You don't grade "the trial." You grade your confidence in the effect estimate for one specific outcome. The very same randomised trial can deliver high-certainty evidence for mortality — a hard, well-measured endpoint — and low-certainty evidence for quality of life in the same patients, because that outcome was measured on a shaky scale, in fewer people, with more dropout. One study, two certainties. Grade the outcome, never the study.

Where the rating begins depends on the study design — but that's only the starting line.

GRADE gives every outcome a provisional certainty before any scrutiny, based on design:

HighRCT starts here
Moderate
LowObservational starts here
Very Low

Now the crucial caveat: the starting line is not the finish. "RCT" does not mean "high certainty." An RCT begins at High and then routinely gets marked down — sometimes two or three levels — until it lands at Low or Very Low. Meanwhile a strong observational study can be marked up. A downgraded trial and an upgraded cohort can meet in the middle.

So the design tells you where the pen starts on the ladder. Five forces can push it down. Three can push it up. The rest of this lesson is those forces — and you already know most of them.

Here's the reveal: every reason to downgrade is something this module already taught you.

GRADE marks an outcome down for five reasons. Each is a concept you've already met — GRADE just gives it a slot on the form. Each domain can cost one level (serious) or two (very serious).

That's the whole machine. Five dials, and you've already learned to read every one of them. GRADE's contribution isn't new statistics — it's the discipline of checking all five, every time, in the open.

Now operate the machine yourself.

Below is an outcome from a real appraisal shape: a randomised trial of an anticoagulant versus warfarin, outcome = strokes prevented. It starts at High. Set each domain to how serious the concern is, and watch the certainty fall.

High
Moderate
Low
Very Low

Current certainty: High (4 − 0 = 4)

Risk of bias

Inconsistency

Indirectness

Imprecision

Publication bias

Play with it. Set two domains to "Serious" and the High evidence becomes Low. Set one to "Very serious" and you drop two rungs at once. Notice there's a floor — you can't fall below Very Low, no matter how many concerns pile up. This is the actual mechanic. Nothing hidden. Which is exactly why the next question is: who decides what counts as "serious"? Hold that thought.

Put a number on it.

Use the ladder values: High = 4, Moderate = 3, Low = 2, Very Low = 1.

Take this outcome. It's randomised evidence, so it starts at 4 (High). On review:

  1. Total downgrade = 1 (risk of bias) + 1 (indirectness) = 2. Final level = 4 − 2 = ?

The real skill isn't spotting a problem. It's filing it under the right domain.

Most GRADE mistakes aren't missed flaws — they're flaws slotted into the wrong dial, which double-counts one and ignores another. And the domain people most often misuse is indirectness, because it's easy to confuse with last lesson's topic.

They are not the same thing. An indirect comparison (Bucher, NMA) is a method — rebuilding A-vs-B through a shared comparator. Indirectness in GRADE is broader: it's any gap between the evidence you have and the question you're asking. Your PICO versus the trial's PICO. It shows up in several flavours:

An indirect comparison is one contributor to indirectness — but a perfectly direct head-to-head trial can still be drowning in indirectness if it was run in the wrong patients on the wrong endpoint.

Tap the concern that belongs in the indirectness box:

Sometimes the pen moves up. Rarely, and only for observational evidence.

Observational studies start at Low because confounding is the default worry. But occasionally the evidence is so striking that confounding can't plausibly explain it away — and GRADE lets you rate up. Three triggers:

This is why nobody demands a randomised trial of parachutes, or of insulin for diabetic coma. The effect is so overwhelming that observational evidence earns high certainty on its own.

You won't compute upgrades here — in practice they're uncommon, and they're the one part of GRADE that leans hardest on judgement. But know the direction exists: Low is a starting point, not a ceiling.

Two competent analysts can GRADE the same evidence to different levels. That's not the flaw. That's the feature.

Everything you've done rests on a soft word: serious. How wide a CI is "serious" imprecision? How different must the trial population be before indirectness is "serious"? GRADE gives you the domains and the discipline — it does not give you an objective cut-off. The judgement is yours.

So watch what happens with one real body of evidence, graded from two chairs — same anticoagulant outcome, same trials:

Manufacturer's profile

Start (RCT)4 (High)
Risk of biasserious −1
Inconsistency0
Indirectnessnot serious — 0
Imprecision0
Publication bias0
4 − 1 = 3Moderate

"The trial population is close enough."

Assessor's profile

Start (RCT)4 (High)
Risk of biasserious −1
Inconsistency0
Indirectnessserious −1
Imprecision0
Publication bias0
4 − 2 = 2Low

"The funded population is older and sicker than the trial's."

Same trials. Same numbers. One is Moderate, the other Low — and the entire disagreement lives in a single domain: indirectness. You can see it. You can point at it. You can argue it on its clinical merits — is the funded population different enough to matter?

Compare that to a black-box "quality score" that just spits out "high" versus "low" with no visible reason. There, the two sides would simply disagree, with nowhere to stand. GRADE's real gift isn't objectivity it doesn't have — it's making the subjectivity local, named, and contestable.

That's also the fair critique of GRADE, and you should carry it: inter-rater agreement is only moderate, the domain thresholds are genuinely fuzzy, and a determined analyst can steer the result. GRADE doesn't remove that. It just guarantees that when someone steers, they leave tyre marks you can follow.

The other chair

The other chair. Reading a submission: don't argue with the certainty label — argue with the domain behind it. If a dossier claims "High certainty," go straight to the five dials: which did they check, and what did they call each? The soft calls — imprecision, indirectness — are where an optimistic grade hides. Ask whether the certainty is quoted per outcome or smuggled in as one grade for the whole submission; the flattering trick is to grade the well-measured outcome and let it stand in for the shaky one. And check which outcomes are critical to the decision — overall certainty is dragged down by the weakest critical outcome, not lifted by the strongest convenient one. Building one: grade every critical outcome separately and show the full profile, not just the verdict. State your reason for each domain in a sentence — "not downgraded for indirectness because…" — so the assessor is arguing with your reasoning, not guessing at it. Where you made an optimistic call, expect it to be probed, and pre-arm the justification. A transparent Low beats a Moderate you can't defend line by line.

Same skill from both chairs — knowing that in GRADE the label is cheap and the reasoning is everything, and being able to defend, or dismantle, one domain at a time.

Why this matters for HTA

When it lands on your desk: a submission arrives claiming its evidence is high quality. GRADE is how you turn that claim into something you can inspect — and how you decide how much weight the rest of the appraisal can bear.

The certainty rating won't decide the appraisal for you. It tells you how hard the estimate can be leaned on before it gives way.

GRADE, in one breath.

GRADE doesn't settle the argument. It makes sure everyone is arguing about the same, visible thing.

That closes Module 4 — and with it, the whole first half of the course. You now have the full toolkit for one question: is the evidence any good? You can read a study, weigh bias against chance, pool results, measure their scatter, hunt the missing ones, compare treatments that never met, and grade how much of it you can trust. But "does it work, and how sure are we?" is only half of an HTA. The other half is the question every payer asks next: is the benefit worth what it costs? That's where Module 5 begins — health economics, opportunity cost, and the machinery that turns a trusted effect into a decision about money.