M4 · EVIDENCE SYNTHESIS
Can you trust this trial?
You've assembled a complete, unbiased set of studies. Now a harder question: is any single one of them telling the truth? A flawless-looking trial can be broken in one specific place — and your job is to find it.
A trial lands in your review. On paper it's beautiful: hundreds of patients, a respected journal, a clear result favouring the new drug.
You read the methods. The randomisation is immaculate. The blinding is airtight. The follow-up is near-complete. The outcome was measured objectively. Four things done right.
Then you reach the fifth: the trial measured six outcomes, and reported only the one that reached significance. The other five are missing.
So — is this a good trial?
Most people hesitate, because four out of five feels like a pass. It isn't. And the reason why is the whole logic of how evidence quality actually works.
Quality isn't a star rating
Here's the instinct to unlearn: that a study's quality is a single verdict — a number of stars, a gut sense of "solid," a reassurance that "the sample was huge, so it's fine."
It doesn't work like that. A huge sample can't fix a broken randomisation. Perfect blinding can't fix a result that was cherry-picked after the fact. Quality isn't one thing you rate — it's several specific places where bias could have entered, judged one at a time.
That's what a modern risk-of-bias tool does. It doesn't ask "is this good?" It walks you through a fixed set of domains — distinct points in a study's design where things can go wrong — and asks, for each, how much room there was for bias to distort the result.
And the domains combine in a way that surprises people: a study is only as trustworthy as its weakest domain. Bias doesn't average out. Four pristine domains and one broken one don't make a study "80% sound" — they make it a study you can't trust, because that one open door is enough to move the result.
One more thing, because it sets up the next lesson: different study designs have different doors. A randomised trial can go wrong in ways an observational study can't, and vice versa. So there isn't one tool — there's a family of them, each built for a study type. This lesson takes the one for randomised trials. The next takes the others. (There are more still, for meta-analyses and other designs — but the logic you learn here is the whole family's logic.)
Two things risk of bias is not
Before the domains, clear two traps that catch even experienced readers.
Risk of bias is not proof that bias happened.
The tool asks whether the mechanism for bias was open — whether the study's design left room for the result to be distorted. "High risk" doesn't mean the result is false. It means you can't rule out that it's distorted, because the safeguard that would have prevented it wasn't there. You're judging the risk, not the verdict.
Risk of bias is not reporting quality.
Risk of bias is not reporting quality. How well a trial is written up is a different axis from whether its design let bias in. Just as PRISMA governs how transparently a systematic review is reported, a randomised trial has its own reporting standard, CONSORT — but meeting it means the trial is described completely, not that it's free of bias. A trial can tick every CONSORT box, its methods laid out with perfect clarity, and still be at high risk of bias. Immaculate reporting of a broken study is still a broken study.
Keep both straight: you're assessing whether this trial's design left the door open for bias, not whether bias definitely walked through, and not whether the write-up was tidy.
Domain 1: the randomisation process
The tool for randomised trials — RoB 2 — has five domains. Take them one at a time; each is a door.
Domain 1 — the randomisation process. Randomisation's whole job is to make the groups identical at baseline, so any later difference is down to the treatment, not to who ended up where. Two things have to hold: the sequence must be truly random, and it must be concealed — whoever was enrolling patients must not be able to foresee or influence which group a patient would get. If they can, they can steer sicker patients away from the drug (consciously or not), and the groups differ before treatment even starts.
Here's how one trial handled it. Judge the risk of bias in this domain.
"Allocation was computer-generated by an independent statistician. Enrolling clinicians requested each assignment from a central telephone service and could not see upcoming allocations."
Judge the risk of bias in this domain:
Domains 2 & 3: sticking to the plan, and vanishing patients
Domain 2 — deviations from the intended intervention. Once randomised, did things unfold as planned? If patients or staff knew the assignment (no blinding) and that knowledge changed their behaviour — extra care for the drug group, patients dropping the placebo — the comparison bends. This is also where intention-to-treat matters: analysing patients in the group they were randomised to, not the group they ended up in, protects the randomisation you worked so hard for.
Domain 3 — missing outcome data. Patients who drop out don't leave randomly. If the sickest patients on the drug quit because of side effects and vanish from the analysis, the drug looks safer and better than it is. A little missing data is normal; a lot — or a little that's differential between groups — is a threat.
Here's the follow-up on one trial. Judge the risk of bias for missing outcome data.
"Of 400 randomised patients, outcome data were available for 210. The 190 missing were disproportionately from the treatment arm, and reasons for dropout were not reported."
Judge the risk of bias for missing outcome data:
Domain 4: measuring the outcome
Domain 4 — measurement of the outcome. Two trials can measure "the same" outcome with very different exposure to bias, and it hinges on one thing: how subjective the outcome is, and whether the assessor was blinded.
An objective outcome — all-cause death — is hard to bias; someone is alive or not. But a subjective outcome — a clinician rating "improvement," a patient scoring their own pain — is wide open if the person judging knows which treatment the patient got. Belief leaks into the score. The fix is a blinded outcome assessor: whoever measures doesn't know the assignment.
Judge the risk of bias for outcome measurement here.
"The primary outcome was symptom improvement, rated by the treating physician, who was aware of each patient's treatment allocation."
Judge the risk of bias for outcome measurement:
Domain 5: choosing what to report
Domain 5 — selection of the reported result. This is Lesson 1's sin, moved inside a single trial. A study measures many outcomes, at many timepoints, analysed many ways. If the researchers run all of them and report only the ones that came out favourable — deciding after seeing the results which to show — the published finding is cherry-picked, no different in spirit from a reviewer curating trials.
The defence is a pre-registered protocol: the trial declares its primary outcome and analysis before collecting data, so a reader can check that what was reported is what was planned — not what happened to look best. (You've seen this logic twice now: pre-specification is the same guarantee that made a systematic review trustworthy.)
Return to the trial from the very first screen — flawless in four domains, but it measured six outcomes and reported only the one that reached significance, with no pre-registered protocol to show the primary was chosen in advance. That's this domain, at high risk. And now the real question: what does one high-risk domain do to the whole trial?
The weakest link
A trial gets a judgement in each domain. Then those combine into one overall risk of bias — and the rule is not an average.
Here is the trial from Screen 1, domain by domain:
What is the overall risk of bias for this trial?
Bias, or something else?
Risk of bias is one specific axis. It's easy to blur it with other things that sound like "quality." Sharpen the line. For each statement, is it about a trial's risk of bias — or about something else?
"The trial enrolled only 40 patients, so its confidence intervals are wide."
"Outcome assessors were unblinded and the outcome was subjective."
"The paper reported its funding source and followed the journal's reporting checklist."
Why this matters for HTA
A manufacturer's dossier rests its case on one pivotal trial. It's large, published in a top journal, and the headline result is strongly positive. The submission treats it as settled fact.
- A positive result and a trustworthy result are different claims. Your job isn't to ask whether the trial found an effect — it's to ask whether its design left room for that effect to be an artefact. A large, prestigious trial can still be high risk of bias, and the size only makes a biased result look more convincing.
- Go domain by domain, and watch the subjective outcomes hardest. Open-label trials with clinician-rated or patient-reported endpoints are where bias enters most quietly — Domain 4 territory. If the pivotal outcome is subjective and assessment was unblinded, that's not a footnote, it's a threat to the whole case.
- A single high-risk domain caps the trial. You don't need to dismantle every part of a study. Find the weakest domain and name it — because the overall trustworthiness can't rise above it, and a decision worth millions shouldn't rest on a result with one open door.
"A trial doesn't have to be fraudulent to be untrustworthy. It only has to leave one door open — and your job is to find which one."
Risk of bias, in one breath
- Study quality isn't a single rating — it's a structured judgement across specific domains, each a place where bias could enter.
- For randomised trials, RoB 2 uses five: randomisation, deviations from intervention, missing outcome data, measurement of the outcome, and selection of the reported result.
- It judges risk, not fact: "high risk" means the door for bias was open, not that the result is proven false. And it's a separate axis from reporting quality — a well-written trial can still be high risk.
- The domains don't average. A trial is only as trustworthy as its weakest domain — one high-risk domain makes it high risk overall.
- Different study designs have different doors, so different tools exist for them — which is exactly where the next lesson goes.
"Bias doesn't average out. It propagates from the weakest link — so quality is judged by the door left open, not the four that were shut."
You can now take a randomised trial apart the way an assessor does — domain by domain, hunting the weakest link. But randomisation is a luxury not every study has. When a treatment was never randomised — when you're comparing groups that chose their paths — a whole new door swings open, the one randomisation existed to close: confounding. The next lesson takes the tools built for those studies — ROBINS-I for non-randomised interventions, and QUADAS-2 for the diagnostic tests you met back in Module 3.