M4 · EVIDENCE SYNTHESIS
When there was no randomisation
The last lesson's logic still holds — quality is judged domain by domain, and a study is only as strong as its weakest one. But when a study isn't a randomised trial, a whole new door swings open. Two new tools exist to watch it.
A study reports that a new drug cuts hospital admissions in half. You run it through everything you learned last lesson. The outcome was measured objectively. Almost no one dropped out. The reported result matches a pre-registered plan. By RoB 2's domains, it looks clean.
But there's a line in the methods you skipped: patients were not randomised. Their doctors chose who received the drug and who didn't.
Suddenly the clean domains don't reassure you at all — because the most important question about this study isn't on RoB 2's list. RoB 2 never had to ask it, for one reason: in a randomised trial, it's already answered.
A different study, a different door
Last lesson ended on a promise: different study designs have different doors, so different tools exist for them. Here's the map.
- RoB 2 — for randomised trials. (Last lesson.)
- ROBINS-I — for non-randomised studies of interventions: the drug-vs-no-drug comparison where nobody randomised, like the hospital study above.
- QUADAS-2 — for diagnostic accuracy studies: does this test correctly identify who has the disease?
Using the wrong tool gives a meaningless answer, so the first skill is matching tool to study. For each study, pick the right tool.
"Patients were randomly assigned to apixaban or warfarin; the outcome was stroke within two years."
"Using hospital records, patients who received the drug were compared with those who didn't. Treatment was decided by their physicians, not randomised."
"A new blood test for coeliac disease was compared against intestinal biopsy to estimate its sensitivity and specificity."
Confounding: the door randomisation closed
Here's what RoB 2 never had to ask, and why.
In the hospital study, doctors chose who got the drug. Doctors don't choose randomly — they might give a new drug to their healthier patients (more likely to tolerate it) or their sickest (nothing left to lose). Either way, the two groups now differ before treatment in ways that also affect the outcome. If healthier patients got the drug, they'd have done better anyway — and the drug takes credit that belongs to their starting health.
That third factor — here, baseline health — is a confounder: something that influences both who gets the treatment and the outcome, manufacturing an association that looks causal but isn't.
This is the door randomisation closes. Randomly assigning treatment breaks the link between patient characteristics and which group they land in — so the groups start balanced, on everything, even things you didn't measure. Take randomisation away and the door stands open. No amount of careful outcome measurement or complete follow-up can close it — which is why a non-randomised study needs a tool that puts this threat first.
ROBINS-I and the confounding domain
ROBINS-I keeps the domains you already know — deviations, missing data, measurement, selective reporting — and adds the ones randomisation used to handle for free. The first and most important: confounding.
A good observational study fights back. It measures the likely confounders (age, disease severity, comorbidities) and adjusts for them statistically — with methods you'll meet in Module 11, like propensity scores. But here's the limit you must never forget: you can only adjust for confounders you measured. The ones you didn't measure — or didn't think of — stay uncontrolled. Adjustment lowers the risk; it never zeroes it, the way randomisation does.
One thing to note: ROBINS-I uses its own four-level scale, different from RoB 2's. Instead of Low / Some concerns / High, it runs Low / Moderate / Serious / Critical. The difference is meaningful. "Low" here is a high bar — it means risk comparable to a well-conducted randomised trial, which observational studies rarely reach. And at the far end sits "Critical": a study so compromised it shouldn't be used in a synthesis at all — a level RoB 2 doesn't need, because a randomised design can't fail quite that badly on confounding.
Judge the risk of bias from confounding in this study.
"Patients receiving the drug were compared with those who did not. The groups differed substantially in baseline disease severity, and the analysis made no adjustment for it."
Judge the risk of bias from confounding:
The rest of ROBINS-I, briefly
Confounding is the headline, but ROBINS-I adds one more door that randomisation also guarded: selection into the study. If the way patients entered the study is itself tied to both their treatment and their outcome — say, only patients who survived long enough to reach the clinic got included — the sample is skewed before any analysis begins.
The remaining domains — deviations from intended intervention, missing data, measurement of the outcome, selection of the reported result — you already know: they're the same doors RoB 2 watches, asked again here. That's the elegance of it. You're not learning a new system, you're extending the one you have onto studies with more exposure.
And the weakest-link rule still governs: an observational study can be spotless on every familiar domain, but if confounding is at serious risk, the study is at serious risk overall. In non-randomised research, confounding is usually exactly where the weakest link sits.
A different world: diagnostic accuracy
Now a genuinely different kind of study — and a callback to Module 3.
There, you met sensitivity and specificity: how well a test catches disease, and how well it clears the healthy. A diagnostic accuracy study is how those numbers get produced. It takes a group of patients, runs the index test (the new test being evaluated), and compares its verdict against a reference standard — the best available proof of who truly has the disease (for coeliac disease, an intestinal biopsy).
But those headline numbers — "98% sensitivity!" — are only as trustworthy as the study that produced them. And diagnostic studies have their own distinctive ways of going wrong, ways that have nothing to do with randomisation or confounding. That's why they get their own tool.
QUADAS-2: four domains
QUADAS-2 assesses a diagnostic study across four domains — the same domain-by-domain logic, translated into the world of tests.
- Patient selection — were the patients representative of those the test is really for, or a convenient, easy-to-classify sample?
- Index test — was the new test interpreted without knowing the reference-standard result? (If the person reading the test already knows the answer, their judgement leaks in.)
- Reference standard — is the "truth" it's measured against actually reliable, and was it interpreted blind to the index test?
- Flow and timing — did every patient get the same reference standard, and at a sensible time — or did who-got-verified depend on their test result?
Two of these hide the most damaging, most diagnostic-specific traps. The next screen is where they live.
How the design inflates the numbers
The most treacherous flaw in a diagnostic study is partial verification — when who gets the reference standard depends on the index test result.
Picture it: everyone who tests positive on the new blood test gets the confirmatory biopsy. But most who test negative are sent home without one — nobody biopsies an apparently healthy person. Now the study only "confirms" truth for the test-positives. The test-negatives who were actually diseased — the false negatives — are invisible, never verified. Drop the false negatives and sensitivity looks dazzling, because the very cases the test missed were quietly excluded from the count.
A related trap is spectrum bias: if the diseased patients studied are floridly, obviously ill and the healthy ones are clearly well, the test looks far sharper than it will be in the real clinic, where the hard cases are the ambiguous middle.
Judge the risk of bias for this study.
"The new test was evaluated in 200 patients. All who tested positive received the reference-standard biopsy; those who tested negative did not, and were recorded as disease-free."
Judge the risk of bias for this study:
Flaw to consequence
Each diagnostic design flaw distorts a specific number. You met these numbers in Module 3 — now connect the flaw to what it fakes. For each flaw, pick the metric it inflates.
"Only test-positive patients got the confirmatory reference standard; negatives were assumed healthy."
"The diseased group were all advanced, unmistakable cases; the healthy group were young and clearly well."
"The person reading the new test already knew each patient's biopsy result."
Why this matters for HTA
A manufacturer's dossier leans on a non-randomised "real-world" study, or a diagnostic accuracy study with striking numbers. Both can be legitimate — and both come with doors that a randomised trial simply doesn't have.
- For a non-randomised comparison, confounding is the first question, not a footnote. Ask what could differ between the treated and untreated groups at baseline, whether the study measured those factors, and whether it adjusted for them. Then ask the harder one: what couldn't they measure? Unmeasured confounding is the permanent asterisk on every observational result — it can be reduced, never eliminated.
- For a diagnostic study, interrogate how truth was assigned. Did every patient get the same reference standard, or only the test-positives? Were the patients the ambiguous real-world cases, or a clean, easy spectrum? A dazzling sensitivity built on partial verification is an artefact of design, not a property of the test.
- The right tool is part of the appraisal. A submission that assesses a non-randomised study with RCT criteria — or ignores verification bias in a diagnostic study — has appraised nothing. Matching tool to design is the first thing you check, and the first thing a weak submission gets wrong.
"Randomisation quietly closed a door most people never noticed. Take it away, and the whole appraisal turns on the door that's now open."
Beyond RoB 2, in one breath
- The domain-by-domain logic and the weakest-link rule from last lesson still hold — but the doors depend on how the study was built, so the tool must match the design.
- RoB 2 → randomised trials; ROBINS-I → non-randomised studies of interventions; QUADAS-2 → diagnostic accuracy studies. Using the wrong one appraises nothing.
- Confounding is the door randomisation closed: a third factor shaping both who gets treated and the outcome. Observational studies can adjust for measured confounders, but never for the ones they didn't measure — so the risk is reduced, never zeroed.
- Diagnostic studies inflate their own numbers in distinctive ways: partial verification (only verifying test-positives) inflates sensitivity; an easy patient spectrum and unblinded test reading flatter accuracy on both axes.
- A reported sensitivity or a real-world effect is only as trustworthy as the study design behind it — which is exactly what these tools are built to expose.
"A number is not evidence. A number produced by a sound design is evidence — and telling the two apart is the whole job."
That closes the loop this module opened. You can find every relevant study, count them honestly, search for them without missing any, and now judge whether each one — randomised or not, treatment or test — can be trusted at all. What remains is the payoff: taking a set of studies you've vetted and combining their results into a single answer. That's meta-analysis, and it's where the next lessons go — pooling, heterogeneity, and the forest plot that puts it all on one page.