M4 · EVIDENCE SYNTHESIS — LESSON 3

From question to query

PRISMA showed you a box that said "1,240 records identified." This is where that number comes from — and why a good searcher wants it to be big.

Last lesson, every flow diagram started with the same kind of box: records identified from databases — 1,240. We treated that number as a given. But someone had to produce it, by writing a search and running it. That search is the most consequential, least glamorous step in the whole review.

Here's the part that surprises people. If you searched the way you search Google — tuning the query until the first page is all relevant — you would build a bad systematic search. A good searcher does almost the opposite. They write a search that drags in hundreds of irrelevant records on purpose, and they consider that a feature, not a failure.

To see why, you need to build one — and then meet the two numbers that judge it.

From PICO to a query

You already have the raw material: the PICO question from Module 1. A search strategy is just PICO, translated into the language a database understands.

Take a concrete question: "In adults with atrial fibrillation, does apixaban versus warfarin reduce stroke?"

Each concept becomes a block. The rule that wires a query together is simple and absolute:

inside a block, list every way a concept might be named, joined so that any one of them counts as a match;
between blocks, join them so that a record must match both to qualify.

You have two operators to place: OR (matches if either side is present) and AND (matches only if both are). Drop the right one into each slot so the logic follows the rule.

"atrial fibrillation"

"AF"

apixaban

"factor Xa inhibitor"

Inside a block

Two tools do most of the work inside a block.

Controlled vocabulary. Databases tag each record with terms from a fixed thesaurus — MeSH in MEDLINE, Emtree in Embase. Searching the tag catches a record even if its text used different words. You combine the tag with free-text terms (again, any-of) so you catch both the well-indexed records and the brand-new ones not yet tagged.

Truncation. A wildcard captures word variants from one stem: anticoagul* matches anticoagulant, anticoagulants, anticoagulation. One term, many forms.

Now the expert move beginners miss: you usually don't put all four PICO elements into the search. Population and Intervention are searched; Comparator and Outcome often are not. Why? Outcomes and comparators are named inconsistently and indexed poorly — a trial measuring stroke might never say "stroke" in its title or abstract. Require it with an AND block and you silently drop real trials. So you search the concepts that are reliably named, and you check the rest later, by reading. The search casts wide; the screening narrows.

The two numbers

Every search is judged on two axes. You've met one of them before.

Sensitivity (also called recall): of all the truly relevant studies that exist, what fraction did your search catch?

sensitivity = relevant retrieved ÷ all relevant that exist = TP ÷ (TP + FN)

That is the exact formula you saw for a diagnostic test's sensitivity in Module 3 — only now the "disease" is this study is relevant and the "positive test" is the search retrieved it. A missed relevant study is a false negative, just as a missed case is.

Precision: of everything your search retrieved, what fraction was actually relevant?

precision = relevant retrieved ÷ all retrieved = TP ÷ (TP + FP)

Why precision, not specificity?

In diagnostics the second axis was specificity. Here it would be useless. Specificity asks about true negatives — irrelevant records you correctly didn't retrieve. In a database of millions, that number is astronomical and almost always near 100%, so it tells you nothing. Precision asks the question that actually matters to a human screener: of the pile I now have to read, how much is worth reading?

Compute the catch

A team registers a protocol for this question. Through exhaustive checking — hand-searching, contacting experts — they establish that 40 relevant trials truly exist. Call it the gold standard.

They run a broad search. It returns 760 records, and among them are 38 of the 40 relevant trials.

Sensitivity asks: of all the relevant trials that exist, what fraction did the search catch? Build the formula.

sensitivity =÷

Compute the noise

Same search, the other axis. It returned 760 records in total, of which 38 were relevant. So precision = 38 ÷ 760 — and this time the point isn't the exact figure, it's the scale of it.

Don't divide it out. Estimate: of everything this search dragged in, roughly what fraction was actually worth reading?

The temptation, and why you resist it

A precise search looks so much more efficient. Compare the broad search with a narrow one on the same question:

Retrieved

Relevant caught

Sensitivity

Precision

Broad search

760

38

95%

5%

Narrow search

120

30

75%

25%

The narrow search is five times more precise. Only 120 records to screen, a quarter of them useful — a far pleasanter afternoon. It is also unfit for a systematic review, and the table tells you why: it caught 30 of 40 trials. It missed ten.

Here's the trap that ties this module together: those ten missing trials are invisible. They don't show up as an error. The review proceeds on 30 trials, looking complete and tidy — exactly as a narrative review looks complete, with no auditable denominator. A search that's too narrow reintroduces selection bias through the back door: not by excluding studies on their results, but by never finding them at all. The reader can't audit what was never retrieved.

That's why the systematic searcher chooses the 5%-precision search and accepts the screening grind. High sensitivity first; precision is a convenience, not a goal.

Steering the search

You steer a search by widening it (more sensitivity, less precision) or narrowing it (more precision, less sensitivity). Knowing which lever does which is the searcher's core skill.

Here are four edits to a search strategy. Tag each one: does it widen the search or narrow it?

Add three more synonyms inside the population block, joined with OR.

Add a new concept block for the outcome, joined with AND.

Replace 'anticoagulant' with the truncated 'anticoagul*'.

Add a filter restricting results to randomised controlled trials.

Why this matters for HTA

A manufacturer's dossier lands on your desk, and the appendix lists the search strategy behind their systematic review. Most people skim past the wall of OR/AND syntax. For an assessor, it's one of the highest-yield things to read.

A search you can't see is a search you can't trust. A reported, dated, reproducible strategy is the baseline. But go further: read the blocks. A search built only on Population, with a thin Intervention block and no truncation, may have quietly low sensitivity — and a manufacturer with an inconvenient trial has a soft incentive not to fix that.
Over-narrowing is the elegant way to lose studies. Watch for an Outcome block joined with AND, or an aggressive design filter. Each looks rigorous and each silently cuts sensitivity. The trials that fall out never appear in the flow diagram, so the gap is invisible unless you interrogate the strategy itself.
One database is a red flag. A search of MEDLINE alone misses studies indexed only in Embase or CENTRAL. A single-database search is a sensitivity problem wearing the costume of thoroughness.

Selection bias doesn't only happen when you exclude a study. It happens, more quietly, when your search was never built to find it.

Search strategy, in one breath

A search strategy is a PICO question translated for a database: each concept becomes a block, synonyms joined with OR within a block, blocks joined with AND between them.
Two numbers judge it. Sensitivity = relevant caught ÷ all relevant that exist (the same formula as a diagnostic test). Precision = relevant caught ÷ everything retrieved.
They trade off. Widening a search (more synonyms, truncation, controlled vocabulary) raises sensitivity and lowers precision; narrowing it (extra AND blocks, design filters) does the reverse.
A systematic review prizes sensitivity first and tolerates brutal precision — often screening twenty records to find one — because a missed study is an invisible, unauditable hole, the same sin as cherry-picking.
Comparator and Outcome are usually left out of the search and checked at screening, because they're named too inconsistently to require without dropping real trials.

A good systematic search feels inefficient on purpose. The wasted reading is the receipt that proves nothing was quietly left out.

You can now read the top box of a PRISMA diagram and know what produced it — and whether it was built to find everything or quietly built to miss. But finding studies and trusting them are different things. The next lessons turn to the studies you've retrieved and ask what the search can't: are these studies any good, how do we grade their risk of bias, and once we trust them, how do we combine their results into a single answer?