M4 · EVIDENCE SYNTHESIS

The experiment that was never run.

A payer has to decide whether to fund Drug A instead of Drug B. Both treat the same condition; B is what clinicians use today. So the question is narrow and concrete: is A better than B, and by enough to justify its price?

Here's the problem. Every trial of A compared it to placebo. Every trial of B compared it to placebo. No trial ever put A against B. The head-to-head experiment that would answer the payer's exact question was never conducted.

Yet the submission answers it anyway. Somewhere in the dossier is a number — "A improves outcomes by X more than B" — and that number was never observed in any single experiment. It was reconstructed from two separate trials that never met.

That reconstruction has a name, a method, and a single load-bearing assumption. Whether you may believe it is one of the most common judgement calls an assessor makes — because in HTA, the head-to-head trial you want usually doesn't exist.

You can't compare A and B directly. But you can walk between them.

Lay out what you actually have:

Trial 1 randomised patients to Drug A or placebo. A came out ahead.
Trial 2 randomised a different set of patients to Drug B or placebo. B came out ahead.

The two trials share one thing: placebo. It appears in both. That shared arm is the common comparator — the anchor — and it's the bridge between two experiments that never overlapped.

Think of it as a little network: A — placebo — B. There's no direct road from A to B, but there's a path through placebo. An indirect treatment comparison walks that path: it uses each drug's result against the shared anchor to work out how the drugs compare to each other.

The whole lesson turns on one choice: how you walk that bridge. Do it wrong and you get a confident, wrong answer. Do it right and you get an honest, uncertain one.

The obvious move is the wrong one.

Both drugs beat placebo, so why not just compare their results side by side? Drug A's patients scored 10. Drug B's patients scored 11. Line them up: B wins by 1. Done.

Except that's two different trials, two different sets of patients, two different rooms. Comparing the A arm from one against the B arm from the other treats them as if they'd been randomised together — and they never were. Watch what happens when you switch to the honest comparison.

A 10 vs B 11 → B better by 1.

This arrow crosses between two separate trials.

Same numbers. Opposite winner. The naive comparison says B. The anchored comparison says A. Only one of them kept the trials' randomisation intact — and the next screen shows why that's the whole game.

The trick isn't the subtraction. It's what the subtraction protects.

Look again at the placebo arms: 2 in Trial 1, 6 in Trial 2. The two trials enrolled genuinely different populations — Trial 2's patients were always going to score higher, drug or no drug. That gap has nothing to do with A or B.

Now here's the mechanism. Within Trial 1, randomisation means the A arm and the placebo arm are the same kind of patient. So whatever made Trial 1's patients different from Trial 2's, it pushed both of Trial 1's arms by the same amount. When you take the difference within the trial — A minus its own placebo — that shared shift cancels. Same in Trial 2. Each relative effect (8 and 5) is already scrubbed clean of the between-trial population gap.

That's why differencing them (8 − 5 = 3) is fair. You're comparing two cancelled quantities.

The naive comparison never cancels anything. Drug A's 10 carries Trial 1's low baseline (2); Drug B's 11 carries Trial 2's high baseline (6). Put 10 next to 11 and you're not comparing the drugs — you're comparing the drugs plus the two populations' baselines, tangled together. The reversal wasn't a quirk. It was the baseline gap leaking straight into the answer.

The anchored indirect comparison, in one line.

This is the Bucher method. Take each drug's effect relative to the shared anchor, and subtract:

effect of A vs B = (effect of A vs placebo) − (effect of B vs placebo)

Plug in the two relative effects:

d(A vs B) = 8 − 5 = ?

Anchoring buys you honesty. It costs you precision.

You didn't measure A vs B directly — you built it from two estimates, each with its own uncertainty. Uncertainty doesn't cancel; it accumulates. Because the two trials are independent, their variances simply add:

Var(A vs B) = Var(A vs placebo) + Var(B vs placebo)

With Var(A vs placebo) = 4 and Var(B vs placebo) = 2.25:

Var(A vs B) = 4 + 2.25 = ?

Everything so far rests on one assumption. Here's exactly where it breaks.

Anchoring cancelled the placebo-arm gap because that gap shifted both arms of a trial equally. A variable that does that — moves outcomes up or down regardless of which drug you got — is a prognostic factor. Anchoring handles those for free. That's what just saved you.

But some variables don't just shift the level — they change how well the drug works. A variable like that is an effect modifier. If the two trials differ in the distribution of an effect modifier, the relative effects (8 and 5) are no longer measuring the same thing, and subtracting them mixes "A vs B" with "Trial 1's population vs Trial 2's." Anchoring can't cancel it, and — the dangerous part — the numbers won't warn you. A clean-looking d = 3 can be quietly biased.

This is transitivity (or exchangeability): the indirect comparison is valid only if the trials are similar enough in their effect modifiers that the anchor behaves the same in both. It's a judgement about clinical and population similarity, not something a p-value can rescue.

Tap the difference that actually invalidates the anchored comparison:

Two trials and one anchor is the smallest case.

Real evidence bases are messier — and richer. Suppose you have A vs placebo, B vs placebo, and B vs C, and C vs A. Now the treatments form a whole web. A network meta-analysis (NMA) generalises the Bucher move across that entire web at once: it estimates every treatment against every other, combining direct evidence (where a head-to-head trial exists) with indirect evidence (paths through the network), and lets each comparison borrow strength from the rest.

The bonus is a built-in audit. Where the network contains a loop — say A vs B measured directly and reconstructed indirectly through C — you can check whether the two agree. That agreement is called consistency. When direct and indirect estimates clash, the loop is telling you that transitivity is failing somewhere: the trials in the network aren't as exchangeable as the method assumes.

So NMA doesn't dissolve the transitivity assumption — it scales it up, and hands you a partial way to test it. Every warning from the two-trial case still applies, now across a whole graph.

Solid = direct trials. Dashed = the indirect A–B path, reconstructed through the network.

What if the trials genuinely differ in an effect modifier — and you know it?

Anchoring alone can't fix that (previous screens). But if you hold individual patient data for your own trial and only published averages for the competitor's, you can do something anchoring can't: adjust the populations to match.

A matching-adjusted indirect comparison (MAIC) reweights your own patients so that their average characteristics — age, severity, whatever the effect modifiers are — line up with the competitor trial's reported averages. Then it runs the indirect comparison on the reweighted, matched population. It's the population-level fix for exactly the effect-modifier imbalance that breaks a plain anchored comparison.

Effective N: 300

Two cautions an assessor watches for:

Anchored vs unanchored. With a shared comparator, MAIC only has to balance effect modifiers — demanding but doable. Without one (e.g. single-arm study vs single-arm study), it becomes unanchored and must balance every prognostic factor and effect modifier, all of them, correctly. That's a heroic assumption, and NICE's methods guidance treats unanchored results with real suspicion.
The cost is precision. Reweighting throws weight onto a subset of your patients, shrinking the effective sample size — sometimes from hundreds to a couple of dozen. A matched comparison can be far less certain than its tidy point estimate suggests.

(MAIC's regression-based sibling, STC, does the same job by modelling outcomes rather than reweighting; same assumptions, same scrutiny.)

The other chair

The other chair. Reading a submission: start with the picture, not the number. Is the network even connected to the comparator you care about? Then interrogate transitivity directly: which effect modifiers matter clinically, and were the linked trials similar in them? Was that argued, or just assumed? For any NMA, ask whether closed loops were consistent. For any MAIC, ask whether it was anchored, which modifiers it matched, and what the effective sample size collapsed to. A confident indirect point estimate with an unexamined transitivity assumption is a number resting on nothing. Building one: the assessor will ask all of that, so pre-empt it. State your effect modifiers up front and show the trials are comparable on them — don't make the reviewer go looking. Report the indirect comparison's full uncertainty, wide interval and all; a crossed-zero CI you disclose is more credible than a tidy one you don't. If you used MAIC, report the effective sample size honestly and justify anchored over unanchored. Argue transitivity as a case, not an aside.

Same skill from both chairs — knowing that the indirect number is only ever as good as the assumption underneath it, and being able to say whether that assumption holds.

Why this matters for HTA

When it lands on your desk: most technologies reach an appraisal without a head-to-head trial against the comparator that matters. So the pivotal comparison — the one the recommendation turns on — is almost always reconstructed: an indirect comparison, a network meta-analysis, or a MAIC. This is not a corner case. It is the median case.

You judge the assumption, not just the estimate. The arithmetic is trivial and always produces a number. The transitivity assumption is where the real appraisal happens — clinical and population reasoning, not statistics.
You treat the indirect estimate as wider than it looks. Variances add; population adjustment shrinks effective samples. An indirect comparison that "shows" superiority may have an interval that never cleared zero.
You price the residual uncertainty into the decision. A recommendation built on a fragile indirect comparison is a candidate for a managed-access or risk-sharing arrangement rather than a clean yes — the same reflex as with an incomplete evidence base.

The comparison that decides the appraisal is usually the one no experiment ever performed. Your job is to know when its reconstruction can be trusted.

Indirect comparisons, in one breath.

When no head-to-head trial exists, you compare two treatments through a shared anchor — a common comparator both trials used.
Never compare the arms directly across trials. That breaks randomisation and confounds the drugs with their trials' populations — it can even flip the winner.
The Bucher method subtracts each drug's effect-versus-anchor: d(A vs B) = d(A vs C) − d(B vs C). For ratios, the same subtraction on the log scale.
The price is precision: variances add, so the indirect interval is wider — often wide enough to cross zero.
It all rests on transitivity: the trials must match on effect modifiers. Anchoring cancels prognostic differences for free; effect-modifier differences it cannot see or fix.
NMA scales this to a whole network and lets closed loops test consistency. MAIC adjusts populations when you have patient-level data — powerful anchored, perilous unanchored, and always costly in effective sample size.

You can always compute the number. Whether you may believe it is what transitivity decides.

That closes the question of what can distort a synthesis: how much studies scatter (heterogeneity), whether you have all of them (publication bias), and whether the ones you have can even be compared (indirect comparisons). One question remains — how do you turn all of this into a single, defensible verdict on how certain the evidence is? That's GRADE, and it's where the whole module lands.