M4 · EVIDENCE SYNTHESIS

They disagree. Is that a problem?

Five trials tested the same blood-pressure therapy. One found a 2 mmHg drop, another 8. Your pooled estimate sits at 5.

The obvious worry: the trials disagree, so maybe pooling them is nonsense.

But "disagree" is the wrong word. Every trial estimates its effect with error — so even five trials measuring exactly the same truth would scatter. The real question isn't whether the numbers differ. It's whether they differ more than chance alone can explain.

That excess — scatter beyond noise — is called heterogeneity. This lesson is about measuring it, and about the one number everyone quotes for it and almost everyone reads wrong.

Illustrative dataset, built for this lesson.

Scatter is not the signal

Two things push study estimates apart:

Chance (sampling error). Small trials bounce around a lot; that bouncing is noise, not disagreement.
Real differences in the true effect — different populations, doses, follow-up. That is heterogeneity.

A set of studies is heterogeneous only when the spread of their estimates is bigger than their confidence intervals say it should be. Wide CIs forgive a lot of spread. Thin CIs forgive almost none.

A

B

Which set shows more heterogeneity?

You can't eyeball this

The instinct is to look at a forest plot and judge overlap by eye. It fails — in a specific, dangerous direction.

Overlap depends on precision, which depends on sample size. Run the same trials with more patients and every CI shrinks. Nothing about the real effects changed — but the error bars stop overlapping and the plot "looks heterogeneous."

The reverse is worse: genuinely different effects in small trials hide behind wide CIs and "look consistent." Eyeballing under-calls heterogeneity in exactly the small, underpowered meta-analyses where it matters most.

We need a measure that accounts for precision. That's the job of Q, and the two numbers we read off it.

Same effects, opposite impression.

Cochran's Q — isolate the signal

Start by measuring total scatter, weighted by precision: each study's squared distance from the pooled estimate, times its weight (precise studies count more), summed.

Q = Σ w_i (y_i − ȳ)²

Here's the trick that makes Q useful. If every study shared one true effect — pure noise, no real differences — Q would, on average, equal k − 1 (the number of studies minus one). That's its expected value under perfect homogeneity.

So the signal we want isn't Q itself. It's Q's excess over k − 1. Everything below is built from that one quantity.

Q sums up…

Our five effects {2, 4, 5, 6, 8} pool at 5, each weight 0.5. Compute Σ(yi − 5)²:

Our five effects are {2, 4, 5, 6, 8}, pooled at 5, each with weight 0.5. Compute Σ(yi − 5)².

Same scatter, two different questions

That excess of 6 answers two completely different questions — and conflating them is the single most common mistake in evidence synthesis. Meet the pair.

τ² — "How big is the spread?"

The variance of the true effects, in the effect's own squared units (mmHg²). A magnitude.

3.0mmHg² (SD ≈ 1.7 mmHg)

This is the honest answer to how much does the therapy's effect actually vary between settings?

I² — "What share of the scatter is real?"

The fraction of the total variation that's real signal rather than chance, as a percentage. A proportion — no units.

I² = (Q − (k − 1)) / Q × 100%

With Q = 10 and k − 1 = 4, compute I² = (Q − 4)/Q as a %.

Hold onto the shape of these two. τ² carries units and answers how much. I² is a bare percentage and answers what share. Same Q underneath — but one is a size, the other is a ratio. Everything about how I² gets misread comes from forgetting that I² is a ratio. The next screen makes that impossible to forget.

I² is a ratio, not an amount

Because I² is a ratio, it depends on what's in its denominator — and the denominator is total variation, which includes chance. Shrink the chance part, and the ratio climbs, even if the real spread never moves.

The five effects below are frozen at {2, 4, 5, 6, 8} — the real spread never changes. Drag the slider to make the trials larger (more precise). Watch τ² and I² separately.

60%I²

3.0τ²

smaller trialslarger trials

Point estimates unchanged. I² is climbing anyway.

The dots never moved — yet I² marched from 20% to 92%. Remember what I² is: a ratio, the share of scatter that's real. Precise trials shrink the chance part of the denominator, so the same real gap fills a bigger share — I² climbs toward 100%. Meanwhile τ², the magnitude, barely twitched (1.0 → 4.6): the actual spread was there all along.

Carry this into every appraisal: large trials can post I² = 90% with effects all sitting between 0.70 and 0.75 — statistically striking, clinically nothing. And small trials can show I² = 0% while true effects range widely. I² answers "how visible?", never "how large?"

So how big is the spread?

If I² only tells you visibility, and τ² gives the magnitude in awkward squared units, what do you actually report to a decision-maker? The prediction interval.

It answers a concrete question: if you ran the therapy in one more setting, where would its true effect likely land? It stretches the average out by the real between-study spread — so it's always wider than the confidence interval, sometimes dramatically.

A confidence interval asks "where's the average effect?" A prediction interval asks "where could the effect land in my setting?" For decisions, the second question is usually the one that bites.

Shown: random-effects average 5.0 [3.0, 7.0]; 95% prediction interval [−1.4, 11.4] mmHg.

The average clearly shows benefit. What does the prediction interval say?

Read it off a real plot

Every Cochrane/RevMan forest plot carries a heterogeneity footer. You can now read every term in it:

One term needs a word of warning. Chi² is just Q run as a hypothesis test. Like any test it's underpowered with few studies and trivially "significant" with many — so a significant Chi² tells you heterogeneity is present, never that it's large or that pooling has failed.

Which term tells you how much the true effect actually varies?

P = 0.04 for Chi². A reviewer writes "significant heterogeneity, so pooling is invalid." Fair?

Why this matters for HTA

A meta-analysis lands on your desk and the applicant's summary leans hard on one number: I². How you read it decides how much of their "consistency" you believe.

A submission reports I² = 12% and calls the evidence "consistent." With three small trials, Q is underpowered — low I² may mean the heterogeneity is invisible, not absent. Check τ² and the prediction interval before accepting "consistent."
Two dossiers both quote I² = 80%. One pools 20 large trials whose effects all sit between HR 0.70–0.78 (irrelevant); the other pools 5 trials ranging HR 0.4–1.1 (decision-changing). Identical I², opposite implications. Never take an I² at face value without asking about magnitude.
A pooled estimate looks comfortably cost-effective — but its prediction interval crosses your decision threshold. In some plausible settings the technology may not deliver; that belongs in your uncertainty narrative, and often drives a request for subgroup analysis or a managed-access arrangement.

I² tells you how clearly you can see the scatter. It never tells you how far the scatter reaches.

Heterogeneity, in one breath

Heterogeneity is scatter beyond chance — real differences in the true effect, not sampling noise.
Cochran's Q isolates the signal: its excess over k − 1 is what's real.
That one excess splits into two questions: τ² = how big (magnitude, in the effect's units) and I² = what share (proportion, a bare %).
Because I² is a ratio, it climbs with precision even when the real spread doesn't. It measures visibility, not size.
The prediction interval is magnitude made usable: where a future study's true effect could land.

Ask I² what share of the scatter is real. Ask τ² and the prediction interval how far it reaches.

But scatter isn't the only way a meta-analysis can mislead you. Sometimes the studies you can see are a biased sample of the ones actually run — the disappointing results quietly never got published. Next, we go hunting for the gaps: publication bias and the funnel plot.