M4 · EVIDENCE SYNTHESIS
Two models,
one dataset
Five trials testing the same antihypertensive. They all show a benefit. But Study A shows +3 mmHg and Study D shows +11 mmHg. Are they measuring the same thing? Fixed-effect says yes. Random-effects says no. Same data — two answers. This lesson decides which assumption you should make, and what changes when you do.
Two claims about reality
Every meta-analysis carries a hidden assumption about why the studies differ. There are exactly two options.
Story one
One true effect
All studies estimate the same underlying truth. Their results scatter around it because of sampling chance alone — like five different labs measuring the same object. More data narrows the answer.
Story two
A distribution of true effects
The studies estimate genuinely different true effects — different populations, doses, follow-up lengths. Their results scatter because the biology differs, not just the luck. No amount of data collapses that spread to zero.
These two stories are not different analysis techniques. They are different claims about what is happening in the world. Choosing between them is a scientific decision, not a statistical one.
The fixed-effect model
The fixed-effect model commits to Story One. Its weight formula follows directly from that commitment.
Weight for study i
wi = 1 / vi
where vi is the within-study variance (the square of the standard error)
A large, precise trial has a small vi, so 1/vi is large — it dominates the pool. A small trial contributes little. That hierarchy is deliberate: if there is one true effect, the biggest trial is closest to it.
Because the only source of uncertainty is within-study chance, aggregating many patients drives the confidence interval gratifyingly narrow. The pooled estimate is a weighted average; its variance is 1/W, where W = Σ(1/vi).
The random-effects model
The random-effects model commits to Story Two. The true effects form a distribution, and the studies sample from it. That adds a second source of variance: τ² (tau-squared), the between-study variance.
Weight for study i
wi = 1 / (vi + τ²)
τ² estimated from the data using DerSimonian & Laird or restricted maximum likelihood
Adding τ² to every denominator shrinks the weight of large studies and grows the weight of small ones. The hierarchy flattens. Weights become more equal across studies.
Two consequences follow:
- The pooled estimate is an average of a distribution, not a best estimate of a single fixed value. It is still meaningful — but its meaning has shifted.
- The confidence interval is wider, because τ² adds real uncertainty that more patients cannot eliminate. Even an infinite combined sample cannot tell you whether the true effect in elderly Thai patients equals that in young Finnish athletes.
Weights — the numbers
Take two studies: P with within-study variance v = 1, and Q with v = 4. Assume τ² = 4 (a plausible estimate for a heterogeneous literature).
| v | Fixed weight = 1/v | Random weight = 1/(v+τ²) | |
|---|---|---|---|
| Study P | 1 | 1.0 | 0.20 |
| Study Q | 4 | 0.25 | 0.125 |
Fixed weight ratio P : Q = 4 : 1 — the large precise trial dominates. Random weight ratio = 1.6 : 1 — still heavier, but far less so. τ² did the equalising.
In the random-effects model, a study's weight is one over which quantity?
Where does τ² itself come from? It is estimated from the dispersion of the observed effect sizes around their weighted mean — a computation the software handles. The key insight is what it represents: how much the true effects genuinely differ across studies.
See it in the forest plot
Below are the same five trials — Study A (+3) through Study E (+4) — pooled under each model. Toggle between the two models to see how the squares and diamond change, then continue.
- Squares: Study A (the largest trial, N = 1500) has the biggest square under fixed-effect. Under random-effects the squares are more equal — τ² is flattening the hierarchy.
- Diamonds: the fixed-effect diamond is narrow and centred near the large-trial results (5.1). The random-effects diamond is wider and shifted toward the small-trial outliers (6.1). Different assumption, different answer.
Reading the plots
Look at the two diamonds you just saw. Answer both questions to continue.
In which model is the pooled confidence interval wider?
Why is the random-effects interval wider?
Which model should you choose?
The choice depends on whether the "one true effect" story is scientifically credible. For each scenario below, pick the appropriate model.
Scenario 1
Four RCTs of the same beta-blocker at 25 mg/day in middle-aged adults with stage-1 hypertension. All trials run for 12 weeks, all measure 24-hour systolic BP by ambulatory monitoring, all report effects between +4 and +7 mmHg. Different sites, identical protocol — run under one coordinating centre.
Scenario 2
Five trials testing different vasodilators across three continents — doses ranging from 10 to 80 mg, follow-up from 4 weeks to 18 months, populations mixing young healthy adults, elderly patients and people with type-2 diabetes. Effects range from +2 to +14 mmHg.
One caution that keeps you honest: in practice you don't just eyeball the scenario — you measure the heterogeneity statistically, using I² and τ², and then justify the model choice in your methods. Choosing random-effects by default to look "conservative" is not a strategy; if the true-effect distribution is tight, you've just widened the interval for no scientific reason.
Measuring heterogeneity: I² and τ²
Our five trials give τ² = 7.2 and I² = 79 %. What does that mean?
τ² = 7.2
The estimated variance of the distribution of true effects. Its square root, τ ≈ 2.7 mmHg, is the standard deviation of that distribution — a sense of how much the "true effect" varies from trial population to trial population.
I² = 79 %
The proportion of observed total variance that is due to true heterogeneity rather than chance. I² = τ²/(τ² + v̄), where v̄ is the average within-study variance. (This is a simplified form. The formal definition works through the Q statistic — I² = (Q − df) / Q, which the heterogeneity lesson unpacks in full — but it measures the same thing: the share of the visible disagreement that chance can't explain.) At 79 % most of the spread between trials reflects real differences, not sampling noise.
Cohen's informal benchmarks (25 % = low, 50 % = moderate, 75 % = high) give a sense of scale, but their mechanical use is discouraged. What matters is whether the heterogeneity is scientifically explicable — and whether a pooled estimate is meaningful at all when I² is this high.
Why this matters for HTA
Model choice is not a footnote. It touches the evidence that reaches the decision table.
- Precision of the estimate. Fixed-effect intervals are narrower — sometimes narrow enough to cross a cost-effectiveness threshold cleanly. Random-effects intervals may straddle the threshold, forcing probabilistic sensitivity analysis rather than a clean verdict. The model choice can determine whether the cost-effectiveness conclusion is "yes," "no," or "it depends."
- Generalisability of the pooled effect. A fixed-effect pooled estimate claims to represent a single true value applicable everywhere. A random-effects estimate is the mean of a distribution — meaningful only if the target population is similar to the average of the evidence base. An English appraisal using trials from US managed-care populations may be looking at the wrong point on that distribution.
- NICE's guidance. NICE expects model choice to be justified on scientific grounds and sensitivity-tested. Switching from random- to fixed-effect (or vice versa) in a sensitivity analysis is a standard expectation — and a common source of disagreement between companies and the Evidence Review Group.
The confidence interval is not just a number. Its width is a statement about what the model believes.
Fixed vs random effects, in one breath
- — Fixed-effect assumes one true underlying effect; all spread is chance. Weight = 1/v. Large trials dominate. Interval narrows with more data.
- — Random-effects assumes a distribution of true effects; τ² is its variance. Weight = 1/(v + τ²). Weights are more equal. Interval includes τ² and does not narrow to zero.
- — The pooled estimates differ: fixed gives 5.1 mmHg (95 % CI 4.0–6.3); random gives 6.1 mmHg (95 % CI 3.4–8.8). Same five trials, different assumptions.
- — Choosing the model is a scientific claim about reality, not a statistical preference. Justify it. Test it in sensitivity analysis.
- — High I² (79 % here) signals that most spread is real — but does not by itself invalidate pooling. Ask why the trials differ.
A narrow pooled interval is only reassuring if the model behind it was honest.
Everything hinged on one quantity we simply assumed: τ², how much the true effects differ. In our five trials it was large — heterogeneity measured at I² = 79 %, meaning most of the spread between these trials is real, not chance. A later lesson makes that heterogeneity precise — turning the visible scatter into I², a single number for how much of the disagreement is real — and asks what to do when the pooled estimate itself stops being the right answer.