M8 · ECONOMIC MODELLING

A model can run perfectly and still be wrong.

We've built four ways to turn evidence into lifetime costs and QALYs — decision trees, Markov models, partitioned survival, and discrete event simulation. Each one, handed the right inputs, produces a clean, precise answer. And that precision is exactly the danger.

A model can compute flawlessly — every formula correct, every column adding up — and still be completely wrong about the world. Recall the last two lessons: an extrapolation that sends a cohort's survival off into an implausible tail, or two survival curves that cross to imply a negative number of patients. Neither of those is an arithmetic mistake. The spreadsheet is working perfectly; it's the assumption that's broken. So the question that hangs over every model is not "does it run?" but "should we believe it?" — and answering that is a discipline of its own, called validation.

"Validation" is really four questions.

The single word "validation" hides four genuinely different questions, and confusing them is one of the most common mistakes in reading a dossier. A model can pass any one of them and fail another. The four:

Face validity — does the model make sense? Would a clinician look at its structure and its outputs and find them plausible?
Internal validity (also called verification) — is the model built correctly? Does it compute what it claims to, free of implementation bugs?
External validity — does the model match a reality it wasn't built from? Does it predict data it hasn't already been fitted to?
Cross-validity — does it agree with other models of the same problem, built independently by other teams?

Each catches a different kind of failure. Internal validity catches broken arithmetic; face validity catches clinical nonsense; external validity catches a model that's untethered from reality; cross-validity catches idiosyncratic choices no one else would make. Treat "the model was validated" as a complete sentence and you've learned almost nothing — the question is always which validation, and how hard.

Internal validity: is it built right?

Start with the most basic, and the most reassuring — which is exactly why it's the most over-trusted. Internal validity, or verification, asks a narrow question: does the model do the calculation it's supposed to, without bugs? It's quality control on the machinery, and the standard techniques are simple:

Zero and extreme inputs. Set a treatment effect to zero — do the two arms give identical results, as they must? Push a probability to 0 or 1 — does the model behave sensibly at the boundary?
Probabilities and shares add up. Every transition-matrix row sums to 1; state occupancies sum to the whole cohort; nothing leaks or duplicates.
Reproduce a hand calculation. Take one cycle, compute it by hand, and check the model agrees.

These catch real, common errors — a mistyped formula, a mislinked cell, a probability that doesn't sum. But here is the point that matters most, and the one this whole lesson turns on: internal validity says nothing about whether the assumptions are true. A model can be flawlessly built around a completely wrong idea. Verify an immortal cohort or a pair of crossing curves and it passes — because the flaw was never in the arithmetic. Verification confirms you built the machine correctly. It cannot tell you whether you built the right machine.

Face validity: does it make clinical sense?

The test that can catch a wrong assumption is the one that sounds softest — and is routinely underrated. Face validity asks whether the model's structure and results are plausible to someone who knows the disease. It's expert judgement applied to the model, and it catches precisely the failures arithmetic is blind to:

Do the health states capture the disease as clinicians understand it — or is a costly, important state missing?
Are the modelled trajectories clinically possible? Does anyone in the cohort effectively live forever? Do progressed patients survive longer than the treatment could plausibly allow?
Do the results pass a sniff test — is the incremental survival gain in the right ballpark for this drug class, or suspiciously large?

Face validity is where a clinician spots that the model implies patients surviving 40 years after progression, or a survival curve that never reaches zero. No amount of internal verification would flag those, because the sums are all correct — the problem is that the correctly-summed numbers describe an impossible world. Never dismiss face validity as the soft option: it is often the only check that catches a confidently-computed false assumption.

External validity: does it match a reality it hasn't seen?

The most demanding test asks the model to predict something it wasn't built from. External validity compares the model's outputs against data that were not used to construct it — a different trial, a real-world registry, a national dataset, or the same trial's own longer follow-up reported years later.

The emphasis on not used to construct it is everything. A model reproducing the trial data it was built on has demonstrated almost nothing — it was fitted to those data; agreement is nearly circular. Real external validation needs an independent yardstick. And the single most powerful version connects straight back to the extrapolation problem from the start of this module: a model extrapolates a survival curve past two years of data; five years later, mature data arrive; you overlay the model's prediction on what actually happened. That confrontation — prediction versus reality in the region that was pure assumption — is the strongest evidence a model can offer. Its catch is time: often you must wait years for the data that would settle it.

One warning. Calibration is not validation. Calibration tunes a model's parameters until its output hits a target. If you then "validate" against that same target, you've drawn a circle — the model matches because you forced it to. External validity requires a target the model was never tuned toward.

Run the validation suite.

Below is a model with a set of flaws you can switch on and off. For each flaw, watch which validation test catches it — and, crucially, which ones don't. Pay attention to the flaws that sail through internal validity.

Flaws (toggle on/off)

INTERNAL (verification)PASS

Computes correctly.

FACE (clinical sense)PASS

Clinically plausible.

EXTERNAL (unseen data)PASS

Consistent with independent data.

Look hard at the immortal tail: internal validity gives it a clean pass. The model is built perfectly — it just describes an impossible world, and only face and external validity notice. That's the whole lesson in one switch: a model that's verified is not a model that's true. The dangerous submissions are the ones that ace the easy test and are never made to sit the hard ones.

Now you.

Match each problem to the validation test most likely to catch it.

A coding error makes one health state's QALYs get counted twice.

The model implies patients live 40 years after their cancer progresses.

A national registry shows real-world survival far below what the model predicts.

The trap: passing the easy test.

Now the reason this lesson exists. Of the four tests, internal validity is the easiest to pass and the most impressive-sounding — and it tells you the least about whether the model is right. "Fully verified, extensively tested, all checks passed" reads like rigour. But it only ever meant the machine runs without bugs, not that the machine models reality.

This is the failure mode to watch for in submissions. A model can be immaculately verified, beautifully documented, and built on an extrapolation no clinician believes. Everything downstream of that assumption computes perfectly — and the perfection becomes camouflage. The harder tests, the ones that could expose the assumption, are the ones quietly skipped: no external data, or none the model wasn't fitted to; a face-validity section that's a paragraph of reassurance rather than a clinician genuinely trying to break it.

And beware its subtle cousin, calibration masquerading as validation: a model tuned to hit a target, then reported as "validated against" that same target. It's a circle dressed as a check. The assessor's instinct should be the reverse of the reader's: the more a validation section emphasises verification and internal consistency while going quiet on external and face validity, the harder you look at the assumptions underneath.

What has this actually demonstrated?

A manufacturer's submission states: "The model was extensively validated — it reproduces the pivotal trial's overall survival almost exactly." What has this actually demonstrated?

Why this matters for HTA

The validation section of a dossier is where confidence is manufactured — and where a sharp assessor does some of their most valuable work, because the rhetoric of "validated" is designed to close down scrutiny, not invite it.

Ask "which validation, and against what?" Never accept "the model was validated" as a claim. Verification against the source trial is near-worthless; agreement with an independent dataset is worth a great deal. The preposition after "validated" carries all the information.
Give face validity real teeth. Read the model's implied trajectories the way a clinician would and ask what's impossible: a cohort that never dies, post-progression survival no drug could deliver, a missing state that matters. These are the failures verification will never surface.
Treat a heavy verification section and a thin external one as a signal. When a submission dwells on internal consistency and goes quiet on external and face validity, the assumptions are usually where the weakness lives. And watch for calibration reported as validation — tuning to a target, then "validating" against it, is a circle.

Verification asks whether you built the model right. Validation, properly understood, asks whether you built the right model — and the two are so easily confused precisely because a wrong model, built flawlessly, passes every test that only checks the building.

Model validation, in one breath.

"Validation" is four distinct questions: face (is it clinically plausible?), internal/verification (is the arithmetic correct?), external (does it match data it wasn't built from?), and cross (does it agree with independent models?). A model can pass one and fail another.
Internal validity is the easiest to pass and says the least about truth: a flawlessly-built model can describe an impossible world — an immortal cohort or crossing curves pass verification untouched.
External validity is the strongest test but needs independent data the model wasn't fitted to; reproducing the source trial is nearly circular, and calibration is not validation.
Face validity is the underrated one — expert judgement is often the only thing that catches a confidently-computed false assumption.

A model earns belief not by running cleanly, but by surviving the tests it could have failed. Ask which tests those were — and which ones no one dared to run.

That completes Module 8. You can now read a model's structure, follow how it builds lifetime costs and QALYs, and interrogate whether to believe it. But even a perfectly-structured, well-validated model rests on inputs that are uncertain — every probability, cost, and utility is an estimate with a range. What that uncertainty does to the answer, and how to measure it, is the whole of Module 9: handling uncertainty.