M6 · MEASURING HEALTH OUTCOMES

Last lesson, we wrote "0.5" as if it fell from the sky. Someone had to decide it.

Remember the QALY arithmetic: a year in a health state at utility 0.5 counts as half a year of full health. We used numbers like 0.5 and 0.6 freely — but where does a number like that actually come from?

It isn't measured with an instrument. There's no blood test for "this illness is worth 0.6 of a healthy year." The number comes from a far stranger source: someone was asked to make a trade. They were shown a health state and asked what they'd give up to avoid it — years of their life, or a risk of dying. What they were willing to sacrifice became the number.

And here's the part that should unsettle you, because it runs through the whole rest of this module: the answer depends on what you ask, and whom you ask. Trade away years, and you get one number. Accept a gamble with death, and you get another. Ask a patient living with the illness, and you get a different number than if you ask a healthy member of the public imagining it. Same illness. Different question. Different utility — and therefore a different QALY, a different ICER, a different funding decision. This lesson is about where that number really comes from, and why it's softer than it looks.

The deepest misunderstanding: a utility doesn't measure how bad a state feels. It measures what you'd sacrifice to escape it.

It's tempting to think a utility is a measurement of suffering — more pain, lower number. But that's not what it is, and the difference is the whole subject. A utility measures the value of a health state, expressed as a preference: it's revealed by forcing a choice, a trade-off between the imperfect state and something you'd have to give up to improve it.

Why must there be a trade? Because value only shows itself when something is at stake. If I ask "how bad is your arthritis, 0 to 100?" you can answer without giving anything up — and the answer floats free, uncalibrated. But if I ask "how many years of life would you trade to be rid of it?", now you have to weigh it against something precious, and your answer reveals how much you actually value being free of it. No sacrifice, no valuation. This is why the credible methods for measuring utility are all preference-based — they make you trade — and why simply rating a state on a line (which we'll meet) measures something weaker.

So keep this fixed: a utility is a preference-based value, extracted through sacrifice — not a reading of how the state feels from the inside.

The most intuitive way to force the trade: how many years of life would you give up to be healthy?

The Time Trade-Off (TTO) sets up a stark choice. Imagine you have 10 years left to live, but in a poor health state — chronic, painful, limiting. Now I offer you a deal: fewer years, but in full health. Would you take 9 healthy years instead of 10 poor ones? Probably. Eight? Seven?

At some point you hesitate — a number of healthy years where you genuinely can't decide between "10 years in the poor state" and "that many years in full health." That's your indifference point, and it is the valuation. If you'd trade 10 poor years for 6 healthy years (and no fewer), then the poor state is worth 6/10 = 0.6 of full health, to you.

The logic is clean: the more years you're willing to give up to escape a state, the worse you consider it, and the lower its utility. Trade away almost nothing (indifferent only at 9.5 years) → the state is mild, utility 0.95. Trade away most of your life (indifferent at 2 years) → the state is grim, utility 0.2. Your willingness to sacrifice time is the measuring stick.

Try the arithmetic yourself. Someone faces 10 years in a health state, and is indifferent once 3 of those years are given up — accepting 7 healthy years instead:

Indifferent giving up 3 of 10 years (accepting 7 healthy years instead): utility = 7 ÷ 10 = ?

Measure your own utility.

Here's a serious illness and 10 years. How many healthy years would make you indifferent?

You're facing 10 years in health state X: a serious chronic condition — persistent pain, limited mobility, dependence on others for some daily tasks. The deal on the table: give up some of those years in exchange for living the rest in full health. Drag the slider to the number of healthy years at which you genuinely can't choose between the two options — your indifference point.

10 years in state X

6 years in full health

0N = 6 healthy years10

utility = 6 ÷ 10 = 0.6

At this point, state X is worth 0.6 of full health — that's its utility.

Whatever number you landed on, look at what just happened: you didn't describe the illness, rate it, or diagnose it. You priced it — in years of your own life you'd surrender. That price, divided by your starting 10 years, is the utility. It's not how state X feels. It's what you revealed you'd trade to escape it. Every 0.6 in every economic model is, somewhere underneath, exactly this: a real person deciding what they'd give up.

A second method forces a different trade — not years, but risk of death.

The Standard Gamble (SG) offers not fewer years, but a gamble. You're in health state X. I offer a treatment that will either restore you to full health (probability p) or kill you instantly (probability 1−p). Would you take it?

If the treatment were guaranteed to work (p = 100%), of course. If it killed you nine times in ten (p = 10%), of course not. Somewhere between sits a probability where you can't decide — your indifference point — and that probability is the utility. Willing to accept the gamble only if success is near-certain (indifferent at p = 0.95) → the state isn't so bad, utility 0.95. Willing to gamble even at p = 0.5 → the state is bad enough that a coin-flip with death is worth it, utility 0.5.

SG has the strongest theoretical pedigree — it's grounded in expected utility theory, the formal economics of decision under risk, which is why some regard it as the "gold standard." But it comes with a real problem: humans reason badly about probability, and most people are strongly risk-averse about death — they refuse even tiny mortality risks — which pushes SG utilities upward, often higher than TTO for the same state. Theoretically cleanest is not the same as empirically truest.

Give the identical health state to three methods and you get three different utilities. That's not error — it's the point.

Here's health state X, valued three ways:

Time Trade-Off: 0.60 — you traded away years.
Standard Gamble: 0.70 — you traded against risk; risk-aversion to death pushed the number up.
Visual Analogue Scale: 0.45 — you did something different entirely.

The Visual Analogue Scale (VAS) just asks you to mark the state on a line from "worst imaginable health" (0) to "best imaginable" (100). Notice what's missing: no trade. You give up nothing to place the mark. And because value only reveals itself through sacrifice, VAS measures something weaker — a rating, not a revealed preference — and it systematically produces lower numbers than the choice-based methods. Most guidelines don't accept raw VAS as a source of utilities for exactly this reason.

So which number is "right"? Wrong question.

Why do TTO, SG, and VAS give different utilities for the same health state?

One more thing changes the number, and it's the most ethically loaded: not how you ask, but whom.

Ask a patient who lives with a health state to value it, and ask a healthy member of the public to imagine and value the same state — you'll usually get different answers, in a consistent direction. Patients typically rate their own states higher than the public imagines them, because of adaptation: people adjust to chronic conditions, find ways to cope and reframe, and discover life is more bearable than an outsider would guess. The public, imagining the state cold, tends to rate it as worse.

So whose valuation should fill the QALY? There's a genuine argument each way. Patients know the state — surely their view is the valid one. But most HTA agencies, including NICE, use societal (public) preferences for the reference case, and the reasoning is deliberate: the public is the one paying and the one whose budget is being allocated across everyone, so the values used to ration should be the community's, chosen as if from behind a veil of ignorance about which illness will strike you. It's a defensible choice — but a choice, with a real cost: using public values can under-weight the lived reality of chronic and disabling conditions, precisely the adaptation that patients report.

Why do most HTA agencies use the public's valuations rather than patients' own?

In practice you won't watch patients play Time Trade-Off. You'll see an off-the-shelf questionnaire — and it works in two separate steps most people never distinguish.

The EQ-5D is the utility instrument you'll actually meet in submissions. Here's what it really does — and the crucial thing is that it splits into description and valuation, done by different people at different times.

Step 1 — describe (the patient). The patient answers 5 questions about their state today: mobility, self-care, usual activities, pain/discomfort, anxiety/depression — each at 3 levels (EQ-5D-3L) or 5 levels (EQ-5D-5L). Their answers form a code, e.g. 21223: some problems walking, no self-care problems, some trouble with activities, moderate pain, some anxiety. This step is pure description — no utility yet.

Step 2 — value (the public, earlier). That code is looked up in a value set: a pre-built table that assigns a utility to every possible EQ-5D state. And where did the value set come from? From exactly the methods you just learned — a large sample of the public valued a selection of states using TTO (or similar), and those valuations were modelled to cover all states. So the patient describes; a population's earlier choices, frozen into the value set, do the valuing.

Patient answers 5 dimensions

code 21223

→

Value-set lookup

country-specific

→

Utility

≈ 0.66

Two consequences fall straight out. First, EQ-5D doesn't "measure" utility — it maps a description onto someone else's pre-recorded preferences. Second, value sets are country-specific: the same patient with the same 21223 gets a different utility under the English, Polish, or Japanese value set, because different populations traded differently. The instrument looks objective and standardised; underneath, it's still someone's trade-off — just done in advance, at national scale.

A utility is a soft, method-dependent number sitting at the head of a hard-looking calculation.

Trace the chain: utility → QALY → ICER → decision. The utility is the first link, and you've now seen how many choices it hides — which method, which respondent, which country's value set. Change any of them and the utility moves, and everything downstream moves with it. A precise ICER can rest on a utility that would have been meaningfully different under an equally valid alternative method.

So an assessor interrogates utilities as carefully as any clinical input. Where did they come from — a preference-based instrument (EQ-5D) or a rating scale that shouldn't be used raw? Which value set, and does it match the reference-case country? Were patient or public preferences used, and is that consistent with the guideline? Were utilities mapped from a different scale (a common, error-prone move when a trial only collected a non-preference measure)? And crucially: how sensitive is the ICER to the utilities? If a plausible alternative utility flips the recommendation, the utility isn't a footnote — it's the load-bearing wall. This is the same lesson as productivity costs back in Module 5: the "soft" inputs are exactly where a decision can quietly be made or unmade.

The other chair

The other chair. Reading a submission: utilities are where a favourable model is most quietly built. Pin down the source of every utility: instrument, method, respondents, and value set — and check each against the reference case. Be especially wary of utilities mapped from a non-preference measure, of value sets borrowed from the wrong country, and of a small utility difference between states doing enormous work once multiplied across many years. If the utilities came from a source that systematically runs high, the whole QALY gain inflates with them. Building one: use the reference-case instrument and value set, and say plainly where each utility came from — don't bury the provenance. If your best evidence needed mapping, present the mapping and its uncertainty rather than hoping it passes unnoticed. Show the ICER's sensitivity to the utilities yourself; a utility you've stress-tested is more credible than one you've asserted. If your treatment's value lives in a quality-of-life gain, the utility source is your case — build it on solid, reference-case-compliant ground.

Same skill from both chairs — treating a utility not as a measured fact but as a recorded trade-off, and knowing that which trade, and whose, was chosen before the number ever reached the model.

Why this matters for HTA

When it lands on your desk: almost every QALY in every submission is built on utilities, and utilities are the softest quantitative input in the whole appraisal — preference-based, method-dependent, respondent-dependent. Knowing how they're produced is knowing where the QALY is most vulnerable, and where an optimistic case is most easily assembled.

You trace every utility to its source. Method, respondents, value set, and any mapping — each is a choice that moves the number. A utility whose provenance you can't reconstruct is a QALY input you can't trust.
You match the utilities to the reference case, not just to convenience. Public preferences where the guideline demands them, the correct country's value set, a preference-based instrument rather than a raw rating. Deviations aren't automatically wrong, but each one is a question the manufacturer must answer.
You test whether the utilities are load-bearing. Run the ICER under plausible alternative utilities. If the recommendation holds, the softness is tolerable; if it flips, the entire decision rests on a contested valuation — and that fragility belongs in the open, not hidden behind a single tidy figure.

The utility is the point where a real person's sacrifice — or a stranger's imagined one — enters an equation that will ration care for everyone. Treat it with the seriousness that deserves.

Utilities, in one breath.

A utility is not how a health state feels — it's its value, revealed through a trade-off: what you'd sacrifice to escape it.
Time Trade-Off trades away years (utility = healthy years you'd accept ÷ years in the state). Standard Gamble trades against risk of death (utility = the success probability you'd require). VAS trades nothing — a rating, not a preference, and systematically lower.
The same state gets different utilities from different methods, because each asks about a different sacrifice. There's no single "true" utility free of method.
Whose preferences matters too: patients rate their states higher (adaptation) than the public imagines them. HTA usually uses societal preferences — a principled but consequential choice.
EQ-5D splits into describe (the patient codes their state across 5 dimensions) and value (a country-specific value set, built earlier from the public's trade-offs, converts the code to a utility). The instrument doesn't measure utility — it maps a description onto pre-recorded preferences.
So the utility is a soft, method- and respondent-dependent input at the head of utility → QALY → ICER. An assessor traces its source and tests whether the decision leans on it.

Every utility is a frozen trade-off. Behind the tidy 0.6 is a real person deciding what they'd sacrifice — and someone else deciding whose sacrifice counts.

You've now taken the QALY apart completely: what it is, and where its every input comes from. But the QALY has a rival — a mirror-image measure built on the opposite question. Instead of "how much health does this add?", it asks "how much healthy life does disease take away?" That's the DALY, the currency of global health and burden-of-disease work, and it makes different assumptions with different consequences. Seeing how it differs — and when each is the right tool — is where the module heads next.