Module 3 · Diagnostic Tests
The test came back positive. What are the odds it's right?
A company submits a new biomarker test for rare-disease screening. It reports 99% sensitivity and 99% specificity. An appraisal committee member asks: "If the test is positive, how likely is the patient to actually have the disease?"
The answer isn't on the label. It never is.
A test has 99% sensitivity and 99% specificity. After a positive result, how likely is the patient to have the disease?
Four cells that contain all of diagnostics
Every diagnostic statistic you will ever read comes from four numbers arranged in a 2×2 table — test result by disease status:
- TP / (TP + FN) — what fraction of the sick were caught → sensitivity
- TN / (TN + FP) — what fraction of the healthy were cleared → specificity
- TP / (TP + FP) — of those who tested positive, how many are truly sick → PPV
- TN / (TN + FN) — of those who tested negative, how many are truly well → NPV
The first two (rows) describe the test. The last two (columns) describe what the result means for a patient. They're related but not the same — and only one pair depends on how common the disease is.
Sensitivity and specificity — calculate them
Sensitivity asks: of all the sick people, how many did the test catch? Specificity asks: of all the healthy people, how many did the test correctly clear?
Here is a worked dataset. Fill in both values:
Test results in 200 patients (100 diseased, 100 healthy):
Of 100 truly diseased patients: 90 test positive (TP = 90), 10 test negative (FN = 10). Sensitivity = TP ÷ (TP + FN). Enter the percentage (no % sign).
Of 100 truly healthy patients: 5 test positive (FP = 5), 95 test negative (TN = 95). Specificity = TN ÷ (TN + FP). Enter the percentage (no % sign).
Sensitivity 90%, specificity 95%. Two mnemonics worth remembering: SnOut — a highly Snensitive test rules disease Out (few missed cases). SpIn — a highly Specific test rules disease In (few false alarms).
Moving the threshold changes both
Sensitivity and specificity aren't fixed properties of the test — they depend on where you draw the line between "positive" and "negative."
Below, two overlapping distributions: lower values for healthy people, higher values for diseased. The vertical line is the threshold. Move it left (more inclusive) or right (more exclusive) and watch the trade-off:
Test-value distributions
ROC curve
This trade-off — sensitivity versus specificity — is exactly what the ROC curve visualises. Each point on the ROC is a different threshold choice. The area under the curve (AUC) summarises the test's overall ability to separate diseased from healthy, independent of any particular threshold.
When does "positive" mean positive?
Sensitivity and specificity describe the test. But a clinician — and an HTA analyst — needs to answer a different question: given a positive result, how likely is the patient to be sick?
That's PPV (positive predictive value): the fraction of positives that are true positives.
Same test. Two very different settings.
Sens = 99%, spec = 99%. A population of 1 000 people.
Setting A — disease present in 50% (500 diseased):
TP = 495, FP = 5 → PPV = 495 / 500 = 99%
Setting B — disease present in 0.1% (1 diseased):
TP = 1 (approx), FP ≈ 10 → PPV = 1 / 11 ≈ 9%
The test didn't change. The population did. When the disease is rare, almost every positive result is a false alarm — even with a near-perfect test.
Sensitivity and specificity don't depend on prevalence. PPV does. This is the most important insight in diagnostic testing.
The base-rate problem
Imagine a test for a rare genetic condition affecting 1 in 1 000 people (prevalence = 0.1%). The test is excellent: 99% sensitive, 99% specific.
- In 100 000 people: 100 truly have the disease.
- The test finds 99 of them (TP), misses 1 (FN).
- But of 99 900 healthy people, 1% test positive: 999 false alarms.
- Total positive results: 99 + 999 = 1 098. True positives: 99.
- PPV = 99 / 1 098 ≈ 9%. Nine false alarms for every true case.
This isn't a test failure — it's arithmetic. The rarer the disease, the larger the false-positive pool swamps the true-positive signal. No amount of improving sensitivity or specificity from 99% will fix it if the disease is rare enough.
Committees reviewing screening programmes ask about this explicitly. A cost-effective test in high-risk subgroups can become a wasteful cascade of anxiety and follow-up in general populations.
See the base-rate problem for yourself
Below, 1 000 people. The test is fixed at 99% sensitivity and 99% specificity throughout. Only the prevalence changes. Move the slider all the way down to 0.1% — and then all the way up to 50%:
Of 20 positive tests,10 are true cases.
At 0.1% prevalence: 1 teal square (TP) in a sea of 10 amber squares (FP). PPV ≈ 9%. At 50% prevalence: the teal nearly fills the left half. PPV ≈ 99%. The test is identical. This is why the context of use — the population being tested — is inseparable from what a positive result means.
Likelihood ratios: the prevalence-free update
PPV varies with prevalence — inconvenient when you need to compare a test across settings. Likelihood ratios (LR) solve this. They express how much a result shifts the pre-test probability — and they don't depend on prevalence.
LR+ = sensitivity ÷ (1 − specificity)
LR+ answers: "How many times more likely is this positive result in someone with the disease than in someone without?"
LR− = (1 − sensitivity) ÷ specificity
LR− answers: "How many times more likely is this negative result in someone with the disease than in someone without?" (You want this small.)
Rules of thumb: LR+ > 10 substantially raises the probability; LR− < 0.1 substantially lowers it.
Now compute LR+ from the sens/spec values you calculated earlier (90% sensitivity, 95% specificity):
LR+ = sensitivity ÷ (1 − specificity). Using 90% sensitivity and 95% specificity from above: what is LR+? Enter a whole number.
LR+ = 18 — a positive result is 18× more likely in someone with the disease than without. That is a genuinely informative test. An LR+ of 2 or 3 (barely informative) looks very different — and the threshold-vs-ROC plot you explored earlier shows why: as you move toward either extreme, LR changes dramatically.
Four claims — what's right?
For each scenario, choose the best response. All four questions unlock the next screen.
Loading…
Why this matters for HTA
Diagnostic tests rarely arrive on a committee desk alone. They arrive embedded in a test–treat pathway: test, then decide, then treat (or not). HTA evaluates the full path, not the test in isolation.
- Diagnostic accuracy submissions must report sensitivity and specificity across the intended setting — not just in the cherry-picked study population. A 97% specificity in a specialist referral clinic can look very different in primary-care screening.
- PPV, not sensitivity, drives the cascade. False positives trigger follow-up tests, biopsies, anxiety, and costs. An HTA model that only reviews sensitivity and specificity without modelling PPV at the target population's prevalence misses the dominant cost driver.
- LR ties to clinical decision thresholds. Committees increasingly ask: "At what pre-test probability does this test change management?" A test with LR+ = 1.5 rarely crosses any decision threshold; one with LR+ = 20 routinely does. The LR is the natural input to that analysis.
"A 99% sensitive test in a 0.1% prevalence population does not have a 99% accuracy problem. It has a population problem. HTA committees are paid to know the difference."
Diagnostic tests, in one breath
- The 2×2 table — TP, FP, FN, TN — is the source of every diagnostic statistic.
- Sensitivity (catch rate) and specificity (clear rate) are intrinsic test properties; they don't shift with prevalence.
- PPV (what a positive result means) depends entirely on prevalence. The rarer the disease, the more false alarms swamp true positives.
- The ROC curve shows the full sensitivity–specificity trade-off as the threshold moves. AUC summarises discrimination ability in a single number.
- Likelihood ratios express how much a result updates the probability of disease — and don't depend on prevalence, making them portable across settings.
- In HTA, the question isn't just "does the test perform?" — it's "what does the test add, in this population, embedded in this pathway?"
"The test is the same; the population is the question. Specificity is for ruling in; sensitivity is for ruling out. And PPV is the number the patient actually lives with."
M3 is now complete. You've moved from raw variation (Module 3 opener) through p-values, confidence intervals, effect measures, survival analysis, hazard ratios, and now diagnostic accuracy. Module 4 will introduce systematic reviews — how to combine evidence across studies and detect when the literature itself misleads.