Reading Clinical Validation in Orthopaedic AI
How to read clinical validation of an AI orthopedic tool: which metrics matter (Dice, ICC, AUC, calibration), internal vs external validation, red flags.
Key takeaways
A validation section is where an AI tool proves it works, or reveals that it only works on the data it was built on. Reading it well means matching the metric to the task (Dice for segmentation, ICC for measurement agreement, AUC plus sensitivity and specificity for classification, calibration for probabilities), then asking the one question that decides everything: was it tested on external data, from scanners and patients the model never saw? Internal-only numbers routinely overstate real-world performance. This is a practical guide to reading validation the way a surgeon or buyer should: metric by metric, with the red flags that tell you to walk.
Match the metric to the task
The first mistake is judging every tool by "accuracy." It answers almost nothing on its own and is the wrong metric for most orthopaedic AI tasks. What matters is whether the reported metric actually measures what the tool is supposed to do.
| Task | Right metric | What it tells you |
|---|---|---|
| Segmentation (bone, cartilage) | Dice / IoU | Spatial overlap with an expert reference mask |
| Measurement (angles, distances) | ICC, mean absolute error, Bland-Altman | Agreement with the manual gold standard |
| Classification (fracture yes/no, grade) | AUC, sensitivity, specificity | Discrimination, and the miss vs false-alarm trade-off |
| Probability output (risk, likelihood) | Calibration (curve, Brier score) | Whether "80%" actually means 80% |
A large systematic review of orthopaedic imaging AI captures the landscape: across modalities, Dice scores ranged from 0.67 to 0.98, accuracy reached up to 0.99, and only a handful of studies reported AUC at all. High headline numbers, wide spread, inconsistent reporting, which is exactly why you read the metric, not the marketing.
Segmentation: Dice, and what it hides
For bone and cartilage segmentation, the Dice Similarity Coefficient measures overlap between the AI mask and an expert reference, 1.0 is perfect, and the high 0.90s supports clinical planning on large bones. But Dice is a volume-overlap score, forgiving of small surface errors that matter surgically. A model can post an excellent Dice and still bridge a narrow joint space or blur a cortical edge. Read where the errors are, not just the mean, and confirm the output is editable so a human can catch them, the whole argument for an AI-first, human-verify workflow.
Measurement: ICC, not accuracy
When the tool outputs a number, an alignment angle like LDFA or MPTA, a joint space width, a leg-length, the right question is agreement with the manual gold standard. The Intraclass Correlation Coefficient (ICC) quantifies it, and pairing it with mean absolute error and a Bland-Altman plot tells you the average error and whether the tool drifts at the extremes. A tool that reports "0.4 mm mean error, ICC 0.95 against two blinded readers" has told you something real. A tool that reports "97% accurate" on a measurement task has told you almost nothing.
Classification: never trust AUC alone
For yes/no or grading tasks, AUC summarises discrimination across all thresholds, but it hides the trade-off you actually care about. A fracture-screening tool with AUC 0.95 could still miss subtle fractures if its operating point favours specificity. Always read sensitivity and specificity together, at the chosen threshold, and know which way the tool is tuned for its intended use. This is also where inter-observer agreement sets a ceiling: if human graders only agree moderately on cartilage grade, no AI can be validated against a perfect reference, because there isn't one.
Calibration: the metric everyone skips
If a tool outputs probabilities, discrimination is not enough. Calibration asks whether predicted probabilities match observed frequencies, when it says 80%, does the event happen 80% of the time? A model can rank cases perfectly (high AUC) and still be badly miscalibrated, which quietly corrupts any decision that uses the probability as a number. Reporting guidelines now treat calibration as a required companion to discrimination; its absence is a reporting gap.
Internal vs external: the question that decides everything
This is the single most important distinction in a validation section, and the one most often blurred.
- Internal validation tests on held-out data from the same source as training, a train-test split, cross-validation, or bootstrapping. It measures whether the model learned its own dataset. It systematically overstates real-world performance.
- External validation (increasingly called external testing) uses a completely separate dataset, another institution, other scanners, a different population. It measures whether the model generalises to data it has never seen. This is the number that predicts how the tool behaves in your clinic.
The gap is not theoretical. A 2025 systematic review of AI generalisability in radiology found models with strong internal performance consistently degraded on external data, with a median AUC drop and specificity falling by up to roughly 24 percentage points across institutions and scanners. Strong internal numbers with no external test is the most common way an AI tool looks better than it is.
If a tool reports only internal validation, treat its performance claims as a ceiling, not an estimate.
Dataset representativeness
External validation only means something if the external data resembles your patients. When you read it, check:
- Scanner and vendor mix. A model validated on one CT vendor may drift on another. Domain shift is real.
- Demographics and pathology. Adult-knee validation says little about paediatric anatomy; mild-OA validation says little about end-stage deformity. Coverage of the hard cases matters more than the easy ones.
- Study design. Most orthopaedic AI evidence is retrospective; prospective and multicentre data are scarcer and far more convincing.
- Sample size and case difficulty. A high score on a small, curated, artefact-free test set is the easiest number to produce and the least informative.
The honest reading: does the validation cohort look like your Tuesday list, or a cleaned-up subset of it?
Red flags checklist
Read any validation section against these. Any one of them should slow you down; several together mean the evidence is weaker than it looks.
- Internal validation only. No external test set. The most common and most serious red flag.
- A single "X% accuracy" number. No metric fits every task; a lone accuracy figure usually hides sensitivity, specificity, and the threshold.
- Metric-task mismatch. Accuracy quoted for a measurement task, or AUC with no sensitivity/specificity, or Dice with no error localisation.
- Cherry-picked metrics. The favourable metric is prominent; the others are missing. Missing metrics are a choice.
- No calibration for probability outputs. Discrimination reported, calibration silent.
- Vague or curated test set. "Validated on our dataset" with no scanner, demographic, or pathology detail, or a suspiciously clean cohort.
- Uneditable output. Even a well-validated model errs; if you cannot correct the output, you cannot use the validation safely.
None of these mean the tool is bad. They mean the claim is unproven, and the burden is on the vendor to close the gap. For the wider evaluation beyond metrics (regulatory status, data governance, explainability), pair this with our checklist for evaluating an AI imaging tool.
Where Salnus sits
Salnus is software-only, implant-agnostic, and designated Research Use Only, not a cleared or approved device. That framing is deliberate: RUO is the honest label for a tool whose external, multicentre validation is ongoing, and it is why the workflow keeps the surgeon in the loop with editable segmentation and measurements you verify against your own read. We would rather tell you what a metric was measured on than hand you a single number, the standard we ask you to hold every tool to, including ours.
FAQ
What is the single most important thing to check in a validation section? Whether there is external validation, testing on data from institutions, scanners, and patients the model never saw in training. Internal-only results overstate real-world performance.
Why isn't "95% accuracy" a good enough number? Accuracy is the wrong metric for most orthopaedic tasks. Segmentation needs Dice, measurement needs ICC and error, classification needs sensitivity and specificity at the working threshold. A lone accuracy figure usually conceals the weak spots.
What is calibration and why does it matter? Calibration is whether predicted probabilities match reality, when a tool says 80%, the event should happen about 80% of the time. A model can discriminate well (high AUC) yet be miscalibrated, corrupting any decision that treats the probability as a real number.
Is a Research Use Only tool with only internal validation useless? No, it is a legitimate starting point for research and evaluation. But its performance claims should be read as a ceiling, and it should not drive clinical decisions without external and local validation.
The Takeaway
A validation section is evidence, and evidence is read, not accepted. Match the metric to the task, insist on external testing, check that the test data looks like your patients, and treat a single accuracy number as a warning rather than a reassurance. The tools worth trusting make this easy, they tell you what they measured, on whose data, and where they fail.
Explore the Salnus Surgeon Portal →
Disclaimer: This article is for educational and research purposes only. Salnus tools are designated for Research Use Only (RUO) and are not cleared medical devices. Clinical decisions should be made by qualified physicians.
References:
- Artificial intelligence demonstrates potential to enhance orthopaedic imaging across multiple modalities: A systematic review. Journal of Experimental Orthopaedics, 2025.
- Assessing the generalizability of artificial intelligence in radiology: a systematic review of performance across different clinical settings. Annals of Medicine and Surgery, 2025.
- Reporting Guidelines for Artificial Intelligence Studies in Healthcare (for Both Conventional and Large Language Models): What's New in 2024. Korean Journal of Radiology, 2024.
Reviewed by the Salnus biomedical engineering team.