9 min read

Measurement Reproducibility in Ortho AI

Why measurement reproducibility matters for tibial slope and coronal alignment, ICC and Bland-Altman interpretation, and how AI improves reliability.

Burak Serteser
Measurement ReproducibilityTibial SlopeCoronal AlignmentICCBland-AltmanOrthopaedic AI

Key takeaways

A measurement you cannot repeat is not a measurement, it is an opinion with a number attached. In orthopaedic planning, the reproducibility of an angle matters as much as its accuracy: the whole point of a preoperative number is that a second observer, or the same observer next week, would land in the same place. Tibial slope, coronal alignment, and the mechanical-axis angles all carry real inter-observer and intra-observer variability. This article explains how that variability is quantified, using the intraclass correlation coefficient (ICC) and Bland-Altman analysis, and where automated measurement genuinely tightens reproducibility.

Accuracy is not the same as reproducibility

Two ideas get conflated constantly. Accuracy is how close a measurement is to the truth. Reproducibility, also called precision, is how close repeated measurements are to each other, regardless of whether they are right. A method can be reproducible and wrong, and for surgical planning that precision gap is often the more dangerous one. If your tibial slope reads 6 degrees today and 10 degrees when a colleague measures the same CT, that 4-degree spread propagates directly into a resection plan or a graft-tunnel decision. Metrology, the science of measurement, gives us the tools to separate these two properties and put a number on the spread, so you can decide whether a measurement is trustworthy enough to plan on.

The two flavours of variability

Intra-observer variability is the disagreement between repeated measurements by the same person. Give a surgeon the same lateral knee radiograph twice, a fortnight apart, and the slope readings rarely match. It captures the noise of landmark selection and line placement on a single reader.

Inter-observer variability is the disagreement between different people measuring the same image. It is usually larger, because it also captures differences in technique and training, and how each reader reads an ambiguous landmark, for instance the femoral-head centre in a dysplastic hip or the posterior tibial cortex on a rotated lateral.

Tiny intra-observer but large inter-observer variability means the measurement is stable per-reader yet not transferable, a problem the moment a case moves between a surgeon and a planning engineer, or between two hospitals. This is the same inter-rater reliability question that dominates cartilage grading agreement and joint-space-width measurement.

Reading the ICC: the numbers that matter

The intraclass correlation coefficient is the standard metric for reproducibility of continuous measurements like angles and distances. It ranges from 0 to 1 and, informally, answers: what fraction of the total variance is real between-patient variance rather than measurement noise? A high ICC means readers agree and the measurement separates patients cleanly. The widely cited interpretation from Koo and Li's 2016 guideline is worth memorising:

ICC valueReliability
Below 0.50Poor
0.50 to 0.75Moderate
0.75 to 0.90Good
Above 0.90Excellent

Two practical warnings. First, ICC is not one number: several forms exist (one-way vs two-way, single vs average measures, agreement vs consistency) that can differ substantially on the same data, so a paper reporting an ICC without its model, type, and definition has not fully reported it. Second, judge reliability by the lower bound of the 95% confidence interval, not the point estimate. An ICC of 0.88 with an interval from 0.62 to 0.96 is not reliably "good", it is compatible with "moderate".

Why Bland-Altman is the companion you need

ICC has a blind spot: it is sensitive to the range of values in your sample. Measure slope across a heterogeneous cohort with lots of between-patient spread and the ICC looks flattering even if each individual read is noisy. That is why reproducibility work pairs it with Bland-Altman analysis, a plot of the difference between two measurements (say, automated minus manual) against their mean. It answers three questions a single correlation cannot:

  • Bias. A systematic offset. If the automated method reads consistently 2 degrees higher, the mean difference is not zero, and that is a calibration problem, not random noise.
  • Limits of agreement. The 95% limits (roughly the mean difference plus or minus 1.96 standard deviations) give the range within which two methods disagree in practice. For a decision hinging on a couple of degrees, limits of plus or minus 5 degrees are disqualifying regardless of a pretty ICC.
  • Proportional error. A funnel-shaped plot means disagreement grows at extreme values, so the method degrades exactly where cases are most deformed and planning matters most.

Report ICC to summarise agreement, and Bland-Altman to expose where it breaks. A method that claims excellent reliability but will not show a Bland-Altman plot is one to be sceptical of, and this is one of the questions worth asking before trusting any AI imaging tool.

Tibial slope and coronal alignment: where the spread bites

Posterior tibial slope is a hard case because its landmarks are ambiguous. The proximal tibial axis, the medial versus lateral plateau reference, and the choice of short versus long axis method each shift the number, and the medial-versus-lateral compartment distinction is a common source of disagreement rather than a rounding issue.

Automation attacks this at its root. In one 2025 study, a deep-learning landmark method for posterior tibial slope on lateral radiographs reached an interobserver ICC of 0.91 to 0.92 against manual measurement and a perfect intraobserver ICC of 1.00, with Bland-Altman outlier rates of only 3.5 to 5.7 percent and no systematic bias, in about 2.5 seconds versus 26 for a manual read. The intraobserver point is the underappreciated one: a deterministic algorithm removes intra-observer variability entirely, because it cannot disagree with itself.

Coronal alignment tells the same story at limb scale. The mechanical-axis angles, LDFA, MPTA, and HKA depend on locating the femoral-head and ankle centres on a long-leg film, both classic sources of inter-observer spread. A 2024 enhanced deep-learning model measured lower-limb alignment with inter-observer ICCs from 0.936 to 0.997 against a specialist and an intra-observer ICC of 1.000, with no absolute error above 1.5 degrees for the hip-knee-ankle angle. That is the level of reproducibility that makes a preoperative angle worth planning on.

What to actually check before you trust a number

Reproducibility is a property you verify, not one you assume. Whether the measurement is manual or automated, the checklist is the same:

  • Is the ICC reported with its model, type, and definition, and a 95% confidence interval? Judge by the lower bound.
  • Is there a Bland-Altman plot showing bias and limits of agreement, not just a correlation coefficient?
  • Was reliability tested on realistic cases, including deformed, arthritic, and implant-bearing knees, not only clean normals?
  • Is the reference standard stated, and is the tool compared against multiple human readers, not one?
  • Is the output deterministic? A method returning the same value on the same input has zero intra-observer variability, half the problem solved for free.

This is where a software-only, implant-agnostic pipeline earns its place. Salnus computes automated landmarks and angles on CT-derived bone models, so the same input yields the same slope and coronal angles every time, and the measurement moves between surgeon and engineer without drifting. Consistency here is not cosmetic, it is the difference between a plan you can defend and a number you have to re-check.

FAQ

Is a higher ICC always better? Higher is better, but the number can be inflated by a wide spread of values in the test sample, so read it alongside a Bland-Altman plot and the confidence interval. An ICC above 0.90 with tight limits of agreement is far stronger than the same ICC on a cohort deliberately spanning extreme deformities.

Does automation guarantee a correct measurement? No. Automation removes intra-observer variability and usually tightens inter-observer agreement, which is reproducibility, but it can still be systematically wrong if its landmark rules are off. That is why bias on a Bland-Altman plot matters, and why an automated method still needs validation against human readers on representative cases.

Why is tibial slope harder to measure reliably than coronal alignment? Slope depends on the reference axis and the medial-versus-lateral plateau choice, and small landmark shifts on a lateral view move the number several degrees. That plateau-side ambiguity makes slope especially sensitive, which is why deterministic automated measurement helps most there.

What ICC should I look for in a planning tool? For a measurement that drives a resection or correction decision, treat the lower bound of the confidence interval as your threshold: above 0.90 is excellent, 0.75 to 0.90 is acceptable with caution, and moderate or below should not be trusted for sub-degree planning. Pair that with limits of agreement narrow enough for the decision at hand.

The Takeaway

A preoperative angle is only useful if it is repeatable. ICC tells you how much of the variance is signal versus noise, Bland-Altman tells you whether the disagreement is bias, scatter, or worse at the extremes, and together they are the honest way to report reproducibility. Tibial slope and coronal alignment are exactly where human variability is real and where deterministic, automated computation earns its keep. When you evaluate any planning tool, do not ask only "is it accurate?" Ask "is it reproducible, and can you show me the ICC and the Bland-Altman plot?"

Explore the Salnus Surgeon Portal →


Disclaimer: This article is for educational and research purposes only. Salnus tools are designated for Research Use Only (RUO) and are not cleared medical devices. Clinical decisions should be made by qualified physicians.

References:

Reviewed by the Salnus biomedical engineering team.

Related Posts

Browser-Based DICOM Processing for Orthopaedic AI9 min readReading Clinical Validation in Orthopaedic AI9 min readHow CT-Based 3D Preoperative Planning Works7 min readThe Salnus Surgeon Portal: Browser DICOM + AI7 min read
← All Posts

Orthopaedic AI Research Updates

Monthly research digest, product updates, and clinical AI insights.

Unsubscribe anytime.

Measurement Reproducibility in Ortho AI, Salnus