ARCO Staging in Femoral Head AVN: The Inter-Observer Problem AI Could Solve
ARCO staging guides treatment decisions in osteonecrosis of the femoral head, but ARCO Stage I-II disagreement among radiologists remains a structural problem. Deep learning shows promise, here is what the literature actually demonstrates.
In osteonecrosis of the femoral head (ONFH), the difference between ARCO Stage I and ARCO Stage II changes everything: the difference between a patient who may benefit from joint-preserving core decompression and one who may not. Yet the inter-observer agreement at this exact transition, between bone marrow oedema with viable subchondral bone (Stage I) and structural bone necrosis without articular collapse (Stage II), has consistently been the weakest part of the entire staging system.
This is not a marginal academic concern. It directly determines which patients are referred for hip-preserving surgery and which are placed on a path toward arthroplasty. AI-based ARCO staging, if validated correctly, could remove the variability that currently shapes clinical pathways, but the published literature has not yet cleared the validation bar that clinical deployment requires.
This post examines where ARCO staging breaks down, what AI approaches have actually shown, and the multi-centre validation problem that separates published models from clinically deployable tools.
ARCO Staging: A Brief Refresher
The Association Research Circulation Osseous (ARCO) classification has gone through several revisions. The current 2019 ARCO consensus simplified the original staging into four primary stages, anchored to MRI and radiographic findings:
- Stage I: Normal radiograph; positive MRI showing bone marrow oedema or early necrosis. Subchondral bone is structurally intact.
- Stage II: Radiographic changes appear (sclerosis, lytic lesions); MRI shows established necrosis with intact articular surface.
- Stage III: Subchondral fracture or articular collapse; "crescent sign" on radiograph or MRI.
- Stage IV: Established osteoarthritis with joint-space loss.
Within this staging framework, the clinically critical transitions are between Stages I-II (where joint-preserving intervention is most viable) and Stages III-IV (where arthroplasty becomes the primary option). The transition between Stages I and II carries particular weight because it determines candidacy for core decompression, a procedure with markedly better outcomes when performed at Stage I-II than at later stages.
Where Disagreement Happens: The Stage I-II Transition
Inter-observer agreement studies for ARCO staging consistently show the same pattern: substantial agreement on advanced disease (Stage III-IV), and meaningfully weaker agreement on early disease (Stage I-II). The reasons are structural to how MRI features evolve in early ONFH.
Bone marrow oedema versus established necrosis is the primary diagnostic axis at Stage I. On T2-weighted and STIR sequences, the high-signal pattern of bone marrow oedema can closely resemble the ill-defined high-signal regions of early necrotic transition. The defining feature, a low-signal demarcation line on T1, appears at variable times in the disease evolution, and its presence or absence at imaging may be a function of when the scan was performed rather than the underlying biology.
Subchondral bone status assessment at Stage II requires evaluating whether the necrotic segment has begun to lose structural integrity. This is fundamentally a question of subtle MRI and CT findings, with substantial reader-dependent interpretation.
Reported weighted kappa values for inter-observer agreement on early-stage ARCO classification typically fall in the 0.55-0.70 range across studies, below the 0.80 threshold often cited for substantial agreement. Even fellowship-trained musculoskeletal radiologists show this pattern. This is not a problem of training; it is a problem of how the staging boundaries map onto MRI's signal characteristics in early disease.
AI Approaches: What Has Been Published
Deep learning approaches to ARCO staging, and to ONFH detection more broadly, fall into distinct technical categories. Reviewing the published literature reveals patterns about what works and what remains unsolved.
Detection-only models address the simpler question: does this MRI show ONFH, yes or no? Convolutional networks trained on T1, T2, and STIR sequences achieve AUCs in the 0.90-0.97 range on internal validation cohorts. These models are clinically useful as screening tools but do not address the staging question that matters for treatment decisions.
Multi-class staging networks attempt the harder problem: predicting ARCO Stage directly from MRI volumes. Reported macro-averaged accuracy across published studies typically ranges from 0.70 to 0.85, with the lower bound corresponding to models that handle Stage I-II transitions explicitly and the upper bound corresponding to models trained primarily on advanced disease. The performance gap reflects exactly the inter-observer problem: AI models trained on labels generated by readers with imperfect agreement inherit the noise in those labels.
Segmentation-based pipelines segment the necrotic region on MRI and quantify its volume, location, and depth as derived measurements. These approaches enable longitudinal disease tracking but add computational complexity that may exceed the value provided in routine clinical workflow.
Attention-based hybrid models combine MRI sequence features with explicit anatomical priors. Recent work in this area has reported the strongest performance on Stage I-II discrimination, though the cohorts remain small and single-centre.
A recurring pattern: most published ONFH AI models report on cohorts under 300 patients, derived from a single institution, with no testing on alternative MRI vendors or sequence parameters. This is not a problem unique to ONFH AI, it characterises most clinical imaging AI literature, but it is particularly consequential for a condition where treatment decisions hinge on early-stage discrimination.
The Pitfalls of Single-Centre Validation
The most common failure mode in clinical imaging AI development is overfitting to scanner-specific signal characteristics. A model trained on 250 hip MRIs from a single institution learns not just disease features, but also the specific noise patterns, sequence parameters, contrast handling, and acquisition geometry of that institution's scanners. Performance on held-out data from the same scanner may exceed 0.90 AUC; performance on data from a different vendor may drop to 0.65.
This problem is particularly acute for ARCO staging because:
MRI sequence variation is high in ONFH workup. Different institutions use different combinations of T1, T2, STIR, fat-suppressed sequences, and sometimes contrast-enhanced studies. A model trained on a specific sequence combination may fail on others.
Patient demographics differ across centres. ONFH aetiology varies, corticosteroid-induced versus traumatic versus alcohol-related versus idiopathic. The MRI appearance of necrosis can subtly differ across these aetiologies, and a model trained primarily on one population may not generalise to others.
Coil configurations and field strengths differ. 1.5T and 3T MRI produce different signal-to-noise profiles. A model trained on 3T data may perform worse on the 1.5T scanners that remain in widespread clinical use.
The clinical consequence: a model with strong published performance metrics may not be clinically usable until it has been validated across the scanner diversity of routine practice.
The Path Forward: Multi-Centre, Multi-Reader Validation
For AI-based ARCO staging to move from publication to clinical tool, three validation properties matter:
Multi-centre cohorts spanning at least 3-5 institutions with diverse scanner vendors, field strengths, and sequence protocols. Performance on the held-out test centres, never seen during training, is the relevant generalisation metric.
Multi-reader ground truth with at least three expert readers per case, using the consensus or adjudicated label as the ground truth. Reporting performance against the consensus, with confidence intervals, provides a defensible validation standard.
Stratified performance reporting. The clinically relevant performance is at the Stage I-II boundary. Reporting overall accuracy obscures the fact that most of the accuracy comes from easy cases (advanced disease). Stratified analysis by stage and by case difficulty exposes whether the model solves the actual clinical problem.
Our team's systematic review and meta-analysis on AI in ONFH MRI detection, currently under peer review, synthesises the published literature on exactly these validation properties. The pattern is consistent: most models do not yet meet these standards. The question is not whether AI can match expert performance on ARCO staging; it is whether AI can do so reliably across the scanner and patient diversity of routine clinical practice.
Where Salnus Fits
Salnus is building orthopaedic AI tools that take seriously the multi-centre validation problem. Our systematic review on AI in osteonecrosis MRI, currently under editorial consideration, provides the evidence base for which detection and staging approaches show generalisation potential and which do not.
For collaboration on multi-centre ARCO staging validation cohorts, including contribution of de-identified hip MRI series and reader adjudication, please contact us through our pilot program.
Key Takeaways
- ARCO Stage I-II inter-observer disagreement is the structural weakness in osteonecrosis classification, and the boundary that determines core decompression candidacy.
- Detection-only AI models perform well (AUC 0.90+) but do not address the clinically relevant staging question.
- Multi-class ARCO staging accuracy in published models ranges 0.70-0.85, with the lower bound on Stage I-II discrimination.
- Single-centre validation does not predict performance on alternative MRI vendors, field strengths, or patient populations.
- The clinical bar is multi-centre, multi-reader validation with stratified performance at boundary cases, not overall accuracy on a single cohort.
For broader context on inter-observer variability in orthopaedic imaging classification, see our companion post on knee cartilage grading AI.
Salnus is an orthopaedic AI startup based in Istanbul, building clinical decision support tools for knee, hip, and shoulder surgery. Our platform is currently in invite-only pilot for selected orthopaedic surgeons. Request access.
Reviewed by the Salnus biomedical engineering team.