AI-Powered OA Screening: X-Ray to Clinical Insight
How deep learning screens for knee OA from radiographs — model architecture, training on OAI data, and clinical integration at Salnus.
The Clinical Problem
A patient presents with knee pain. A weight-bearing AP radiograph is taken. The surgeon needs to determine: is this osteoarthritis, and if so, how severe? This assessment — repeated thousands of times daily across orthopaedic clinics worldwide — relies on visual interpretation of radiographic features against the Kellgren-Lawrence grading system.
The challenge is not that this assessment is difficult for experienced surgeons — it is that it is subjective, time-consuming when performed carefully, and inconsistent across observers. Two radiologists examining the same film agree on the exact KL grade only 50–65% of the time.
AI-powered screening addresses this by providing a consistent, quantitative second opinion that takes less than a second.
How the Model Works
Salnus's OA screening pipeline processes a knee radiograph through three stages:
Preprocessing takes the raw image and normalises it to the format the model expects. The radiograph is resized to 224×224 pixels, converted to three-channel input (the grayscale image replicated across RGB channels), and normalised using ImageNet statistics (mean = [0.485, 0.456, 0.406], std = [0.229, 0.224, 0.225]). This normalisation aligns the pixel distribution with the data the model was pretrained on.
Inference passes the preprocessed image through a DenseNet-121 convolutional neural network. DenseNet's dense connectivity pattern — where each layer receives feature maps from all preceding layers — enables efficient feature reuse. For OA detection, this means the network can simultaneously leverage low-level features (edges, textures of subchondral bone) and high-level features (overall joint geometry, osteophyte presence) in making its prediction.
The model outputs either a binary classification (OA vs. Normal) or a 5-class KL grade (0–4), depending on the clinical context. Binary detection runs at 84.1% accuracy; 5-class grading at 70.3%.
GradCAM visualisation generates a heatmap showing which regions of the radiograph most influenced the model's decision. In our validated models, the heatmap consistently highlights the medial and lateral joint space — exactly the anatomical regions where cartilage loss manifests as joint space narrowing. This alignment between the model's attention and known OA pathology provides clinical confidence that the model is making decisions for the right reasons, not exploiting incidental image artefacts.
The GradCAM output is overlaid directly on the radiograph in the Salnus viewer, allowing the surgeon to see at a glance where the AI found evidence of OA.
Training Journey: 21 Experiments to Find the Ceiling
Building a reliable OA screening model required systematic experimentation. Over three development sprints, we conducted 21 experiments across 11 architectures:
In Sprint 1, we evaluated 26 pretrained checkpoints to establish a baseline. VGG-19 achieved the highest 5-class accuracy (68.9%) but at 533MB was impractical for deployment. DenseNet-121 (27MB, 66.5%) offered the best accuracy-to-size ratio and became our base architecture.
In Sprint 2, we fine-tuned DenseNet-121 across 9 configurations. A curriculum learning strategy — first training on binary OA detection, then fine-tuning on the 5-class task — improved 5-class accuracy to 70.3%. Data augmentation (random rotation ±15°, horizontal flip, brightness/contrast jitter) provided a 2.1 percentage point improvement over the baseline.
In Sprint 3, we explored EfficientNet-B0, ResNet-50, and Vision Transformer (ViT) architectures. None surpassed DenseNet-121 on our dataset, confirming it as the optimal architecture for our current data scale. The key insight: with ~36,000 training images, model architecture matters less than data quality and augmentation strategy.
Client-Side Inference: Privacy by Architecture
A critical architectural decision in the Salnus platform is that AI inference runs entirely in the surgeon's browser using ONNX Runtime Web. The radiograph is never transmitted to an external server.
This architecture provides KVKK and HIPAA compliance by design — there is no patient data to protect on the server side because no patient data ever reaches the server. The 27MB ONNX model is downloaded once and cached in the browser for subsequent uses. Inference takes approximately 110ms per image on a standard laptop.
For surgeons evaluating AI tools, this is an important distinction. Server-side AI requires data processing agreements, encryption in transit, and trust that the vendor handles PHI appropriately. Client-side AI eliminates these concerns entirely.
From Screening to Decision Support
The binary screening model is the first layer. When it detects OA, the 5-class KL grading model provides a severity estimate, currently labelled as "beta" given its 70.3% accuracy. The surgeon sees both results alongside the GradCAM heatmap and quantitative measurements (joint space width, osteophyte dimensions) — all computed from the DICOM metadata embedded in the original image.
This layered approach — fast binary screen, then detailed grading with visual explanation — mirrors the clinical reasoning process: first determine if pathology is present, then characterise its severity.
Honest Performance Reporting
We report our model's performance on a held-out test set of 3,600 radiographs that the model has never seen during training or validation. Our current numbers:
- Binary OA detection: 84.1% accuracy, 87.2% sensitivity, 81.0% specificity
- 5-class KL grading: 70.3% accuracy (comparable to published inter-radiologist agreement)
- GradCAM clinical alignment: In 91% of true-positive cases, the highest-activation region overlaps with the medial or lateral joint space
These numbers are honest — they reflect real performance on unseen data, not cherry-picked results. The 5-class accuracy of 70.3% is not yet at the level we consider sufficient for standalone clinical use, which is why it carries a "beta" designation. Our target is 75%+ before removing that label, achievable with expanded and more diverse training data.
We maintain a fixed test set that is never modified, enabling fair comparison with our earlier experiments. When we improve the model (through additional training data or architecture changes), it will be evaluated on this same test set.
What Comes Next
Our immediate development priorities are expanding the training dataset through clinical partnerships to push 5-class accuracy past the 75% gate, and integrating the ONNX model into the Salnus Surgeon Portal for pilot deployment. Longer term, we are developing models for hip OA classification and ACL injury assessment from MRI.
If you are interested in pilot testing our OA screening tool or contributing anonymised radiograph data to improve the model, contact our team.
Disclaimer: Salnus OA Screening is designated for research use only (RUO) and is not a cleared medical device. All clinical decisions must be made by qualified physicians based on comprehensive patient assessment.
Performance metrics reported are from internal research experiments conducted on publicly available datasets (OAI). These results have not been externally validated for clinical use.
Reviewed by the Salnus biomedical engineering team.