How to Choose an AI Tool for Knee OA Assessment
A practical evaluation framework for orthopaedic surgeons comparing AI-powered knee osteoarthritis grading tools — what to ask vendors, which metrics matter, and red flags to watch for.
The Problem: Too Many Claims, Not Enough Clarity
The market for AI-powered knee osteoarthritis tools is growing rapidly. A surgeon searching for "knee OA AI" will find dozens of vendors, each claiming high accuracy, seamless integration, and clinical validation. Separating genuine capability from marketing requires a structured evaluation framework.
This guide provides the questions every orthopaedic surgeon should ask before adopting an AI tool for knee OA assessment — whether for clinical use, research evaluation, or pilot testing.
Question 1: What Is the Regulatory Status?
This is the single most important question, and the answer determines everything else.
CE-marked / FDA-cleared means the tool has undergone conformity assessment by a Notified Body (EU) or premarket review (FDA). This includes review of clinical evidence, risk management, software lifecycle documentation, and quality management systems. Examples include Radiobotics' RBknee (CE-marked) and ImageBiopsy Lab's KOALA (CE-marked).
Research Use Only (RUO) means the tool has not been cleared for clinical use. It may still be valuable for research, education, and evaluation — but its outputs must not be the sole basis for clinical decisions. The surgeon retains full responsibility for independent clinical assessment.
No designation stated is a red flag. Any legitimate medical AI tool should clearly state its regulatory status. If a vendor demonstrates an OA grading tool without mentioning regulatory clearance, ask directly.
The practical implication: a CE-marked tool can be integrated into clinical workflow as a decision aid. An RUO tool should be treated as a second opinion that requires independent verification — useful for learning and evaluation, not for standalone clinical reliance.
Question 2: What Was the Training Data?
The performance of any AI model is bounded by the data it was trained on. Three factors matter.
Dataset source. The Osteoarthritis Initiative (OAI) is the most commonly used public dataset — approximately 36,000 knee radiographs with expert-consensus KL grades. Models trained on OAI benefit from large volume and standardised labelling but may not generalise well to radiographs acquired with different equipment, positioning, or patient populations.
Dataset diversity. A model trained exclusively on images from one institution, one scanner manufacturer, or one demographic group will underperform on data that differs from its training distribution. Ask whether the training data includes multiple centres, scanner vendors, and demographic groups.
Label quality. KL grading is subjective — inter-observer agreement between radiologists is only 50–65%. Models trained on single-reader labels inherit that reader's biases; models trained on consensus panels produce more robust classifications.
What to ask the vendor: "What dataset was your model trained on? How many images? How many institutions? How were labels obtained — single reader, consensus panel, or automated extraction from radiology reports?"
Question 3: How Was It Validated?
There are three levels of validation, each with increasing clinical relevance.
Internal validation tests the model on a held-out portion of the same dataset used for training. This measures whether the model learned the patterns in its training data — necessary but insufficient. An internal accuracy of 90% tells you the model works on data similar to what it trained on.
External validation tests the model on data from a completely independent source — different institution, different equipment, different patient population. This measures generalisability — whether the model works on your patients, with your equipment, in your clinical setting. External validation is the minimum standard for clinical credibility.
Prospective validation tests the model in real-time clinical workflow, comparing AI assessment against expert clinical judgment on new, unseen cases. This is the gold standard but relatively rare for orthopaedic AI tools in 2026.
What to ask the vendor: "Has your model been externally validated? On what dataset? Were the results published in a peer-reviewed journal? What was the accuracy on the external dataset versus the internal test set?"
A significant drop between internal and external accuracy (e.g., 92% internal → 74% external) indicates the model has learned patterns specific to its training data rather than generalisable clinical features.
Question 4: Which Metrics Are Reported?
Not all accuracy metrics are equal.
Binary accuracy (OA present vs. absent) is the simplest metric. Most tools achieve 80–95% here. Clinically useful for screening but does not address severity grading.
Multi-class accuracy (five KL grades: 0–4) is substantially harder. Published results range from 65–85%, with the majority of errors occurring between adjacent grades (e.g., KL-1 vs. KL-2).
Sensitivity and specificity tell you different things. High sensitivity (few missed cases) matters for screening; for surgical decision-making, high specificity (few false positives) may matter more.
AUC (Area Under the ROC Curve) summarises discrimination across all classification thresholds. Values above 0.90 indicate strong discrimination; above 0.95 is excellent.
Per-grade performance reveals where the model succeeds and fails. A model with 95% accuracy on KL grade 4 but 40% accuracy on KL grade 1 has a very different clinical profile than one with uniform 75% accuracy across all grades.
What to ask the vendor: "What is the five-class KL grading accuracy? What are the per-grade sensitivity and specificity? What is the confusion matrix? Where does the model make the most errors?"
Question 5: Does It Show Its Reasoning?
A classification without explanation has limited clinical value. The surgeon needs to understand why the AI reached its conclusion to decide whether to accept or override it.
GradCAM heatmaps show which regions of the radiograph most influenced the model's decision. For a well-trained OA model, the heatmap should highlight the medial and lateral joint space — where cartilage loss manifests as joint space narrowing — and osteophyte margins. If the heatmap highlights the image border, patient label, or soft tissue, the model may be making decisions for the wrong reasons.
Confidence scores indicate how certain the model is about its classification. A model that reports "KL-2 with 92% confidence" provides more useful information than one that simply outputs "KL-2". Low confidence cases can be flagged for expert review.
Quantitative measurements — such as joint space width in millimetres, osteophyte area, or subchondral bone density — provide objective data that goes beyond categorical classification and supports longitudinal monitoring.
What to ask the vendor: "Does the tool provide visual explanation of its reasoning? Can I see what part of the image the model is looking at? Is a confidence score reported alongside the classification?"
Question 6: How Does It Handle Patient Data?
Data privacy architecture is not a technical detail — it directly impacts your compliance obligations and your patients' privacy.
Client-side processing runs AI inference entirely in your browser. No patient imaging data is transmitted to external servers. This provides KVKK/HIPAA/GDPR compliance by architecture — there is no server-side patient data to protect, breach, or misuse. The trade-off is that inference is limited by the computational power of your device.
Server-side processing uploads patient images to a cloud server for AI analysis. This enables more computationally intensive models but creates data protection obligations: data processing agreements, encryption requirements, and vendor trust. For hospital-based surgeons, this may require institutional IT approval and data protection impact assessments.
Hybrid approaches process images locally but send extracted features (not raw images) to a server for more sophisticated analysis. This reduces privacy risk but still involves some data transmission.
What to ask the vendor: "Where does AI inference run — on my device or your server? Is patient imaging data transmitted externally? Is the tool KVKK/HIPAA/GDPR compliant? Where is patient data stored, and for how long?"
Question 7: What Does Integration Look Like?
A tool that requires exporting DICOM files, navigating to a separate website, uploading images, and manually transcribing results back into the clinical record will not be used in a busy practice.
DICOM-native integration means the tool works directly with standard medical imaging files — not screenshots, JPEGs, or manual image capture.
Workflow integration means the AI assessment appears within the surgeon's existing image viewing environment — not as a separate application. The surgeon opens a study, and AI results appear automatically or on demand.
Report generation means AI findings can be exported as structured clinical documents (PDF reports, structured data) that integrate with the clinical record — not just on-screen overlays that require manual transcription.
What to ask the vendor: "Does your tool accept DICOM files directly? Does it integrate into existing PACS/viewer workflows? Can AI findings be exported as structured reports?"
Red Flags
When evaluating orthopaedic AI vendors, these signals warrant caution:
Accuracy claims without methodology. A vendor claiming "99% accuracy" without specifying the dataset, validation method, or metric definition is presenting marketing, not science.
No regulatory status disclosed. Legitimate medical AI vendors are transparent about whether their tool is CE-marked, FDA-cleared, or RUO. Evasive answers suggest the tool lacks any formal regulatory assessment.
No external validation. Internal accuracy alone does not predict real-world performance. If the vendor cannot point to external validation — ideally published in a peer-reviewed journal — the claimed performance is unverified.
Black-box decisions. AI that provides a grade without any visual or quantitative explanation does not give the surgeon a mechanism to verify the reasoning.
Opaque data handling. If the vendor cannot clearly explain where patient data goes, the tool should not be used with real patient data.
Making the Decision
No AI tool replaces clinical judgment. The best tools augment it — providing consistent, quantitative, reproducible assessments that anchor the surgeon's evaluation. The questions above help distinguish tools that genuinely serve this purpose from those that merely appear to.
For surgeons beginning their evaluation, start with regulatory status and external validation. These two factors alone eliminate the majority of unreliable tools. Then assess explainability, privacy architecture, and integration fit for your specific practice environment.
The field is maturing rapidly. Tools that are research prototypes today may be clinically validated within 12–18 months. Engaging early — even with RUO tools for evaluation and feedback — positions your practice to adopt proven AI tools faster when they reach clinical readiness.
Evaluate Salnus OA Screening →
Disclaimer: This article is for educational purposes only. Salnus tools are designated for Research Use Only (RUO) and are not cleared medical devices. Clinical decisions should be made by qualified physicians based on comprehensive patient assessment. Mention of third-party products is for educational context and does not constitute endorsement.
Reviewed by the Salnus biomedical engineering team.