Development and retrospective validation of MRNet

[ad_1]

Abstract

Context

Magnetic resonance imaging (MRI) of the knee is the preferred method for the diagnosis of knee injuries. However, the interpretation of knee MRI is time consuming and can be tainted with misdiagnosis and variability. An automated knee MRI interpretation system could prioritize high-risk patients and help clinicians to make diagnoses. In-depth learning methods, to automatically learn layers of functionality, are well suited to modeling the complex relationships between medical images and their interpretations. In this study, we developed an in-depth learning model for the detection of general abnormalities and specific diagnoses (anterior cruciate ligament). [ACL] tears and meniscal tears) on knee MRI exams. We then measured the effect of providing model predictions to clinical experts during interpretation.

Methods and results

Our dataset included 1,370 knee MRI exams performed at the Stanford University Medical Center between January 1, 2001 and December 31, 2012 (mean age: 38.0 years, 569 years). [41.5%] patients). Most of the voices of 3 musculoskeletal radiologists have established reference labels on an internal validation set of 120 exams. We developed MRNet, a convolutional neural network that classifies MRI series and combines 3-series predictions by examination with the help of a logistic regression. By detecting abnormalities, ACL tears and meniscal tears, this model obtained surface values under the operating characteristic curve (AUC) of 0.937 (95% CI: 0.895, 0.980), 0.965 (95% CI) , 0.938, 0.993) and 0.847 (95%). % CI 0.780, 0.914), respectively, on the internal validation set. We also obtained a public dataset of 917 exams with a sagittal T1-weighted series and LCA lesion tags from the Rijeka Clinical Hospital Center, Croatia. On the external validation set consisting of 183 examinations, the MRNet formed in the Stanford Sagittal T2-weighted series achieved an AUC of 0.824 (95% CI 0.757, 0.892) for the detection of ACL lesions without additional training. while an MRNet was trained on the rest. external data reached an AUC of 0.911 (95% CI 0.864, 0.958). We also measured the specificity, sensitivity and accuracy of 9 clinical experts (7 general radiologists and 2 orthopedic surgeons certified by the board of directors) on the internal validation set, with and without the model help. Using a two-sided Pearson chi-square test with adjustment for multiple comparisons, we found no significant difference between the performance of the model and that of unassisted general radiologists with respect to anomaly detection. General radiologists achieved significantly higher sensitivity by detecting ACL tears (p-value = 0.002; qvalue = 0.019) and a significantly higher specificity in the detection of meniscal tears (p-value = 0.003; qvalue = 0.019). Using a unilateral t testing the evolution of performance parameters, we found that the provision of model predictions significantly increased the specificity of clinical experts in the identification of ACL tears (p-value <0.001; qvalue = 0.006). The main limitations of our study are the lack of truth about the surgical field and the small size of the clinical expert group.

conclusions

Our in-depth learning model can quickly generate accurate clinical pathology classifications of knee MRI scans from internal and external data sets. In addition, our results support the claim that deep learning models can improve the performance of clinical experts in medical imaging interpretation. Further research is needed to validate the model prospectively and to determine its utility in a clinical setting.

Author's abstract

Why was this study done?

We wanted to see if an in-depth learning model could achieve the clinically important task of detecting disorders related to magnetic resonance imaging of the knee.
We wanted to determine whether an in-depth learning model could improve the accuracy, specificity or sensitivity of the diagnosis of clinical experts, including radiologists and orthopedic surgeons.

What did the researchers do and find?

Our Deep Learning Model Predicts 3 Results for Knee MRI (Anterior Cruciate Ligament) [ACL] tears, meniscal tears and general abnormalities) in a few seconds and with a performance similar to that of radiologists in general.
We attempted to provide model results to general radiologists and orthopedic surgeons during interpretation and observed a statistically significant improvement in the diagnosis of ACL tears with the help of the model.
When it was validated externally on a dataset from a different institution, the model detected ACL tears with high discriminant capacity.

What do these results mean?

In-depth learning can allow to obtain rapid preliminary results as a result of MRI scans and to improve access to quality MRI diagnostics in the absence specialized radiologists.
Providing clinical experts with predictions from an in-depth learning model could improve the quality and consistency of MRI interpretation.

Quote: Well N, Rajpurkar P, RL Ball, J Irvin, A Park, Jones E, et al. (2018) Diagnosis assisted by deep learning for magnetic resonance imaging of the knee: Development and retrospective validation of MRNet. PLoS Med 15 (11):
e1002699.

https://doi.org/10.1371/journal.pmed.1002699

Academic Publisher: Suchi Saria, Johns Hopkins University, USA

Received: June 2, 2018; Accepted: October 23, 2018; published: November 27, 2018

Copyright: © 2018 Bien et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which allows unrestricted use, distribution, and reproduction on any medium, provided that the author and source are acknowledged. origin.

Availability of data: The data from the Stanford University Medical Center used in this study is available at https://stanfordmlgroup.github.io/projects/MRNet for users who agree to a contract to use the search with the dataset. The code for reproducing these results is provided as additional information (code S1 and code S2).

Funding: The authors have not received any specific funding for this work.

Competing interests: I've read the magazine's policy and the authors of this manuscript have the following conflicting interests: CL is a shareholder of whiterabbit.ai and nines.ai. Since submission of this manuscript, RLB has joined and received stock options from Roam Analytics, whose mission is to use the AI methodology to improve human health .

Abbreviations:
ACL
ACL; AUC
surface under the characteristic operating curve of the receiver; CAM,
class activation mapping; CNN
convolutional neural network; FDR
false discovery rate; MRI
Magnetic resonance imaging; MSK
musculoskeletal; PD
proton density

introduction

Magnetic resonance imaging (MRI) of the knee is the standard imaging modality for assessing knee disorders. In addition, more MRIs of the musculoskeletal (MSK) are performed on the knee rather than on any other region of the body.[[[[1 to 3]. MRI has repeatedly demonstrated great precision for the diagnosis of meniscal pathology and cruciate ligament [4–7] and is commonly used to identify those who might benefit from surgery [8–10]. In addition, the negative predictive value of knee MRI is almost 100%. MRI is therefore a non-invasive method to eliminate surgical disorders such as anterior cruciate ligament (ACL) tears. [11]. Due to the quantity and detail of the images in each MRI examination of the knee, accurate interpretation of knee MRI is a tedious task and subject to inter and intra-examiner variability, even when it is performed by certified MSK radiologists from the board of directors. [12]. An automated system for interpreting MRI images of the knee has many potential applications, such as the rapid prioritization of high-risk patients in the radiologist's workflow and assistance in performing diagnostics. [13]. However, the multidimensional and multi-planar properties of MRI have so far limited the applicability of traditional methods of image analysis to knee MRI. [13,14].

In-depth learning approaches, which automatically learn layers of functionality, are ideally suited to modeling the complex relationships between medical images and their interpretations.[[[[15,16]. Recently, these approaches have surpassed traditional methods of image analysis and have led to significant advances in medical imaging tasks, including the classification of skin cancer. [17], detection of diabetic retinopathy [18]and detection of pulmonary nodules [19]. Previous applications for deep-knee MRI training have been limited to cartilage segmentation and cartilage injury detection. [20–22].

In this study, we present MRNet, a fully automated in-depth learning model for knee MRI interpretation, and compare the performance of this model to that of radiologists in general. In addition, we evaluate changes in diagnostic performance of clinical experts when automated deep learning model predictions are provided during interpretation. Finally, we are evaluating the performance of our model on a public external MRI knee data set that is labeled "ACL Injury" and publicly available.

The methods

Dataset

Knee MRI reports from the Stanford University Medical Center between January 1, 2001 and December 31, 2012 were manually reviewed to manage a dataset of 1,370 knee MRI exams. The dataset contained 1,104 abnormal exams (80.6%), with 319 (23.3%) ACL tears and 508 (37.1%) meniscal tears. ACL tears and meniscal tears occurred simultaneously in 194 examinations (38.2%). The most common indications for knee MRI examinations in this study included acute and chronic pain, preoperative follow-up or assessment, injury / trauma, and another / not provided. The exams were performed with GE scanners (GE Discovery, GE Healthcare, Waukesha, WI) with a standard knee MRI coil and a non-contrast knee MRI routine including the following sequences: T1 weighted coronal, T2 coronal with fat saturation, weighted sagittal proton density (PD), sagittal T2 with fat saturation and weighted axial PD with fat saturation. A total of 775 examinations (56.6%) used a 3.0 T magnetic field; the rest used a magnetic field of 1.5 t. See Table S1 for detailed MRI sequence parameters. For this study, sagittal T2-weighted series, coronal plane T1-weighted series, and axially-PD-weighted series were extracted from each examination for use in the model. The number of images in these series ranged from 17 to 61 (mean 31.48, SD 7.97).

The examinations were divided into a set of training (1,130 examinations, 1,088 patients), a set of adjustments (120 examinations, 111 patients) and a validation set (120 examinations, 113 patients) (Fig. 1). To form the validation and fitting sets, stratified random sampling was used to ensure that at least 50 positive examples of each label (abnormal, ACL tear, and meniscus) were present in each set. All the examinations of each patient were placed in the same division. Table 1 contains statistics for patient pathology and patient demographics for each dataset.

thumbnail "class =" thumbnail "/></div>
<p><span>Fig. 1. </span> Flowchart of experimental installation.</p>
<p>We retrospectively collected a dataset of 1,370 MRI knee tests used to develop the model and evaluate the model and clinical experts. The labels were obtained prospectively by manual extraction from clinical reports. The images were extracted from DICOM files, pre-processed, and then linked to reports. The dataset was divided into a set of training (to develop the model), a set of tuning (to choose between models) and a set of validation (to evaluate the best model and the best clinical experts) . The DICOM validation set corresponds to the same exams as the validation set, but the images from the validation set were preprocessed before they entered the template. These validation examinations were independently annotated by MSK (MSK specialists), non-model-assisted clinical experts and model-assisted clinical experts.</p>
<p class=

https://doi.org/10.1371/journal.pmed.1002699.g001

External validation

We obtained a publicly available dataset from Štajduhar et al.[[[[23]consisting of 917 weighted sagittal examinations for the PD of a Siemens Avanto 1.5-T scanner from the Rijeka Clinical Hospital Center, Croatia. According to the radiologists' reports, the authors had extracted the labels of 3 levels of ACL disease: non-wounded (690 exams), partially injured (172 exams) and completely broken (55 exams). We divide the exams in a ratio of 60:20:20 into training, adjustment and validation sets using stratified random sampling. We first applied MRNet without reconfiguring external data, and then optimized MRNet with the help of external training and tuning sets. The classification task was to distinguish between uninjured ACLs and injured (partially injured or fully torn) ACLs.

Model

Evaluation

The standard reference labels were obtained on the internal validation set obtained by the majority of the votes of 3 MSK radiologists certified by the board of directors in a large university practice (6 to 19 years of practice, average of 12 years ). MSK radiologists had access to all DICOM series, original report, clinical history and follow-up exams during interpretation. All of the readers participating in the study used a Clinical Image Archiving System (PACS) environment (GE Centricity) in a diagnostic reading room, and an evaluation was performed on the DICOM clinical images presented on a screen. medical display of at least 3 megapixels. a minimum luminance of 1 cd / m², maximum luminance of 400 cd / m², pixel size of 0.2 and native resolution of 1500 × 2000 pixels. The exams were sorted in reverse chronological order. Three binary labels were assigned at each examination to indicate the presence or absence of (1) any abnormality, (2) an ACL tear, and (3) a meniscal tear. The definitions for the labels were as follows:

Anomaly: normal (all images examined are free of abnormalities) or abnormal (abnormal findings in the internal validation set that were not an ACL tear or a meniscal tear included osteoarthritis, effusion, iliotibial band syndrome, posterior cruciate ligament tear, fracture, contusion, plica and medial collateral ligament sprain);
LCA: intact (normal, mucoid degeneration, ganglion cyst, sprain) or tear (partial tear of low grade with <50% of fibers torn, high-grade partial tear with >50% torn fibers, complete tearing) [32];
Meniscus: intact (normal and degenerative changes without tearing, post-surgical changes without tearing) or tearing (increased signal reaching the articular surface on at least 2 sections or morphological malformation) [33,34].

Independently of MSK radiologists, 7 general radiologists and 2 orthopedic surgeons from the Stanford University Medical Center (3 to 29 years of practice, 12 years on average) have labeled the internal validation set, ignoring reports and labels of origin. The labels of these clinical experts were measured against the labels of the MSK radiologists consensus reference standards. General radiologists were randomly divided into 2 groups, with 4 radiologists in group 1 and 3 radiologists in group 2. The 2 orthopedic surgeons were also in group 1. Group 1 first examined the entire group. validation without template help, and group 2 has validation set with model support. For model-assisted examinations, model predictions were provided as predicted probabilities of a positive diagnosis (eg, 0.98 ACL tear). After a 10-day off-load period, Group 1 then examined the validation set in a different order with the help of the model, and Group 2 examined the unassisted validation set of the model. The Stanford Institutional Review Committee approved this study.

Statistical methods

Performance measures of the model, general radiologists, and orthopedic surgeons included sensitivity, specificity, and precision. We also calculated the micro-mean of these statistics on general radiologists only and on all clinical experts (general radiologists and surgeons). We evaluated the performance of the model with the area under the operating characteristic curve of the receiver. To assess the variability of the estimates, we provide confidence intervals of the Wilson score of 95%.[[[[35]for the sensitivity, specificity, accuracy and 95% confidence intervals of DeLong for AUC [36,37]. A threshold of 0.5 was used to dichotomise model predictions. The performance of the model on the external validation set was evaluated with AUC and DeLong confidence intervals of 95%.

Comme nous avons effectué plusieurs comparaisons dans cette étude pour évaluer les performances du modèle par rapport à celles des radiologistes généralistes en exercice et pour évaluer l’utilité clinique de la fourniture d’une assistance au modèle, nous avons contrôlé le taux global de découverte erronée (FDR) à 0,05.[[[[38]et signaler à la fois non ajusté p-valeurs et ajusté q-valeurs. En gros, un FDR <0,05 peut être interprété comme la proportion attendue (0,05) d'allégations fausses d'importance dans tous les résultats significatifs. Ainsi, au lieu d’utiliser les données non ajustées p-valeur pour évaluer la signification statistique, une q-La valeur <0,05 rend correctement compte de ces comparaisons multiples. Pour évaluer les performances du modèle par rapport à celles des radiologistes généraux, nous avons utilisé un test du chi-carré de Pearson à deux côtés pour déterminer s’il existait des différences significatives en termes de spécificité, de sensibilité et de précision entre le modèle et la micro-moyenne des radiologues en général. Les chirurgiens orthopédiques n’ont pas été inclus dans cette comparaison.

Nous avons évalué l'utilité clinique de fournir des prévisions de modèles aux experts cliniques en vérifiant si les mesures de performance des 7 radiologistes généraux et des 2 chirurgiens orthopédistes augmentaient lorsqu'elles recevaient une assistance de modèle. Il existe une variabilité naturelle lorsqu'un expert clinique évalue la même étude IRM du genou à différents moments. Il n’est donc pas surprenant que les mesures de performance d’un expert clinique soient légèrement meilleures ou moins bonnes lorsqu’il est testé à deux reprises, quel que soit l’assistance du modèle. Ainsi, nous avons effectué des tests d’hypothèses robustes pour déterminer si les experts cliniques (en tant que groupe) présentaient une amélioration statistiquement significative avec l’aide du modèle. Nous avons utilisé un unilatéral t test sur le changement (différence) des mesures de performance pour les 9 experts cliniques pour les 3 étiquettes. Pour déterminer si ces résultats dépendaient spécifiquement de l’amélioration des chirurgiens orthopédiques, nous avons effectué une analyse de sensibilité: nous avons répété le test unilatéral. t test sur le changement des mesures de performance uniquement chez les radiologistes généraux, à l'exclusion des chirurgiens orthopédistes, afin de déterminer s'il y avait toujours une amélioration significative.

Le Fleiss Kappa exact[[[[39,40]est censé évaluer le degré d’accord des 3 radiologistes MSK, dont le vote à la majorité a été utilisé pour les étiquettes de référence. En outre, pour évaluer si l’assistance fournie par un modèle peut améliorer la fiabilité entre évaluateurs, nous indiquons le Fleapp kappa exact de l’ensemble des 9 experts cliniques, avec et sans assistance du modèle, pour chacune des 3 tâches.

Toutes les analyses statistiques ont été effectuées dans l'environnement R pour le calcul statistique[[[[41], en utilisant les packages irr, pROC, binom et qvalue [38,42–44], et le code R a été fourni avec la soumission.

Résultats

L'accord entre les évaluateurs sur la validation interne établie entre les 3 radiologues MSK, mesuré par le score kappa de Fleiss exact, était de 0,508 pour la détection des anomalies, de 0,800 pour la détection des déchirures du LCA et de 0,745 pour la détection des déchirures méniscales.

La performance du modèle

Pour la détection d'anomalie, la détection de déchirure du LCA et la détection de déchirure méniscale, le modèle a obtenu des ASC de 0,937 (IC à 95%: 0,895, 0,980), 0,965 (IC à 95%, 0,938, 0,993) et 0,847 (IC à 95%, 0,780, 0,914), respectivement. (Fig 5). Lors de la détection d'anomalies, il n'y avait pas de différences significatives dans les mesures de performance du modèle et des radiologistes généraux (Tableau 2). La spécificité du modèle pour la détection des anomalies était inférieure à la micro-moyenne des radiologistes généraux, à 0,714 (IC 95%: 0,500, 0,862) et 0,844 (IC 95%: 0,776, 0,893). Le modèle a atteint une sensibilité de 0,879 (IC à 95%: 0,800, 0,929) et une précision de 0,850 (IC à 95%, 0,775, 0,903), tandis que les radiologistes généraux ont obtenu une sensibilité de 0,905 (IC à 95%, 0,881, 0,924) et une précision de 0,894. IC à 95% 0,871, 0,913) (tableau 2).

thumbnail "class =" thumbnail "/></div>
<p><span>Fig 5 </span> Courbes caractéristiques de fonctionnement du récepteur du modèle et points de fonctionnement d’experts cliniques non assistés et assistés.</p>
<p>Chaque graphique illustre la courbe des caractéristiques de fonctionnement du récepteur (ROC) de l'algorithme (courbe noire) sur l'ensemble de validation pour (a) une anomalie, (b) une déchirure du ligament croisé antérieur (LCA) et (c) une déchirure du ménisque. La courbe ROC est générée en faisant varier le seuil de discrimination (utilisé pour convertir les probabilités de sortie en prédictions binaires). Les points individuels experts cliniques (spécificité, sensibilité) sont également tracés, où les x rouges représentent les radiologues généraux non assistés par modèle, les x orange les chirurgiens orthopédiques non assistés par modèle, les signes plus verts représentent les radiologues généraux assistés par modèles et les signes plus bleus représentent des chirurgiens orthopédistes assistés par modèle. Nous avons également tracé la macro-moyenne des experts cliniques non assistés par le modèle (black x) et la macro-moyenne des experts cliniques assistés par le modèle (signes plus noirs). Chaque point de fonctionnement expert clinique non assisté est connecté au point de fonctionnement correspondant assisté par modèle avec une ligne en pointillé.</p>
<p class=

https://doi.org/10.1371/journal.pmed.1002699.g005

Le modèle était très spécifique à la détection de déchirure du LCA, atteignant une spécificité de 0,968 (IC 95% 0,890, 0,991), ce qui est supérieur à la micro-moyenne des radiologistes généraux, à 0,933 (IC 95% 0,906, 0,953), mais cette différence n'était pas statistiquement significatif (tableau 2). Les radiologistes généraux ont obtenu une sensibilité nettement supérieure au modèle en détectant les déchirures du LCA (p-valeur = 0.002, q-valeur = 0,019); la sensibilité micro-moyenne du radiologiste général était de 0,906 (IC 95%: 0,874, 0,931), tandis que le modèle obtenait une sensibilité de 0,759 (IC 95%: 0,635, 0,850). Les radiologistes généraux ont également atteint une spécificité significativement plus élevée dans la détection des déchirures méniscales (p-valeur = 0.003, q-valeur = 0,019), avec une spécificité de 0,892 (IC à 95%: 0,858, 0,918) par rapport à une spécificité de 0,741 (IC à 95%, 0,616, 0,837). Il n'y avait pas d'autres différences significatives dans les mesures de performance (Tableau 2). Le tableau 2 présente des estimations de mesure de performance et des intervalles de confiance récapitulatifs, ainsi que des mesures de performance individuelles pour les 7 radiologistes généraux agréés par le conseil d’administration et 2 chirurgiens orthopédistes dans cette étude.

Utilité clinique de l'assistance modèle

L’utilité clinique de la fourniture de prévisions de modèles à des experts cliniques au cours du processus d’étiquetage est illustrée dans La figure 6 et les valeurs numériques fournies dans le tableau 3. Lorsque les experts cliniques bénéficiaient d’une assistance modèle, il existait une augmentation statistiquement significative de la spécificité des experts cliniques dans l’identification des déchirures du LCA (p-valeur <0,001, q-valeur = 0,006). The mean increase in ACL specificity was 0.048 (4.8%), and since the validation set contained 62 exams that were negative for ACL tear, this increase in specificity in the optimal clinical setting would mean potentially 3 fewer patients sent to surgery for suspected ACL tear unnecessarily. Though it appeared that model assistance also significantly increased the clinical experts’ accuracy in detecting ACL tears (p-value = 0.020) and sensitivity in detecting meniscus tears (p-value = 0.028), these findings were no longer significant after adjusting for multiple comparisons by controlling the FDR (q-values = 0.092 and 0.110, respectively). There were no other statistically significant improvements to clinical experts’ performance with model assistance. Individual results, unadjusted p-values, and adjusted q-values are provided in S3 Table.

thumbnail" class="thumbnail"/></div>
<p><span>Fig 6. </span> Comparison of unassisted and model-assisted performance metrics of clinical experts on the validation set.</p>
<p>Mean differences (with 95% CI error bars) in clinical experts’ performance metrics (model-assisted minus unassisted) for abnormality, anterior cruciate ligament (ACL) tear, and meniscal tear detection. Numerical values are provided in Table 3, and individual values provided in S2 Table.</p>
<p class=

https://doi.org/10.1371/journal.pmed.1002699.g006

To determine whether the statistically significant improvement in specificity in identifying ACL tears with model assistance was dependent on the orthopedic surgeons’ performance metrics, we assessed the improvement of general radiologists only, excluding orthopedic surgeons. This sensitivity analysis confirmed that even among only general radiologists, there was a significant increase in specificity in identifying ACL tears (p-value = 0.003, q-value = 0.019; see S4 Table). Additionally, we computed Fleiss kappa for the 9 clinical experts with and without model assistance, and while we did not assess statistical significance, we observed that model assistance increased the Fleiss kappa measure of inter-rater reliability for all 3 tasks. With model assistance, the Fleiss kappa measure for abnormality detection increased from 0.571 to 0.640, for ACL tear detection it increased from 0.754 to 0.840, and for meniscal tear detection it increased from 0.526 to 0.621.

External validation

The MRNet trained on Stanford sagittal T2-weighted series and Stanford ACL tear labels achieved an AUC of 0.824 (95% CI 0.757, 0.892) on the Štajduhar et al. validation set with no additional training. Additionally, we trained 3 MRNets starting from ImageNet weights on the Štajduhar et al. training set with different random seeds. We selected the MRNet with the lowest average loss on the tuning set and then evaluated this model on the validation set. This model achieved an AUC of 0.911 (95% CI 0.864, 0.958) on the Štajduhar et al. validation set. Štajduhar et al. recorded an AUC of 0.894 for their best model, a semi-automated approach using support vector machines, although it was evaluated using a 10-fold cross-validation scheme[[[[23]. MRNet took less than 30 minutes to train on and less than 2 minutes to evaluate the Štajduhar et al. dataset with an NVIDIA GeForce GTX 12GB GPU.

Discussion

The purpose of this study was to design and evaluate a deep learning model for classifying pathologies on knee MRI and to compare performance to human clinical experts both with and without model assistance during interpretation in a crossover design. Our results demonstrate that a deep learning approach can achieve high performance in clinical classification tasks on knee MR, with AUCs for abnormality detection, ACL tear detection, and meniscus tear detection of 0.937 (95% CI 0.895, 0.937), 0.965 (95% CI 0.938, 0.965), and 0.847 (95% CI 0.780, 0.847), respectively. Notably, the model achieved high specificity in detecting ACL tears on the internal validation set, which suggests that such a model, if used in the clinical workflow, may have the potential to effectively rule out ACL tears. On an external dataset using T1-weighted instead of T2-weighted series and a different labeling convention for ACL injury, the same ACL tear model achieved an AUC of 0.824 (95% CI 0.757, 0.892). Retraining on the external dataset improved the AUC to 0.911 (95% CI 0.864, 0.958). Our deep learning model achieved state-of-the-art results on the external dataset, but only after retraining. It remains to be seen if the model would better generalize to an external dataset with more MRI series and a more similar MRI protocol. We also found that providing the deep learning model predictions to human clinical experts as a diagnostic aid resulted in significantly higher specificities in identifying ACL tears. Finally, in contrast to the human experts, who required more than 3 hours on average to completely review 120 exams, the deep learning model provided all classifications in under 2 minutes. Our results suggest that deep learning can be successfully applied to advanced MSK MRI to generate rapid automated pathology classifications and that the output of the model may improve clinical interpretations.

There are many exciting potential applications of an automated deep learning model for knee MRI diagnosis in clinical practice. For example, the model described could be immediately applied for diagnostic worklist prioritization, wherein exams detected as abnormal could be moved ahead in the image interpretation workflow, and those identified as normal could be automatically assigned a preliminary reading of “normal.” With its high negative predictive value for abnormalities, the model could lead to quick preliminary feedback for patients whose exams come back as “normal.” Additionally, providing rapid results to the ordering clinician could improve disposition in other areas of the healthcare system. In this work we noticed that specificity for detecting ACL tears improved for both general radiologists and orthopedic surgeons, which implies that this model could help reduce unnecessary additional testing and surgery. Automated abnormality prediction and localization could help general radiologists or even non-radiologist clinicians (orthopedic surgeons) interpret medical imaging for patients at the point of care rather than waiting for specialized radiologist interpretation, which could aid in efficient interpretation, reduce errors, and help standardize quality of diagnoses when MSK specialist radiologists are not readily available. Ultimately, more studies are necessary to evaluate the optimal integration of this model and other deep learning models in the clinical setting. However, our results provide early support for a future where deep learning models may play a significant role in assisting clinicians and healthcare systems.

To examine the effect that a deep learning model may have on the interpretation performance of clinicians, our study deliberately recruited general radiologists to interpret knee MRI exams with and without model predictions. We found a statistically significant improvement in specificity for the ACL tear detection task with model assistance and, though not statistically significant, increased accuracy for ACL tear detection and increased sensitivity for meniscal tear detection. For both general radiologists and non-radiologist clinicians (orthopedic surgeons), we found improved sensitivity and/or specificity across all 3 tasks with model assistance (Fig 5; Table 3), although the group of surgeons was too small for formal analysis. Importantly, model assistance also resulted in higher inter-rater reliability among clinical experts for all 3 tasks, with higher Fleiss kappa measures with model assistance than without. To our knowledge, this is the first study to explore providing outputs of deep learning models to assist radiologists and non-radiologist clinicians in the task of image interpretation. More work will be needed to understand whether and how deep learning models could optimize the interpretation performance of practicing radiologists and non-radiologist clinicians.

A difficulty in deep learning for medical imaging is curating large datasets containing examples of the wide variety of abnormalities that can occur on a given imaging examination to train an accurate classifier, which is a strategy we employed for detecting ACL and meniscal tears. However, our other classification task was to distinguish “normal” from “abnormal” with the intention that if the model could learn the range of normal for a given population of knee MRI exams, then theoretically any abnormality, no matter how rare, could be detected by the model. An example is shown in Fig 3A of a relatively uncommon but serious complete rupture of the gastrocnemius tendon, which was correctly classified and localized as “abnormal” by the model, despite the fact that there were no other examples of this specific abnormality in the abnormal training data. It is possible that with a binary approach and enough “normal” training data, a model could detect any abnormality, no matter how uncommon. However, more work is needed to explore whether subtler abnormalities would require specific training data.

This study has limitations. Our validation set ground truth was not governed strictly by surgical confirmation in all cases. The deep learning model described was developed and trained on MRI data from 1 large academic institution. While MRNet performed well on the external validation set without additional training (AUC 0.824), we saw a substantial improvement (AUC 0.911) after training on the external dataset. This finding suggests that achieving optimal model performance may require additional model development using data more similar to what the model is likely to see in practice. More research is needed to determine if models trained on larger and multi-institutional datasets can achieve high performance without retraining. Power to detect statistically significant gains in clinical experts’ performance with model assistance was limited by the size of the panel, and a larger study that includes more clinical experts as well as more MRI exams may detect smaller gains in utility. Nevertheless, we have shown that even in this small set of clinical experts, providing model predictions significantly increased ACL tear detection specificity, even after correcting for multiple comparisons.

In conclusion, we developed a deep learning model that achieves high performance in clinical classification tasks on knee MRI and demonstrated the benefit, in a retrospective experiment, of providing model predictions to clinicians during the diagnostic imaging task. Future studies are needed to improve the performance and generalizability of deep learning models for MRI and to determine the effect of model assistance in the clinical setting.