[ad_1]
Abstract
Context
Magnetic resonance imaging (MRI) of the knee is the preferred method for the diagnosis of knee injuries. However, the interpretation of knee MRI is time consuming and can be tainted with misdiagnosis and variability. An automated knee MRI interpretation system could prioritize high-risk patients and help clinicians to make diagnoses. In-depth learning methods, to automatically learn layers of functionality, are well suited to modeling the complex relationships between medical images and their interpretations. In this study, we developed an in-depth learning model for the detection of general abnormalities and specific diagnoses (anterior cruciate ligament). [ACL] tears and meniscal tears) on knee MRI exams. We then measured the effect of providing model predictions to clinical experts during interpretation.
Methods and results
Our dataset included 1,370 knee MRI exams performed at the Stanford University Medical Center between January 1, 2001 and December 31, 2012 (mean age: 38.0 years, 569 years). [41.5%] patients). Most of the voices of 3 musculoskeletal radiologists have established reference labels on an internal validation set of 120 exams. We developed MRNet, a convolutional neural network that classifies MRI series and combines 3-series predictions by examination with the help of a logistic regression. By detecting abnormalities, ACL tears and meniscal tears, this model obtained surface values under the operating characteristic curve (AUC) of 0.937 (95% CI: 0.895, 0.980), 0.965 (95% CI) , 0.938, 0.993) and 0.847 (95%). % CI 0.780, 0.914), respectively, on the internal validation set. We also obtained a public dataset of 917 exams with a sagittal T1-weighted series and LCA lesion tags from the Rijeka Clinical Hospital Center, Croatia. On the external validation set consisting of 183 examinations, the MRNet formed in the Stanford Sagittal T2-weighted series achieved an AUC of 0.824 (95% CI 0.757, 0.892) for the detection of ACL lesions without additional training. while an MRNet was trained on the rest. external data reached an AUC of 0.911 (95% CI 0.864, 0.958). We also measured the specificity, sensitivity and accuracy of 9 clinical experts (7 general radiologists and 2 orthopedic surgeons certified by the board of directors) on the internal validation set, with and without the model help. Using a two-sided Pearson chi-square test with adjustment for multiple comparisons, we found no significant difference between the performance of the model and that of unassisted general radiologists with respect to anomaly detection. General radiologists achieved significantly higher sensitivity by detecting ACL tears (p-value = 0.002; qvalue = 0.019) and a significantly higher specificity in the detection of meniscal tears (p-value = 0.003; qvalue = 0.019). Using a unilateral t testing the evolution of performance parameters, we found that the provision of model predictions significantly increased the specificity of clinical experts in the identification of ACL tears (p-value <0.001; qvalue = 0.006). The main limitations of our study are the lack of truth about the surgical field and the small size of the clinical expert group.
conclusions
Our in-depth learning model can quickly generate accurate clinical pathology classifications of knee MRI scans from internal and external data sets. In addition, our results support the claim that deep learning models can improve the performance of clinical experts in medical imaging interpretation. Further research is needed to validate the model prospectively and to determine its utility in a clinical setting.
Author's abstract
Why was this study done?
- We wanted to see if an in-depth learning model could achieve the clinically important task of detecting disorders related to magnetic resonance imaging of the knee.
- We wanted to determine whether an in-depth learning model could improve the accuracy, specificity or sensitivity of the diagnosis of clinical experts, including radiologists and orthopedic surgeons.
What did the researchers do and find?
- Our Deep Learning Model Predicts 3 Results for Knee MRI (Anterior Cruciate Ligament) [ACL] tears, meniscal tears and general abnormalities) in a few seconds and with a performance similar to that of radiologists in general.
- We attempted to provide model results to general radiologists and orthopedic surgeons during interpretation and observed a statistically significant improvement in the diagnosis of ACL tears with the help of the model.
- When it was validated externally on a dataset from a different institution, the model detected ACL tears with high discriminant capacity.
What do these results mean?
- In-depth learning can allow to obtain rapid preliminary results as a result of MRI scans and to improve access to quality MRI diagnostics in the absence specialized radiologists.
- Providing clinical experts with predictions from an in-depth learning model could improve the quality and consistency of MRI interpretation.
Quote: Well N, Rajpurkar P, RL Ball, J Irvin, A Park, Jones E, et al. (2018) Diagnosis assisted by deep learning for magnetic resonance imaging of the knee: Development and retrospective validation of MRNet. PLoS Med 15 (11):
e1002699.
https://doi.org/10.1371/journal.pmed.1002699
Academic Publisher: Suchi Saria, Johns Hopkins University, USA
Received: June 2, 2018; Accepted: October 23, 2018; published: November 27, 2018
Copyright: © 2018 Bien et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which allows unrestricted use, distribution, and reproduction on any medium, provided that the author and source are acknowledged. origin.
Availability of data: The data from the Stanford University Medical Center used in this study is available at https://stanfordmlgroup.github.io/projects/MRNet for users who agree to a contract to use the search with the dataset. The code for reproducing these results is provided as additional information (code S1 and code S2).
Funding: The authors have not received any specific funding for this work.
Competing interests: I've read the magazine's policy and the authors of this manuscript have the following conflicting interests: CL is a shareholder of whiterabbit.ai and nines.ai. Since submission of this manuscript, RLB has joined and received stock options from Roam Analytics, whose mission is to use the AI methodology to improve human health .
Abbreviations:
ACL
ACL; AUC
surface under the characteristic operating curve of the receiver; CAM,
class activation mapping; CNN
convolutional neural network; FDR
false discovery rate; MRI
Magnetic resonance imaging; MSK
musculoskeletal; PD
proton density
introduction
Magnetic resonance imaging (MRI) of the knee is the standard imaging modality for assessing knee disorders. In addition, more MRIs of the musculoskeletal (MSK) are performed on the knee rather than on any other region of the body.[[[[1 to 3]. MRI has repeatedly demonstrated great precision for the diagnosis of meniscal pathology and cruciate ligament [4–7] and is commonly used to identify those who might benefit from surgery [8–10]. In addition, the negative predictive value of knee MRI is almost 100%. MRI is therefore a non-invasive method to eliminate surgical disorders such as anterior cruciate ligament (ACL) tears. [11]. Due to the quantity and detail of the images in each MRI examination of the knee, accurate interpretation of knee MRI is a tedious task and subject to inter and intra-examiner variability, even when it is performed by certified MSK radiologists from the board of directors. [12]. An automated system for interpreting MRI images of the knee has many potential applications, such as the rapid prioritization of high-risk patients in the radiologist's workflow and assistance in performing diagnostics. [13]. However, the multidimensional and multi-planar properties of MRI have so far limited the applicability of traditional methods of image analysis to knee MRI. [13,14].
In-depth learning approaches, which automatically learn layers of functionality, are ideally suited to modeling the complex relationships between medical images and their interpretations.[[[[15,16]. Recently, these approaches have surpassed traditional methods of image analysis and have led to significant advances in medical imaging tasks, including the classification of skin cancer. [17], detection of diabetic retinopathy [18]and detection of pulmonary nodules [19]. Previous applications for deep-knee MRI training have been limited to cartilage segmentation and cartilage injury detection. [20–22].
In this study, we present MRNet, a fully automated in-depth learning model for knee MRI interpretation, and compare the performance of this model to that of radiologists in general. In addition, we evaluate changes in diagnostic performance of clinical experts when automated deep learning model predictions are provided during interpretation. Finally, we are evaluating the performance of our model on a public external MRI knee data set that is labeled "ACL Injury" and publicly available.
The methods
Dataset
Knee MRI reports from the Stanford University Medical Center between January 1, 2001 and December 31, 2012 were manually reviewed to manage a dataset of 1,370 knee MRI exams. The dataset contained 1,104 abnormal exams (80.6%), with 319 (23.3%) ACL tears and 508 (37.1%) meniscal tears. ACL tears and meniscal tears occurred simultaneously in 194 examinations (38.2%). The most common indications for knee MRI examinations in this study included acute and chronic pain, preoperative follow-up or assessment, injury / trauma, and another / not provided. The exams were performed with GE scanners (GE Discovery, GE Healthcare, Waukesha, WI) with a standard knee MRI coil and a non-contrast knee MRI routine including the following sequences: T1 weighted coronal, T2 coronal with fat saturation, weighted sagittal proton density (PD), sagittal T2 with fat saturation and weighted axial PD with fat saturation. A total of 775 examinations (56.6%) used a 3.0 T magnetic field; the rest used a magnetic field of 1.5 t. See Table S1 for detailed MRI sequence parameters. For this study, sagittal T2-weighted series, coronal plane T1-weighted series, and axially-PD-weighted series were extracted from each examination for use in the model. The number of images in these series ranged from 17 to 61 (mean 31.48, SD 7.97).
The examinations were divided into a set of training (1,130 examinations, 1,088 patients), a set of adjustments (120 examinations, 111 patients) and a validation set (120 examinations, 113 patients) (Fig. 1). To form the validation and fitting sets, stratified random sampling was used to ensure that at least 50 positive examples of each label (abnormal, ACL tear, and meniscus) were present in each set. All the examinations of each patient were placed in the same division. Table 1 contains statistics for patient pathology and patient demographics for each dataset.
Model
Pretreatment.
The images were extracted from Digital Imaging and Communications in Medicine (DICOM) files, reduced to 256 × 256 pixels and converted to PNG (Portable Network Graphics) using the Python programming language (version 2.7).[[[[24]and the pydicom library (version 0.9.9) [25].
To account for variable pixel intensity scales in the MRI series, a histogram – based intensity normalization algorithm was applied to the images.[[[[26]. For each series, a representative intensity distribution was drawn from the overall training exams. Then, the parameters of this distribution were used to adjust the pixel intensities of the exams in all the datasets (training, adjustment and validation). Under this transformation, pixels with similar values correspond to similar tissue types. After normalizing the intensity, the pixel values were clipped between 0 and 255, the standard range for PNG images.
MRnet.
The main building block of our forecasting system is MRNet, a convolutional neuron network (CNN) mapping a three-dimensional MRI series to a probability.[[[[15](Fig 2). The entry in MRNet has dimensions s × 3 × 256 × 256, where s is the number of images in the IRM series (3 is the number of color channels). First, each 2-dimensional MRI image slice was passed to an AlexNet-based feature extractor to obtain a s × 256 × 7 × 7 tensor containing characteristics for each slice. A global average aggregation layer was then applied to reduce these features to s × 256. We then applied maximum pooling between slices to obtain a 256 dimensional vector, which was transmitted to a fully connected layer and a sigmoidal activation function to obtain a prediction. in the range 0 to 1. We optimized the model using a cross-entropy bit loss. To account for unbalanced class sizes on all tasks, the loss for an example has been reduced in proportion to the prevalence of the class in this example in the dataset.
During the training, the loss gradient was calculated on each training example using the backpropagation algorithm, and the MRNet parameters were adjusted in the opposite direction to the gradient.[[[[15]. Each training example underwent a random rotation between -25 and 25 degrees, a random shift between -25 and 25 pixels and a horizontal reversal with a 50% probability at each appearance in the training. The model parameters were saved after each complete pass in the training set and the model with the lowest average loss on the set of adjustment was chosen to be evaluated on the set of validation. Figure 2 describes the MRNet architecture in more detail. The training of each MRNet network for 50 iterations on the entire drive took an average of 6 hours with an NVIDIA GeForce GTX 1070 8GB graphics processor. MRNet was implemented with Python 3.6.3 [27] and PyTorch 0.3.0 [28].
Formatting a CNN to the classification of images from scratch generally requires a data set greater than 1,130 examples. For this reason, we have initialized the weights of the MRNet AlexNet part to optimized values on the ImageNet database.[[[[29]1.2 million images per 1,000 classes, then refined these weights to fit our MRI dataset. This allowed the first layers of the network, more difficult to optimize than subsequent layers, to immediately recognize generic features such as lines and outlines. This "learning transfer" approach has also been applied to skin cancer [17] and diabetic retinopathy [18] image data sets.
MRNet Interpretation.
To ensure that MRNet models learned relevant features, we generated Class Activation Mappings (CAMs).[[[[30](Fig 3). To generate a CAM for an image, we calculated a weighted average on the 256 CNN feature maps using the weights of the classification layer to obtain a 7 × 7 image. The CAM was then mapped to a color scheme, oversampled up to 256 × 256 pixels and superimposed on the original input image. By using the parameters of the last layer of the network to weight the feature maps, more predictive feature maps appear brighter. Thus, the brightest areas of the CAMs are the regions that most influence the prediction of the model.
Combination of MRNet forecasts.
From the predictions of T2 sagittal MR, T1 sagittal, and PD axial sagittal images of the training set, as well as the corresponding original tags, we learned how to use a logistic regression to weight the 3-series predictions and generate a single result. for each exam (Fig 4). The most beneficial series, determined from the adjusted logistic regression coefficients, were axial PD for meniscal abnormalities and tears, and coronal T1 for ACL tears. After the training, the logistic regression was applied to the forecasts of the 3 MRN networks for the internal validation set to obtain the final forecasts. We formed 3 logistic regression models in total – 1 for each task (anomaly detection, ACL tears and meniscal tears). These models have been implemented in Python [24] using the scikit-learn package [31]. For external validation, since there was only one series in the dataset, we used the prediction of a single MRNet network as the final output directly.
Evaluation
The standard reference labels were obtained on the internal validation set obtained by the majority of the votes of 3 MSK radiologists certified by the board of directors in a large university practice (6 to 19 years of practice, average of 12 years ). MSK radiologists had access to all DICOM series, original report, clinical history and follow-up exams during interpretation. All of the readers participating in the study used a Clinical Image Archiving System (PACS) environment (GE Centricity) in a diagnostic reading room, and an evaluation was performed on the DICOM clinical images presented on a screen. medical display of at least 3 megapixels. a minimum luminance of 1 cd / m2, maximum luminance of 400 cd / m2, pixel size of 0.2 and native resolution of 1500 × 2000 pixels. The exams were sorted in reverse chronological order. Three binary labels were assigned at each examination to indicate the presence or absence of (1) any abnormality, (2) an ACL tear, and (3) a meniscal tear. The definitions for the labels were as follows:
- Anomaly: normal (all images examined are free of abnormalities) or abnormal (abnormal findings in the internal validation set that were not an ACL tear or a meniscal tear included osteoarthritis, effusion, iliotibial band syndrome, posterior cruciate ligament tear, fracture, contusion, plica and medial collateral ligament sprain);
- LCA: intact (normal, mucoid degeneration, ganglion cyst, sprain) or tear (partial tear of low grade with <50% of fibers torn, high-grade partial tear with >50% torn fibers, complete tearing) [32];
- Meniscus: intact (normal and degenerative changes without tearing, post-surgical changes without tearing) or tearing (increased signal reaching the articular surface on at least 2 sections or morphological malformation) [33,34].
Independently of MSK radiologists, 7 general radiologists and 2 orthopedic surgeons from the Stanford University Medical Center (3 to 29 years of practice, 12 years on average) have labeled the internal validation set, ignoring reports and labels of origin. The labels of these clinical experts were measured against the labels of the MSK radiologists consensus reference standards. General radiologists were randomly divided into 2 groups, with 4 radiologists in group 1 and 3 radiologists in group 2. The 2 orthopedic surgeons were also in group 1. Group 1 first examined the entire group. validation without template help, and group 2 has validation set with model support. For model-assisted examinations, model predictions were provided as predicted probabilities of a positive diagnosis (eg, 0.98 ACL tear). After a 10-day off-load period, Group 1 then examined the validation set in a different order with the help of the model, and Group 2 examined the unassisted validation set of the model. The Stanford Institutional Review Committee approved this study.
Statistical methods
Performance measures of the model, general radiologists, and orthopedic surgeons included sensitivity, specificity, and precision. We also calculated the micro-mean of these statistics on general radiologists only and on all clinical experts (general radiologists and surgeons). We evaluated the performance of the model with the area under the operating characteristic curve of the receiver. To assess the variability of the estimates, we provide confidence intervals of the Wilson score of 95%.[[[[35]for the sensitivity, specificity, accuracy and 95% confidence intervals of DeLong for AUC [36,37]. A threshold of 0.5 was used to dichotomise model predictions. The performance of the model on the external validation set was evaluated with AUC and DeLong confidence intervals of 95%.
Comme nous avons effectué plusieurs comparaisons dans cette étude pour évaluer les performances du modèle par rapport à celles des radiologistes généralistes en exercice et pour évaluer l’utilité clinique de la fourniture d’une assistance au modèle, nous avons contrôlé le taux global de découverte erronée (FDR) à 0,05.[[[[38]et signaler à la fois non ajusté p-valeurs et ajusté q-valeurs. En gros, un FDR <0,05 peut être interprété comme la proportion attendue (0,05) d'allégations fausses d'importance dans tous les résultats significatifs. Ainsi, au lieu d’utiliser les données non ajustées p-valeur pour évaluer la signification statistique, une q-La valeur <0,05 rend correctement compte de ces comparaisons multiples. Pour évaluer les performances du modèle par rapport à celles des radiologistes généraux, nous avons utilisé un test du chi-carré de Pearson à deux côtés pour déterminer s’il existait des différences significatives en termes de spécificité, de sensibilité et de précision entre le modèle et la micro-moyenne des radiologues en général. Les chirurgiens orthopédiques n’ont pas été inclus dans cette comparaison.
Nous avons évalué l'utilité clinique de fournir des prévisions de modèles aux experts cliniques en vérifiant si les mesures de performance des 7 radiologistes généraux et des 2 chirurgiens orthopédistes augmentaient lorsqu'elles recevaient une assistance de modèle. Il existe une variabilité naturelle lorsqu'un expert clinique évalue la même étude IRM du genou à différents moments. Il n’est donc pas surprenant que les mesures de performance d’un expert clinique soient légèrement meilleures ou moins bonnes lorsqu’il est testé à deux reprises, quel que soit l’assistance du modèle. Ainsi, nous avons effectué des tests d’hypothèses robustes pour déterminer si les experts cliniques (en tant que groupe) présentaient une amélioration statistiquement significative avec l’aide du modèle. Nous avons utilisé un unilatéral t test sur le changement (différence) des mesures de performance pour les 9 experts cliniques pour les 3 étiquettes. Pour déterminer si ces résultats dépendaient spécifiquement de l’amélioration des chirurgiens orthopédiques, nous avons effectué une analyse de sensibilité: nous avons répété le test unilatéral. t test sur le changement des mesures de performance uniquement chez les radiologistes généraux, à l'exclusion des chirurgiens orthopédistes, afin de déterminer s'il y avait toujours une amélioration significative.
Toutes les analyses statistiques ont été effectuées dans l'environnement R pour le calcul statistique[[[[41], en utilisant les packages irr, pROC, binom et qvalue [38,42–44], et le code R a été fourni avec la soumission.
Résultats
L'accord entre les évaluateurs sur la validation interne établie entre les 3 radiologues MSK, mesuré par le score kappa de Fleiss exact, était de 0,508 pour la détection des anomalies, de 0,800 pour la détection des déchirures du LCA et de 0,745 pour la détection des déchirures méniscales.
La performance du modèle
Pour la détection d'anomalie, la détection de déchirure du LCA et la détection de déchirure méniscale, le modèle a obtenu des ASC de 0,937 (IC à 95%: 0,895, 0,980), 0,965 (IC à 95%, 0,938, 0,993) et 0,847 (IC à 95%, 0,780, 0,914), respectivement. (Fig 5). Lors de la détection d'anomalies, il n'y avait pas de différences significatives dans les mesures de performance du modèle et des radiologistes généraux (Tableau 2). La spécificité du modèle pour la détection des anomalies était inférieure à la micro-moyenne des radiologistes généraux, à 0,714 (IC 95%: 0,500, 0,862) et 0,844 (IC 95%: 0,776, 0,893). Le modèle a atteint une sensibilité de 0,879 (IC à 95%: 0,800, 0,929) et une précision de 0,850 (IC à 95%, 0,775, 0,903), tandis que les radiologistes généraux ont obtenu une sensibilité de 0,905 (IC à 95%, 0,881, 0,924) et une précision de 0,894. IC à 95% 0,871, 0,913) (tableau 2).
Le modèle était très spécifique à la détection de déchirure du LCA, atteignant une spécificité de 0,968 (IC 95% 0,890, 0,991), ce qui est supérieur à la micro-moyenne des radiologistes généraux, à 0,933 (IC 95% 0,906, 0,953), mais cette différence n'était pas statistiquement significatif (tableau 2). Les radiologistes généraux ont obtenu une sensibilité nettement supérieure au modèle en détectant les déchirures du LCA (p-valeur = 0.002, q-valeur = 0,019); la sensibilité micro-moyenne du radiologiste général était de 0,906 (IC 95%: 0,874, 0,931), tandis que le modèle obtenait une sensibilité de 0,759 (IC 95%: 0,635, 0,850). Les radiologistes généraux ont également atteint une spécificité significativement plus élevée dans la détection des déchirures méniscales (p-valeur = 0.003, q-valeur = 0,019), avec une spécificité de 0,892 (IC à 95%: 0,858, 0,918) par rapport à une spécificité de 0,741 (IC à 95%, 0,616, 0,837). Il n'y avait pas d'autres différences significatives dans les mesures de performance (Tableau 2). Le tableau 2 présente des estimations de mesure de performance et des intervalles de confiance récapitulatifs, ainsi que des mesures de performance individuelles pour les 7 radiologistes généraux agréés par le conseil d’administration et 2 chirurgiens orthopédistes dans cette étude.
Utilité clinique de l'assistance modèle
L’utilité clinique de la fourniture de prévisions de modèles à des experts cliniques au cours du processus d’étiquetage est illustrée dans La figure 6 et les valeurs numériques fournies dans le tableau 3. Lorsque les experts cliniques bénéficiaient d’une assistance modèle, il existait une augmentation statistiquement significative de la spécificité des experts cliniques dans l’identification des déchirures du LCA (p-valeur <0,001, q-valeur = 0,006). The mean increase in ACL specificity was 0.048 (4.8%), and since the validation set contained 62 exams that were negative for ACL tear, this increase in specificity in the optimal clinical setting would mean potentially 3 fewer patients sent to surgery for suspected ACL tear unnecessarily. Though it appeared that model assistance also significantly increased the clinical experts’ accuracy in detecting ACL tears (p-value = 0.020) and sensitivity in detecting meniscus tears (p-value = 0.028), these findings were no longer significant after adjusting for multiple comparisons by controlling the FDR (q-values = 0.092 and 0.110, respectively). There were no other statistically significant improvements to clinical experts’ performance with model assistance. Individual results, unadjusted p-values, and adjusted q-values are provided in S3 Table.
To determine whether the statistically significant improvement in specificity in identifying ACL tears with model assistance was dependent on the orthopedic surgeons’ performance metrics, we assessed the improvement of general radiologists only, excluding orthopedic surgeons. This sensitivity analysis confirmed that even among only general radiologists, there was a significant increase in specificity in identifying ACL tears (p-value = 0.003, q-value = 0.019; see S4 Table). Additionally, we computed Fleiss kappa for the 9 clinical experts with and without model assistance, and while we did not assess statistical significance, we observed that model assistance increased the Fleiss kappa measure of inter-rater reliability for all 3 tasks. With model assistance, the Fleiss kappa measure for abnormality detection increased from 0.571 to 0.640, for ACL tear detection it increased from 0.754 to 0.840, and for meniscal tear detection it increased from 0.526 to 0.621.
External validation
The MRNet trained on Stanford sagittal T2-weighted series and Stanford ACL tear labels achieved an AUC of 0.824 (95% CI 0.757, 0.892) on the Štajduhar et al. validation set with no additional training. Additionally, we trained 3 MRNets starting from ImageNet weights on the Štajduhar et al. training set with different random seeds. We selected the MRNet with the lowest average loss on the tuning set and then evaluated this model on the validation set. This model achieved an AUC of 0.911 (95% CI 0.864, 0.958) on the Štajduhar et al. validation set. Štajduhar et al. recorded an AUC of 0.894 for their best model, a semi-automated approach using support vector machines, although it was evaluated using a 10-fold cross-validation scheme[[[[23]. MRNet took less than 30 minutes to train on and less than 2 minutes to evaluate the Štajduhar et al. dataset with an NVIDIA GeForce GTX 12GB GPU.
Discussion
The purpose of this study was to design and evaluate a deep learning model for classifying pathologies on knee MRI and to compare performance to human clinical experts both with and without model assistance during interpretation in a crossover design. Our results demonstrate that a deep learning approach can achieve high performance in clinical classification tasks on knee MR, with AUCs for abnormality detection, ACL tear detection, and meniscus tear detection of 0.937 (95% CI 0.895, 0.937), 0.965 (95% CI 0.938, 0.965), and 0.847 (95% CI 0.780, 0.847), respectively. Notably, the model achieved high specificity in detecting ACL tears on the internal validation set, which suggests that such a model, if used in the clinical workflow, may have the potential to effectively rule out ACL tears. On an external dataset using T1-weighted instead of T2-weighted series and a different labeling convention for ACL injury, the same ACL tear model achieved an AUC of 0.824 (95% CI 0.757, 0.892). Retraining on the external dataset improved the AUC to 0.911 (95% CI 0.864, 0.958). Our deep learning model achieved state-of-the-art results on the external dataset, but only after retraining. It remains to be seen if the model would better generalize to an external dataset with more MRI series and a more similar MRI protocol. We also found that providing the deep learning model predictions to human clinical experts as a diagnostic aid resulted in significantly higher specificities in identifying ACL tears. Finally, in contrast to the human experts, who required more than 3 hours on average to completely review 120 exams, the deep learning model provided all classifications in under 2 minutes. Our results suggest that deep learning can be successfully applied to advanced MSK MRI to generate rapid automated pathology classifications and that the output of the model may improve clinical interpretations.
There are many exciting potential applications of an automated deep learning model for knee MRI diagnosis in clinical practice. For example, the model described could be immediately applied for diagnostic worklist prioritization, wherein exams detected as abnormal could be moved ahead in the image interpretation workflow, and those identified as normal could be automatically assigned a preliminary reading of “normal.” With its high negative predictive value for abnormalities, the model could lead to quick preliminary feedback for patients whose exams come back as “normal.” Additionally, providing rapid results to the ordering clinician could improve disposition in other areas of the healthcare system. In this work we noticed that specificity for detecting ACL tears improved for both general radiologists and orthopedic surgeons, which implies that this model could help reduce unnecessary additional testing and surgery. Automated abnormality prediction and localization could help general radiologists or even non-radiologist clinicians (orthopedic surgeons) interpret medical imaging for patients at the point of care rather than waiting for specialized radiologist interpretation, which could aid in efficient interpretation, reduce errors, and help standardize quality of diagnoses when MSK specialist radiologists are not readily available. Ultimately, more studies are necessary to evaluate the optimal integration of this model and other deep learning models in the clinical setting. However, our results provide early support for a future where deep learning models may play a significant role in assisting clinicians and healthcare systems.
To examine the effect that a deep learning model may have on the interpretation performance of clinicians, our study deliberately recruited general radiologists to interpret knee MRI exams with and without model predictions. We found a statistically significant improvement in specificity for the ACL tear detection task with model assistance and, though not statistically significant, increased accuracy for ACL tear detection and increased sensitivity for meniscal tear detection. For both general radiologists and non-radiologist clinicians (orthopedic surgeons), we found improved sensitivity and/or specificity across all 3 tasks with model assistance (Fig 5; Table 3), although the group of surgeons was too small for formal analysis. Importantly, model assistance also resulted in higher inter-rater reliability among clinical experts for all 3 tasks, with higher Fleiss kappa measures with model assistance than without. To our knowledge, this is the first study to explore providing outputs of deep learning models to assist radiologists and non-radiologist clinicians in the task of image interpretation. More work will be needed to understand whether and how deep learning models could optimize the interpretation performance of practicing radiologists and non-radiologist clinicians.
A difficulty in deep learning for medical imaging is curating large datasets containing examples of the wide variety of abnormalities that can occur on a given imaging examination to train an accurate classifier, which is a strategy we employed for detecting ACL and meniscal tears. However, our other classification task was to distinguish “normal” from “abnormal” with the intention that if the model could learn the range of normal for a given population of knee MRI exams, then theoretically any abnormality, no matter how rare, could be detected by the model. An example is shown in Fig 3A of a relatively uncommon but serious complete rupture of the gastrocnemius tendon, which was correctly classified and localized as “abnormal” by the model, despite the fact that there were no other examples of this specific abnormality in the abnormal training data. It is possible that with a binary approach and enough “normal” training data, a model could detect any abnormality, no matter how uncommon. However, more work is needed to explore whether subtler abnormalities would require specific training data.
This study has limitations. Our validation set ground truth was not governed strictly by surgical confirmation in all cases. The deep learning model described was developed and trained on MRI data from 1 large academic institution. While MRNet performed well on the external validation set without additional training (AUC 0.824), we saw a substantial improvement (AUC 0.911) after training on the external dataset. This finding suggests that achieving optimal model performance may require additional model development using data more similar to what the model is likely to see in practice. More research is needed to determine if models trained on larger and multi-institutional datasets can achieve high performance without retraining. Power to detect statistically significant gains in clinical experts’ performance with model assistance was limited by the size of the panel, and a larger study that includes more clinical experts as well as more MRI exams may detect smaller gains in utility. Nevertheless, we have shown that even in this small set of clinical experts, providing model predictions significantly increased ACL tear detection specificity, even after correcting for multiple comparisons.
In conclusion, we developed a deep learning model that achieves high performance in clinical classification tasks on knee MRI and demonstrated the benefit, in a retrospective experiment, of providing model predictions to clinicians during the diagnostic imaging task. Future studies are needed to improve the performance and generalizability of deep learning models for MRI and to determine the effect of model assistance in the clinical setting.
References
- 1.
Nacey NC, Geeslin MG, Miller GW, Pierce JL. Magnetic resonance imaging of the knee: an overview and update of conventional and state of the art imaging. J Magn Reson Imaging. 2017;45:1257–75. pmid:28211591 - 2
Naraghi AM, White LM. Imaging of athletic injuries of knee ligaments and menisci: sports imaging series. Radiology. 2016;281:23–40. pmid:27643766 - 3
Helms CA. Magnetic resonance imaging of the knee. In: Brant WE, Helms CA, editors. Fundamentals of diagnostic radiology. Philadelphia: Lippincott Williams & Wilkins; 2007. pp. 1193–204. - 4
Oei EH, Nikken JJ, Verstijnen AC, Ginai AZ, Myriam Hunink MG. MR Imaging of the menisci and cruciate ligaments: a systematic review. Radiology. 2003;226:837–48. pmid:12601211 - 5
Rangger C, Klestil T, Kathrein A, Inderster A, Hamid L. Influence of magnetic resonance imaging on indications for arthroscopy of the knee. Clin Orthop Relat Res. 1996;330:133–42. - 6
Cheung LP, Li KC, Hollett MD, Bergman AG, Herfkens RJ. Meniscal tears of the knee: accuracy of detection with fast spin-echo MR imaging and arthroscopic correlation in 293 patients. Radiology. 1997;203:508–12. pmid:9114113 - 7.
Mackenzie R, Palmer CR, Lomas DJ, Dixon AK. Magnetic resonance imaging of the knee: diagnostic performance studies. Clin Radiol. 1996;51:251–7. pmid:8617036 - 8
McNally EG, Nasser KN, Dawson S, Goh LA. Role of magnetic resonance imaging in the clinical management of the acutely locked knee. Skeletal Radiol. 2002;31:570–3. pmid:12324825 - 9
Feller JA, Webster KE. Clinical value of magnetic resonance imaging of the knee. ANZ J Surg. 2001;71:534–7. pmid:11527263 - ten.
Elvenes J, Jerome CP, Reikerås O, Johansen O. Magnetic resonance imaging as a screening procedure to avoid arthroscopy for meniscal tears. Arch Orthop Trauma Surg. 2000;120:14–6. pmid:10653097 - 11.
Crawford R, Walley G, Bridgman S, Maffulli N. Magnetic resonance imaging versus arthroscopy in the diagnosis of knee pathology, concentrating on meniscal lesions and ACL tears: a systematic review. Br Med Bull. 2007;84:5–23. pmid:17785279 - 12.
Kim A, Khoury L, Schweitzer M, Jazrawi L, Ishak C, Meislin R, et al. Effect of specialty and experience on the interpretation of knee MRI scans. Bull NYU Hosp Jt Dis. 2008;66:272–5. pmid:19093902 - 13.
Doi K. Computer-aided diagnosis in medical imaging: historical review, current status and future potential. Comput Med Imaging Graph. 2007;31:198–211. pmid:17349778 - 14
Oakden-Rayner L, Carneiro G, Bessen T, Nascimento JC, Bradley AP, Palmer LJ. Precision radiology: predicting longevity using feature engineering and deep learning methods in a radiomics framework. Sci Rep. 2017;7:1648. pmid:28490744 - 15.
LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521:436–44. pmid:26017442 - 16.
Zeiler MD, Fergus R. Visualizing and understanding convolutional networks. In: Fleet D, Pajdla T, Schiele B, Tuytelaars T, editors. Computer vision—ECCV 2014. Berlin: Springer; 2014. pp. 818–33. - 17.
Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017;542:115–8. pmid:28117445 - 18.
Gulshan V, Peng L, Coram M, Stumpe MC, Wu D, Narayanaswamy A, et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA. 2016;316:2402–10. pmid:27898976 - 19.
Golan R, Jacob C, Denzinger J. Lung nodule detection in CT images using deep convolutional neural networks. 2016 International Joint Conference on Neural Networks; 2016 Jul 24–29; Vancouver, BC, Canada. - 20.
Litjens G, Kooi T, Bejnordi BE, Setio AA, Ciompi F, Ghafoorian M, et al. A survey on deep learning in medical image analysis. Med Image Anal. 2017;42:60–88. pmid:28778026 - 21.
Prasoon A, Petersen K, Igel C, Lauze F, Dam E, Nielsen M. Deep feature learning for knee cartilage segmentation using a triplanar convolutional neural network. In: Mori K, Sakuma I, Sato Y, Barillot C, Navab N, editors. Medical Image Computing and Computer-Assisted Intervention—MICCAI 2013. Berlin: Springer; 2013. pp. 246–253. - 22.
Liu F, Zhou Z, Samsonov A, Blankenbaker D, Larison W, Kanarek A, et al. Deep learning approach for evaluating knee MR images: achieving high diagnostic performance for cartilage lesion detection. Radiology. 2018;289:160–9. pmid:30063195 - 23
Štajduhar I, Mamula M, Miletić D, Ünal G. Semi-automated detection of anterior cruciate ligament injury from MRI. Comput Methods Programs Biomed. 2017;140:151–64. pmid:28254071 - 24
van Rossum G. Python 2.7.10 language reference. Wickford (UK): Samurai Media; 2015. - 25
Mason D. SU-E-T-33: Pydicom: an open source DICOM library. Med Phys. 2011;38:3493. - 26
Nyúl LG, Udupa JK. On standardizing the MR image intensity scale. Magn Reson Med. 1999;42:1072–81. pmid:10571928 - 27
van Rossum G, Drake FL. Python 3 reference manual. Paramount (CA): CreateSpace; 2009. - 28
Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, et al. Automatic differentiation in PyTorch. 31st Conference on Neural Information Processing Systems; 2017 Dec 4–9; Long Beach, CA, US. - 29
Deng J, Dong W, Socher R, Li L, Li K, Fei-Fei L. ImageNet: a large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition; 2009 Jun 20–25; Miami, FL, US. - 30
Zhou B, Khosla A, Lapedriza À, et al. Learning deep features for discriminative localization. 2016 IEEE Conference on Computer Vision and Pattern Recognition; 2016 Jun 26–Jul 1; Las Vegas, NV, US. - 31.
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12:2825–30. - 32
Hong SH, Choi JY, Lee GK, Choi JA, Chung HW, Kang HS. Grading of anterior cruciate ligament injury. Diagnostic efficacy of oblique coronal magnetic resonance imaging of the knee. J Comput Assist Tomogr. 2003;27:814–9. pmid:14501376 - 33
De Smet AA, Tuite MJ. Use of the ‘two-slice-touch’ rule for the MRI diagnosis of meniscal tears. AJR Am J Roentgenol. 2006;187:911–4. pmid:16985134 - 34
Nguyen JC, De Smet AA, Graf BK, Rosas HG. MR imaging-based diagnosis and classification of meniscal tears. Radiographics. 2014;34:981–99. pmid:25019436 - 35
Wilson EB. Probable inference, the law of succession, and statistical inference. J Am Stat Assoc. 1927;22:209–12. - 36
DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44:837–45. pmid:3203132 - 37
Sun X, Xu W. Fast implementation of DeLong’s algorithm for comparing the areas under correlated receiver operating characteristic curves. IEEE Signal Process Lett. 2014;21:1389–93. - 38
Storey JD, Dabney A, Robinson D. qvalue: Q-value estimation for false discovery rate control. GitHub. 2018 Mar 9 [cited 2018 Oct 26]. Available from: http://github.com/jdstorey/qvalue. - 39
Conger AJ. Integration and generalization of kappas for multiple raters. Psychol Bull. 1980;88:322–8. - 40
Fleiss JL. Measuring nominal scale agreement among many raters. Psychol Bull. 1971;76:378–82. - 41
R Development Core Team. R: a language and environment for statistical computing. Version 3.4.2. Vienna: R Foundation for Statistical Computing; 2017 [cited 2018 Oct 26]. Available from: http://www.R-project.org/. - 42
Gamer M, Lemon J, Singh IP. irr: various coefficients of interrater reliability and agreement. Version 0.84. Vienna: R Foundation for Statistical Computing; 2012 [cited 2018 Oct 26]. Available from: https://CRAN.R-project.org/package=irr. - 43
Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez JC, et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics. 2011;12:77. pmid:21414208 - 44
Dorai-Raj S. binom: binomial confidence intervals for several parameterizations. Version 1.1–1. Vienna: R Foundation for Statistical Computing; 2014 [cited 2018 Oct 26]. Available from: https://CRAN.R-project.org/package=binom.
Source link