Development and retrospective validation of MRNet



[ad_1]

Abstract

Methods and results

Our dataset included 1,370 knee MRI exams performed at the Stanford University Medical Center between January 1, 2001 and December 31, 2012 (mean age: 38.0 years, 569 years). [41.5%] patients). Most of the voices of 3 musculoskeletal radiologists have established reference labels on an internal validation set of 120 exams. We developed MRNet, a convolutional neural network that classifies MRI series and combines 3-series predictions by examination with the help of a logistic regression. By detecting abnormalities, ACL tears and meniscal tears, this model obtained surface values ​​under the operating characteristic curve (AUC) of 0.937 (95% CI: 0.895, 0.980), 0.965 (95% CI) , 0.938, 0.993) and 0.847 (95%). % CI 0.780, 0.914), respectively, on the internal validation set. We also obtained a public dataset of 917 exams with a sagittal T1-weighted series and LCA lesion tags from the Rijeka Clinical Hospital Center, Croatia. On the external validation set consisting of 183 examinations, the MRNet formed in the Stanford Sagittal T2-weighted series achieved an AUC of 0.824 (95% CI 0.757, 0.892) for the detection of ACL lesions without additional training. while an MRNet was trained on the rest. external data reached an AUC of 0.911 (95% CI 0.864, 0.958). We also measured the specificity, sensitivity and accuracy of 9 clinical experts (7 general radiologists and 2 orthopedic surgeons certified by the board of directors) on the internal validation set, with and without the model help. Using a two-sided Pearson chi-square test with adjustment for multiple comparisons, we found no significant difference between the performance of the model and that of unassisted general radiologists with respect to anomaly detection. General radiologists achieved significantly higher sensitivity by detecting ACL tears (p-value = 0.002; qvalue = 0.019) and a significantly higher specificity in the detection of meniscal tears (p-value = 0.003; qvalue = 0.019). Using a unilateral t testing the evolution of performance parameters, we found that the provision of model predictions significantly increased the specificity of clinical experts in the identification of ACL tears (p-value <0.001; qvalue = 0.006). The main limitations of our study are the lack of truth about the surgical field and the small size of the clinical expert group.

introduction

Magnetic resonance imaging (MRI) of the knee is the standard imaging modality for assessing knee disorders. In addition, more MRIs of the musculoskeletal (MSK) are performed on the knee rather than on any other region of the body.[[[[1 to 3]. MRI has repeatedly demonstrated great precision for the diagnosis of meniscal pathology and cruciate ligament [4–7] and is commonly used to identify those who might benefit from surgery [8–10]. In addition, the negative predictive value of knee MRI is almost 100%. MRI is therefore a non-invasive method to eliminate surgical disorders such as anterior cruciate ligament (ACL) tears. [11]. Due to the quantity and detail of the images in each MRI examination of the knee, accurate interpretation of knee MRI is a tedious task and subject to inter and intra-examiner variability, even when it is performed by certified MSK radiologists from the board of directors. [12]. An automated system for interpreting MRI images of the knee has many potential applications, such as the rapid prioritization of high-risk patients in the radiologist's workflow and assistance in performing diagnostics. [13]. However, the multidimensional and multi-planar properties of MRI have so far limited the applicability of traditional methods of image analysis to knee MRI. [13,14].

In-depth learning approaches, which automatically learn layers of functionality, are ideally suited to modeling the complex relationships between medical images and their interpretations.[[[[15,16]. Recently, these approaches have surpassed traditional methods of image analysis and have led to significant advances in medical imaging tasks, including the classification of skin cancer. [17], detection of diabetic retinopathy [18]and detection of pulmonary nodules [19]. Previous applications for deep-knee MRI training have been limited to cartilage segmentation and cartilage injury detection. [20–22].

In this study, we present MRNet, a fully automated in-depth learning model for knee MRI interpretation, and compare the performance of this model to that of radiologists in general. In addition, we evaluate changes in diagnostic performance of clinical experts when automated deep learning model predictions are provided during interpretation. Finally, we are evaluating the performance of our model on a public external MRI knee data set that is labeled "ACL Injury" and publicly available.

The methods

Dataset

Knee MRI reports from the Stanford University Medical Center between January 1, 2001 and December 31, 2012 were manually reviewed to manage a dataset of 1,370 knee MRI exams. The dataset contained 1,104 abnormal exams (80.6%), with 319 (23.3%) ACL tears and 508 (37.1%) meniscal tears. ACL tears and meniscal tears occurred simultaneously in 194 examinations (38.2%). The most common indications for knee MRI examinations in this study included acute and chronic pain, preoperative follow-up or assessment, injury / trauma, and another / not provided. The exams were performed with GE scanners (GE Discovery, GE Healthcare, Waukesha, WI) with a standard knee MRI coil and a non-contrast knee MRI routine including the following sequences: T1 weighted coronal, T2 coronal with fat saturation, weighted sagittal proton density (PD), sagittal T2 with fat saturation and weighted axial PD with fat saturation. A total of 775 examinations (56.6%) used a 3.0 T magnetic field; the rest used a magnetic field of 1.5 t. See Table S1 for detailed MRI sequence parameters. For this study, sagittal T2-weighted series, coronal plane T1-weighted series, and axially-PD-weighted series were extracted from each examination for use in the model. The number of images in these series ranged from 17 to 61 (mean 31.48, SD 7.97).

The examinations were divided into a set of training (1,130 examinations, 1,088 patients), a set of adjustments (120 examinations, 111 patients) and a validation set (120 examinations, 113 patients) (Fig. 1). To form the validation and fitting sets, stratified random sampling was used to ensure that at least 50 positive examples of each label (abnormal, ACL tear, and meniscus) were present in each set. All the examinations of each patient were placed in the same division. Table 1 contains statistics for patient pathology and patient demographics for each dataset.

Model

MRnet.

The main building block of our forecasting system is MRNet, a convolutional neuron network (CNN) mapping a three-dimensional MRI series to a probability.[[[[15](Fig 2). The entry in MRNet has dimensions s × 3 × 256 × 256, where s is the number of images in the IRM series (3 is the number of color channels). First, each 2-dimensional MRI image slice was passed to an AlexNet-based feature extractor to obtain a s × 256 × 7 × 7 tensor containing characteristics for each slice. A global average aggregation layer was then applied to reduce these features to s × 256. We then applied maximum pooling between slices to obtain a 256 dimensional vector, which was transmitted to a fully connected layer and a sigmoidal activation function to obtain a prediction. in the range 0 to 1. We optimized the model using a cross-entropy bit loss. To account for unbalanced class sizes on all tasks, the loss for an example has been reduced in proportion to the prevalence of the class in this example in the dataset.

During the training, the loss gradient was calculated on each training example using the backpropagation algorithm, and the MRNet parameters were adjusted in the opposite direction to the gradient.[[[[15]. Each training example underwent a random rotation between -25 and 25 degrees, a random shift between -25 and 25 pixels and a horizontal reversal with a 50% probability at each appearance in the training. The model parameters were saved after each complete pass in the training set and the model with the lowest average loss on the set of adjustment was chosen to be evaluated on the set of validation. Figure 2 describes the MRNet architecture in more detail. The training of each MRNet network for 50 iterations on the entire drive took an average of 6 hours with an NVIDIA GeForce GTX 1070 8GB graphics processor. MRNet was implemented with Python 3.6.3 [27] and PyTorch 0.3.0 [28].

Formatting a CNN to the classification of images from scratch generally requires a data set greater than 1,130 examples. For this reason, we have initialized the weights of the MRNet AlexNet part to optimized values ​​on the ImageNet database.[[[[29]1.2 million images per 1,000 classes, then refined these weights to fit our MRI dataset. This allowed the first layers of the network, more difficult to optimize than subsequent layers, to immediately recognize generic features such as lines and outlines. This "learning transfer" approach has also been applied to skin cancer [17] and diabetic retinopathy [18] image data sets.

MRNet Interpretation.

To ensure that MRNet models learned relevant features, we generated Class Activation Mappings (CAMs).[[[[30](Fig 3). To generate a CAM for an image, we calculated a weighted average on the 256 CNN feature maps using the weights of the classification layer to obtain a 7 × 7 image. The CAM was then mapped to a color scheme, oversampled up to 256 × 256 pixels and superimposed on the original input image. By using the parameters of the last layer of the network to weight the feature maps, more predictive feature maps appear brighter. Thus, the brightest areas of the CAMs are the regions that most influence the prediction of the model.

Evaluation

The standard reference labels were obtained on the internal validation set obtained by the majority of the votes of 3 MSK radiologists certified by the board of directors in a large university practice (6 to 19 years of practice, average of 12 years ). MSK radiologists had access to all DICOM series, original report, clinical history and follow-up exams during interpretation. All of the readers participating in the study used a Clinical Image Archiving System (PACS) environment (GE Centricity) in a diagnostic reading room, and an evaluation was performed on the DICOM clinical images presented on a screen. medical display of at least 3 megapixels. a minimum luminance of 1 cd / m2, maximum luminance of 400 cd / m2, pixel size of 0.2 and native resolution of 1500 × 2000 pixels. The exams were sorted in reverse chronological order. Three binary labels were assigned at each examination to indicate the presence or absence of (1) any abnormality, (2) an ACL tear, and (3) a meniscal tear. The definitions for the labels were as follows:

  • Anomaly: normal (all images examined are free of abnormalities) or abnormal (abnormal findings in the internal validation set that were not an ACL tear or a meniscal tear included osteoarthritis, effusion, iliotibial band syndrome, posterior cruciate ligament tear, fracture, contusion, plica and medial collateral ligament sprain);
  • LCA: intact (normal, mucoid degeneration, ganglion cyst, sprain) or tear (partial tear of low grade with <50% of fibers torn, high-grade partial tear with >50% torn fibers, complete tearing) [32];
  • Meniscus: intact (normal and degenerative changes without tearing, post-surgical changes without tearing) or tearing (increased signal reaching the articular surface on at least 2 sections or morphological malformation) [33,34].

Independently of MSK radiologists, 7 general radiologists and 2 orthopedic surgeons from the Stanford University Medical Center (3 to 29 years of practice, 12 years on average) have labeled the internal validation set, ignoring reports and labels of origin. The labels of these clinical experts were measured against the labels of the MSK radiologists consensus reference standards. General radiologists were randomly divided into 2 groups, with 4 radiologists in group 1 and 3 radiologists in group 2. The 2 orthopedic surgeons were also in group 1. Group 1 first examined the entire group. validation without template help, and group 2 has validation set with model support. For model-assisted examinations, model predictions were provided as predicted probabilities of a positive diagnosis (eg, 0.98 ACL tear). After a 10-day off-load period, Group 1 then examined the validation set in a different order with the help of the model, and Group 2 examined the unassisted validation set of the model. The Stanford Institutional Review Committee approved this study.

Statistical methods

Performance measures of the model, general radiologists, and orthopedic surgeons included sensitivity, specificity, and precision. We also calculated the micro-mean of these statistics on general radiologists only and on all clinical experts (general radiologists and surgeons). We evaluated the performance of the model with the area under the operating characteristic curve of the receiver. To assess the variability of the estimates, we provide confidence intervals of the Wilson score of 95%.[[[[35]for the sensitivity, specificity, accuracy and 95% confidence intervals of DeLong for AUC [36,37]. A threshold of 0.5 was used to dichotomise model predictions. The performance of the model on the external validation set was evaluated with AUC and DeLong confidence intervals of 95%.

Comme nous avons effectué plusieurs comparaisons dans cette étude pour évaluer les performances du modèle par rapport à celles des radiologistes généralistes en exercice et pour évaluer l’utilité clinique de la fourniture d’une assistance au modèle, nous avons contrôlé le taux global de découverte erronée (FDR) à 0,05.[[[[38]et signaler à la fois non ajusté p-valeurs et ajusté q-valeurs. En gros, un FDR <0,05 peut être interprété comme la proportion attendue (0,05) d&#39;allégations fausses d&#39;importance dans tous les résultats significatifs. Ainsi, au lieu d’utiliser les données non ajustées p-valeur pour évaluer la signification statistique, une q-La valeur <0,05 rend correctement compte de ces comparaisons multiples. Pour évaluer les performances du modèle par rapport à celles des radiologistes généraux, nous avons utilisé un test du chi-carré de Pearson à deux côtés pour déterminer s’il existait des différences significatives en termes de spécificité, de sensibilité et de précision entre le modèle et la micro-moyenne des radiologues en général. Les chirurgiens orthopédiques n’ont pas été inclus dans cette comparaison.

Nous avons évalué l&#39;utilité clinique de fournir des prévisions de modèles aux experts cliniques en vérifiant si les mesures de performance des 7 radiologistes généraux et des 2 chirurgiens orthopédistes augmentaient lorsqu&#39;elles recevaient une assistance de modèle. Il existe une variabilité naturelle lorsqu&#39;un expert clinique évalue la même étude IRM du genou à différents moments. Il n’est donc pas surprenant que les mesures de performance d’un expert clinique soient légèrement meilleures ou moins bonnes lorsqu’il est testé à deux reprises, quel que soit l’assistance du modèle. Ainsi, nous avons effectué des tests d’hypothèses robustes pour déterminer si les experts cliniques (en tant que groupe) présentaient une amélioration statistiquement significative avec l’aide du modèle. Nous avons utilisé un unilatéral t test sur le changement (différence) des mesures de performance pour les 9 experts cliniques pour les 3 étiquettes. Pour déterminer si ces résultats dépendaient spécifiquement de l’amélioration des chirurgiens orthopédiques, nous avons effectué une analyse de sensibilité: nous avons répété le test unilatéral. t test sur le changement des mesures de performance uniquement chez les radiologistes généraux, à l&#39;exclusion des chirurgiens orthopédistes, afin de déterminer s&#39;il y avait toujours une amélioration significative.

Le Fleiss Kappa exact[[[[39,40]est censé évaluer le degré d’accord des 3 radiologistes MSK, dont le vote à la majorité a été utilisé pour les étiquettes de référence. En outre, pour évaluer si l’assistance fournie par un modèle peut améliorer la fiabilité entre évaluateurs, nous indiquons le Fleapp kappa exact de l’ensemble des 9 experts cliniques, avec et sans assistance du modèle, pour chacune des 3 tâches.

Toutes les analyses statistiques ont été effectuées dans l&#39;environnement R pour le calcul statistique[[[[41], en utilisant les packages irr, pROC, binom et qvalue [38,42–44], et le code R a été fourni avec la soumission.

Résultats

L&#39;accord entre les évaluateurs sur la validation interne établie entre les 3 radiologues MSK, mesuré par le score kappa de Fleiss exact, était de 0,508 pour la détection des anomalies, de 0,800 pour la détection des déchirures du LCA et de 0,745 pour la détection des déchirures méniscales.

La performance du modèle

Pour la détection d&#39;anomalie, la détection de déchirure du LCA et la détection de déchirure méniscale, le modèle a obtenu des ASC de 0,937 (IC à 95%: 0,895, 0,980), 0,965 (IC à 95%, 0,938, 0,993) et 0,847 (IC à 95%, 0,780, 0,914), respectivement. (Fig 5). Lors de la détection d&#39;anomalies, il n&#39;y avait pas de différences significatives dans les mesures de performance du modèle et des radiologistes généraux (Tableau 2). La spécificité du modèle pour la détection des anomalies était inférieure à la micro-moyenne des radiologistes généraux, à 0,714 (IC 95%: 0,500, 0,862) et 0,844 (IC 95%: 0,776, 0,893). Le modèle a atteint une sensibilité de 0,879 (IC à 95%: 0,800, 0,929) et une précision de 0,850 (IC à 95%, 0,775, 0,903), tandis que les radiologistes généraux ont obtenu une sensibilité de 0,905 (IC à 95%, 0,881, 0,924) et une précision de 0,894. IC à 95% 0,871, 0,913) (tableau 2).

Utilité clinique de l&#39;assistance modèle

L’utilité clinique de la fourniture de prévisions de modèles à des experts cliniques au cours du processus d’étiquetage est illustrée dans La figure 6 et les valeurs numériques fournies dans le tableau 3. Lorsque les experts cliniques bénéficiaient d’une assistance modèle, il existait une augmentation statistiquement significative de la spécificité des experts cliniques dans l’identification des déchirures du LCA (p-valeur <0,001, q-valeur = 0,006). The mean increase in ACL specificity was 0.048 (4.8%), and since the validation set contained 62 exams that were negative for ACL tear, this increase in specificity in the optimal clinical setting would mean potentially 3 fewer patients sent to surgery for suspected ACL tear unnecessarily. Though it appeared that model assistance also significantly increased the clinical experts’ accuracy in detecting ACL tears (p-value = 0.020) and sensitivity in detecting meniscus tears (p-value = 0.028), these findings were no longer significant after adjusting for multiple comparisons by controlling the FDR (q-values = 0.092 and 0.110, respectively). There were no other statistically significant improvements to clinical experts’ performance with model assistance. Individual results, unadjusted p-values, and adjusted q-values are provided in S3 Table.

Discussion

The purpose of this study was to design and evaluate a deep learning model for classifying pathologies on knee MRI and to compare performance to human clinical experts both with and without model assistance during interpretation in a crossover design. Our results demonstrate that a deep learning approach can achieve high performance in clinical classification tasks on knee MR, with AUCs for abnormality detection, ACL tear detection, and meniscus tear detection of 0.937 (95% CI 0.895, 0.937), 0.965 (95% CI 0.938, 0.965), and 0.847 (95% CI 0.780, 0.847), respectively. Notably, the model achieved high specificity in detecting ACL tears on the internal validation set, which suggests that such a model, if used in the clinical workflow, may have the potential to effectively rule out ACL tears. On an external dataset using T1-weighted instead of T2-weighted series and a different labeling convention for ACL injury, the same ACL tear model achieved an AUC of 0.824 (95% CI 0.757, 0.892). Retraining on the external dataset improved the AUC to 0.911 (95% CI 0.864, 0.958). Our deep learning model achieved state-of-the-art results on the external dataset, but only after retraining. It remains to be seen if the model would better generalize to an external dataset with more MRI series and a more similar MRI protocol. We also found that providing the deep learning model predictions to human clinical experts as a diagnostic aid resulted in significantly higher specificities in identifying ACL tears. Finally, in contrast to the human experts, who required more than 3 hours on average to completely review 120 exams, the deep learning model provided all classifications in under 2 minutes. Our results suggest that deep learning can be successfully applied to advanced MSK MRI to generate rapid automated pathology classifications and that the output of the model may improve clinical interpretations.

There are many exciting potential applications of an automated deep learning model for knee MRI diagnosis in clinical practice. For example, the model described could be immediately applied for diagnostic worklist prioritization, wherein exams detected as abnormal could be moved ahead in the image interpretation workflow, and those identified as normal could be automatically assigned a preliminary reading of “normal.” With its high negative predictive value for abnormalities, the model could lead to quick preliminary feedback for patients whose exams come back as “normal.” Additionally, providing rapid results to the ordering clinician could improve disposition in other areas of the healthcare system. In this work we noticed that specificity for detecting ACL tears improved for both general radiologists and orthopedic surgeons, which implies that this model could help reduce unnecessary additional testing and surgery. Automated abnormality prediction and localization could help general radiologists or even non-radiologist clinicians (orthopedic surgeons) interpret medical imaging for patients at the point of care rather than waiting for specialized radiologist interpretation, which could aid in efficient interpretation, reduce errors, and help standardize quality of diagnoses when MSK specialist radiologists are not readily available. Ultimately, more studies are necessary to evaluate the optimal integration of this model and other deep learning models in the clinical setting. However, our results provide early support for a future where deep learning models may play a significant role in assisting clinicians and healthcare systems.

To examine the effect that a deep learning model may have on the interpretation performance of clinicians, our study deliberately recruited general radiologists to interpret knee MRI exams with and without model predictions. We found a statistically significant improvement in specificity for the ACL tear detection task with model assistance and, though not statistically significant, increased accuracy for ACL tear detection and increased sensitivity for meniscal tear detection. For both general radiologists and non-radiologist clinicians (orthopedic surgeons), we found improved sensitivity and/or specificity across all 3 tasks with model assistance (Fig 5; Table 3), although the group of surgeons was too small for formal analysis. Importantly, model assistance also resulted in higher inter-rater reliability among clinical experts for all 3 tasks, with higher Fleiss kappa measures with model assistance than without. To our knowledge, this is the first study to explore providing outputs of deep learning models to assist radiologists and non-radiologist clinicians in the task of image interpretation. More work will be needed to understand whether and how deep learning models could optimize the interpretation performance of practicing radiologists and non-radiologist clinicians.

A difficulty in deep learning for medical imaging is curating large datasets containing examples of the wide variety of abnormalities that can occur on a given imaging examination to train an accurate classifier, which is a strategy we employed for detecting ACL and meniscal tears. However, our other classification task was to distinguish “normal” from “abnormal” with the intention that if the model could learn the range of normal for a given population of knee MRI exams, then theoretically any abnormality, no matter how rare, could be detected by the model. An example is shown in Fig 3A of a relatively uncommon but serious complete rupture of the gastrocnemius tendon, which was correctly classified and localized as “abnormal” by the model, despite the fact that there were no other examples of this specific abnormality in the abnormal training data. It is possible that with a binary approach and enough “normal” training data, a model could detect any abnormality, no matter how uncommon. However, more work is needed to explore whether subtler abnormalities would require specific training data.

This study has limitations. Our validation set ground truth was not governed strictly by surgical confirmation in all cases. The deep learning model described was developed and trained on MRI data from 1 large academic institution. While MRNet performed well on the external validation set without additional training (AUC 0.824), we saw a substantial improvement (AUC 0.911) after training on the external dataset. This finding suggests that achieving optimal model performance may require additional model development using data more similar to what the model is likely to see in practice. More research is needed to determine if models trained on larger and multi-institutional datasets can achieve high performance without retraining. Power to detect statistically significant gains in clinical experts’ performance with model assistance was limited by the size of the panel, and a larger study that includes more clinical experts as well as more MRI exams may detect smaller gains in utility. Nevertheless, we have shown that even in this small set of clinical experts, providing model predictions significantly increased ACL tear detection specificity, even after correcting for multiple comparisons.

In conclusion, we developed a deep learning model that achieves high performance in clinical classification tasks on knee MRI and demonstrated the benefit, in a retrospective experiment, of providing model predictions to clinicians during the diagnostic imaging task. Future studies are needed to improve the performance and generalizability of deep learning models for MRI and to determine the effect of model assistance in the clinical setting.

References

  1. 1.
    Nacey NC, Geeslin MG, Miller GW, Pierce JL. Magnetic resonance imaging of the knee: an overview and update of conventional and state of the art imaging. J Magn Reson Imaging. 2017;45:1257–75. pmid:28211591
  2. 2
    Naraghi AM, White LM. Imaging of athletic injuries of knee ligaments and menisci: sports imaging series. Radiology. 2016;281:23–40. pmid:27643766
  3. 3
    Helms CA. Magnetic resonance imaging of the knee. In: Brant WE, Helms CA, editors. Fundamentals of diagnostic radiology. Philadelphia: Lippincott Williams & Wilkins; 2007. pp. 1193–204.
  4. 4
    Oei EH, Nikken JJ, Verstijnen AC, Ginai AZ, Myriam Hunink MG. MR Imaging of the menisci and cruciate ligaments: a systematic review. Radiology. 2003;226:837–48. pmid:12601211
  5. 5
    Rangger C, Klestil T, Kathrein A, Inderster A, Hamid L. Influence of magnetic resonance imaging on indications for arthroscopy of the knee. Clin Orthop Relat Res. 1996;330:133–42.
  6. 6
    Cheung LP, Li KC, Hollett MD, Bergman AG, Herfkens RJ. Meniscal tears of the knee: accuracy of detection with fast spin-echo MR imaging and arthroscopic correlation in 293 patients. Radiology. 1997;203:508–12. pmid:9114113
  7. 7.
    Mackenzie R, Palmer CR, Lomas DJ, Dixon AK. Magnetic resonance imaging of the knee: diagnostic performance studies. Clin Radiol. 1996;51:251–7. pmid:8617036
  8. 8
    McNally EG, Nasser KN, Dawson S, Goh LA. Role of magnetic resonance imaging in the clinical management of the acutely locked knee. Skeletal Radiol. 2002;31:570–3. pmid:12324825
  9. 9
    Feller JA, Webster KE. Clinical value of magnetic resonance imaging of the knee. ANZ J Surg. 2001;71:534–7. pmid:11527263
  10. ten.
    Elvenes J, Jerome CP, Reikerås O, Johansen O. Magnetic resonance imaging as a screening procedure to avoid arthroscopy for meniscal tears. Arch Orthop Trauma Surg. 2000;120:14–6. pmid:10653097
  11. 11.
    Crawford R, Walley G, Bridgman S, Maffulli N. Magnetic resonance imaging versus arthroscopy in the diagnosis of knee pathology, concentrating on meniscal lesions and ACL tears: a systematic review. Br Med Bull. 2007;84:5–23. pmid:17785279
  12. 12.
    Kim A, Khoury L, Schweitzer M, Jazrawi L, Ishak C, Meislin R, et al. Effect of specialty and experience on the interpretation of knee MRI scans. Bull NYU Hosp Jt Dis. 2008;66:272–5. pmid:19093902
  13. 13.
    Doi K. Computer-aided diagnosis in medical imaging: historical review, current status and future potential. Comput Med Imaging Graph. 2007;31:198–211. pmid:17349778
  14. 14
    Oakden-Rayner L, Carneiro G, Bessen T, Nascimento JC, Bradley AP, Palmer LJ. Precision radiology: predicting longevity using feature engineering and deep learning methods in a radiomics framework. Sci Rep. 2017;7:1648. pmid:28490744
  15. 15.
    LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521:436–44. pmid:26017442
  16. 16.
    Zeiler MD, Fergus R. Visualizing and understanding convolutional networks. In: Fleet D, Pajdla T, Schiele B, Tuytelaars T, editors. Computer vision—ECCV 2014. Berlin: Springer; 2014. pp. 818–33.
  17. 17.
    Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017;542:115–8. pmid:28117445
  18. 18.
    Gulshan V, Peng L, Coram M, Stumpe MC, Wu D, Narayanaswamy A, et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA. 2016;316:2402–10. pmid:27898976
  19. 19.
    Golan R, Jacob C, Denzinger J. Lung nodule detection in CT images using deep convolutional neural networks. 2016 International Joint Conference on Neural Networks; 2016 Jul 24–29; Vancouver, BC, Canada.
  20. 20.
    Litjens G, Kooi T, Bejnordi BE, Setio AA, Ciompi F, Ghafoorian M, et al. A survey on deep learning in medical image analysis. Med Image Anal. 2017;42:60–88. pmid:28778026
  21. 21.
    Prasoon A, Petersen K, Igel C, Lauze F, Dam E, Nielsen M. Deep feature learning for knee cartilage segmentation using a triplanar convolutional neural network. In: Mori K, Sakuma I, Sato Y, Barillot C, Navab N, editors. Medical Image Computing and Computer-Assisted Intervention—MICCAI 2013. Berlin: Springer; 2013. pp. 246–253.
  22. 22.
    Liu F, Zhou Z, Samsonov A, Blankenbaker D, Larison W, Kanarek A, et al. Deep learning approach for evaluating knee MR images: achieving high diagnostic performance for cartilage lesion detection. Radiology. 2018;289:160–9. pmid:30063195
  23. 23
    Štajduhar I, Mamula M, Miletić D, Ünal G. Semi-automated detection of anterior cruciate ligament injury from MRI. Comput Methods Programs Biomed. 2017;140:151–64. pmid:28254071
  24. 24
    van Rossum G. Python 2.7.10 language reference. Wickford (UK): Samurai Media; 2015.
  25. 25
    Mason D. SU-E-T-33: Pydicom: an open source DICOM library. Med Phys. 2011;38:3493.
  26. 26
    Nyúl LG, Udupa JK. On standardizing the MR image intensity scale. Magn Reson Med. 1999;42:1072–81. pmid:10571928
  27. 27
    van Rossum G, Drake FL. Python 3 reference manual. Paramount (CA): CreateSpace; 2009.
  28. 28
    Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, et al. Automatic differentiation in PyTorch. 31st Conference on Neural Information Processing Systems; 2017 Dec 4–9; Long Beach, CA, US.
  29. 29
    Deng J, Dong W, Socher R, Li L, Li K, Fei-Fei L. ImageNet: a large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition; 2009 Jun 20–25; Miami, FL, US.
  30. 30
    Zhou B, Khosla A, Lapedriza À, et al. Learning deep features for discriminative localization. 2016 IEEE Conference on Computer Vision and Pattern Recognition; 2016 Jun 26–Jul 1; Las Vegas, NV, US.
  31. 31.
    Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12:2825–30.
  32. 32
    Hong SH, Choi JY, Lee GK, Choi JA, Chung HW, Kang HS. Grading of anterior cruciate ligament injury. Diagnostic efficacy of oblique coronal magnetic resonance imaging of the knee. J Comput Assist Tomogr. 2003;27:814–9. pmid:14501376
  33. 33
    De Smet AA, Tuite MJ. Use of the ‘two-slice-touch’ rule for the MRI diagnosis of meniscal tears. AJR Am J Roentgenol. 2006;187:911–4. pmid:16985134
  34. 34
    Nguyen JC, De Smet AA, Graf BK, Rosas HG. MR imaging-based diagnosis and classification of meniscal tears. Radiographics. 2014;34:981–99. pmid:25019436
  35. 35
    Wilson EB. Probable inference, the law of succession, and statistical inference. J Am Stat Assoc. 1927;22:209–12.
  36. 36
    DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44:837–45. pmid:3203132
  37. 37
    Sun X, Xu W. Fast implementation of DeLong’s algorithm for comparing the areas under correlated receiver operating characteristic curves. IEEE Signal Process Lett. 2014;21:1389–93.
  38. 38
    Storey JD, Dabney A, Robinson D. qvalue: Q-value estimation for false discovery rate control. GitHub. 2018 Mar 9 [cited 2018 Oct 26]. Available from:
    http://github.com/jdstorey/qvalue.
  39. 39
    Conger AJ. Integration and generalization of kappas for multiple raters. Psychol Bull. 1980;88:322–8.
  40. 40
    Fleiss JL. Measuring nominal scale agreement among many raters. Psychol Bull. 1971;76:378–82.
  41. 41
    R Development Core Team. R: a language and environment for statistical computing. Version 3.4.2. Vienna: R Foundation for Statistical Computing; 2017 [cited 2018 Oct 26]. Available from:
    http://www.R-project.org/.
  42. 42
    Gamer M, Lemon J, Singh IP. irr: various coefficients of interrater reliability and agreement. Version 0.84. Vienna: R Foundation for Statistical Computing; 2012 [cited 2018 Oct 26]. Available from:
    https://CRAN.R-project.org/package=irr.
  43. 43
    Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez JC, et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics. 2011;12:77. pmid:21414208
  44. 44
    Dorai-Raj S. binom: binomial confidence intervals for several parameterizations. Version 1.1–1. Vienna: R Foundation for Statistical Computing; 2014 [cited 2018 Oct 26]. Available from:
    https://CRAN.R-project.org/package=binom.
[ad_2]
Source link