The standard methods of diagnosing autism spectrum disorders evaluate between 20 and 100 behaviors and last several hours. This partly contributed to the long waiting times for diagnosis and subsequent delays in treatment access. We hypothesize that using the auto-learning analysis on a home video can speed up the diagnosis without compromising accuracy. We analyzed item-level records from 2 standard diagnostic instruments to construct automated learning classifiers optimized for parsimony, interpretability, and accuracy. In this study, we prospectively test whether the characteristics of these optimized models can be extracted by blind non-expert evaluators from 3-minute home videos showing children with and without ASD, in order to arrive at a rapid and rapid classification. accurate autism autism learning.
Methods and results
We have created a mobile web portal to evaluate videos to evaluate 30 behavioral characteristics (eye contact, social smile) used by 8 independent machine learning models for identifying ASDs, each with a > 94% accuracy in cross-validation tests and subsequent independent tests. validation of previous work. We then collected 116 short videos of children with autism (mean age = 4 years and 10 months, SD = 2 years and 3 months) and 46 videos of typical developing children (mean age = 2 years and 11 months, SD = 1 year and 2 months). . Three blind diagnostic evaluators independently measured each of the 30 characteristics of the 8 models, with a median delay of 4 minutes. Although several models (consisting of alternating decision trees, machine support [SVM], logistic regression (LR), radial core and linear SVM worked well, a LR 5 classifier provided the highest accuracy (area under the curve). [AUC]92% [95% CI 88%–97%]) across all ages tested. We used an independent validation set of 66 videos (33 ASD and 33 non-ASD) and 3 independent evaluation measures collected prospectively to validate the result, achieving a lower but comparable accuracy (AUC: 89% [95% CI 81%–95%]). Finally, we applied LR to the matrix of 162 video characteristics to construct an 8-characteristic model, which reached a value of 0.93 AUC (95% CI 0.90-0.97) on the test set retained and 0.86 on the validation game of 66 videos. Validation on children with an existing diagnosis limited the ability to generalize performance to undiagnosed populations.
These results support the hypothesis that tagging personal videos for autism classification by machine learning can produce accurate results in a very short time frame, with the help of mobile devices. . Further work will be needed to confirm that this approach can accelerate the diagnosis of autism on a large scale.
Why was this study done?
- The incidence of autism has increased by about 700% since 1996 and affects at least one in every 59 children in the United States.
- The current standard of diagnosis requires direct observation from the clinician to the child and takes hours to be administered.
- The sharp rise in the incidence of autism, coupled with the non-evolving standard of care (SOC), has created pressure on the health care system and the average age of diagnosis stays around 4.5 years, 2 years after the moment when could be reliably diagnosed.
- Large scale mobile measures could help alleviate this pressure on the health care system, reduce waiting times for access to treatment and treatment, and reach underserved populations.
What did the researchers do and find?
- We applied 8 self-learning models to 162 two-minute home videos of children with and without autism diagnosis in order to test the ability to detect autism reliably on flat surfaces. mobile forms.
- Three non-expert evaluators measured 30 behavioral characteristics required for classification of machine learning by the 8 models in about 4 minutes.
- Taking advantage of the video notes, a machine learning model with only 5 features achieved 86% of the unweighted average booster on 162 videos and a ratio = 80% on a series of 66 different and independently evaluated videos with a ratio of 83% on children under 24 years old. 4
- The above machine learning process for rendering a mobile video diagnostic quickly created a new collection of tagged video features and a new model based on video features with> 90% accuracy.
What do these results mean?
- Short home videos can provide enough information to run trained auto learning classifiers to detect children with autism among those with typical or atypical development. The features needed for machine learning models designed to detect autism can be identified and measured in personal videos on mobile devices by non-experts in near-total video lengths and below at 6 minutes.
- The machine learning models provide a quantitative indication of the risk of autism with greater granularity than a binary result for reporting inconclusive cases, potentially adding useful value in a clinical setting, such as for triage.
- The mobile video analysis process for autism detection generates an increasing array of video features that can be used to build new machine learning models that can have greater accuracy for detecting autism. Autism in the home video.
- Prospective clinical trials in general pediatrics on previously undiagnosed populations will be required. However, these results confirm the possibility that mobile video analysis with automatic learning can enable rapid detection of autism outside clinics, in order to reduce waiting times for the autism. Access to care and reach disadvantaged populations in regions with limited health infrastructure.
Quote: Tariq Q, Daniels J, Schwartz JN, Washington P, Kalantarian H, DP Wall (2018) Mobile Autism Detection by Machine Learning on Home Video: Prospective Development and Validation Study. PLoS Med 15 (11):
Academic Publisher: Suchi Saria, Johns Hopkins University, USA
Received: June 8, 2018; Accepted: October 25, 2018; published: November 27, 2018
Copyright: © 2018 Tariq et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction on any medium, provided that the author and source are acknowledged. origin.
Availability of data: Anonymized data is available in the following github repository and includes the primary dataset and the validation dataset: https://github.com/qandeelt/video_phenotyping_autism_plos/tree/master/datasets. The code has been made available in the following github repository and instructions on how to run each classifier have been provided: https://github.com/qandeelt/video_phenotyping_autism_plos
Funding: The work was funded in part by the NIH (1R01EB025025-01 & 1R21HD091500-01), the Hartwell Foundation, the Bill and Melinda Gates Foundation, the Coulter Foundation, the Lucile Packard Foundation, and the Artificial Intelligence Program Grants. centered on the human of Stanford University. , Precision and Integrated Health Diagnostics Center (PHIND), Beckman Center, Bio-X Center, Prognostics and Diagnostics Accelerator, and Institute for Child Health Research. We also received philanthropic support from Bobby Dekesyer and Peter Sullivan. No funding agency has played a role in the design of the study, the collection and analysis of data, the decision to publish or the preparation of the manuscript.
Competing interests: I've read the magazine's policy and the authors of this manuscript have the following competing interests: DPW is the scientific founder of Cognoa, a company focused on pediatric digital healthcare; The approach and results presented in this document are independent of / unrelated to Cognoa. All other writers stated that there was no conflict of interest.
Autism Diagnosis Interview – Revised; ADOS
Diagnostic Observation Schedule of Autism; ADTree,
alternating decision tree; ADTree7,
Alternating decision tree with 7 functions; ADTree8,
Alternating decision tree with 8 functions; ASD
autism spectrum disorders; AUC
area under the curve; AUC-ROC
surface under the characteristic operating curve of the receiver; OFFER,
Balanced independent dataset; IRA
agreement between interlocutors; LR,
logistic regression; LR10,
Logistic regression classifier with 10 characteristics; LR 5,
Logistic regression classifier with 5 characteristics; LR9,
Logistic regression classifier with 9 characteristics; LR-IN-VA,
logistic regression with a net elastic penalty; ROCK,
operating characteristic of the receiver; SOC
standard of care; SVM,
support vector machine; SVM10,
10-function support vector machine; SVM12,
12-function support vector machine; SVM5,
Support vector machine with 5 functions; PSUs
unweighted average booster
Neuropsychiatric disorders are the leading cause of disability due to noncommunicable diseases worldwide, accounting for 14% of the global burden of disease.[[[. Autism Spectrum Disorders (ASD), whose incidence has increased by about 700% since 1996, have largely contributed to this measure. [2,3] and now affects 1 in 59 children in the United States [4,5]. ASD is perhaps one of the biggest health problems in pediatrics, as supporting a person with this condition costs up to $ 2.4 million over the course of his life in the United States.  and more than $ 5 billion a year in US health care costs .
Like most mental health issues, autism presents a complex range of symptoms.[[[who are diagnosed by behavioral exams. The standard of care (SOC) for the diagnosis of autism uses behavioral tools such as the Autism Diagnosis Observation Schedule (ADOS)  and the Diagnostic Interview on Revised Autism (ADI-R) . These standard exams look like others in developmental pediatrics  in that they require direct observation from the doctor to the child and take hours to be administered [11–14]. The sharp increase in the incidence of autism, combined with the insuperable nature of SOC, has created pressure on the health system. Wait times for a diagnostic evaluation can reach or exceed 12 months in the United States. , and the average age of diagnosis in the United States remains close to 5 years [2, 13], the average age of under-served populations at the time of ASD diagnosis of up to 8 years [16–18]. The wide variability in the availability of diagnostic and therapeutic services is common to most psychiatric and mental health problems in the United States, with a severe shortage of mental health services in 77% of US counties. . Behavioral interventions for ASD have the most impact when administered before or at the age of 5 years [12,20–23]; However, the diagnostic bottleneck that families face severely limits the impact of therapeutic interventions. Scalable measures are needed to eliminate bottlenecks, reduce waiting times for access to treatment, and reach needy populations in need.
To enable rapid and accurate access to ASD care, we used supervised automatic learning approaches to identify minimal sets of behaviors consistent with clinical diagnosis of ASD.[[[[24-30]. We collected and analyzed the ADOS and ADI-R administration element results to train and test the accuracy of a range of classifiers. For ADOS, we focused our analysis on the Ordinal Outcomes data in Modules 1, 2 and 3, which assess children with limited or no vocabulary, articulated language and spoken language, respectively. Each of the 3 ADOS modules uses about 10 activities for a clinical observation of the at-risk child and 28 to 30 additional behavioral measures used to score the child as a result of observation. Our machine learning analyzes focused on archived records of categorical and ordinal data generated from the scoring component of these ADOS exams. Similarly, ADI-R includes 93 multiple choice questions from a clinician from the child's primary care provider during a clinic interview; As with ADOS, we focused our classification task on the ordinal outcome data that resulted from administering the test.
These preliminary studies focused on building models optimized for accuracy, clarity, and interpretation, which differentiate autism from non-autism while managing class imbalance. We chose models with a small number of features, with performance that is equal to or better than a standard error, and interpretable results, such as scores generated by an optimized decision tree or a logistic regression (LR) approach. In total, these studies used the scores of 11,298 autistic individuals (mixed with low, medium and high severity autism) and 1,356 controls (including some children for whom autism could have been suspected but excluded). and identified the following 8 classifiers: a 7-function alternating decision tree (ADTree7)[[[, 8-function alternative decision tree (ADTree8) , a 12-function support vector machine (SVM12) , a 9-function LR classifier (LR9) , 5 Function Support Vector Machine (SVM5) , a LR function classifier (LR5) , a 10-function LR classifier (LR10) and a 10-function support vector machine (SVM10) .
Two of these 8 classifiers were independently tested in 4 separate analyzes. In a prospective prospective comparison between clinical outcome and ADTree7 (measured before clinical evaluation and official diagnosis) on 222 children (NOTTSA = 69; NOTcontrols = 153; median = 5.8 years), the performance, measured as an unweighted mean booster (PSU[[[; average sensitivity and specificity), was 84.8% . Separately, Bone and his colleagues  tested ADTree7 on a "balanced independent data set" (BID) consisting of data on ADI-R outcomes from 680 participants (462 ASD, mean age = 9.2 years, SD = 3.1 years) and 218 non-ASD (mean age = 9.4 years, SD = 2.9 years) and found that performance was equally high, at 80%. Duda and his colleagues  tested the ADTree8 with 2,333 people with autism (mean age = 5.8 years) and 283 "non-autistic" control people (mean age = 6.4 years) and found a performance of 90.2%. Bone and his colleagues  also tested this model ADTree8 with 1,033 IDB participants: 858 autistic (mean age = 5.2 years, SD = 3.6 years), 73 spectra (average autism = 3.9 years, SD = 2.4 years) and 102 non-age = 3.4 years, standard deviation = 2.0 years) and found that performance was slightly higher (94%). These independent validation studies indicate the classifier's performance in the accuracy range of the published tests and support the hypothesis that models using a minimal number of features are reliable and accurate for the detection of autism.
Others have conducted similar training and testing experiments to identify the best-ranked features from standard instrument data, including Bone.[[[and Bussu . These approaches have come to similar conclusions, namely that machine learning is an effective way of constructing objective quantitative models with few features to distinguish mild, medium and high autism from children outside the classroom. autistic spectrum, including those with autism. developmental disorders. However, translating such models into clinical practice requires additional steps that have not yet been adequately addressed. Although some of our earlier work has shown that untrained video annotators can measure autism behaviors on personal videos with high reliability and accuracy among traders the question of how to move from minimal behavioral patterns to clinical practice remains.
This study builds on this earlier work to answer this question and the assumption that the features represented in our viable minimum classifiers can be tagged quickly, accurately and reliably from short personal videos by video evaluators. having no formal training in diagnosing autism or developing the child. We have deployed crowdsourcing and real-time video analysis for feature labeling to run and evaluate the accuracy of the 8 auto-learning patterns trained in detection autism in 2 independent video repositories. This procedure allowed us to test the ability to reduce to practice the process of rapid mobile video analysis as a viable method of identifying the symptoms of autism and screening. In addition, since mobile tagging of videos automatically generates a feature-rich matrix, it offers the opportunity to form a new model of artificial intelligence that can be generalized to automatic detection of autism in short video clips. We test this hypothesis by constructing a new video feature classifier and comparing its results to other models in a conserved subset of the original video feature matrix and in an independent external validation set. The results of this work support the hypothesis that autism detection can be performed with mobile devices outside clinical parameters with high efficiency and accuracy.
Source classifiers for reduced practice tests
We have assembled 8 published machine learning classifiers to test the viability to be used in the fast mobile detection of autism in short personal videos. For the eight models, the training and validation data came from the medical records generated by the administration of one of the two reference instruments for the diagnosis of autism, ADOS or ADI-R. The ADOS has several modules containing about 30 features corresponding to the level of development of the individual evaluated. Module 1 is used for people whose vocabulary is limited or nonexistent. Module 2 is used for people who use a verbal expression but do not speak fluently. Module 3 is used for people who speak fluently. The ADI-R is a parent-led interview that includes more than 90 items requested from the parent, with several choices of answers. Each model has been trained on ADOS and ADI-R elements-level results and optimized for accuracy, rarity of functionality, and interpretability.
For the sake of brevity without omitting details, we chose to create an abbreviation for each model using a basic naming convention. This abbreviation took the form "type_type" – "number of features". For example, we used ADTree8 to designate the use of an Alternate Decision Tree (ADTree) with 8 features developed from medical data from the administration of the ADOS diagnostic tool. Module 1 and LR5 to refer to the LR with 5 behavioral characteristics developed from the analysis of data from the ADOS Module 2 medical file, etc.
We (Wall and colleagues[[[) applied automatic learning to the data of electronic medical records recorded via the ADI-R administration as part of the diagnostic evaluation of children at risk of autism. We used a training and test division of 80%: 20% and cross-validation 10 for a sample of 891 autistic children and 75 non-autistic participants with an ADTree model containing 7 functions. ADTree uses boosting to handle class imbalance [36,37]. We also over-sampled 1,000 bootstrap permutations to handle the class imbalance. The model was validated in a clinical trial on 222 participants  and in a BID composed of 680 individuals (462 with autism) . The lowest sensitivity and specificity presented were 89.9 and 79.7, respectively (UAR = 84.8%).
we  used an ADOS Module 2 scorecard data set for 612 children with ASD and 15 participants without autism control measures, with a 90% distribution: 10% for Training and testing and cross-validation 10 times to train and test an ADTree model with 8 of the 29 features of Module 2. The ADTree uses boosting and has the inherent robustness of class imbalance [36,37]. We also over-sampled through 1,000 bootstrap permutations to test the sensitivity of the model's performance to class imbalance. This 8-function ADTree model was independently tested on 446 people with autism by Wall and his colleagues. , out of 2,333 autistic and 238 without autism by Duda and his colleagues and 1,033 people (858 autistic, 73 autistic spectrum, 102 non-spectrum) of Bone and his colleagues . The lowest sensitivity and specificity reported were 97.1% and 83.3%, respectively (UAR = 90.2%).
we  conducted training with ADOS module 2 of 362 autistic and 282 autistic autism-free patients with retrospective selection and iterative removal of the lowest ranked 10-fold entity with a 90% class division: 10 %. Classes were weighted inversely proportional to their size to handle the imbalance. The model with the greatest sensitivity and specificity and the smallest number of features, LR with L1 regulation and 9 features, was selected for testing. We tested the model on independent data from 1,089 people with autism and 66 undiagnosed individuals. The lowest sensitivity and specificity identified were 98.8% and 89.4%, respectively (RSU = 94.1%).
we  used the ADOS Module 3 scorecards generated by the assessment of 510 children with ASD and 93 non-controlled participants. These data were divided into a 90% training package and a 10% test set. The training and setting parameters were achieved with a gradual selection of features and an iterative removal of the least ranked feature on 10 folds. Classes were weighted inversely proportional to their size to handle the imbalance. Several models have been adapted to each of the folds of cross-validation of the functionalities. The model with the highest sensitivity and specificity and the smallest number of features, a carrier vector machine (SVM) with a radial base function, was then applied to the test set to measure the Generalization error. We tested the model on 1,924 people with autism and 214 unskilled people diagnosed with autism. The lowest sensitivity and specificity identified in the test set were 97.7% and 97.2%, respectively (PSU = 97.5%).
LR5 and SVM5.
In this experiment, we  used medical records generated by the administration of the ADOS 2 module for 1,319 children with autism and 70 non-autistic participants. The data set was divided to 80%: 20% in train and test sets, with the same proportion for participants with and without ASD in each set. The class imbalance was managed by establishing class weights inversely proportional to class sizes. Cross-validation of 10 times was used to select the features, and a separate 10-fold cross-validation was performed for setting the hyperparameter before testing the performance. An SVM model and an LR model with L1 regularization showed the highest test performance with 5 features. The lowest sensitivity and specificity presented on the SVM5 test set were 98% and 58% (PSU = 78%) and 93% and 67% (RSU = 80%) for LR5, respectively.
LR10 and SVM10.
In this experiment, we  used medical records generated by ADOS Module 3 administration for 2,870 autistic children and 273 non-autistic participants. The data set was divided to 80%: 20% in train and test sets, with the same proportion for participants with and without ASD in each set. Class imbalance was managed by setting class weights inversely proportional to class sizes. A 10-fold cross-validation was used to select the characteristics, and a separate 10-fold cross-validation was performed for setting the hyperparameter before testing the performance. An SVM model and an LR model with L1 regularization showed the highest test performance with 10 features. The lowest sensitivity and specificity on the SVM10 independent test set were 95% and 87% (PSU = 91%) and 90% and 89% (RSU = 89.5%), respectively. LR10.
Given the overlap of the selected characteristics, these 8 models measure 23 unique characteristics in total. The accuracy of the test for each model was> 90%. All models contain about 90% fewer questions than the ADI-R and 70% to 84% fewer questions than all of the features measured in the ADOS. Seven additional features were chosen for their potential diagnostic value and scored by video reviewers to rate their ability to score personal videos, creating a total of 30 features for the mobile video rating process described below (Fig. 1).
Recrutement et collection de vidéos
En vertu d'un protocole IRB approuvé de l'Université Stanford, nous avons développé un portail mobile pour faciliter la collecte de vidéos d'enfants atteints de TSA, à partir duquel les participants ont consenti électroniquement à participer et à télécharger leurs vidéos. Les participants ont été recrutés via des méthodes de crowdsourcing[[[[38–41]ciblé sur les plateformes de médias sociaux et les serveurs de liste de diffusion pour les familles d’enfants autistes. Les participants intéressés ont été dirigés vers un site Web portail vidéo sécurisé et crypté pour donner leur consentement. Nous avons demandé aux participants d’avoir au moins 18 ans et d’être le principal prestataire de soins d’un enfant autiste âgé de 12 mois à 17 ans. Les participants ont fourni des vidéos soit par téléchargement direct sur le portail, soit par référence à une vidéo déjà téléchargée sur YouTube, avec l'âge, le diagnostic et d'autres caractéristiques essentielles. Nous avons considéré les vidéos comme éligibles si (1) elles duraient entre 1 et 5 minutes, (2) montraient le visage et les mains de l’enfant, (3) montraient des opportunités claires ou un engagement social direct, et (4) impliquaient des opportunités pour utilisation d'un objet tel qu'un ustensile, un crayon ou un jouet.
Nous nous sommes appuyés sur des informations auto-déclarées fournies par les parents concernant le diagnostic officiel de l’autisme ou de l’autisme, l’âge de l’enfant au moment de la soumission de la vidéo et des informations démographiques supplémentaires sur les vidéos soumises directement au portail Web. Pour les vidéos fournies via des URL YouTube, nous avons utilisé des métatags YouTube pour confirmer l'âge et le diagnostic de l'enfant dans la vidéo. Si une vidéo n'incluait pas de métabalise pour l'âge de l'enfant dans la vidéo, cet âge a été attribué à la suite d'un accord total entre les estimations de 3 praticiens cliniques en pédiatrie. Pour évaluer l’exactitude de l’auto-évaluation des parents et éviter les biais, nous avons chargé un pédiatre spécialiste certifié d’administrer le système ADOS afin de visionner une sélection aléatoire de 20 vidéos. Nous avons également demandé à un pédiatre spécialisé en développement d'examiner une sélection aléatoire de 10 vidéos supplémentaires sans chevauchement. Ces experts cliniques ont classé chaque vidéo dans la catégorie «TSA» ou «non TSA».
Balisage des vidéos pour l'exécution de modèles d'apprentissage automatique
Nous avons employé un total de 9 évaluateurs vidéo qui étaient soit des étudiants (lycée, premier cycle ou cycles supérieurs), soit des professionnels actifs. Aucun n’avait de formation ou d’accréditation en détection ou diagnostic de l’autisme. Tous ont reçu des instructions sur la manière de baliser les 30 questions et ont été invités à noter 10 exemples de vidéos avant de procéder au balisage indépendant de nouvelles vidéos. Après la formation, nous avons fourni aux évaluateurs des noms d'utilisateur et des mots de passe uniques leur permettant d'accéder au portail en ligne sécurisé, de visionner des vidéos et de répondre à 30 questions pour chaque vidéo requise par les vecteurs de fonctions afin d'exécuter les 8 classificateurs d'apprentissage automatique (tableau 1). Les fonctionnalités ont été présentées aux évaluateurs vidéo sous forme de questions à choix multiples rédigées à un niveau de lecture d'environ la septième année. Les évaluateurs, restés aveugles au diagnostic tout au long de l’étude, ont été chargés de choisir l’une des étiquettes pour chaque caractéristique décrivant le mieux le comportement de l’enfant dans la vidéo. Chaque réponse à une caractéristique a ensuite été mise en correspondance avec un score compris entre 0 et 3, les scores les plus élevés indiquant des caractéristiques plus graves de l'autisme dans le comportement mesuré, ou 8 indiquant que la fonctionnalité n'a pas pu être notée. Les caractéristiques comportementales et le chevauchement entre les modèles sont fournis à la figure 1.
Pour tester la viabilité des vidéos de marquage de fonctions pour la détection rapide et le diagnostic de l'autisme par apprentissage automatique, nous avons identifié de manière empirique un nombre minimal de rapporteurs de vidéos nécessaires pour évaluer les vidéos personnelles fournies par les parents. Nous avons sélectionné un sous-ensemble aléatoire de vidéos à partir de l'ensemble complet de vidéos collectées via notre portail participatif et avons exécuté ADTree8.  modèle sur les vecteurs de caractéristiques étiquetés par les 9 évaluateurs. Nous avons choisi de n'utiliser que ADTree8 pour des raisons d'efficacité et parce que ce modèle avait déjà été validé dans 2 études indépendantes. [25,32]. Nous avons utilisé une procédure de permutation échantillon avec remplacement pour mesurer la précision en fonction de l’accord de la majorité des évaluateurs avec la véritable classification diagnostique. Nous avons progressivement augmenté le nombre de notateurs vidéo par essai de 1 évaluateur, en commençant par 1 et en terminant par 9, le tirage avec remplacement 1 000 fois par essai. Lorsqu’on n’a retenu que 2 évaluateurs, nous avions besoin d’un accord de classe parfait entre les évaluateurs. Avec un nombre impair d’évaluateurs, nous avions besoin d’un consensus à la majorité stricte. Lorsqu'un nombre pair d'évaluateurs étaient en désaccord sur la classification, nous avons utilisé un score d'évaluateur indépendant et choisi au hasard pour briser l'égalité.
Après avoir déterminé le nombre minimum d’évaluateurs de vidéo, nous avons utilisé ce minimum pour générer l’ensemble complet de vecteurs de 30 caractéristiques sur toutes les vidéos. Sept des modèles ont été écrits en Python 3 à l'aide du paquet scikit-learn, et un en R. Nous avons exécuté ces 8 modèles sur nos matrices de fonctionnalités après le balisage de fonctionnalités dans des vidéos. We measured the model accuracy through comparison of the raters’ majority classification result with the true diagnosis. We evaluated model performance further by age categories: ≤2 years, >2 to ≤4 years, >4 years to ≤6 years, and >6 years. For each category, we calculated accuracy, sensitivity, and specificity.
We collected timed data from each rater for each video, which began when a video rater pressed “play” on the video and concluded when a video rater finished scoring by clicking “submit” on the video portal. We used these time stamps to calculate the time spent annotating each video. We approximated the time taken to answer the questions by excluding the length of the video from the total time spent to score a video.
Building a video feature classifier
The process of video feature tagging provides an opportunity to generate a crowdsourced collection of independent feature measurements that are specific to the video of the child as well as independent rater impressions of that child’s behaviors. This in turn has the ability to generate a valuable feature matrix to develop models that include video-specific features rather than features identified through analysis on archived data generated through administration of the SOC (as is the case for all classifiers contained in Table 1). To this end, and following the completion of the annotation on all videos by the minimum number of raters, we performed machine learning on our video feature set. We used LR with an elastic net penalty  (LR-EN-VF) to predict the autism class from the non-autism class. We randomly split the dataset into training and testing, reserving 20% for the latter while using cross-validation on the training set to tune for hyperparameters. We used cross-validation for model hyperparameter tuning by performing a grid search with different values of alpha (varying penalty weights) and L1 ratio (the mixing parameter determining how much weight to apply to L1 versus L2 penalties). Based on the resulting area under the curve (AUC) and accuracy from each combination, we selected the top-performing pair of hyperparameters. Using this pair, we trained the model using LR and balanced class weights to adjust weights inversely proportional to class frequencies in the input data. After determining the top-ranked features based on the trained model and the resulting coefficients, we validated the model on the reserved test set.
Independent test set for validation of video phenotyping processes
We used our video portal and crowdsourcing approaches to generate an independent collection of videos for evaluation and feature tagging by 3 different raters than those used in the primary analysis. These raters had similar characteristics to the original group (age, education, no clinical certifications in developmental pediatrics) and were trained for video tagging through the same procedures.
All classifiers used for testing the time and accuracy of mobile video rating had accuracies above 90% (Table 1). The union of features across these 8 classifiers (Table 1) was 23 (Fig 1). These features plus an additional 7 chosen for clinical validity testing were loaded into a mobile video rating portal to enable remote feature tagging by nonclinical video raters.
We collected a total of 193 videos (Table 2) with average video length of 2 minutes 13 seconds (SD = 1 minute 40 seconds). Of the 119 ASD videos, 72 were direct submissions made by the primary caregiver of the child, and 47 were links to an existing video on YouTube. Of the 74 non-ASD videos, 46 non-ASD videos were links to existing YouTube videos, and 28 were direct submissions from the primary caregiver. We excluded 31 videos because of insufficient evidence for the diagnosis (not = 25) or inadequate video quality (not = 6), leaving 162 videos (116 with ASD and 46 non-ASD) which were loaded into our mobile video rating portal for the primary analysis. To validate self-reporting of the presence or absence of an ASD diagnosis, 2 clinical staff trained and certified in autism diagnosis evaluated a random selection of 30 videos (15 with ASD and 15 non-ASD) from the 162 videos. Their classifications had perfect correspondence with the diagnoses provided through self-report by the primary caregiver.
We randomly selected 50 videos (25 ASD and 25 non-ASD) from the total 162 collected videos and had 9 raters feature tag all in an effort to evaluate the potential for an optimal number of raters, with optimal being defined through a balance of scalability and information content. The average video length of this random subset was 1 minute 54 seconds (SD = 46 seconds) for the ASD class and 2 minutes 36 seconds (SD = 1 minute 15 seconds) for the non-ASD class. We then ran the ADTree8 (Table 1) model on the feature vectors generated by the 9 raters. We found the difference in accuracy to be statistically insignificant between 3 raters—the minimum number to have a majority consensus on the classification with no ties—and 9 raters (Fig 2). We therefore elected to use a random selection of 3 raters from the 9 to feature tag all 162 crowdsourced home videos.
Three raters performed video screening and feature tagging to generate vectors for each of the 8 machine learning models for comparative evaluation of performance (Fig 3). All classifiers had sensitivity >94.5%. However, only 3 of the 8 models exhibited specificity above 50%. The top-performing classifier was LR5, which showed an accuracy of 88.9%, sensitivity of 94.5%, and specificity of 77.4%. The next-best-performing models were SVM5 with 85.4% accuracy (54.9% specificity) and LR10 with 84.8% accuracy (51% specificity).
Participants were recruited to participate via crowdsourcing methods and provided video by direct upload or via a preexisting YouTube link. The minimum for majority rules of 3 video raters tagged all features, generating feature vectors to run each of the 8 classifiers automatically. The sensitivity and specificity based on majority outcome generated by the 3 raters on 162 (119 with autism) videos are provided. Highlighted in yellow is the best performing model, LR5. ADTree7, 7-feature alternating decision tree; ADTree8, 8-feature alternating decision tree; ASD, autism spectrum disorder; LR5, 5-feature logistic regression classifier; LR9, 9-feature logistic regression classifier; LR10, 10-feature logistic regression classifier; SVM5, 5-feature support vector machine; SVM10, 10-feature support vector machine; SVM12, 12-feature support vector machine.
LR5 exhibited high accuracy on all age ranges with the exception of children over 6 years old (although note that we had limited examples of non-ASD[[[[not = 1]class in this range). This model performed best on children between the ages of 4 and 6 years, with sensitivity and specificity both above 90% (Fig 4, Table 3). SVM5 and LR10 showed an increase in performance on children ages 2–4 years, both with 100% sensitivity and the former with 66.7% and the latter with 58.8% specificity. The 3 raters agreed unanimously on 116 out of 162 videos (72%) when using the top-performing classifier, LR5. The interrater agreement (IRA) for this model was above 75% in all age ranges with the exception of the youngest age group of children, those under 2 years, for which there was a greater frequency of disagreement. The numbers of non-ASD representatives were small for the older age ranges evaluated (Table 3).
The median time for the 3 raters to watch and score a video was 4 minutes (Table 4). Excluding the time spent watching the video, raters required a median of 2 minutes 16 seconds to tag all 30 features in the analyst portal. We found a significant difference (p = 0.0009) between the average time spent to score the videos of children with ASD and the average time spent to score the non-ASD videos (6 minutes 36 seconds compared with 5 minutes 8 seconds).
To validate the feasibility and accuracy of rapid feature tagging and machine learning on short home videos, we launched a second effort for crowdsourcing videos of children with and without autism to generate an independent replication dataset. We collected 66 videos, 33 of children with autism and 33 non-ASD. This set of videos was comparable to the initial set of 162 videos in terms of gender, age, and video length. The average age for children with ASD was 4 years 5 months (SD = 1 year 9 months), and the average age for non-ASD children was 3 years 11 months (SD = 1 year 7 months). Forty-two percent (not = 14) of the children with ASD were male and 45% (not = 15) of the non-ASD children were male. The average video length was 3 minutes 24 seconds, with an SD of 45 seconds. For this independent replication, we used 3 different raters, each with no official training or experience with developmental pediatrics. The raters required a median time of 6 minutes 48 seconds for complete feature tagging. LR5 again yielded the highest accuracy, with a sensitivity of 87.8% and a specificity of 72.7%. A total of 13 of the 66 videos were misclassified, with 4 false negatives.
Given the higher average time for video evaluation, we hypothesized that the videos contained challenging displays of autism symptoms. Therefore, we examined the probabilities generated by the LR5 model for the 13 misclassified videos. Two of the 4 false negatives and 4 of the 9 false positives had borderline probabilities scores between 0.4 and 0.6. We elected to define a probability threshold between 0.4 and 0.6 to flag videos as inconclusive cases. Twenty-six of the 66 videos fell within this inconclusive group when applying this threshold. When we excluded these 26 from our accuracy analysis, the sensitivity and specificity increased to 91.3% and 88.2%, respectively.
Training a video feature–specific classifier
To build a video feature–specific classifier, we trained an LR-EN-VF model on 528 (3 raters × 176 videos) novel measures of the 30 video features used to distinguish the autism class from the neurotypical cohort. Out of these 176 videos (ASD = 121, non-ASD = 58), 162 (ASD = 116, non-ASD = 46) were from the analysis set, and 14 videos (ASD = 5, non-ASD = 12) were from the set of 66 validation videos. Model hyperparameters (alpha and L1 ratio) identified through 10-fold cross-validation were 0.01 and 0.6, respectively. We used a high L1 ratio to enforce sparsity and to decrease model complexity and the number of features. We had similar proportions (0.60) for non-ASD and ASD measures in the training set and held-out test set, which allowed us to create a model that generalizes well without a significant change in sensitivity or specificity on novel data. The model had an area under the receiver operating characteristic curve (AUC-ROC) of 93.3% and accuracy of 87.7% on the held-out test set. A comparison of LR-EN-VF with LR L2 penalty (no feature reduction) revealed similar results (AUC-ROC: 93.8%, test accuracy: 90.7%) (Fig 5). The top-8 features selected by the model consisted of the following, in order of highest to lowest rank: speech patterns, communicative engagement, understands language, emotion expression, sensory seeking, responsive social smile, stereotyped speech. One of these 8 features—sensory seeking—was not part of the full sets of items on the standard instrument data used in the development and testing of the 8 models depicted in Table 1. We then validated this classifier on the remaining 52 videos (ASD = 28, non-ASD = 21) from the validation set, and the results showed an accuracy of 75.5% and an AUC-ROC of 86.0%.
Previous work [26–29] has shown that machine learning models built on records from standard autism diagnoses can achieve high classification accuracy with a small number of features. Although promising in terms of their minimal feature requirements and ability to generate an accurate risk score, their potential for improving autism diagnosis in practice has remained an open question. The present study tested the ability to reduce these models to the practice of home video evaluation by nonexperts using mobile platforms (e.g., tablets, smartphones). Independent tagging of 30 features by 3 raters blind to diagnosis enabled majority rules machine learning classification of 162 two-minute (average) home videos in a median of 4 minutes at 90% AUC on children ages 20 months to 6 years. This performance was maintained at 89% AUC (95% CI 81%–95%) in a prospectively collected and independent external set of 66 videos each with 3 independent rater measurement vectors. Taking advantage of the probability scores generated by the best-performing model (L1-regularized LR model with 5 features) to flag low-confidence cases, we were able to achieve a 91% AUC, suggesting that the approach could benefit from the use of the scores on a more quantitative scale rather than just as a binary classification outcome.
By using a mobile format that can be accessed online, we showed that it is possible to get multiple independent feature vectors for classification. This has the potential to elevate confidence in classification outcome at the time of diagnosis (i.e., when 3 or more agree on class) while fostering the growth of a novel matrix of features from short home videos. In the second part of our study, we tested the ability for this video feature matrix to enable development of a new model that can generalize to the task of video-based classification of autism. We found that an 8-feature LR model could achieve an AUC of 0.93 on the held-out subset and 0.86 on the prospective independent validation set. One of the features used by this model, sensory seeking, was not used by the instruments on which the original models were trained, suggesting the possibility that alternative features may provide added power for video classification.
These results support the hypothesis that the detection of autism can be done effectively at scale through mobile video analysis and machine learning classification to produce a quantified indicator of autism risk quickly. Such a process could streamline autism diagnosis to enable earlier detection and earlier access to therapy that has the highest impact during earlier windows of social development. Further, this approach could help to reduce the geographic and financial burdens associated with access to diagnostic resources and provide more equal opportunity to underserved populations, including those in developing countries. Further testing and refinement should be conducted to identify the most viable method(s) of crowdsourcing video acquisition and feature tagging. In addition, prospective trials in undiagnosed and in larger, more-balanced cohorts including examples of children with non-autism developmental delays will be needed to better understand the approach’s potential for use in autism diagnosis.
LR10, LR5, and ADTree7 are the top-3 best-performing classifiers on the validation set, which falls in line with the results observed on the test dataset of 162 videos used earlier. LR5 still performs with the highest specificity out of the 8 models. ADTree7, 7-feature alternating decision tree; LR5, 5-feature logistic regression classifier; LR10, 10-feature logistic regression classifier.
We would like to thank Kaitlyn Dunlap, the participating families, and each of our video raters for their important contributions to this study.
Prince M, Patel V, Saxena S, Maj M, Maselko J, Phillips MR, et al. Global mental health 1 – No health without mental health. Lancet. 2007;370(9590):859–77. https://doi.org/10.1016/S0140-6736(07)61238-0 pmid:17804063
Baio J, Wiggins L, Christensen DL, Maenner MJ, Daniels J, Warren Z, et al. Prevalence of Autism Spectrum Disorder Among Children Aged 8 Years—Autism and Developmental Disabilities Monitoring Network, 11 Sites, United States, 2014. MMWR Surveillance Summaries. 2018;67(6):1. https://doi.org/10.15585/mmwr.ss6706a1 pmid:29701730. PMCID: PMC5919599.
Hertz-Picciotto I, Delwiche L. The Rise in Autism and the Role of Age at Diagnosis. Epidemiology. 2009;20(1):84–90. https://doi.org/10.1097/EDE.0b013e3181902d15 pmid:19234401. PMCID: PMC4113600.
Christensen DL, Baio J, Van Naarden Braun K, Bilder D, Charles J, Constantino JN, et al. Prevalence and Characteristics of Autism Spectrum Disorder Among Children Aged 8 Years–Autism and Developmental Disabilities Monitoring Network, 11 Sites, United States, 2012. MMWR Surveill Summ. 2016;65(3):1–23. https://doi.org/10.15585/mmwr.ss6503a1 pmid:27031587.
Christensen DL, Bilder DA, Zahorodny W, Pettygrove S, Durkin MS, Fitzgerald RT, et al. Prevalence and characteristics of autism spectrum disorder among 4-year-old children in the autism and developmental disabilities monitoring network. Journal of Developmental & Behavioral Pediatrics. 2016;37(1):1–8. https://doi.org/10.1097/DBP.0000000000000235 pmid:26651088.
Buescher AV, Cidav Z, Knapp M, Mandell DS. Costs of autism spectrum disorders in the United Kingdom and the United States. JAMA Pediatr. 2014;168(8):721–8. https://doi.org/10.1001/jamapediatrics.2014.210 pmid:24911948.
McPartland JC, Reichow B, Volkmar FR. Sensitivity and specificity of proposed DSM-5 diagnostic criteria for autism spectrum disorder. J Am Acad Child Adolesc Psychiatry. 2012;51(4):368–83. https://doi.org/10.1016/j.jaac.2012.01.007 pmid:22449643. PMCID: PMC3424065.
Lord C, Rutter M, Goode S, Heemsbergen J, Jordan H, Mawhood L, et al. Austism diagnostic observation schedule: A standardized observation of communicative and social behavior. Journal of autism and developmental disorders. 1989;19(2):185–212. pmid:2745388.
Lord C, Rutter M, Le Couteur A. Autism Diagnostic Interview-Revised: a revised version of a diagnostic interview for caregivers of individuals with possible pervasive developmental disorders. Journal of autism and developmental disorders. 1994;24(5):659–85. pmid:7814313.
Association AP. Diagnostic and statistical manual of mental disorders (DSM-5®). Arlington, VA: American Psychiatric Pub; 2013.
Bernier R, Mao A, Yen J. Psychopathology, families, and culture: autism. Child Adolesc Psychiatr Clin N Am. 2010;19(4):855–67. https://doi.org/10.1016/j.chc.2010.07.005 pmid:21056350.
Dawson G. Early behavioral intervention, brain plasticity, and the prevention of autism spectrum disorder. Dev Psychopathol. 2008;20(3):775–803. https://doi.org/10.1017/S0954579408000370 pmid:18606031.
Mazurek MO, Handen BL, Wodka EL, Nowinski L, Butter E, Engelhardt CR. Age at first autism spectrum disorder diagnosis: the role of birth cohort, demographic factors, and clinical features. J Dev Behav Pediatr. 2014;35(9):561–9. https://doi.org/10.1097/DBP.0000000000000097 pmid:25211371.
Wiggins LD, Baio J, Rice C. Examination of the time between first evaluation and first autism spectrum diagnosis in a population-based sample. Journal of Developmental and Behavioral Pediatrics. 2006;27(2):S79–S87. pmid:16685189.
Gordon-Lipkin E, Foster J, Peacock G. Whittling Down the Wait Time: Exploring Models to Minimize the Delay from Initial Concern to Diagnosis and Treatment of Autism Spectrum Disorder. Pediatr Clin North Am. 2016;63(5):851–9. https://doi.org/10.1016/j.pcl.2016.06.007 pmid:27565363. PMCID:PMC5583718.
Howlin P, Moore A. Diagnosis in autism: A survey of over 1200 patients in the UK. autism. 1997;1(2):135–62.
Kogan MD, Strickland BB, Blumberg SJ, Singh GK, Perrin JM, van Dyck PC. A National Profile of the Health Care Experiences and Family Impact of Autism Spectrum Disorder Among Children in the United States, 2005-2006. Pediatrics. 2008;122(6):E1149–E58. https://doi.org/10.1542/peds.2008-1057 pmid:19047216.
Siklos S, Kerns KA. Assessing the diagnostic experiences of a small sample of parents of children with autism spectrum disorders. Res Dev Disabil. 2007;28(1):9–22. https://doi.org/10.1016/j.ridd.2005.09.003 pmid:16442261.
Thomas KC, Ellis AR, Konrad TR, Holzer CE, Morrissey JP. County-level estimates of mental health professional shortage in the United States. Psychiatr Serv. 2009;60(10):1323–8. https://doi.org/10.1176/ps.2009.60.10.1323 pmid:19797371.
Dawson G, Jones EJH, Merkle K, Venema K, Lowy R, Faja S, et al. Early Behavioral Intervention Is Associated With Normalized Brain Activity in Young Children With Autism. Journal of the American Academy of Child and Adolescent Psychiatry. 2012;51(11):1150–9. https://doi.org/10.1016/j.jaac.2012.08.018 pmid:23101741. PMCID: PMC3607427.
Dawson G, Rogers S, Munson J, Smith M, Winter J, Greenson J, et al. Randomized, controlled trial of an intervention for toddlers with autism: the Early Start Denver Model. Pediatrics. 2010;125(1):e17–23. https://doi.org/10.1542/peds.2009-0958 pmid:19948568. PMCID: PMC4951085.
Landa RJ. Efficacy of early interventions for infants and young children with, and at risk for, autism spectrum disorders. International Review of Psychiatry. 2018;30(1):25–39. https://doi.org/10.1080/09540261.2018.1432574 pmid:29537331. PMCID: PMC6034700.
Phillips DA, Shonkoff JP. From neurons to neighborhoods: The science of early childhood development. Washington, D.C.: National Academies Press; 2000. https://doi.org/10.17226/9824 pmid:25077268.
Duda M, Daniels J, Wall DP. Clinical Evaluation of a Novel and Mobile Autism Risk Assessment. J Autism Dev Disord. 2016;46(6):1953–61. https://doi.org/10.1007/s10803-016-2718-4 pmid:26873142. PMCID: PMC4860199.
Duda M, Kosmicki JA, Wall DP. Testing the accuracy of an observation-based classifier for rapid detection of autism risk. Transl Psychiatry. 2014;4(8):e424. https://doi.org/10.1038/tp.2014.65 pmid:25116834.
Kosmicki JA, Sochat V, Duda M, Wall DP. Searching for a minimal set of behaviors for autism detection through feature selection-based machine learning. Translational Psychiatry. 2015;5(2):e514. https://doi.org/10.1038/tp.2015.7 pmid:25710120. PMCID: PMC4445756.
Levy S, Duda M, Haber N, Wall DP. Sparsifying machine learning models identify stable subsets of predictive features for behavioral detection of autism. Mol Autism. 2017;8(1):65. https://doi.org/10.1186/s13229-017-0180-6 pmid:29270283. PMCID: PMC5735531.
Wall DP, Kosmicki J, DeLuca TF, Harstad E, Fusaro VA. Use of machine learning to shorten observation-based screening and diagnosis of autism. Translational Psychiatry. 2012;2(4):e100. https://doi.org/10.1038/tp.2012.10 pmid:22832900. PMCID: PMC3337074.
Wall DP, Dally R, Luyster R, Jung JY, Deluca TF. Use of artificial intelligence to shorten the behavioral diagnosis of autism. PLoS One. 2012;7(8):e43855. https://doi.org/10.1371/journal.pone.0043855 pmid:22952789.
Wall DP, Kosmiscki J, Deluca TF, Harstad L, Fusaro VA. Use of machine learning to shorten observation-based screening and diagnosis of autism. Translational Psychiatry. 2012;2(e100). https://doi.org/10.1038/tp.2012.10 pmid:22832900. PMCID: PMC3337074.
Schuller B, Vlasenko B, Eyben F, Wollmer M, Stuhlsatz A, Wendemuth A, et al. Cross-Corpus Acoustic Emotion Recognition: Variances and Strategies. Ieee Transactions on Affective Computing. 2010;1(2):119–31. https://doi.org/10.1109/T-Affc.2010.8
Bone D, Goodwin MS, Black MP, Lee CC, Audhkhasi K, Narayanan S. Applying machine learning to facilitate autism diagnostics: pitfalls and promises. J Autism Dev Disord. 2015;45(5):1121–36. https://doi.org/10.1007/s10803-014-2268-6 pmid:25294649. PMCID: PMC4390409.
Bone D, Bishop SL, Black MP, Goodwin MS, Lord C, Narayanan SS. Use of machine learning to improve autism screening and diagnostic instruments: effectiveness, efficiency, and multi‐instrument fusion. Journal of Child Psychology and Psychiatry. 2016;57(8):927–37. https://doi.org/10.1111/jcpp.12559 pmid:27090613. PMCID: PMC4958551.
Bussu G, Jones EJH, Charman T, Johnson MH, Buitelaar JK, Team B. Prediction of Autism at 3 Years from Behavioural and Developmental Measures in High-Risk Infants: A Longitudinal Cross-Domain Classifier Analysis. Journal of Autism and Developmental Disorders. 2018;48(7):2418–33. https://doi.org/10.1007/s10803-018-3509-x pmid:29453709. PMCID: PMC5996007.
Fusaro VA, Daniels J, Duda M, DeLuca TF, D'Angelo O, Tamburello J, et al. The Potential of Accelerating Early Detection of Autism through Content Analysis of YouTube Videos. Plos One. 2014;9(4):e93533. https://doi.org/10.1371/journal.pone.0093533 pmid:24740236. PMCID: PMC3989176.
Freund Y, Schapire RE, editors. Experiments with a new boosting algorithm. Icml; 1996 July 3, 1996; Bari, Italy. San Francisco, CA, USA: Morgan Kaufman Publishers Inc.; 1996.
Freund Y, Mason L, editors. The alternating decision tree learning algorithm. icml; 1999 June 27, 1999; Bled, Slovenia. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
Behrend TS, Sharek DJ, Meade AW, Wiebe EN. The viability of crowdsourcing for survey research. Behav Res Methods. 2011;43(3):800–13. https://doi.org/10.3758/s13428-011-0081-0 pmid:21437749.
David MM, Babineau BA, Wall DP. Can we accelerate autism discoveries through crowdsourcing? Research in Autism Spectrum Disorders. 2016;32:80–3.
Ogunseye S, Parsons J, editors. What Makes a Good Crowd? Rethinking the Relationship between Recruitment Strategies and Data Quality in Crowdsourcing. Proceedings of the 16th AIS SIGSAND Symposium; 2017 May 19-20, 2017; Cincinnati, OH.
Swan M. Crowdsourced health research studies: an important emerging complement to clinical trials in the public health research ecosystem. J Med Internet Res. 2012;14(2):e46. https://doi.org/10.2196/jmir.1988 pmid:22397809. PMCID: PMC3376509.
Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2005;67(2):301–20. https://doi.org/10.1111/j.1467-9868.2005.00503.x