We validated our generalized approach of crowd-amplification and deep learning by comparing classification results against an existing, specialized algorithm for QC of T1 weighted images, called MRIQC \cite{Esteban2017}. The features extracted by MRIQC are guided by the physics of MR image acquisition and by the statistical properties of images. An XGBoost model was trained on the features extracted by MRIQC on a training subset of gold-standard images, and evaluated on a previously unseen test subset. The AUC was also 0.99, matching the performance of our crowd-trained deep learning model.
The secondary goal of this study was to investigate how scaling expertise through citizen science amplification affects scientific inferences from these data. For this proof of concept, we studied brain development, which is the primary focus on the HBN dataset. Lebel and colleagues
\cite{Lebel2011} found that increases in white matter volume and decreases in gray matter volume are roughly equal in magnitude, resulting in no overall brain volume change over development in late childhood. Based on Figure 2 in the Lebel manuscript
\cite{Lebel2011}, we estimate an effect of approximately -4.3 cm
3 per year - a decrease in gray matter volume over the ages measured (See Figure 2 in the the original manuscript; we estimate the high point to be 710 cm
3 and the low point to be 580 cm
3 with a range of ages of approximately 5 years to 35 years and hence: (710-580)/(5-35) = -4.3 cm
3/year). To reproduce their analysis and assess the effect of using the CNN-derived quality control estimates, we estimated gray and white matter volume in the subjects that had been scored for quality using our algorithm. Figure
\ref{182176} shows gray matter volume as a function of age. Two conditions are compared: in one (Figure
\ref{182176}A) all of the subjects are included, while in the other only subjects that were passed by the CNN are included (Figure
\ref{182176}B, blue points). Depending on the threshold chosen, the effect of gray matter volume over age varies from -2.6 cm
3/year (with no threshold) to -5.3 cm
3/year (with Braindr rating > 0.9). A threshold of 0.7 of either Braindr or MRIQC results in an effect size around -4.3 cm
3 per year, replicating the results of
\cite{Lebel2011}. A supplemental interactive version of this figure allows readers to threshold data points based on QC scores from the predicted labels of the CNN (called “Braindr ratings”), or on MRIQC XGBoost probabilities (called “MRIQC ratings”) is available at
http://results.braindr.us. Thus, quality control has a substantial impact on estimates of brain development and allowing poor quality data into the statistical model can almost entirely obscure developmental changes in gray matter volume.