Discussion
We have developed a system to scale expertise in neuroimaging to meet the demands of Big Data. The system uses citizen scientists to amplify an initially-small, expert-labeled dataset. Combined with deep learning (via CNNs), the system can then accurately perform image analysis tasks that require expertise, such as quality control (QC). We have validated our method against MRIQC, a specialized tool that was designed specifically for this use case based on knowledge of the physics underlying the signal generation process in T1-weighted images \cite{Esteban2017}. Unlike MRIQC, our method is able to generalize beyond quality control of T1-weighted images; any image-based binary classification task can be loaded onto the Braindr platform, and crowdsourced via the web. For this use-case, we demonstrated the importance of scaling QC expertise by showing how replication of a previously established results depends on a researcher’s decision on data quality. Lebel and colleagues \cite{Lebel2011} report changes in gray matter volume over development and we find that we only replicate these findings when using a stringent quality control threshold for the input data.
The Internet and Web Applications for Collaboration
The internet and web browser technologies are not only crucial for scientific communication, but also for collaboration and distribution of work. This is particularly true in the age of large consortium efforts aimed at generating high-quality large data sets. Recent progress in citizen science projects for neuroscience research have proven extremely useful and popular, in part due to the ubiquity of the web browser. Large-scale citizen science projects, like EyeWire \cite{kim2014space,marx2013neuroscience} , and Mozak \cite{roskams2016power}, have enabled scientists working with high resolution microscopy data to map neuronal connections at the microscale, with help from over 100,000 citizen scientists. In MR imaging, web-based tools such as BrainBox \cite{heuer2016open} and Mindcontrol \cite{Keshavan2017} were built to facilitate the collaboration of neuroimaging experts in image segmentation and quality control. However, the task of inspecting each slice of a 3D image in either BrainBox or Mindcontrol takes a long time, and this complex task tends to lose potential citizen scientists who find it too difficult or time consuming. In general, crowdsourcing is most effective when a project is broken down into short, simple, well-defined “micro-tasks”, that can be completed in short bursts of work and are resilient to interruption \cite{cheng2015break}. In order to simplify the task for citizen scientists, we developed a web application called braindr, which reduces the time-consuming task of slice-by-slice 3D inspection to a quick binary choice made on a 2D slice. While we might worry that distilling a complex decision into a simple swipe on a smartphone might add noise, we demonstrated that a model could be constructed to accurately combine ratings from many citizen scientists to almost perfectly emulate those obtained from inspection by experts. Using braindr, citizen scientists amplified the initial expert-labelled dataset (200 3D images) to the entire dataset (> 700 3D images, > 3000 2D slices) in a few weeks. Because braindr is a lightweight web application, users could play it at any time and on any device, and this meant we were able to attract many users. On braindr, each slice received on average 20 ratings, and therefore each 3D brain (consisting of 5 slices) received on average 100 ratings. In short, by redesigning the way we interact with our data and presenting it in the web browser, we were able to get many more eyes on our data than would have been possible in a single research lab.
Scaling expertise through interactions between experts, citizen scientists and machine learning
We found that an interaction between experts, citizen scientists, and machine learning results in scalable decision-making on brain MRI images. Recent advances in machine learning have vastly improved image classification\cite{krizhevsky2012imagenet}, object detection\cite{girshick2014rich}, and segmentation\cite{long2015fully} through the use of deep convolutional neural networks. In the biomedical domain, these networks have been trained to accurately diagnose eye disease \cite{lee2017deep}, diagnose skin cancer \cite{esteva2017dermatologist}, and breast cancer \cite{sahiner1996classification}, to name a few applications. But these applications require a large and accurately labeled dataset. This presents an impediment for many scientific disciplines, where labeled data may be more scarce, or hard to come by, because it requires labor-intensive procedures. The approach presented here solves this fundamental bottleneck in the current application of modern machine learning approaches, and enables scientists to automate complex tasks that require substantial expertise.
A surprising finding that emerges from this work is that a deep learning algorithm can learn to match or even exceed the aggregated ratings that are used for training. This finding is likely to reflect the fact that algorithms are more reliable than humans, and when an algorithm is trained to match human accuracy, it has the added benefit of perfect reliability. For example even an expert might not provide the exact same ratings each time they see the same image, while an algorithm will. This is in line with findings from \cite{lee2017deepa}, showing that the agreement between an algorithm and any one expert can be equivalent to agreement between any pair of experts. We have demonstrated that while an individual citizen scientist may not provide reliable results, by intelligently combining a crowd with machine learning, and keeping an expert in the loop to monitor results, decisions can be accurately scaled to meet the demands of Big Data.
MRI Quality Control and Morphometrics over Development
The specific use-case that we focused on pertains to the importance of quality control in large-scale MRI data acquisitions. Recently, Ducharme and colleagues \cite{Ducharme2016} stressed the importance of quality control for studies of brain development in a large cohort of 954 subjects. They estimated cortical thickness on each point of a cortical surface and fit linear, quadratic and cubic models of thickness versus age at each vertex. Quality control was performed by visual inspection of the reconstructed cortical surface, and removing data that failed QC from the analysis. Without stringent quality control, the best fit models were more complex (quadratic/cubic), and with quality control the best fit models were linear. They found sex differences only at the occipital regions, which thinned faster in males. In the supplemental figure that accompanies Figure \ref{182176}, we presented an interactive chart where users can similarly explore different ordinary least squares models (linear or quadratic) and also split by sex for the relationship between total gray matter volume, white matter volume, CSF volume, and total brain volume over age.
We chose to QC raw MRI data in this study, rather than the processed data because the quality of the raw MRI data affects the downstream cortical mesh generation, and many other computed metrics. A large body of research in automated QC of T1-weighted images exists, in part because of large open data sharing initiatives. In 2009, Mortamet and colleagues \cite{mortamet2009automatic} developed a QC algorithm based on the background of magnitude images of the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset, and reported a sensitivity and specificity of > 85%. In 2015, Shezad and colleagues \cite{shehzadpreprocessed} developed the Preprocessed Connectomes Project Quality Assessment Protocol (PCP-QAP) on the Autism Brain Imaging Data Exchange (ABIDE) and Consortium for Reproducibility and Reliability (CoRR) datasets. The PCP-QAP also included a Python library to easily compute metrics such as signal to noise ratio, contrast to noise ratio, entropy focus criterion, foreground-to-background energy ratio, voxel smoothness, and percentage of artifact voxels. Building on this work, the MRIQC package from Esteban and colleagues \cite{Esteban2017} includes a comprehensive set of 64 image quality metrics, from which a classifier was trained to predict data quality of the ABIDE dataset for new, unseen sites with 76% accuracy.
Our strategy differed from that of the MRIQC classification study. In the Esteban 2017 study \cite{Esteban2017}, the authors labelled images that were “doubtful” in quality as a “pass” when training and evaluating their classifier. Our MRIQC classifier was trained and evaluated only on images that our raters very confidently passed or failed. Because quality control is subjective, we felt that it was acceptable for a “doubtful” image to be failed by the classifier. Since our classifier was trained on data acquired within a single site, and only on images that we were confident about, our MRIQC classifier achieved near perfect accuracy with an AUC of 0.99. On the other hand, our braindr CNN was trained as a regression (rather than a classification) on the full dataset, including the “doubtful” images (i.e those with ratings closer to 0.5), but was still evaluated as a classifier against data we were confident about. This also achieved near-perfect accuracy with an AUC of 0.99. Because both the MRIQC and braindr classifiers perform so well on data we are confident about, we contend that it is acceptable to let the classifier act as a “tie-breaker” for images that lie in the middle of the spectrum, for future acquisitions of the HBN dataset.
Quality control of large consortium datasets, and more generally, the scaling of expertise in neuroimaging, will become increasingly important as neuroscience moves towards data-driven discovery. Interdisciplinary collaboration between domain experts and computer scientists, and public outreach and engagement of citizen scientists can help realize the full potential of Big Data.
Limitations
One limitation of this method is that there is an interpretability-to-speed tradeoff. Specialized QC tools were developed over many years, while this study was performed in a fraction of that time. Specialized QC tools are far more interpretable; for example, the coefficient of joint variation (CJV) metric from MRIQC is sensitive to the presence of head motion. CJV was one of the most important features of our MRIQC classifier, implying that our citizen scientists were primarily sensitive to motion artifacts. This conclusion is difficult to come to when interpreting the braindr CNN. Because we employed transfer learning, the features that were extracted were based on the ImageNet classification task, and it is unclear how these features related to MRI-specific artifacts. However, interpretability of deep learning is an ongoing active field of research \cite{chakraborty2017interpretability}, and we may be able to fit more interpretable models in the future.
Compared to previous efforts to train models to predict quality ratings, such as MRIQC \cite{Esteban2017}, our AUC scores are very high. There are two main reasons for this. First, in the Esteban 2017 study \cite{Esteban2017}, the authors tried to predict the quality of scans from unseen sites, whereas in our study, we combined data across the two sites from which data had been made publicly available at the time we conducted this study. Second, even though our quality ratings on the 3D dataset were continuous scores (ranging from -5 to 5), we only evaluated the performance of our models on data that received an extremely high (4,5) or extremely low score (-4,-5) by the experts. This was because quality control is very subjective, and therefore, there is more variability on images that people are unsure about. An image that was failed with low confidence (-3 to -1) by one researcher could conceivably be passed with low confidence by another researcher (1 to 3). Most importantly, our study had enough data to exclude the images within this range of relative ambiguity in order to train our XGBoost model on both the braindr ratings and the MRIQC features. In studies with less data, such an approach might not be feasible.
Another limitation of this method was that our citizen scientists were primarily neuroscientists. The braindr application was advertised on Twitter (
https://www.twitter.com) by the authors, whose social networks (on this platform) primarily consisted of neuroscientists. As the original tweet travelled outside our social network, we saw more citizen scientists without experience looking at brain images on the platform, but the number of ratings they contributed were not as high as those with neuroscience experience. We also saw that there was an overall tendency for all our users to incorrectly pass images. Future iterations of braindr will include a more informative tutorial and random checks with known images throughout the game to make sure our players are well informed and are performing well throughout the task. In this study, we were able to overcome this limitation because we had enough ratings to train the XGBoost algorithm to preferentially weight some user’s ratings over others.
Future Directions
Citizen science platforms like the Zooniverse \cite{simpson2014zooniverse} enable researchers to upload tasks and engage over 1 million citizen scientists. We plan to integrate braindr into a citizen science platform like Zooniverse. This would enable researchers to upload their own data to braindr, and give them access to a diverse group of citizen scientists, rather than only neuroscientists within their social network. We also plan to reuse the braindr interface for more complicated classification tasks in brain imaging. An example could be the classification of ICA components as signal or noise \cite{griffanti2017hand}, or the evaluation of segmentation algorithms. Finally, incorporating braindr with existing open data initiatives, like OpenNeuro \cite{gorgolewski2017openneuro}, or existing neuroimaging platforms like LORIS \cite{das2012loris} would enable scientists to directly launch braindr tasks from these platforms, which would seamlessly incorporate human in the loop data analysis in neuroimaging research. More generally, the principles described here motivate platforms that integrate citizen science with deep learning for Big Data applications across the sciences.