Author Summary
How do we scale procedures that currently depend on human expertise to large-scale datasets? This is a fundamental challenge in this era of Big Data, not unique to any one discipline, but particularly pertinent to computational neuroimaging. For example, when studying pediatric mental health using brain MRI scans, researchers would need to visually check the quality of hundreds of brain images. Instead, we developed a web application (
https://braindr.us) for citizen scientists to perform quality control of this large dataset by swiping right (to pass) or left (to fail) each image. We aggregated the ratings with a machine learning model, and then trained a deep neural network to automatically predict image quality, such that it matched expert ratings. In other words, combining citizen science with deep learning through an intuitive web application enabled us to amplify and automate expertise. This procedure will be broadly applicable to the growing demands of Big Data across the sciences. An interactive version of this article is at
http://results.braindr.us .
Introduction
Many research fields ranging from astronomy, to genomics, to neuroscience are entering an era of Big Data. Large and complex datasets promise to address many scientific questions, but they also present a new set of challenges. For example, over the last few years human neuroscience has evolved into a Big Data field. In the past, individual groups would each collect their own samples of data from a relatively small group of individuals. More recently, large data sets collected from many thousands of individuals are increasingly more common. This transition has been facilitated through assembly of large aggregated datasets, containing measurements from many individuals, and collected through consortium efforts such as the Human Connectome Project \cite{glasser2016human}. These efforts, and the large datasets that they are assembling, promise to enhance our understanding of the relationship between brain anatomy, brain activity and cognition. The field is experiencing a paradigm shift \cite{Fan_2014}, where our once established scientific procedures are morphing as dictated by the new challenges posed by large datasets. We’ve seen a shift from desktop computers to cyberinfrastructure \cite{Van_Horn_2013}, from small studies siloed in individual labs to an explosion of data sharing initiatives \cite{Ferguson_2014,Poldrack_2014}, from idiosyncratic data organization and analysis scripts to standardized file structures and workflows \cite{gorgolewski2016brain,gorgolewski2017bids}, and an overall shift in statistical thinking and computational methods \cite{Fan_2014} that can accommodate large datasets. But one often overlooked aspect of our protocols in neuroimaging has not yet evolved to the needs of Big Data: expert decision making.
Specifically, decisions made by scientists with expertise in neuroanatomy and MRI methods (i.e., neuroimaging experts) through visual inspection of imaging data cannot be accurately scaled to large datasets. For example, when inspecting an MRI image of the brain, there is extensive variation in neuroanatomy across individuals, and variation in image acquisition and imaging artifacts; knowing which of these variations are acceptable versus abnormal comes with years of training and experience. Specific research questions require even more training and domain expertise in a particular method, such as tracing anatomical regions of interest (ROIs), editing fascicle models from streamline tractography \cite{Jordan_2017}, evaluating cross-modality image alignment, and quality control of images at each stage of image processing. On large datasets, especially longitudinal multisite consortium studies, these expert decisions cannot be reliably replicated because the timeframe of these studies is long, individual experts get fatigued, and training teams of experts is time consuming, difficult and costly. As datasets grow to hundreds of thousands of brains it is no longer feasible to depend on manual interventions.
One solution to this problem is to train machines to emulate expert decisions. However, there are many cases in which automated algorithms exist, but expert decision-making is still required for optimal results. For example, a variety of image segmentation algorithms have been developed to replace manual ROI editing, with Freesurfer \cite{fischl2012freesurfer}, FSL \cite{Patenaude_2011}, ANTS \cite{Avants_2011}, and SPM \cite{Ashburner_2005} all offering automated segmentation tools for standard brain structures. But these algorithms were developed on a specific type of image (T1-weighted) and on a specific type of brain (those of healthy controls). Pathological brains, or those of children or the elderly may violate the assumptions of these algorithms, and their outputs often still require manual expert editing. Similarly, in tractography, a set of anatomical ROIs can be used to target or constrain streamlines to automatically extract fascicles of interest \cite{CATANI_2008,yeatman2012tract}. But again, abnormal brain morphology resulting from pathology would still require expert editing \cite{Jordan_2017a}. The delineation of retinotopic maps in visual cortex is another task that has been recently automated \cite{Benson2014,Benson2012}, but these procedures are limited to only a few of the known retinotopic maps and substantial expertise is still required to delineate the other known maps \cite{Winawer2017,Wandell2011}. Another fundamental step in brain image processing that still requires expert examination is quality control. There are several automated methods to quantify image quality, based on MRI physics and the statistical properties of images, and these methods have been collected under one umbrella in an algorithm called MRIQC \cite{Esteban2017}. However, these methods are specific to T1-weighted images, and cannot generalize to different image acquisition methods. To address all of these cases, and scale to new, unforeseen challenges, we need a general-purpose framework that can train machines to emulate experts for any purpose, allowing scientists to fully realize the potential of Big Data.
One general solution that is rapidly gaining traction is deep learning. Specifically, convolutional neural networks (CNNs) have shown promise in a variety of biomedical image processing tasks. Modeled loosely on the human visual system, CNNs can be trained for a variety of image classification and segmentation tasks using the same architecture. For example, the U-Net \cite{ronneberger2015u} which was originally built for segmentation of neurons in electron microscope images, has also been adapted to segment macular edema in optical coherence tomography images \cite{Lee_2017}, to segment breast and fibroglandular tissue \cite{Dalm__2017}, and a 3D adaptation was developed to segment the Xenopus kidney \cite{cciccek20163d}. Transfer learning is another broadly applicable deep learning technique, where a number of layers from pretrained network are retrained for a different use case. This can drastically cut down the training time and labelled dataset size needed \cite{ahmed2008training,pan2010survey}. For example, the same transfer learning approach was used for brain MRI tissue segmentation (gray matter, white matter, and CSF) and for multiple sclerosis lesion segmentation \cite{van2015transfer}. Yet despite these advances in deep learning, there is one major constraint to generalizing these methods to new imaging problems: a large amount of labelled data is still required to train CNNs. Thus, even with the cutting-edge machine learning methods available, researchers seeking to automate these processes are still confronted with the original problem: how does a single expert create an annotated dataset that is large enough to train an algorithm to automate their expertise through machine learning?
We propose that citizen scientists are a solution. Specifically, we hypothesize that citizen scientists can learn from, and amplify expert decisions, to the extent where deep learning approaches become feasible. Rather than labelling hundreds or thousands of training images, an expert can employ citizen scientists to help with this task, and machine learning can identify which citizen scientists provide expert-quality data. As a proof of concept, we apply this approach to brain MRI quality control (QC): a binary classification task where images are labelled “pass” or “fail” based on image quality. QC is a paradigmatic example of the problem of scaling expertise, because a large degree of subjectivity still remains in QC. Each researcher has their own standards as to which images pass or fail on inspection, and this variability may have problematic effects on downstream analyses, especially statistical inference. Effect size estimates may depend on the input data to a statistical model. Varying QC criteria will add more uncertainty to these estimates, and might result in replication failures. For example, in \cite{ducharme2016trajectories}, the authors found that QC had a significant impact on their estimates of the trajectory of cortical thickness during development. They concluded that post-processing QC (in the form of expert visual inspection) is crucial for such studies, especially due to motion artifacts in younger children. While this was feasible in their study of 398 subjects, this would not be possible for larger scale studies like the ABCD study, which aims to collect data on 10,000 subjects longitudinally \cite{casey2018adolescent}. It is therefore essential that we develop systems that can accurately emulate expert decisions, and that these systems are made openly available for the scientific community.
To demonstrate how citizen science and deep learning can be combined to amplify expertise in neuroimaging, we developed a citizen-science amplification and CNN procedure for the openly available Healthy Brain Network dataset (HBN; \cite{alexander2017open}). The HBN initiative aims to collect and publicly release data on 10,000 children over the next 6 years to facilitate the study of brain development and mental health through transdiagnostic research. The rich dataset includes MRI brain scans, EEG and eye tracking recordings, extensive behavioral testing, genetic sampling, and voice and actigraphy recordings. In order to understand the relationship between brain structure (based on MRI) and behavior (EEG, eye tracking, voice, actigraphy, behavioral data), or the association between genetics and brain structure, researchers require high quality MRI data.
In this study, we crowd-amplify image quality ratings and train a CNN on the first and second data releases of the HBN (n=722), which can be used to infer data quality on future data releases. We also demonstrate how choice of QC threshold is related to the effect size estimate on the established association between age and brain tissue volumes during development \cite{Lebel2011}. Finally, we show that our approach of deep learning trained on a crowd-amplified dataset matches state-of-the-art software built specifically for image QC \cite{Esteban2017}. We conclude that this novel method of crowd-amplification has broad applicability across scientific domains where manual inspection by experts is still the gold-standard.
Results
Overview
Our primary goals were to 1) amplify a small, expertly labelled dataset through citizen science, 2) train a model that optimally combines citizen scientist ratings to emulate an expert, 3) train a CNN on the amplified labels, and 4) evaluate its performance on a validation dataset. Figure
\ref{567433} shows an overview of the procedure and provides a summary of our results. At the outset, a group of neuroimaging experts created a gold-standard quality control dataset on a small subset of the data (n=200), through extensive visual examination of the full 3D volumes of the data. In parallel, citizen scientists were asked to “pass” or “fail” two-dimensional axial slices from the full dataset (n=722) through a web application called braindr that could be accessed through a desktop, tablet or mobile phone (
https://braindr.us). Amplified labels, that range from 0 (fail) to 1 (pass), were generated from citizen scientist ratings. A receiver operating characteristic (ROC) curve was generated for both the ratings averaged across citizen scientists and labels generated by fitting a classifier that weights ratings more heavily for citizen scientists who more closely matched the experts in the subset rated by both (gold-standard). Next, a neural network was trained to predict the weighted labels. The AUC for the predicted labels on a left out dataset was 0.99.