Transfer learning is another broadly applicable deep learning technique, where a number of layers from pretrained network are retrained for a different use case. This can drastically cut down the training time and labelled dataset size needed \cite{ahmed2008training,pan2010survey}. For example, the same transfer learning approach was used for brain MRI tissue segmentation (gray matter, white matter, and CSF) and for multiple sclerosis lesion segmentation \cite{van2015transfer}. Yet despite these advances in deep learning, there is one major constraint to generalizing these methods to new imaging problems: a large amount of labelled data is still required to train CNNs. Thus, even with the cutting-edge machine learning methods available, many researchers are still confronted with the original problem: how does a single expert create an annotated dataset that is large enough to automate their expertise through machine learning.
Transfer learning is another broadly applicable deep learning technique, where a number of layers from pretrained network are retrained for a different use case. This can drastically cut down the training time and labelled dataset size needed \cite{ahmed2008training,pan2010survey}. For example, the same transfer learning approach was used for brain MRI tissue segmentation (gray matter, white matter, and CSF) and for multiple sclerosis lesion segmentation \cite{van2015transfer}. Yet despite these advances in deep learning, there is one major constraint to generalizing these methods to new imaging problems: a large amount of labelled data is still required to train CNNs. Thus, even with the cutting-edge machine learning methods available, many researchers are still confronted with the original problem: how does a single expert create an annotated dataset that is large enough to automate their expertise through machine learning.
Here, we hypothesize that expert decision making can be scaled up by citizen scientists that can learn from and amplify expert decisions, to the extent where deep learning approaches become feasible. As a proof of concept, we apply this approach to brain MRI quality control (QC): a binary classification task where images are labelled “pass” or “fail” based on image quality. QC is a paradigmatic example of the problem of scaling expertise. QC is subjective, and each researcher has their own standards as to which images pass or fail on inspection. The variability of expert subjectivity has problematic effects on downstream analyses, especially statistical inference: effect size estimates may depend on the input data to a statistical model. Varying QC criteria will add more uncertainty to these estimates, and might result in replication failures. For example, in \cite{ducharme2016trajectories}, the authors found that QC had a significant impact on their estimates of the trajectory of cortical thickness during development. They concluded that post-processing QC (in the form of visual inspection) is crucial for such studies, especially due to motion artifacts in younger children. It is therefore essential that we develop systems that can accurately emulate decisions, and that these systems are made openly available for the scientific community.
Here, we hypothesize that expert decision making can be scaled up by citizen scientists that can learn from, and amplify expert decisions, to the extent where deep learning approaches become feasible. As a proof of concept, we apply this approach to brain MRI quality control (QC): a binary classification task where images are labelled “pass” or “fail” based on image quality. QC is a paradigmatic example of the problem of scaling expertise. QC is subjective, and each researcher has their own standards as to which images pass or fail on inspection. The variability of expert subjectivity has problematic effects on downstream analyses, especially statistical inference: effect size estimates may depend on the input data to a statistical model. Varying QC criteria will add more uncertainty to these estimates, and might result in replication failures. For example, in (Ducharme et al. 2016), the authors found that QC had a significant impact on their estimates of the trajectory of cortical thickness during development. They concluded that post-processing QC (in the form of visual inspection) is crucial for such studies, especially due to motion artifacts in younger children. It is therefore essential that we develop systems that can accurately emulate expert decisions, and that these systems are made openly available for the scientific community.
To demonstrate how citizen science and deep learning can be combined to amplify expertise in neuroimaging, we developed a citizen-science amplification and CNN procedure for the openly available Healthy Brain Network dataset (HBN; \cite{alexander2017open}). This initiative aims to collect and publicly release data on 10,000 children over the next 6 years to facilitate the study of brain development and mental health through transdiagnostic research. The rich dataset includes MRI brain scans, EEG and eye tracking recordings, extensive behavioral testing, genetic sampling, and voice and actigraphy recordings. In order to understand the relationship between brain structure (based on MRI) and behavior (EEG, eye tracking, voice, actigraphy, behavioral data), or the association between genetics and brain structure, researchers require high quality MRI data.
To demonstrate how citizen science and deep learning can be combined to amplify expertise in neuroimaging, we developed a citizen-science amplification and CNN procedure for the openly available Healthy Brain Network dataset (HBN; (Alexander et al. 2017)). This initiative aims to collect and publicly release data on 10,000 children over the next 6 years to facilitate the study of brain development and mental health through transdiagnostic research. The rich dataset includes MRI brain scans, EEG and eye tracking recordings, extensive behavioral testing, genetic sampling, and voice and actigraphy recordings. In order to understand the relationship between brain structure (based on MRI) and behavior (EEG, eye tracking, voice, actigraphy, behavioral data), or the association between genetics and brain structure, researchers require high quality MRI data.
In this study, we crowd-amplify image quality ratings and train a CNN on the first and second data releases of the HBN (n=722), which can be used to infer data quality on future data releases. We also demonstrate how choice of QC threshold is related to the effect size estimate on the established association between age and brain tissue volumes during development \cite{Lebel2011}. Finally, we show that our approach of deep learning trained on a crowd-amplified dataset matches state-of-the-art software built specifically for image QC \cite{Esteban2017}. We therefore recommend employing our crowd-amplification method for any binary image classification task, particularly in the cases where specialized, fully automated software do not exist.
In this study, we crowd-amplify image quality ratings and train a CNN on the first and second data releases of the HBN (n=722), which can be used to infer data quality on future data releases. We also demonstrate how choice of QC threshold is related to the effect size estimate on the established association between age and brain tissue volumes during development (Lebel and Beaulieu 2011). Finally, we show that our approach of deep learning trained on a crowd-amplified dataset matches state-of-the-art software built specifically for image QC (missing citation). We therefore recommend employing our crowd-amplification method for any binary image classification task, particularly in the cases where specialized, fully automated software do not exist.
Results
Overview
Our primary goals were to 1) amplify a small, expertly labelled dataset through citizen science, 2) train a model that optimally combines citizen scientist ratings to emulate an expert, 3) train a CNN on the amplified labels, and 4) evaluate its performance on a validation dataset. Figure
\ref{567433} shows an overview of the procedure and provides a summary of our results. At the outset, a group of neuroimaging experts created a gold-standard quality control dataset on a small subset of the data (n=200), through extensive visual examination of the full volumes of the data. In parallel, citizen scientists were asked to “pass” or “fail” two dimensional axial slices from the full dataset (n=722) through a web application called braindr that could be accessed through a desktop, tablet or mobile phone (
https://braindr.us). Amplified labels, that range from 0 (fail) to 1 (pass), were generated from citizen scientist ratings. A receiver operating characteristic (ROC) curve was generated for both the ratings averaged across citizen scientists and labels generated by fitting a classifier that weights ratings more heavily for citizen scientists who more closely matched the experts in the subset rated by both. Next, a neural network was trained to predict the weighted labels. The AUC for the predicted labels on a left out dataset was 0.99.