To aggregate citizen scientist ratings, we weighted citizen scientists based on how consistent their ratings were with the gold-standard. We trained an XGBoost classifier
\cite{chen2016xgboost} implemented in Python (
http://xgboost.readthedocs.io/en/latest/python/python_intro.html) using the cross-validation functions from the scikit-learn Python library
\cite{pedregosa2011scikit}. We used 600 estimators, and grid searched over a stratified 10-fold cross-validation within the training set to select the optimal maximum depth (2 vs 6) and learning rate (0.01, 0.1). The features of the model were the citizen scientists and each observation was a slice, with the entries in the design matrix set to be the average rating of a specific citizen scientist on a particular slice. We trained the classifier on splits of various sizes of the data to test the dependence on training size (see Figure
\ref{468392}A). We used the model trained with n=670 to extract the probability scores of the classifier on all 3609 slices in braindr (see Figure
\ref{468392}B). While equally weighting each citizen scientist’s ratings results in a bimodal distribution with a lower peak that is shifted up from zero (Figure
\ref{358654}A), the distribution of probability scores in Figure
\ref{468392}B more accurately matches our expectations of the data; a bimodal distribution with peaks at 0 and 1. Feature importances were extracted from the model and plotted in Figure
\ref{468392}C, and plotted against total number of gold-standard image ratings in Figure
\ref{468392}D.