Aggregating Citizen Scientist Ratings to Emulate Expert Labels

Citizen scientists who rated images through the braindr web application differed substantially in terms of how well their ratings matched the experts’ ratings on the gold-standard subset: while some provided high-quality ratings that agree with the experts most of the time, others displayed variable and unreliable ratings. In order to capitalize on citizen scientists to amplify expert ratings to new data, a weighting of each citizen scientist was learned based on a reliable match to expert agreement in slices from the gold-standard set. We used the XGBoost algorithm \cite{Chen2016}, an ensemble method that combines a set of weak learners (decision trees) to fit the gold-standard labels based on a set of features. In our case, the features were the average rating of the slice image from each citizen scientist (some images were viewed and rated more than once, so image ratings could vary between 1=always “pass” and 0=always “fail”). We then used the weights to combine the ratings of the citizen scientists and predict the left out test set. Figure \ref{468392}A shows ROC curves of classification on the left-out test set for different training set sizes, compared to the ROC curve of a baseline model in which equal weights were assigned to each citizen scientist. We see an improvement in the AUC of the XGBoosted labels (0.97) compared to the AUC of the equi-weighted  labels (0.95). Using the model trained on two-thirds of the gold standard data (n=670 slices), we extracted the probability scores of the classifier on all slices (see Figure \ref{468392}B). The distribution of probability scores in Figure \ref{468392}B matches our expectations of the data; a bimodal distribution with peaks at 0 and 1, reflecting that images are mostly perceived as “passing” or “failing” . The XGBoost model also calculates a feature importance score (F). F is the number of times that a feature (in our case, an individual citizen scientist) has split the branches of a tree, summed over all boosted trees. Figure \ref{468392}C shows the feature importance for each citizen scientist, and \ref{468392}D shows the relationship between a citizen scientist’s importance compared to the number of images they rated. In general, the more images a citizen scientist rates, the more important they are to the model. However, there are still exceptions where a citizen scientist rated many images and their ratings were incorrect or unreliable, so the model gave them less weight during aggregation.