1. Model performance
The performance of all models is summarized in Figures 2-4 and Table
5-6. At the macro-averaged level, the ensemble model performed better
than either submodel individually within each classification pass (Table
5). The addition of random artificial noise in submodel 2 improved both
precision and recall in pass 1, though only recall in pass 2 (Table 5).
The ensemble model of pass 2 performed substantially better than the
corresponding model in pass 1 (Figure 2), likely both due to the larger
training dataset used for this pass and the fact that the training and
validation datasets used during this pass were both comprised of audio
collected by us in the field, thus being more similar to one another
than they were in pass 1. This increased similarity between training and
validation datasets in pass 2 is also a potential explanation for the
observed decrease in recall score with added artificial noise during
this pass, though we did not perform further analysis of this specific
result. Per-class performance was generally good, with visible
improvements from pass 1 to pass 2 in most, though species with
subjectively more variable vocalizations (e.g. T. major )
performed less well (Figure 3, Table 6). Intriguingly the increase in
classification accuracy we observed at the macro-averaged level did not
hold uniformly true at a class level, with submodel 1 or 2 often
yielding better results (Table 6). An analysis of classifier score
distributions for positive detections showed increased score separation
between true positive and false positive detections in pass 2 relative
to pass 1 (Figure 4), indicating better overall predictive power in the
case of the latter model (Knight et al. 2017). We also observed that our
chosen score threshold yielded precision and recall values that were
close to the inflection point of the precision-recall curve, indicating
this value was an appropriate choice for ensuring a good balance of the
two metrics.