Discussion

Morphometric characters proved reproducible in terms of inter-gauger agreement. The eleven gaugers successfully arrived at the same two-species conclusion despite a great variety of morphometric skills and microscopic equipment of differing quality. The PERMANOVA test revealed no significant gauger effect on the species identity (R2 =0.69, p=0.58). The ratio of misidentifications on specimen level over all gaugers was only 1.0% within a total of 198 determinations. The non-parametric Spearman’s Rank correlation revealed that gauger ICC scores and morphometric skills were significantly correlated, whereas repeatability parameters and maximum magnification used by the gauger were not significantly correlated. These results indicate that both observer experience (Fig. 4) and better optical resolution in microscopes reduces measurement error and increases repeatability (Table 3, Fig. 4).
In analyzing mean intra-gauger agreement character-wise, the mean ICC scores (R) varied between 0.471 in the least reproducible character and 0.872 in the most reproducible character. This rather low average reproducibility may have different causes. One of these may be the absolute physical size of a trait. Traits with smaller sizes tended to have lower ICC scores, but when we tested this with a generalized linear model (GLM) analysis there was no significant correlation between trait size and ICC score. This non-significance may be explained by the rather large minimum trait size (155 µm) in the Nesomyrmex test organisms where the given differences in resolution and magnification of the optical systems did not play a major role. The situation might change dramatically if, for instance, 25-µm long antennal segments of tiny Plagiolepis ants were to be measured. The solution of such a task requires measurement conditions as they were given in the gaugers MYRM_60000_360x and MYRM_5000_288x.
If mean trait size does not contribute much to the rather low ICC scores in the present study, these data are probably better explained by a combination of ten error sources as they were specified for stereomicroscopy by Seifert (2002). It is impossible to analyze which of these caused major disturbances in this study. All observers received verbal and picture-assisted character definitions (see Fig. 2 and Table 1) but were given no further advice or protocols on how to minimize stereomicroscopic measuring errors. Firstly, whether all observers avoided the parallax error is unknown. Secondly, whether all observers used an X-Y-Z-stage for spatial positioning of specimens (see Fig. 1 in Seifert, 2002) and which position stability this stage had are also unknown. In spatial positioning, it is important to place the two endpoints of a measurement in the same visual plane, which is more accurate the lower the depth of focus or the higher the magnification of the optical system. Thirdly, the performance and reliability (e.g., ratchet-step error) of the zoom microscopes used by gaugers in this study are unknown. Fourthly, it is unknown how the observers made their readings (by one tenth of a graduation mark, by entire graduation marks, by digital read-out systems, etc.). A fifth important error source is observer-specific, ambiguous translation of character definitions. These factors highlight the importance of presenting unambiguous character definitions and proposing accurate measurement procedures (see supplementary file SI4, the measuring protocol of the most advanced observer).
To conclude, besides the above-mentioned uncertainties that are common in regular practice in insect taxonomic research, morphometry has proven reproducible in our test setting. The best morphology, we believe, may be done through multi-modal means, such as combining multiple microscopic and morphometric methods (e.g., Richter et al., 2018; Sarnat et al., 2019; Hita-Garcia et al., 2019; Boudinot, 2019; Keklikoglou et al., 2019; Braga et al., 2019). Given the same size range of measured traits, the same range of observers’ skill, and the same range of equipment, we expect the same reproducibility for other groups of arthropods, provided these have a similar exoskeleton stability and that specimens belong to a comparable developmental stage. Apart from this, we encourage research teams to replicate this study with taxa of different size classes, such as with tiny parasitic wasps and larger grasshoppers or crickets. The requirements for equipment will change, but we are keen to know if the basic conclusions prove comparable to our results with Nesomyrmex ants.