Discussion
Morphometric characters proved reproducible in terms of inter-gauger
agreement. The eleven gaugers successfully arrived at the same
two-species conclusion despite a great variety of morphometric skills
and microscopic equipment of differing quality. The PERMANOVA test
revealed no significant gauger effect on the species identity
(R2 =0.69, p=0.58). The ratio of misidentifications on
specimen level over all gaugers was only 1.0% within a total of 198
determinations. The non-parametric Spearman’s Rank correlation revealed
that gauger ICC scores and morphometric skills were significantly
correlated, whereas repeatability parameters and maximum magnification
used by the gauger were not significantly correlated. These results
indicate that both observer experience (Fig. 4) and better optical
resolution in microscopes reduces measurement error and increases
repeatability (Table 3, Fig. 4).
In analyzing mean intra-gauger agreement character-wise, the mean ICC
scores (R) varied between 0.471 in the least reproducible character and
0.872 in the most reproducible character. This rather low average
reproducibility may have different causes. One of these may be the
absolute physical size of a trait. Traits with smaller sizes tended to
have lower ICC scores, but when we tested this with a generalized linear
model (GLM) analysis there was no significant correlation between trait
size and ICC score. This non-significance may be explained by the rather
large minimum trait size (155 µm) in the Nesomyrmex test
organisms where the given differences in resolution and magnification of
the optical systems did not play a major role. The situation might
change dramatically if, for instance, 25-µm long antennal segments of
tiny Plagiolepis ants were to be measured. The solution of such a
task requires measurement conditions as they were given in the gaugers
MYRM_60000_360x and MYRM_5000_288x.
If mean trait size does not contribute much to the rather low ICC scores
in the present study, these data are probably better explained by a
combination of ten error sources as they were specified for
stereomicroscopy by Seifert (2002). It is impossible to analyze which of
these caused major disturbances in this study. All observers received
verbal and picture-assisted character definitions (see Fig. 2 and Table
1) but were given no further advice or protocols on how to minimize
stereomicroscopic measuring errors. Firstly, whether all observers
avoided the parallax error is unknown. Secondly, whether all observers
used an X-Y-Z-stage for spatial positioning of specimens (see Fig. 1 in
Seifert, 2002) and which position stability this stage had are also
unknown. In spatial positioning, it is important to place the two
endpoints of a measurement in the same visual plane, which is more
accurate the lower the depth of focus or the higher the magnification of
the optical system. Thirdly, the performance and reliability (e.g.,
ratchet-step error) of the zoom microscopes used by gaugers in this
study are unknown. Fourthly, it is unknown how the observers made their
readings (by one tenth of a graduation mark, by entire graduation marks,
by digital read-out systems, etc.). A fifth important error source is
observer-specific, ambiguous translation of character definitions. These
factors highlight the importance of presenting unambiguous character
definitions and proposing accurate measurement procedures (see
supplementary file SI4, the measuring protocol of the most advanced
observer).
To conclude, besides the above-mentioned uncertainties that are common
in regular practice in insect taxonomic research, morphometry has proven
reproducible in our test setting. The best morphology, we believe, may
be done through multi-modal means, such as combining multiple
microscopic and morphometric methods (e.g., Richter et al., 2018; Sarnat
et al., 2019; Hita-Garcia et al., 2019; Boudinot, 2019; Keklikoglou et
al., 2019; Braga et al., 2019). Given the same size range of measured
traits, the same range of observers’ skill, and the same range of
equipment, we expect the same reproducibility for other groups of
arthropods, provided these have a similar exoskeleton stability and that
specimens belong to a comparable developmental stage. Apart from this,
we encourage research teams to replicate this study with taxa of
different size classes, such as with tiny parasitic wasps and larger
grasshoppers or crickets. The requirements for equipment will change,
but we are keen to know if the basic conclusions prove comparable to our
results with Nesomyrmex ants.