Introduction

The phenotype of organisms varies continuously during development and through evolutionary time. Continuous morphological variation is captured for numerous purposes in the life sciences via the practice of morphometry: the measurement of the size and shape of anatomical forms. Morphometry has yielded novel findings in evolution (Esquerré et al., 2020) and has been used to assess fluctuating asymmetry (Palmer, 1993; Klingenberg, 2015), ontogeny (Csősz & Majoros, 2009; Shingleton et al., 2007), ecomorphism (Mahendiran et al. 2018; Tomiya & Meachen, 2018; Anderson et al., 2019), and in human clinical practice (Bartlett & Frost, 2008). Among other applications, morphometric data are also key for alpha taxonomy, the discipline of formally differentiating and describing species and higher taxa. This is exemplified by the development of phenetics in the twentieth century (Michener & Sokal, 1957; Sokal & Sneath, 1963) and by numerous modern studies in other frameworks, such as for plants (Savriama, 2018; Chuanromanee, Cohen & Ryan, 2019), animals (Villemant, Simbolotti & Kenis, 2007; Inäbnit, 2019), and other organisms (Fodor et al., 2015; McMullin et al., 2018). Continuous data are also valuable, for modeling evolutionary histories (e.g., Parins-Fukuchi, 2017, 2020). Thus, the morphometric approach constitutes a fundamental and crucial practice for the study of phenotypes in biodiversity research.
Morphology is traditionally considered to comprise both continuous and discrete traits (Artistotle, 350; Thompson, 1917; Rensch, 1947; Remane, 1952). Discrete states were established as the basic comparative units in animal alpha taxonomy from its formalization (Linnaeus 1758), and have become a key means of scoring data for phylogenetic analysis, particularly after Hennig (1950, 1966). The reproducibility of scoring discrete states is an issue, however, as qualitative perception of phenotype not only requires specific training and considerable experience but can also be plagued by arbitrariness (Bond & Beamer, 2006), meaning that variation may simply come from individual (mis-)interpretation. The qualitative approach commonly uses verbal species descriptions that are often subjective or difficult to articulate. Therefore, information transfer, if at all reliable, is based on one-to-one knowledge sharing mechanisms, and requires logically-structured linguistic hierarchies such as the Hymenoptera Anatomy Ontology (Yoder, Mikó, Seltmann, Bertone & Deans, 2010).
In contrast to this relatively idiosyncratic approach, morphometry is considered transferable. It converts variation of shape, size of anatomical traits, and number and arrangement of anatomical elements into numerical values, allowing for the dissemination of reproducible, phenotype-based knowledge. Today, an increasing number of morphology-based insect alpha-taxonomists use morphometric data and provide numeric keys to species (Steiner Schlick-Steiner & Moder, 2006; Csősz Heinze & Mikó 2015; Seifert, 2018). If observers arrive at the same conclusion by measuring traits according to the same protocol, findings are believed to be reliable and transferable. If one can measure a trait, anyone else should be able to reproduce it.
However, measurements come with error. Agreement among different observers and within a single observer’s measurements is affected by a number of sources, such as the skills of the observer (if human input is required), the precision and accuracy of the equipment, clear interpretation and appropriate understanding of the character recording protocol, and other parameters. All of the uncertainty factors mentioned above are common in practice, and the fact that it is impossible to control every source of measurement variation challenges morphometry-based research (Wolak, Fairbairn & Paulsen, 2012). Understanding of the degree to which measurement errors may affect the transferability of findings is urgently needed. During the last few decades, reproducibility issues have been studied in vertebrate systematics (e.g., Oxnard, 1983, Corruccini, 1988; Yezerinac, Lougheed & Handford, 1992; Helm and Albrecht, 2000; Takacs Vital, Ferincz & Staszny, 2016; Fox, Veneracion & Blois, 2020), clinical research (e.g., Bland & Altman, 1986; Ridgway et al., 2008; Phexell et al., 2019), social science (e.g., Salganik et al. 2020), molecular phylogeny and genetic clustering (e.g., Huelsenbeck, 1998; Jones et al., 1998; DeBiasse & Ryan, 2019), and morphometric data generally (Andrew et al., 2015). However, to date, reproducibility assessments of morphometric data in entomology are extremely limited (Mutanen & Pretorius, 2007; Johnson et al., 2013).
In order to address the question “to what extent is insect morphometry reproducible?”, we compiled a broad database of morphometric data and performed robust statistical analyses. We used ants, a group in which the application of morphometric data has a long tradition (e.g., Brown, 1943; Brian & Brian, 1949), as a model organism. Morphometry has been employed widely in recent myrmecological studies (e.g., Ward, 1999; Baroni Urbani, 1998; Seifert, 1992, 2003, 2019; Csősz Heinze & Mikó, 2015; Wagner et al., 2017) as the primary method of interpreting anatomical forms and their variation. Eleven participants of diverse levels of skill and expertise, working with different taxonomic routines over three continents and six countries, were asked to perform repeated measurements on the same set of ant specimens, according to the same measurement protocol, with their own equipment. The wide range of morphometric skills and the quality of microscopes used provided us with an overview of the level of reproducibility of morphometric interpretation as it works in daily practice. Our findings are a first step in exploring the reproducibility of morphometric data across entomology.
Terminology [Textbox 1.]
A number of terms (e.g. “accuracy”, “precision”, “reliability”, “repeatability”, and “reproducibility”) commonly used in association with repeatability studies are defined differently in the literature. To increase the fluency of scientific discourse, we propose to adopt the standard terminology of the National Institute for Standards and Technology (NIST, Taylor & Kuyatt, 2001) of the USA and terms proposed by (Bartlett & Frost, 2008) in biological systematics:
● Accuracy describes the average closeness of the measurement(s) to the value of the measurand (= subject or quantity to be measured) (Fig. 1). Accuracy is affected by systematic and random error. We follow the terminology proposed by the NIST in using the phrase ”the value of the measurand” instead of the often-applied ”true value of the measurand” (or ”a true value”) (Taylor & Kuyatt, 2001).
● Precision refers to the closeness of the measurements between pairs of measurements made on the same measurand and applying the same protocol. Precise measurements are tightly clustered, but are not necessarily accurate, i.e. close to the value of the measurand (Fig. 1). Precision is affected by random error.
● Reliability refers to the amount of measurement error that occurs between observed measurements compared to the inherent amount of variability that occurs between measurands (Bartlett & Frost, 2008).
● Repeatability refers to the degree of agreement between repeat measurements made on the same measurand under the same conditions, i.e. made by the same observer, using the same microscope, following the same measurement protocol (Taylor & Kuyatt, 2001). Repeatability can be assessed via intra-class correlation (ICC, see Lessells & Boag, 1987).
● Reproducibility refers to the degree of agreement between measurements made on the same measurand under changing conditions, such as changing principle, method of measurement, observer, instrument, etc. (Taylor & Kuyatt, 2001).