The future

Objects in natural history collections represent one of the most important tools to understand life on our planet. Mobilising the capacity to analyse billions of objects with the help of machine learning is essential to meet the challenge of conserving and sustainably using biodiversity in the alarmingly short time-frames. This paper is written to emphasise the huge potential and the challenges. The main limitation to achieving our vision is not the software for machine learning, nor the ideas for using it, but it is the accessibility of data and images of specimens in a computational environment where they can be processed efficiently. The potential applications of machine learning to specimen images can be divided into those of the specimen object itself and those related to the associated labels. Future expansion of the approach will see further extraction of traits from the specimen images include the 1) quantitative and qualitative analysis of organisms structure (e.g., relative proportions, topological arrangement of organs, symmetry); 2) actual type-based taxonomy, i.e., clustering morphological groups of specimens, assessing the consistency of current hierarchies in taxonomy and building automatic identification tools through a direct link to the traits retrievable from specimens and 3) solving metadata limitations through the analytical and comparative analysis of preparation and mounting “styles”, even when their identity is not explicitly linked to the specimen itself.
Many additional uses can be imagined for the analysis of non-specimen data, that is the additional information that is linked to the physical object, either when directly written on attached labels or linked inventories, catalogues, or spreadsheets \citep{hardisty_digital_2022}. These include extracting information on: 1) the interaction between species and the abiotic elements of their environment; 2) collection data expressed by cryptic textual elements, such as idem and ibidem , that imply a link to other text, and 3) tracing nomenclatural type material, by linking data elements on type specimens. There is also enormous potential for biological collections that have so far not been the main focus of digitization, including microscope slides of thin sections; histological; or other extractions (Fig. 2B).
The uses of machine learning on collection images are numerous, but as we have shown the real benefits come from scaling up the approach and being able to combine images of many collections. One can imagine research into fields ranging from morphometry, evolution, environmental change to biomimicry and subjects in the humanities. Although imagination is the ultimate limit, we are currently limited by the availability of infrastructure to conduct such research.

Acknowledgements

This work was supported by European Cooperation in Science and Technology (COST) as part of the Mobilise Action CA17106 on Mobilising Data, Experts and Policies in Scientific Collections. Heliana Teixeira thanks FCT/MCTES for the financial support to the host institution CESAM (UIDB/50017/2020+UIDP/50017/2020). Renato Panda was supported by Ci2 - FCT/MCTES UIDP/05567/2020. This work was also facilitated by the Research Foundation – Flanders research infrastructure under grant number FWO I001721N, the BiCIKL (grant agreement No 101007492) and SYNTHESYS+ (grant agreement No 823827) projects of the European Union’s Horizon 2020 Research and Innovation action under grant agreement.