The future
Objects in natural history collections represent one of the most
important tools to understand life on our planet. Mobilising the
capacity to analyse billions of objects with the help of machine
learning is essential to meet the challenge of conserving and
sustainably using biodiversity in the alarmingly short time-frames. This
paper is written to emphasise the huge potential and the challenges. The
main limitation to achieving our vision is not the software for machine
learning, nor the ideas for using it, but it is the accessibility of
data and images of specimens in a computational environment where they
can be processed efficiently. The potential applications of machine
learning to specimen images can be divided into those of the specimen
object itself and those related to the associated labels. Future
expansion of the approach will see further extraction of traits from the
specimen images include the 1) quantitative and qualitative analysis of
organisms structure (e.g., relative proportions, topological arrangement
of organs, symmetry); 2) actual type-based taxonomy, i.e., clustering
morphological groups of specimens, assessing the consistency of current
hierarchies in taxonomy and building automatic identification tools
through a direct link to the traits retrievable from specimens and 3)
solving metadata limitations through the analytical and comparative
analysis of preparation and mounting “styles”, even when their
identity is not explicitly linked to the specimen itself.
Many additional uses can be imagined for the analysis of non-specimen
data, that is the additional information that is linked to the physical
object, either when directly written on attached labels or linked
inventories, catalogues, or spreadsheets \citep{hardisty_digital_2022}. These
include extracting information on: 1) the interaction between species
and the abiotic elements of their environment; 2) collection data
expressed by cryptic textual elements, such as idem and ibidem , that imply a link to other text, and 3) tracing
nomenclatural type material, by linking data elements on type specimens.
There is also enormous potential for biological collections that have so
far not been the main focus of digitization, including microscope slides
of thin sections; histological; or other extractions (Fig. 2B).
The uses of machine learning on collection images are numerous, but as
we have shown the real benefits come from scaling up the approach and
being able to combine images of many collections. One can imagine
research into fields ranging from morphometry, evolution, environmental
change to biomimicry and subjects in the humanities. Although
imagination is the ultimate limit, we are currently limited by the
availability of infrastructure to conduct such research.
Acknowledgements
This work was supported by European Cooperation in Science and
Technology (COST) as part of the Mobilise Action CA17106 on Mobilising
Data, Experts and Policies in Scientific Collections. Heliana Teixeira
thanks FCT/MCTES for the financial support to the host institution CESAM
(UIDB/50017/2020+UIDP/50017/2020). Renato Panda was supported by Ci2 -
FCT/MCTES UIDP/05567/2020. This work was also facilitated by the
Research Foundation – Flanders research infrastructure under grant
number FWO I001721N, the BiCIKL (grant agreement No 101007492) and
SYNTHESYS+ (grant agreement No 823827) projects of the European Union’s
Horizon 2020 Research and Innovation action under grant agreement.