Many specimens are signed, either by their collector, determiner, or both (Fig. \ref{478509}). Expert curators within an institution learn to recognise the signature of prolific collectors, but they are often illegible without that knowledge. Yet, it is common practice to use the name of a collector, together with their collecting number to identify a gathering (collection event) uniquely. Furthermore, due to exchanges, loans and gifts, a collector’s specimens may be spread between a number of institutions. If the name is not distinct enough to be transcribed accurately, finding the specimens from a specific collector across the whole corpus of global collections would be an impossible task without some automated process.

Unsupervised learning

The stacked layers of deep neural networks can be regarded as a set of transformations that learn useful representations of the starting data. Using representations of specimen images learned by neural networks, rather than extracted metadata, would allow content-based interaction with, and comparison between, images. Such interaction is useful for tasks where a high-quality labelled dataset does not currently exist or where the characteristics of a specimen that are important to a task are not well-defined. For instance, \citet{white_evaluating_2019} used representations of specimen images learned by a neural network trained to classify fern genera to directly compare specimen morphology and test biogeographic hypotheses. Similarly, \citet{hoyal_cuthill_deep_2019} trained a network to estimate the similarity of two sets of butterfly specimen images and used the learned representations to test mimicry hypotheses.
Furthermore, some tasks require researchers to inspect and compare specimen images individually. The reduced dimensionality of deep representations in combination with scalable nearest-neighbour search (e.g. \citet{johnson_billion-scale_2021}) makes direct comparison of images very efficient, opening up opportunities to explore collections through image content rather than through the metadata. Tasks like searching a collection for similar specimens during identification, and identifying misidentified or poor-quality specimens, become much more efficient in a digital setting.
Recently, interest in learning useful representations from unlabelled data has surged \citep{rives_biological_2021} in the field of unsupervised (or self-supervised) representation learning. These studies have shown that large numbers of unlabelled images (millions to billions) can be used to learn representations that work well as a starting point for supervised classification tasks, such as species identification. There is exciting potential to apply these methods to herbarium specimens \citep{walker_harnessing_2022}. A large, centralised repository of specimen images would further enable this research by allowing the development and curation of the two types of dataset necessary for self-supervised representation learning: large training corpora and smaller, task-specific benchmarking datasets \citep{van_horn_benchmarking_2021}.

Conceptual framework of the infrastructure

Unlocking the potential for machine learning in natural history collections is contingent on technical infrastructure which is easy-to-use, interoperable with regional and global biodiversity data platforms, and accessible to the global scientific community. To build the infrastructure will require extensive consultations with the scientific community and funding agencies. It is imperative that investments in the infrastructure are scientifically, economically and socially justified, as well as sustainable. Here, we present a conceptual framework which is conceived as a roadmap for building the envisioned infrastructure. Although the infrastructure could be implemented in different ways (e.g. distributed or centralised) with advanced components depending on the scope and requirements of the research community, there are essential components that form the foundation of this proposed infrastructure. In the following section, we describe the three core technical components of the infrastructure, coordinated by the orchestration logic: (1) the repository to index data and metadata, (2) the storage of images, models and data and (3) the processing of images to generate new data and annotations, as well as train new models (Fig. \ref{205447}). The orchestration logic will consist of components such as technical workflows, security protocols and application integrations that enable implementation of business logic and access to services. In addition to the technical components, the infrastructure will require a governance structure and set of protocols, as well as training and outreach to reach the intended audience.