Many specimens are signed, either by their collector, determiner, or
both (Fig. \ref{478509}). Expert curators within an institution learn to recognise
the signature of prolific collectors, but they are often illegible
without that knowledge. Yet, it is common practice to use the name of a
collector, together with their collecting number to identify a gathering
(collection event) uniquely. Furthermore, due to exchanges, loans and
gifts, a collector’s specimens may be spread between a number of
institutions. If the name is not distinct enough to be transcribed
accurately, finding the specimens from a specific collector across the
whole corpus of global collections would be an impossible task without
some automated process.
Unsupervised learning
The stacked layers of deep neural networks can be regarded as a set of
transformations that learn useful representations of the starting data.
Using representations of specimen images learned by neural networks,
rather than extracted metadata, would allow content-based interaction
with, and comparison between, images. Such interaction is useful for
tasks where a high-quality labelled dataset does not currently exist or
where the characteristics of a specimen that are important to a task are
not well-defined. For instance, \citet{white_evaluating_2019} used representations
of specimen images learned by a neural network trained to classify fern
genera to directly compare specimen morphology and test biogeographic
hypotheses. Similarly, \citet{hoyal_cuthill_deep_2019} trained a network to
estimate the similarity of two sets of butterfly specimen images and
used the learned representations to test mimicry hypotheses.
Furthermore, some tasks require researchers to inspect and compare
specimen images individually. The reduced dimensionality of deep
representations in combination with scalable nearest-neighbour search (e.g. \citet{johnson_billion-scale_2021}) makes direct comparison of images very efficient,
opening up opportunities to explore collections through image content
rather than through the metadata. Tasks like searching a collection for
similar specimens during identification, and identifying misidentified
or poor-quality specimens, become much more efficient in a digital
setting.
Recently, interest in learning useful representations from unlabelled
data has surged \citep{rives_biological_2021} in the field of unsupervised (or
self-supervised) representation learning. These studies have shown that
large numbers of unlabelled images (millions to billions) can be used to
learn representations that work well as a starting point for supervised
classification tasks, such as species identification. There is exciting
potential to apply these methods to herbarium specimens \citep{walker_harnessing_2022}. A large, centralised repository of specimen images would further
enable this research by allowing the development and curation of the two
types of dataset necessary for self-supervised representation learning:
large training corpora and smaller, task-specific benchmarking datasets
\citep{van_horn_benchmarking_2021}.
Conceptual framework of the
infrastructure
Unlocking the potential for machine learning in natural history
collections is contingent on technical infrastructure which is
easy-to-use, interoperable with regional and global biodiversity data
platforms, and accessible to the global scientific community. To build
the infrastructure will require extensive consultations with the
scientific community and funding agencies. It is imperative that
investments in the infrastructure are scientifically, economically and
socially justified, as well as sustainable. Here, we present a
conceptual framework which is conceived as a roadmap for building the
envisioned infrastructure. Although the infrastructure could be
implemented in different ways (e.g. distributed or centralised) with
advanced components depending on the scope and requirements of the
research community, there are essential components that form the
foundation of this proposed infrastructure. In the following section, we
describe the three core technical components of the infrastructure,
coordinated by the orchestration logic: (1) the repository to index data
and metadata, (2) the storage of images, models and data and (3) the
processing of images to generate new data and annotations, as well as
train new models (Fig. \ref{205447}). The orchestration logic will consist of
components such as technical workflows, security protocols and
application integrations that enable implementation of business logic
and access to services. In addition to the technical components, the
infrastructure will require a governance structure and set of protocols,
as well as training and outreach to reach the intended audience.