Improving online access to collections is important because collections
are physically dispersed, yet still interconnected through their origins
and exchanges \citep{nicolson_specimens_2018}. Researchers are rarely able to
obtain a full set of specimens for a single taxon or single collector
from a single institution. Most are scattered across tens or even
hundreds of different collections. Digitization and digital access can
break down physical barriers between collections and make them
accessible as a unified research tool \citep{hardisty_conceptual_2020}. Online
access is also fundamental to addressing historic imbalances in the
amassing of collections in the northern hemisphere, from
high-biodiversity regions \citep{grace_botanical_2021}.
Unified access to specimen images is particularly important because
image files are comparatively large and image analysis pipelines are
therefore demanding on processor time. Furthermore, current internet
bandwidth makes transferring large numbers of files a bottleneck,
particularly if those files need to be moved multiple times. Therefore,
it makes sense to store large numbers of images close to where the
processing is going to occur. While such infrastructures exist for other
data types (e.g., Copernicus for remote sensing and WLCG for
the Large Hadron Collider), no such support exists for biological
collections-based image processing. All image analysis pipelines have so
far built their own corpus of images and processed them independently.
This approach is not scalable, it is wasteful of human time and effort,
not to mention the internet bandwidth that would be required to do this
on the scale of global collections. It is also unsuitable for dynamic
image corpora and workflows that are intended to be run multiple times.
The Vision
We envisage a data space for biological collections with a centrally
accessible image corpus with built-in processing. This will allow anyone
to access digitised images of specimens, without having to concentrate
on the logistics of corpus creation and maintenance. By building
accessible interfaces, it would also make it possible to remove
technological barriers that prevent taxonomists and ecologists, among
other users, from using advanced image analysis tools. Through
supervised expert contributions the system could be further advanced
with the integration of knowledge from many scientific disciplines. Such
a corpus would be constantly furnished with new images from publishing
collections and would support both the citation and reproducibility of
the workflows, and their underlying collections, in alignment with FAIR
Data Principles \citep{wilkinson_fair_2016}. It would also make it easier to
curate image datasets and use them for research (e.g. for benchmarking
and challenges for machine learning) and for activities like teaching
species identification from digitised specimens.
Scope
With such an infrastructure, we aim to increase the use and improve the
usability of biological collections for research. The initial focus
would be to support two-dimensional images from preserved, fossilised or
geological specimens. This could later be extended to other types of
specimen image, such as to three-dimensional images and X-ray computed
tomography (CT scanning).
Images from living organisms are not considered here, nor other media,
such as sounds, though they are undoubtedly useful and deserve
attention. However, the challenges of pictures of living organisms are
different, their numbers are at least two orders of magnitude larger and
increasing more rapidly than digitised preserved specimens and dedicated
infrastructures already exist to process them, such as Pl@ntnet and
iNaturalist. The creators of such images are also more varied, as are
the licensing requirements placed upon images. An exception might be
those pictures of living organisms in situ before they were
preserved. Such pictures give additional context to the specimen and can
potentially be used together with the preserved specimen both for human
and computational comparison \citep*{goeau_overview_2021}.
In this paper, we answer the questions: what research could be done with
such an infrastructure, who would use it, what functionality would be
needed and what are the architectural requirements? First, we present
the purposes for such a unified corpus of specimen images, and secondly
we envisage what such an infrastructure might look like. In total we
imagine a future where we can search across global collections for such
things as the pattern of a butterfly’s wing, the shape of a leaf, the
logo of a specific collection, or for examples of someone’s handwriting.