Improving online access to collections is important because collections are physically dispersed, yet still interconnected through their origins and exchanges \citep{nicolson_specimens_2018}. Researchers are rarely able to obtain a full set of specimens for a single taxon or single collector from a single institution. Most are scattered across tens or even hundreds of different collections. Digitization and digital access can break down physical barriers between collections and make them accessible as a unified research tool \citep{hardisty_conceptual_2020}. Online access is also fundamental to addressing historic imbalances in the amassing of collections in the northern hemisphere, from high-biodiversity regions \citep{grace_botanical_2021}.
Unified access to specimen images is particularly important because image files are comparatively large and image analysis pipelines are therefore demanding on processor time. Furthermore, current internet bandwidth makes transferring large numbers of files a bottleneck, particularly if those files need to be moved multiple times. Therefore, it makes sense to store large numbers of images close to where the processing is going to occur. While such infrastructures exist for other data types (e.g., Copernicus for remote sensing and WLCG for the Large Hadron Collider), no such support exists for biological collections-based image processing. All image analysis pipelines have so far built their own corpus of images and processed them independently. This approach is not scalable, it is wasteful of human time and effort, not to mention the internet bandwidth that would be required to do this on the scale of global collections. It is also unsuitable for dynamic image corpora and workflows that are intended to be run multiple times.

The Vision

We envisage a data space for biological collections with a centrally accessible image corpus with built-in processing. This will allow anyone to access digitised images of specimens, without having to concentrate on the logistics of corpus creation and maintenance. By building accessible interfaces, it would also make it possible to remove technological barriers that prevent taxonomists and ecologists, among other users, from using advanced image analysis tools. Through supervised expert contributions the system could be further advanced with the integration of knowledge from many scientific disciplines. Such a corpus would be constantly furnished with new images from publishing collections and would support both the citation and reproducibility of the workflows, and their underlying collections, in alignment with FAIR Data Principles \citep{wilkinson_fair_2016}. It would also make it easier to curate image datasets and use them for research (e.g. for benchmarking and challenges for machine learning) and for activities like teaching species identification from digitised specimens.

Scope

With such an infrastructure, we aim to increase the use and improve the usability of biological collections for research. The initial focus would be to support two-dimensional images from preserved, fossilised or geological specimens. This could later be extended to other types of specimen image, such as to three-dimensional images and X-ray computed tomography (CT scanning).
Images from living organisms are not considered here, nor other media, such as sounds, though they are undoubtedly useful and deserve attention. However, the challenges of pictures of living organisms are different, their numbers are at least two orders of magnitude larger and increasing more rapidly than digitised preserved specimens and dedicated infrastructures already exist to process them, such as Pl@ntnet and iNaturalist. The creators of such images are also more varied, as are the licensing requirements placed upon images. An exception might be those pictures of living organisms in situ before they were preserved. Such pictures give additional context to the specimen and can potentially be used together with the preserved specimen both for human and computational comparison \citep*{goeau_overview_2021}.
In this paper, we answer the questions: what research could be done with such an infrastructure, who would use it, what functionality would be needed and what are the architectural requirements? First, we present the purposes for such a unified corpus of specimen images, and secondly we envisage what such an infrastructure might look like. In total we imagine a future where we can search across global collections for such things as the pattern of a butterfly’s wing, the shape of a leaf, the logo of a specific collection, or for examples of someone’s handwriting.