Purposes

Infrastructure costs money, it needs to justify the cost through the benefits, not just for science, but society in general. We also need to understand who the users will be and the beneficiaries. Below we have outlined below some uses and users for an imaging infrastructure for collections. Though there are undoubtedly uses we have yet to imagine.

Species identification

Machine learning applications for the identification of organisms mostly use digitised photographs of living organisms (e.g. \citealp{waldchen_machine_2018,bonnet_how_2020}).
Most experiments with species identification from digitised specimen images have focused on herbarium specimens (e.g. \citealp{Carranza-Rojas2017,kho_automated_2017,pryer_using_2020,hussein_applications_2022}). This is because they are two-dimensional, they follow a fairly standardised format and are highly available. Herbaria have preceded the digitization of animal collections that tend to be more three-dimensional and, in the case of insects, are much larger (Fig. 1). Nevertheless, because insect specimens are, in general, much more abundant there is clear demand for automated identification of these specimens too \citep{earl_discovering_2019,valan_automated_2019,hoye_deep_2021}. A clear advantage of insects is that their colour and morphology are well preserved in specimens. This means that automatic identification trained on specimens may work on living specimens and vice versa, having the possibility to create training datasets for rarely seen organisms \citep{goeau_overview_2021}. Specimens from natural history collections have been used successfully to train models that assist in sorting images from camera traps, thus greatly facilitating the monitoring process \citep{hoye_deep_2021}. Camera traps are routinely deployed in ecological monitoring and have been advocated as a method of global biodiversity monitoring \citep*{wearn_snap_2019}.
Similarly to insects, the state of preservation, uniformity and distinctiveness of pollen grains also makes them good targets for automated identification whether they are from preserved collections or fresh. Indeed, pollen is well preserved as fossils and sub-fossils making them useful targets to analyse evolutionary and ecological change \citep{romero_improving_2020,hornick_integrative_2022}. Manual identification of pollen grains by experts is slow and laborious, which machine learning could transform into a much more routine process \citep{bourel_automated_2020}, with potential applications in environmental monitoring, archaeology and forensics.
Work on ichthyological collections \citep{elhamod_hierarchyguided_2022} has demonstrated that the addition of phylogenetic information can strengthen neural network models and improve the identification of specimen images. This Hierarchy-Guided Neural Network allows for imperfect, yet realistic scenarios such as damaged specimens or limited training data while incorporating and potentially improving knowledge of taxonomic relationships.
The main advantage of automated identification of digital images of preserved specimens is not the accuracy, but the potential for high throughput. Accessing large numbers of images in a suitable computational environment remains a critical factor to mainstreaming automatic specimen identification across collections.

Extracting trait data

Morphological, phenological and colourimetric traits are often clearly visible on images of specimens (e.g. Fig \ref{312487}A). Such traits might be diagnostic for the identification of the organism, but they are also used to understand how traits evolve and what the traits tell us about the evolutionary history of a taxon.

Functional traits

Traits are interesting from the perspective of the functions they have evolved to perform. Morphological functional traits have been used to predict impacts of climate change on ecosystem functioning \citep{pigot_macroevolutionary_2020}, species distributions (e.g. \citealt{pollock_role_2012,regos_effects_2019}), community structure \citep{li_are_2015}, and how these traits fit into the land surface component of climate models \citep{kala_impact_2016}. Functional traits recorded from preserved specimens supplement field recorded data filling geographic and temporal gaps and providing legacy data \citep{heberling_herbarium_2017,bauters_centurylong_2020,kommineni_comprehensive_2021}, as well as potentially enabling discovery of newly-relevant morphological traits. Examining such traits in preserved specimens in collections is also considerably cheaper than fieldwork.
Leaf morphological traits are particularly amenable to extraction from herbarium sheets, both because they are laid flat and because they do not necessarily require magnification \citep{heberling_herbaria_2022}. Their size, dimensions, arrangement, dentation and venation are all possible targets for machine learning and experiments with extracting these parameters have shown it to be feasible and reliable \citep{heberling_herbarium_2017,triki_objects_2020,weaver_leafmachine_2020}. The extraction of traits from collections of insects has great potential, particularly as the state of preservation of insects in collections is high \citep{hoye_deep_2021}.
In the case of fish, due to the large number of species globally, the enormous number of morphological traits and large amount of variation, we can only hope to fill the gaps in our knowledge of traits if preserved specimens are used \citep{hay_why_2020,kattge_try_2020}. Furthermore, specimens have the advantage that there is a voucher where the measurements can be verified and new measurements can be taken.
Using well-documented machine learning algorithms for extracting traits from specimens would mean much greater efficiency if a single large corpus were available for analysis, but also measurements could be less prone to error and more reproducible if the source code and training data are open and shared \citep{meeus_leaf_2020}. Digital specimens share similar pitfalls as their physical counterparts, such as missing metadata from specimen labels.
Further, collection practices have changed considerably over the more than four centuries they have been amassed \citep{kozlov_changes_2021}. Also, characters of specimens can change upon preservation, for instance, shrinkage associated with drying \citep*{tomaszewski_is_2016}. Yet, with suitable awareness and controls, there is much to be learned from trait data gathered from digital specimens.

Phenology

A trait of particular interest for climate change impact studies is phenology. Changes in seasonal temperatures and rainfall patterns affect the hatching or emergence of dormant animals and the maturation of leaves, flowers and fruits. Such changes may lead to a mismatch in seasonality among organisms \citep*{renner_climate_2018}. Detecting the phenological state of an organism is possible through machine learning \citep{lorieul_toward_2019,davis_new_2020,triki_deep_2021,goeau_can_2022} though not to the level of accuracy achieved manually. Nevertheless, the obvious advantage of machine learning is the potential for high throughput processing of images to track phenological shifts \citep{pearson_machine_2020}.

Colour analysis

Imaging of specimens is almost always done with colour cameras, even though some organisms, such as plants, change colour when they are preserved \citep{davis_temporal_2013}. Nevertheless, there are animals, such as insects and birds, that maintain their colour well and may be interesting targets for research \citep{hoyal_cuthill_deep_2019}. Among other avenues of research, studies have shown that colour is an important factor in climate change adaptation of insects \citep{zeuss_global_2014}.

Species interactions

Organisms are in constant conflict with their predators, parasites and pathogens. Specimens provide a record of this and have been shown to reveal long-term changes related to environmental change, such as the introduction of non-native species \citep{vega_elucidation_2019}, pollution and climate change \citep{lang_using_2019}. For example, manually extracted changes in leaf herbivory of herbarium specimens were correlated with climate change and urbanisation in the north-east of the United States of America \citep{meineke_herbarium_2019}. Indeed, \citet{meineke_applying_2020} took this further and investigated the potential for extracting leaf damage data from herbarium specimens of two species through a process of detection and classification of images split into grid cells. Although in this instance image analysis was less accurate than human classification, the possibility remains of applying such an analysis to many more species over a much larger geographic area that would be possible only with automation using images from multiple collections.

Collections care, curation and management

The preceding use cases extract data from specimens for research, but information is also needed for curation, organisation, storage and management of collections. A pertinent example is the need to identify specimens treated with toxic substances, suc as mercuric chloride used historically on herbarium specimens to prevent insect damage. Over time, mercuric chloride leaves stains on the specimen mounting paper that image classification can be used to distinguish. \citet{schuettpelz_applications_2017} used a convolutional neural network trained to detect such stained sheets. It has a false-negative rate of 8%, which is comparatively high error for a situation related to toxicity, yet could likely be improved, particularly if provenance information is combined. Similarly, one might use a similar approach to detect pests in collections, such as Lasioderma serricorne (J.Fabr., 1792).
One could even imagine image analysis workflows that detect the type of mounting strategy and preservation state of the specimen. This would help curators triage specimens that need remounting or some other form of curational care.
Numerous other checks and controls can be performed on images of specimens. For example, quality control of the images themselves, such as lighting, colour, cropping, orientation and focus. Additionally, the presence and accuracy of image components, such as the barcode, ruler and colour chart, can be verified and is a useful check to the integrity of a corpus.

Visual features of the specimen

Image segmentation and object separation

Image segmentation is a fundamental low-level image processing task to facilitate higher-level tasks such as object detection and recognition \citep{de_la_hidalga_cross-validation_2022}. In preparation for image analysis, such as searching for signatures, or to support further digitisation with a human-in-the-loop, it is often more efficient to recognise the individual objects in an image, classify them and separate them into multiple images. For example, if the image contains multiple specimens, or if a label needs to be extracted from the image to present to humans for transcribing or to do further image analysis on. Biological specimens from different collections representing the same species often show a large variety in backgrounds, caused by different mounting techniques and different digitisation processes. Separating the object and repositioning it in preparation for further image analysis may help in establishing training sets that ignore the differences in background and positioning.
In an infrastructure built for image analysis, standard segmentation workflows could be run and optimised to avoid every researcher repeating the segmentation step and users of the infrastructure could choose themselves whether they want to analyse the whole image, all the segments, or specific classes of segment.

Labels

Specimens are usually annotated with information on their labels. In the case of plants, these labels are on the mounting paper, for insects they are on the mounting pin, while for larger zoological and plant specimens labels might be tied to the specimen, or on, or in, specimen jars. These labels document characteristic data from the collecting event (Fig. \ref{312487} & \ref{478509}). Therefore, as images of mounted specimens often contain text it is useful to provide printed and handwritten text recognition output as part of an image processing pipeline. If this text can be recognized, this additional metadata can be used to enrich the items of the collection and automatically perform cross-collection linking. Furthermore, the recognized text can aid in the digitization process and validation of the extracted metadata, reducing the amount of manual input required and improving the quality of the data being transcribed \citep{drinkwater_use_2014}.