Purposes
Infrastructure costs money, it needs to justify the cost through the
benefits, not just for science, but society in general. We also need to
understand who the users will be and the beneficiaries. Below we have
outlined below some uses and users for an imaging infrastructure for
collections. Though there are undoubtedly uses we have yet to imagine.
Species identification
Machine learning applications for the identification of organisms mostly
use digitised photographs of living organisms (e.g. \citealp{waldchen_machine_2018,bonnet_how_2020}).
Most experiments with species identification from digitised specimen
images have focused on herbarium specimens (e.g. \citealp{Carranza-Rojas2017,kho_automated_2017,pryer_using_2020,hussein_applications_2022}). This is
because they are two-dimensional, they follow a fairly standardised
format and are highly available. Herbaria have preceded the digitization
of animal collections that tend to be more three-dimensional and, in the
case of insects, are much larger (Fig. 1). Nevertheless, because insect
specimens are, in general, much more abundant there is clear demand for
automated identification of these specimens too \citep{earl_discovering_2019,valan_automated_2019,hoye_deep_2021}. A clear advantage of insects is that
their colour and morphology are well preserved in specimens. This means
that automatic identification trained on specimens may work on living
specimens and vice versa, having the possibility to create training
datasets for rarely seen organisms \citep{goeau_overview_2021}. Specimens from
natural history collections have been used successfully to train models
that assist in sorting images from camera traps, thus greatly
facilitating the monitoring process \citep{hoye_deep_2021}. Camera traps are
routinely deployed in ecological monitoring and have been advocated as a
method of global biodiversity monitoring \citep*{wearn_snap_2019}.
Similarly to insects, the state of preservation, uniformity and
distinctiveness of pollen grains also makes them good targets for
automated identification whether they are from preserved collections or
fresh. Indeed, pollen is well preserved as fossils and sub-fossils
making them useful targets to analyse evolutionary and ecological change
\citep{romero_improving_2020,hornick_integrative_2022}. Manual identification of
pollen grains by experts is slow and laborious, which machine learning
could transform into a much more routine process \citep{bourel_automated_2020},
with potential applications in environmental monitoring, archaeology and
forensics.
Work on ichthyological collections \citep{elhamod_hierarchyguided_2022} has
demonstrated that the addition of phylogenetic information can
strengthen neural network models and improve the identification of
specimen images. This Hierarchy-Guided Neural Network allows for
imperfect, yet realistic scenarios such as damaged specimens or limited
training data while incorporating and potentially improving knowledge of
taxonomic relationships.
The main advantage of automated identification of digital images of
preserved specimens is not the accuracy, but the potential for high
throughput. Accessing large numbers of images in a suitable
computational environment remains a critical factor to mainstreaming
automatic specimen identification across collections.
Extracting trait data
Morphological, phenological and colourimetric traits are often clearly
visible on images of specimens (e.g. Fig \ref{312487}A). Such traits might be
diagnostic for the identification of the organism, but they are also
used to understand how traits evolve and what the traits tell us about
the evolutionary history of a taxon.
Functional traits
Traits are interesting from the perspective of the functions they have
evolved to perform. Morphological functional traits have been used to
predict impacts of climate change on ecosystem functioning \citep{pigot_macroevolutionary_2020}, species distributions (e.g. \citealt{pollock_role_2012,regos_effects_2019}), community structure \citep{li_are_2015}, and how these traits fit
into the land surface component of climate models \citep{kala_impact_2016}.
Functional traits recorded from preserved specimens supplement field
recorded data filling geographic and temporal gaps and providing legacy
data \citep{heberling_herbarium_2017,bauters_centurylong_2020,kommineni_comprehensive_2021}, as well as potentially enabling discovery of newly-relevant
morphological traits. Examining such traits in preserved specimens in
collections is also considerably cheaper than fieldwork.
Leaf morphological traits are particularly amenable to extraction from
herbarium sheets, both because they are laid flat and because they do
not necessarily require magnification \citep{heberling_herbaria_2022}. Their size,
dimensions, arrangement, dentation and venation are all possible targets
for machine learning and experiments with extracting these parameters
have shown it to be feasible and reliable \citep{heberling_herbarium_2017,triki_objects_2020,weaver_leafmachine_2020}. The extraction of traits from
collections of insects has great potential, particularly as the state of
preservation of insects in collections is high \citep{hoye_deep_2021}.
In the case of fish, due to the large number of species globally, the
enormous number of morphological traits and large amount of variation,
we can only hope to fill the gaps in our knowledge of traits if
preserved specimens are used \citep{hay_why_2020,kattge_try_2020}.
Furthermore, specimens have the advantage that there is a voucher where
the measurements can be verified and new measurements can be taken.
Using well-documented machine learning algorithms for extracting traits
from specimens would mean much greater efficiency if a single large
corpus were available for analysis, but also measurements could be less
prone to error and more reproducible if the source code and training
data are open and shared \citep{meeus_leaf_2020}. Digital specimens share
similar pitfalls as their physical counterparts, such as missing
metadata from specimen labels.
Further, collection practices have changed considerably over the more
than four centuries they have been amassed \citep{kozlov_changes_2021}. Also,
characters of specimens can change upon preservation, for instance,
shrinkage associated with drying \citep*{tomaszewski_is_2016}. Yet,
with suitable awareness and controls, there is much to be learned from
trait data gathered from digital specimens.
Phenology
A trait of particular interest for climate change impact studies is
phenology. Changes in seasonal temperatures and rainfall patterns affect
the hatching or emergence of dormant animals and the maturation of
leaves, flowers and fruits. Such changes may lead to a mismatch in
seasonality among organisms \citep*{renner_climate_2018}. Detecting the
phenological state of an organism is possible through machine learning
\citep{lorieul_toward_2019,davis_new_2020,triki_deep_2021,goeau_can_2022} though not to the level of accuracy achieved manually.
Nevertheless, the obvious advantage of machine learning is the potential
for high throughput processing of images to track phenological shifts
\citep{pearson_machine_2020}.
Colour analysis
Imaging of specimens is almost always done with colour cameras, even
though some organisms, such as plants, change colour when they are
preserved \citep{davis_temporal_2013}. Nevertheless, there are animals, such as
insects and birds, that maintain their colour well and may be
interesting targets for research \citep{hoyal_cuthill_deep_2019}. Among
other avenues of research, studies have shown that colour is an
important factor in climate change adaptation of insects \citep{zeuss_global_2014}.
Species interactions
Organisms are in constant conflict with their predators, parasites and
pathogens. Specimens provide a record of this and have been shown to
reveal long-term changes related to environmental change, such as the
introduction of non-native species \citep{vega_elucidation_2019}, pollution and
climate change \citep{lang_using_2019}. For example, manually extracted
changes in leaf herbivory of herbarium specimens were correlated with
climate change and urbanisation in the north-east of the United States
of America \citep{meineke_herbarium_2019}. Indeed, \citet{meineke_applying_2020} took
this further and investigated the potential for extracting leaf damage
data from herbarium specimens of two species through a process of
detection and classification of images split into grid cells. Although
in this instance image analysis was less accurate than human
classification, the possibility remains of applying such an analysis to
many more species over a much larger geographic area that would be
possible only with automation using images from multiple collections.
Collections care, curation and
management
The preceding use cases extract data from specimens for research, but
information is also needed for curation, organisation, storage and
management of collections. A pertinent example is the need to identify
specimens treated with toxic substances, suc as mercuric chloride used
historically on herbarium specimens to prevent insect damage. Over time,
mercuric chloride leaves stains on the specimen mounting paper that
image classification can be used to distinguish. \citet{schuettpelz_applications_2017} used a convolutional neural network trained to detect such
stained sheets. It has a false-negative rate of 8%, which is
comparatively high error for a situation related to toxicity, yet could
likely be improved, particularly if provenance information is combined.
Similarly, one might use a similar approach to detect pests in
collections, such as Lasioderma serricorne (J.Fabr., 1792).
One could even imagine image analysis workflows that detect the type of
mounting strategy and preservation state of the specimen. This would
help curators triage specimens that need remounting or some other form
of curational care.
Numerous other checks and controls can be performed on images of
specimens. For example, quality control of the images themselves, such
as lighting, colour, cropping, orientation and focus. Additionally, the
presence and accuracy of image components, such as the barcode, ruler
and colour chart, can be verified and is a useful check to the integrity
of a corpus.
Visual features of the
specimen
Image segmentation and object
separation
Image segmentation is a fundamental low-level image processing task to
facilitate higher-level tasks such as object detection and recognition
\citep{de_la_hidalga_cross-validation_2022}. In preparation for image analysis, such as
searching for signatures, or to support further digitisation with a
human-in-the-loop, it is often more efficient to recognise the
individual objects in an image, classify them and separate them into
multiple images. For example, if the image contains multiple specimens,
or if a label needs to be extracted from the image to present to humans
for transcribing or to do further image analysis on. Biological
specimens from different collections representing the same species often
show a large variety in backgrounds, caused by different mounting
techniques and different digitisation processes. Separating the object
and repositioning it in preparation for further image analysis may help
in establishing training sets that ignore the differences in background
and positioning.
In an infrastructure built for image analysis, standard segmentation
workflows could be run and optimised to avoid every researcher repeating
the segmentation step and users of the infrastructure could choose
themselves whether they want to analyse the whole image, all the
segments, or specific classes of segment.
Labels
Specimens are usually annotated with information on their labels. In the
case of plants, these labels are on the mounting paper, for insects they
are on the mounting pin, while for larger zoological and plant specimens
labels might be tied to the specimen, or on, or in, specimen jars. These
labels document characteristic data from the collecting event (Fig. \ref{312487} & \ref{478509}).
Therefore, as images of mounted specimens often contain text it is
useful to provide printed and handwritten text recognition output as
part of an image processing pipeline. If this text can be recognized,
this additional metadata can be used to enrich the items of the
collection and automatically perform cross-collection linking.
Furthermore, the recognized text can aid in the digitization process and
validation of the extracted metadata, reducing the amount of manual
input required and improving the quality of the data being transcribed
\citep{drinkwater_use_2014}.