Data integration
Data integration will push the data extracted or generated by the above
mentioned sub-components and push it to the respective parts of the
system, that is the repository (e.g., metadata registry of the trained
models and images, datasets, the annotated data, and so on) and the
storage (e.g,. Image files and their derivatives, pre-trained models,
metadata packages, among others).
Data export
The system will catalogue millions of specimens, each with variable
amounts of metadata. These data can be filtered with complex queries,
based on several parameters and fields. As an example, a user might want
to search for specimen images of a specific species, containing images
and having annotations regarding the presence of signatures between a
specific timespan. Requesting the generation of an image dataset based
on the result of such query requires several processing tasks that need
to be scheduled, from the extraction and merging of the relevant
metadata into the desired format, to resizing images if needed,
assigning a persistent identifier so it can be uniquely identified,
generating a dataset page and notifying the user. Moreover, if new
images falling on the same search criteria are annotated in the
following months, the user might request the dataset to be updated,
generating a second version and assigning a new or versioned persistent
identifier such as the
DOI.
As an example of the feasibility of this, part of this functionality is
already demonstrated by GBIF, which uses background jobs to export
datasets on user request (excluding images and DOIs, but allowing the
export of metadata based on queries). Moreover, this sub-component may
also be responsible for exporting machine learning datasets to public
platforms such as the Registry of Open Data on
AWS
or
Google Datasets,
allowing users to easily mount them on external cloud solutions.
Discussion
As the decades of the 21st century proceed, we
anticipate important changes in global biodiversity. The resource needed
to adequately monitor this change is far greater than the cadre of
professional ecologists and taxonomists can provide. Machine learning
offers one of the promising technologies to dramatically increase our
collective capacity to provide the data and identify the taxa, and in
complementary fashion prioritise the attention of human taxonomists
where they are most needed. In table 1 we have listed the direct
benefits to biodiversity and research into artificial intelligence, but
there are also positive impacts for society, the economy, the
environment and for the collections holding institutions (also see \citealt{Popov_2021}). Making images
accessible in a common infrastructure is an opportunity for small
collections to gain access to tools that would otherwise be unavailable
to them given their limited resources. Indeed, Open Access for all
researchers including those from the Global South is critical to ensure
that collections fulfil their obligations to access and benefit sharing.
Making an infrastructure accessible will require a commitment to ease of
use, good tutorials, a user focused design and capacity building.
Such an infrastructure aligns with the European Strategy for Data
\citep{european_commission_european_2020}, which aims to overcome challenges related to
fragmentation, data availability and reuse, data quality and
interoperability, and dissolve barriers across sectors. Having a global
infrastructure in place will incentivize natural history collections and
their funders to digitise their specimens, and attract funding to do so.
Infrastructures that aggregate images of biological specimens from
different sources do exist. For example,
Europeana bridges the
historical divide between natural history and cultural history
collections and it can be searched as a whole
\citep{petras_europeana_2017}, but
all images have to be downloaded before they can be used in image
analysis. GBIF aggregates specimen data from many different providers,
which can also supply links to images of these specimens. GBIF maintains
a temporary
cache of these images, using
Thumbor,
but they are not readily available for processing.
In the USA,
iDigBio (Integrated
Digitized Biocollections) was established to coordinate a nationwide
digitization effort, developing the infrastructure to standardise and
preserve specimen data. Currently it holds almost 43 million media
records. Although it has now been phased out, an experimental pipeline
had been set up for users to apply image processing algorithms to
subsets of the hosted media
\citep{collins_pipeline_2018}.
In Europe,
EUDAT allows data
access, services and storage to support the scientific community. Its
development is based on a network of more than 20 European research
organisations and data centres, based in 14 different European
countries. It builds the backbone of the European Open Science Cloud
(
EOSC), which aims to offer open and seamless services for storage,
management, analysis and re-use of research data, across borders and
scientific disciplines. In the context of the
Herbadrop project, EUDAT has managed and processed more than 2 million specimen
images representing, equalling more than 15 TB of data and 180,000 hours
of computation power.
The recent (2021) funded US National Science Foundation
Imageomics
Institute will establish infrastructure for biologists to use machine
learning algorithms to analyse existing image data from publicly funded
digital collections, including natural history collections. Also,
Google’s AutoML provides a commercial platform in the cloud for training
custom models using cloud services. Such developments may mean that the
whole infrastructure does not need to be built from scratch, indeed if
it can be built upon existing systems it will be less expensive and more
reliable.
Opportunity, obstacles and risks to realising a shared
infrastructure for natural history
collections
Given the many use cases, the large number and diversity of
stakeholders, and the potential for innovative services and research,
what is holding us back from creating the proposed infrastructure? One
clear issue is that the experts in machine learning are not always aware
of the needs of biological collections. More needs to be done to bring
these communities together, but perhaps also to find the areas of more
general interest where collections can benefit from generalised
approaches. A lack of standardisation and consequent lack of
interoperability further impedes progress \citep{lannom_fair_2020}.
We suggest that the most intractable obstacles to a shared, global
infrastructure are socio-political. We envisage an infrastructure
without institutional and national borders, in which people,
organisations and nations are co-beneficiaries of a system, in which
knowledge, skills, financing and other resourcing are acknowledged
\citep{pearce_international_2020}. Furthermore, tracking the provenance of resources
is also needed to ensure reproducibility and replicability of the system
\citep{goodman_ten_2014}.
Experiments so far lack scalability, often have manual bottlenecks in
the workflows, and there is a significant time lag in the production of
results due to limited access to computational and physical resources,
but also to human resources to create and curate training datasets
\citep*{waldchen_machine_2018}.
Data access is especially important to ensure researchers in places
where biodiversity is especially rich, and threatened with extinction,
including tropical countries in the Global South \citep{fazey_who_2005}. A
large percentage of the world’s natural history specimens are housed in
collections in global institutions in the Global North \citep*{thiers_worlds_2020}.
This undoubtedly contributes to the exclusion of local scientists from
research on their own countries \citep{dahdouh-guebas_neo-colonial_2003}.
The establishment of a new paradigm in research on collections impacts
the frameworks and the workflows currently used in collection curation
and the research based on them and can therefore be disruptive. One of
the largest risks is introducing inherent errors and biases that are
derived from the algorithms and prejudices that may be embedded
unknowingly in training data \citep{boakes_distorted_2010,osoba_intelligence_2017}.
The institutions that hold collections have safeguarded this rich
resource of information about biodiversity and natural history. They are
the main stakeholders for these materials to be preserved and associated
data to become available for researchers and society. Paradoxically,
making the data accessible digitally might create the illusion that
there is no need to maintain the collections physically any more. We
must resist any impression that physical specimens are not less
valuable, because in fact the more information we can extract and link
to them, the more valuable they become for any future technology that
can be applied to them. It is therefore critical to guarantee the link
between the digital specimen and the physical specimen to ensure neither
becomes obsolete, risking the real value attached to both.
Finally, the long-term sustainability of an infrastructure should be
considered. Infrastructures consume resources, need maintenance and
replacement. Software should be updated periodically to keep up with the
latest technology releases. Infrastructure needs to justify their
maintenance costs and they must have the means to monitor and quantify
their impact on science, technology and society. Unstable funding and
short-term prioritisation might undermine the potential of such a
resource for the future.