Data integration

Data integration will push the data extracted or generated by the above mentioned sub-components and push it to the respective parts of the system, that is the repository (e.g., metadata registry of the trained models and images, datasets, the annotated data, and so on) and the storage (e.g,. Image files and their derivatives, pre-trained models, metadata packages, among others).

Data export

The system will catalogue millions of specimens, each with variable amounts of metadata. These data can be filtered with complex queries, based on several parameters and fields. As an example, a user might want to search for specimen images of a specific species, containing images and having annotations regarding the presence of signatures between a specific timespan. Requesting the generation of an image dataset based on the result of such query requires several processing tasks that need to be scheduled, from the extraction and merging of the relevant metadata into the desired format, to resizing images if needed, assigning a persistent identifier so it can be uniquely identified, generating a dataset page and notifying the user. Moreover, if new images falling on the same search criteria are annotated in the following months, the user might request the dataset to be updated, generating a second version and assigning a new or versioned persistent identifier such as the DOI. As an example of the feasibility of this, part of this functionality is already demonstrated by GBIF, which uses background jobs to export datasets on user request (excluding images and DOIs, but allowing the export of metadata based on queries). Moreover, this sub-component may also be responsible for exporting machine learning datasets to public platforms such as the Registry of Open Data on AWS or Google Datasets, allowing users to easily mount them on external cloud solutions.

Discussion

As the decades of the 21st century proceed, we anticipate important changes in global biodiversity. The resource needed to adequately monitor this change is far greater than the cadre of professional ecologists and taxonomists can provide. Machine learning offers one of the promising technologies to dramatically increase our collective capacity to provide the data and identify the taxa, and in complementary fashion prioritise the attention of human taxonomists where they are most needed. In table 1 we have listed the direct benefits to biodiversity and research into artificial intelligence, but there are also positive impacts for society, the economy, the environment and for the collections holding institutions (also see \citealt{Popov_2021}). Making images accessible in a common infrastructure is an opportunity for small collections to gain access to tools that would otherwise be unavailable to them given their limited resources. Indeed, Open Access for all researchers including those from the Global South is critical to ensure that collections fulfil their obligations to access and benefit sharing. Making an infrastructure accessible will require a commitment to ease of use, good tutorials, a user focused design and capacity building.
Such an infrastructure aligns with the European Strategy for Data \citep{european_commission_european_2020}, which aims to overcome challenges related to fragmentation, data availability and reuse, data quality and interoperability, and dissolve barriers across sectors. Having a global infrastructure in place will incentivize natural history collections and their funders to digitise their specimens, and attract funding to do so.
Infrastructures that aggregate images of biological specimens from different sources do exist. For example, Europeana bridges the historical divide between natural history and cultural history collections and it can be searched as a whole \citep{petras_europeana_2017}, but all images have to be downloaded before they can be used in image analysis. GBIF aggregates specimen data from many different providers, which can also supply links to images of these specimens. GBIF maintains a temporary cache of these images, using Thumbor, but they are not readily available for processing.
In the USA, iDigBio (Integrated Digitized Biocollections) was established to coordinate a nationwide digitization effort, developing the infrastructure to standardise and preserve specimen data. Currently it holds almost 43 million media records. Although it has now been phased out, an experimental pipeline had been set up for users to apply image processing algorithms to subsets of the hosted media \citep{collins_pipeline_2018}.
In Europe, EUDAT allows data access, services and storage to support the scientific community. Its development is based on a network of more than 20 European research organisations and data centres, based in 14 different European countries. It builds the backbone of the European Open Science Cloud (EOSC), which aims to offer open and seamless services for storage, management, analysis and re-use of research data, across borders and scientific disciplines. In the context of the Herbadrop project, EUDAT has managed and processed more than 2 million specimen images representing, equalling more than 15 TB of data and 180,000 hours of computation power.
The recent (2021) funded US National Science Foundation Imageomics Institute will establish infrastructure for biologists to use machine learning algorithms to analyse existing image data from publicly funded digital collections, including natural history collections. Also, Google’s AutoML provides a commercial platform in the cloud for training custom models using cloud services. Such developments may mean that the whole infrastructure does not need to be built from scratch, indeed if it can be built upon existing systems it will be less expensive and more reliable.
Opportunity, obstacles and risks to realising a shared infrastructure for natural history collections
Given the many use cases, the large number and diversity of stakeholders, and the potential for innovative services and research, what is holding us back from creating the proposed infrastructure? One clear issue is that the experts in machine learning are not always aware of the needs of biological collections. More needs to be done to bring these communities together, but perhaps also to find the areas of more general interest where collections can benefit from generalised approaches. A lack of standardisation and consequent lack of interoperability further impedes progress \citep{lannom_fair_2020}.
We suggest that the most intractable obstacles to a shared, global infrastructure are socio-political. We envisage an infrastructure without institutional and national borders, in which people, organisations and nations are co-beneficiaries of a system, in which knowledge, skills, financing and other resourcing are acknowledged \citep{pearce_international_2020}. Furthermore, tracking the provenance of resources is also needed to ensure reproducibility and replicability of the system \citep{goodman_ten_2014}.
Experiments so far lack scalability, often have manual bottlenecks in the workflows, and there is a significant time lag in the production of results due to limited access to computational and physical resources, but also to human resources to create and curate training datasets \citep*{waldchen_machine_2018}.
Data access is especially important to ensure researchers in places where biodiversity is especially rich, and threatened with extinction, including tropical countries in the Global South \citep{fazey_who_2005}. A large percentage of the world’s natural history specimens are housed in collections in global institutions in the Global North \citep*{thiers_worlds_2020}. This undoubtedly contributes to the exclusion of local scientists from research on their own countries \citep{dahdouh-guebas_neo-colonial_2003}.
The establishment of a new paradigm in research on collections impacts the frameworks and the workflows currently used in collection curation and the research based on them and can therefore be disruptive. One of the largest risks is introducing inherent errors and biases that are derived from the algorithms and prejudices that may be embedded unknowingly in training data \citep{boakes_distorted_2010,osoba_intelligence_2017}.
The institutions that hold collections have safeguarded this rich resource of information about biodiversity and natural history. They are the main stakeholders for these materials to be preserved and associated data to become available for researchers and society. Paradoxically, making the data accessible digitally might create the illusion that there is no need to maintain the collections physically any more. We must resist any impression that physical specimens are not less valuable, because in fact the more information we can extract and link to them, the more valuable they become for any future technology that can be applied to them. It is therefore critical to guarantee the link between the digital specimen and the physical specimen to ensure neither becomes obsolete, risking the real value attached to both.
Finally, the long-term sustainability of an infrastructure should be considered. Infrastructures consume resources, need maintenance and replacement. Software should be updated periodically to keep up with the latest technology releases. Infrastructure needs to justify their maintenance costs and they must have the means to monitor and quantify their impact on science, technology and society. Unstable funding and short-term prioritisation might undermine the potential of such a resource for the future.