The image metadata in the repository will include a reference to the image object located in the storage layer (Component 2), along with annotated training image data. Different kinds of image annotations will be supported, including geometric-based regions of interest (ROI), taxonomic or ecological traits and textual representations of label data. For interoperability, data standards supporting the machine readability of these annotations are required. As different standards exist for these annotations and not all are equally suitable for any model, the platform should ensure support for multiple standards, such as COCO (JSON), Pascal VOC (XML) and image masks (rasterized or vectorized images). Multiple annotations can be made on a single specimen record, making persistent identifiers for these specimen records vital. The metadata indexed in the repository will facilitate the findability of suitable annotations, for instance, to serve as training data. A feedback mechanism may be implemented to correct and/or update annotations.
The pre-trained machine learning models will be stored in the repository and made available for reuse, along with accuracy metrics and the model output, such as the segmented features or species metadata. To ensure findability, models should be classified by use-case through the use of keywords, since they are often trained for very specific use-cases but could later be reused in other contexts. As part of the metadata, suitability scores will facilitate comparison of models in terms of their efficacy, possibly through community feedback or by analytics scores that take standardised model performance metrics into account. These results should be linked to the original images used in the training of the model (on the platform) and also to the images that were analysed in the use case.
Persistent identifiers such as Digital Object Identifiers (DOIs) or hash-based content identification (e.g., Software Heritage PIDs for code or simple SHA-256 hashes for images) will be assigned to the digital objects produced during the use of the infrastructure to make them citable. It will also be possible to assign persistent identifiers to different versions, reflecting any subsequent updates the submitter makes to the digital objects. The repository will display citations of the persistent identifiers, including links to publications in which they are included, as well as any instances of their reuse in other projects within the repository. It is not only important to make the digital objects or outcomes openly available, but also under appropriate licences (e.g., Creative Commons) as indicated by the FAIR for research software (FAIR4RS) working group and \citep*{labastida_licensing_2020}.
Managed through the orchestration logic, the repository is connected to a storage system and the processing unit, while having features such as a content-based search engine to browse the content not only on the traditional humanly-annotated metadata, (e.g., date and place of observation, taxonomy, and others), but also on information extracted from the images themselves. Advanced features can be built into the system, such as the ability for users to upload an image and search the catalogue by similarity (e.g. similar handwritten signatures), or query and filter the collections of data using the indexed metadata extracted from the observations, either humanly or automatically annotated. In general terms, such functionality can be summarised as the ability to aggregate to each specimen media record entry all the information that is extracted from it either manually or automatically, and indexed making it available to query. Some good examples of similar content-based systems exist in production today. Pl@ntNet and iNaturalist provide species identification of organisms from photographs. Results can be refined by providing the user’s location, thus limiting the possible results to the most likely matches, boosting accuracy. A more general example is Google Image Search, where anyone can search images using either a keyword (e.g., dog), or using an image as the search term’. This function is also available on Google Photos (web or mobile), where a user can search their personal photos for specific people, different types of objects, places, ceremonies, events, and so on. Although different, all those systems share similar logic: (1) they include machine learning models trained for specific tasks (e.g. object detection) that have been created offline using massive datasets in large GPU clusters (e.g., TensorFlow Model Zoo and COCO dataset; (2) when a new image is added to the collection (or possibly all, when new models are deployed), in addition to the submitted user tags, the images are processed with these models (inference/prediction pipeline) and tags are extracted; (3) the extracted information is saved and indexed, and made available as searchable data. The envisioned system should provide similar functionality, although with the added complexity of supporting a myriad of different models and images, as illustrated by the use cases listed in the previous section, such as searching for colour bars, rulers, the institutional stamp or a specific trait.

Component 2: The Storage

The storage component (Fig. \ref{205447}) encompasses all physical storage that is a local part of the platform, and on which images, models, metadata and results are stored. It also includes functions, managed via orchestration logic, required to manage that data as far as access control (e.g. governance) and low-level file management is concerned (such as back-ups). Higher level management, such as handling uploads, selection of specific images and the moving of images to processing, is the responsibility of other components. The storage component is divided into two areas, archive and regular (active) storage. This distinction is primarily a technical one, separating high-performance storage required for accessing images while training models, from less advanced storage for other purposes.
Whether images are mirrored from their original source onto the platform, or if they are only downloaded temporarily onto the platform when needed for training, is a technical design question that should be answered during implementation. While this choice has no functional impact, it does have profound technical implications, as well as budgetary consequences. Locally mirroring all images referenced in the repository guarantees availability and predictable speed of access, but will also require extensive management to accurately reflect changes made to the source material, and will take up an increasingly large storage volume. On the other hand, while downloading images on-the-fly greatly diminishes the required storage volume, it implies less control over availability and carries the risk of images becoming unavailable over time.

Storage of training images

Images to use in training are discovered through the repository component, which functions as a central index of images, metadata, models and results. Actual image files might be hosted on the platform, or remotely, on servers of associated parties. In case of the latter, because of the technical requirements (i.e., high throughput, guaranteed availability, low latency), these images must be downloaded to the platform, and be made available locally to be used in the training of models. Selection of these images is done in the repository, and the orchestration logic functions as a broker between the repository and remote hosting facilities, taking care of downloading of images. The storage component is responsible for the local storage of these files. This includes facilitating access control (i.e. keeping track of what images belong with which training jobs), and making images available to the processing component, where the actual training takes place. In the scenario where the local storage of training images is temporary, the images will be deleted once the training cycle of a model has been completed, while only the references in the repository to those images are retained with the resulting model. The handling of images while stored in the system, including their accessibility and deletion policies, is subordinate to the platform’s governance policies.

Storage of models

Once a model is deemed finished or suitable for use, it may be published as such in the repository, and thus become available for researchers to use. Again, the repository functions as a central index that allows researchers to find suitable models, while the actual code that makes up a model will be stored in the storage component. Once a model has been selected by a researcher for use (see also next section), it is retrieved from storage and copied to the processing component for use. A similar scenario applies when a stored model is used as the basis from which to further train a new model, or a new version of the same model (transfer learning). Since there are no specific performance requirements for storing a model, they will be stored in the archive section of the media storage component. Besides models that have been trained locally, the platform can also host and publish models that were trained elsewhere. From the point of view of storage, these models are treated as identical as ones trained locally. As with images, availability of and access to models stored on the platform is subject to governance policies.

Storage of images for analysis

Another function of the processing component is using ‘finished’ machine learning models for the analysis of images, resulting in the annotation of newly uploaded images with or without metadata (such as classification or identified regions of interest). For this purpose, images will be uploaded by researchers, after having selected a model or models from the repository to run on the images. Uploaded images will be stored in the storage component, and kept there for the duration of the experiment running the selected models. Responsibility for running these experiments, including the loading and execution of the selected models, lies with the processing component. Actively making available the images to the models is facilitated by orchestration logic.
Once experiments have been completed, these images will be moved to a low-performance part of the media storage component (archive storage), where they are stored with the newly acquired metadata, in line with relevant governance policies. These archived images and their annotations are registered in the repository component, so as to make them findable by other researchers. If, at a later stage, someone wants to perform further image analysis on images that were analysed previously, these images can be moved back to the active storage area for further analysis.
The technical requirements for analysis processes are far less demanding than those of training processes, especially with regards to the need for constant high throughput. It is therefore conceivable that the platform will allow researchers to access stored models remotely through an API, in which case no images are stored locally.

Storage of model results

Value for researchers is to be gained from access to results derived from the models on the platform. These results might be produced by analysis processes as described hitherto, or by use of a model remotely, either via API access or even by entirely running a model remotely. The form of these results can be manifold; besides previously mentioned examples such as classification or the identification of regions of interest, they can also include more generalised performance characteristics of a model, such as the average recall and precision for a given set of images in case of a classification experiment. Uploading such results, in whatever format they might take, and associating them with the models that generated them is the responsibility of the repository component, while the physical storage of data is taken care of by the storage component. Negotiation between the two components, both when storing and when retrieving, is performed by the orchestration logic. Again, all handling of these results follows the platform’s governance policies.

Component 3: The Processing

The processing component encompasses all the services and pipelines whose focus is to compute tasks on batches of incoming data into infrastructure or already existing in the system, such as those stored in the repository and storage components (Fig. \ref{205447}). In other words, it supports a myriad of computational-intensive tasks, from ingesting new data, to the automated extraction of information from media, as well as exporting new datasets or scheduling the training of new models or the retraining of old ones.
This component requires a considerable amount of computing power to handle all the scheduled tasks in the system, which can even be elastic (i.e. cloud principles) given the fluctuating demand. These are delegated by the orchestration logic component, a set of services that are responsible for handling the external requests, such as those from users through frontend applications, or other external services using one of the public APIs, serving as both gateway and manager to the main internal components – repository, storage, and processing (Fig. \ref{205447}). The biggest computational demand comes from tasks related to the creation of machine learning and deep learning models periodically, updating the existing services or adding new ones. For these, specific hardware capabilities such as several GPU/TPU instances may be required from time to time.
The processing component, and the tasks and services supporting it, should be able to scale vertically, that is, handle more tasks by adding more RAM, more CPU cores or a better GPU to a cluster node, but preferentially able to scale horizontally, namely, by adding more nodes, hence able to process multiple independent tasks in parallel.
The processing component can be organised into sub-components, among which are: (1) Data ingestion, (2) Machine learning models and analytics services (such as image segmentation, objection detection, and image classification) (3) Analytics pipelines (processes or programming scripts built to provide analytical services), (4) Data integration and (5) Data export; which helps to deal with any given use case such as depositing new images and metadata, annotating the images, and depositing trained deep learning models.

Data ingestion

Data ingestion is the process of adding new data to the system, encompassing tasks such as crawling, parsing, validating and transforming information to be indexed. This process includes several types of data, including metadata, images, annotations, analytics pipelines, which includes services and models, and so on. To this end, specific tools should handle the incoming data to the infrastructure, following different paths depending on the data’s source and type. Some brief examples are given below.

Specimen datasets

When a new dataset, such as a metadata set, is submitted, each entry of the set undergoes a series of tasks to parse, validate and transform the information to facilitate a standardised entry. This may include crawling additional data from external services like GBIF and Wikidata, or to compute metrics, validate geographic coordinates and map them to locations. Additionally, this process will check for duplicate entries based on the existing data in the infrastructure.

Specimen images

Following the above-mentioned logic, if a dataset contains an associated image, the file needs to be processed before being added to the system. This might include validating the metadata of the file, transformations and even triggering the analytics pipelines to automatically infer new data about the image content itself.

Image annotations

One of the key features of the system will be the ability to provide annotations for the existing images. When a set of annotations is supplied, these need to be validated and ingested into the system in a series of steps, depending on the data type. This includes ingesting, validating and transforming them into standard data types and structures, depending on the problem (e.g., classification, object detection, natural language processing and optical character recognition). After preprocessing, the set of annotations will be additionally validated to find: if they duplicate existing annotations or if the attached labels make sense, if the tagged region falls inside of the image and so on. This information will then be indexed and provided by the repository component and can be included in datasets, which will serve to improve existing inference tools and develop new ones.

Machine learning models and analytics services

The same applies to other tasks such as submitting a new image analysis pipeline. As explained below, analytics pipelines provide services to extract information from images and other sources. These sets of services can be added over time and should be open, well documented and reproducible. This means that the new pipeline, such as one to identify the species of a specimen, would include data and metadata; machine learning models; source code; service containers; automated workflow and service provisioning information as code; results and others. Each of these must be verified and tested, before being included as part of the analytics toolset.
Eventually, the ingestion of data of one type will trigger other sub-components of the system such as the analytics pipelines, to infer data from images, or the data integration components, to invoke other parts of the system, as the repository to index the parsed information, or the storage component, in order to store the image and its derivatives if needed.

Analytics pipelines

This sub-component encompasses the set of services and functionalities responsible for processing images or other media, to automatically infer information that would otherwise be manually tagged, e.g., identifying a specific trait. To this end, each service provides specific functionality, and encompasses a sequence of instructions, from using multiple pre-trained machine learning and deep learning models, to image transformations or other solutions, depending on the problem at hand. For instance, when ingesting a new dataset, for each given specimen image, various analytics pipelines will be scheduled to run, each made of different steps and deep learning models trained for specific tasks (e.g., detect mercuric chloride stains, identify specific botany traits, extract collector/label information). As a result, the output might include the predicted species of the image, the presence of stains, particular signs of the images, text from handwritten and computerised labels, specific traits and their values, and so on.\ref{240550}\ref{478509}