The image metadata in the repository will include a reference to the
image object located in the storage layer (Component 2), along with
annotated training image data. Different kinds of image annotations will
be supported, including geometric-based regions of interest (ROI),
taxonomic or ecological traits and textual representations of label
data. For interoperability, data standards supporting the machine
readability of these annotations are required. As different standards
exist for these annotations and not all are equally suitable for any
model, the platform should ensure support for multiple standards, such
as COCO (JSON), Pascal VOC (XML)
and image masks (rasterized or vectorized images). Multiple annotations
can be made on a single specimen record, making persistent identifiers
for these specimen records vital. The metadata indexed in the repository
will facilitate the findability of suitable annotations, for instance,
to serve as training data. A feedback mechanism may be implemented to
correct and/or update annotations.
The pre-trained machine learning models will be stored in the repository
and made available for reuse, along with accuracy metrics and the model
output, such as the segmented features or species metadata. To ensure
findability, models should be classified by use-case through the use of
keywords, since they are often trained for very specific use-cases but
could later be reused in other contexts. As part of the metadata,
suitability scores will facilitate comparison of models in terms of
their efficacy, possibly through community feedback or by analytics
scores that take standardised model performance metrics into account.
These results should be linked to the original images used in the
training of the model (on the platform) and also to the images that were
analysed in the use case.
Persistent identifiers such as Digital Object Identifiers (DOIs) or
hash-based content identification (e.g., Software
Heritage PIDs for code or simple SHA-256 hashes for images) will be
assigned to the digital objects produced during the use of the
infrastructure to make them citable. It will also be possible to assign
persistent identifiers to different versions, reflecting any subsequent
updates the submitter makes to the digital objects. The repository will
display citations of the persistent identifiers, including links to
publications in which they are included, as well as any instances of
their reuse in other projects within the repository. It is not only
important to make the digital objects or outcomes openly available, but
also under appropriate licences
(e.g., Creative Commons) as indicated by the FAIR for research
software (FAIR4RS) working group and \citep*{labastida_licensing_2020}.
Managed through the orchestration logic, the repository is connected to
a storage system and the processing unit, while having features such as
a content-based search engine to browse the content not only on the
traditional humanly-annotated metadata, (e.g., date and place of
observation, taxonomy, and others), but also on information extracted
from the images themselves. Advanced features can be built into the
system, such as the ability for users to upload an image and search the
catalogue by similarity (e.g. similar handwritten signatures), or query
and filter the collections of data using the indexed metadata extracted
from the observations, either humanly or automatically annotated. In
general terms, such functionality can be summarised as the ability to
aggregate to each specimen media record entry all the information that
is extracted from it either manually or automatically, and indexed
making it available to query. Some good examples of similar
content-based systems exist in production today.
Pl@ntNet and
iNaturalist provide species
identification of organisms from photographs. Results can be refined by
providing the user’s location, thus limiting the possible results to the
most likely matches, boosting accuracy. A more general example is Google
Image Search, where anyone can search images using either a keyword
(e.g., dog), or using an image as the search term’. This function is
also available on Google Photos (web or mobile), where a user can search
their personal photos for specific people, different types of objects,
places, ceremonies, events, and so on. Although different, all those
systems share similar logic: (1) they include machine learning models
trained for specific tasks (e.g. object detection) that have been
created offline using massive datasets in large GPU clusters (e.g.,
TensorFlow
Model Zoo and
COCO dataset;
(2) when a new image is added to the collection (or possibly all, when
new models are deployed), in addition to the submitted user tags, the
images are processed with these models (inference/prediction pipeline)
and tags are extracted; (3) the extracted information is saved and
indexed, and made available as searchable data. The envisioned system
should provide similar functionality, although with the added complexity
of supporting a myriad of different models and images, as illustrated by
the use cases listed in the previous section, such as searching for
colour bars, rulers, the institutional stamp or a specific trait.
Component 2: The
Storage
The storage component (Fig. \ref{205447}) encompasses all physical storage that is
a local part of the platform, and on which images, models, metadata and
results are stored. It also includes functions, managed via
orchestration logic, required to manage that data as far as access
control (e.g. governance) and low-level file management is concerned
(such as back-ups). Higher level management, such as handling uploads,
selection of specific images and the moving of images to processing, is
the responsibility of other components. The storage component is divided
into two areas, archive and regular (active) storage. This distinction
is primarily a technical one, separating high-performance storage
required for accessing images while training models, from less advanced
storage for other purposes.
Whether images are mirrored from their original source onto the
platform, or if they are only downloaded temporarily onto the platform
when needed for training, is a technical design question that should be
answered during implementation. While this choice has no functional
impact, it does have profound technical implications, as well as
budgetary consequences. Locally mirroring all images referenced in the
repository guarantees availability and predictable speed of access, but
will also require extensive management to accurately reflect changes
made to the source material, and will take up an increasingly large
storage volume. On the other hand, while downloading images on-the-fly
greatly diminishes the required storage volume, it implies less control
over availability and carries the risk of images becoming unavailable
over time.
Storage of training
images
Images to use in training are discovered through the repository
component, which functions as a central index of images, metadata,
models and results. Actual image files might be hosted on the platform,
or remotely, on servers of associated parties. In case of the latter,
because of the technical requirements (i.e., high throughput, guaranteed
availability, low latency), these images must be downloaded to the
platform, and be made available locally to be used in the training of
models. Selection of these images is done in the repository, and the
orchestration logic functions as a broker between the repository and
remote hosting facilities, taking care of downloading of images. The
storage component is responsible for the local storage of these files.
This includes facilitating access control (i.e. keeping track of what
images belong with which training jobs), and making images available to
the processing component, where the actual training takes place. In the
scenario where the local storage of training images is temporary, the
images will be deleted once the training cycle of a model has been
completed, while only the references in the repository to those images
are retained with the resulting model. The handling of images while
stored in the system, including their accessibility and deletion
policies, is subordinate to the platform’s governance policies.
Storage of models
Once a model is deemed finished or suitable for use, it may be published
as such in the repository, and thus become available for researchers to
use. Again, the repository functions as a central index that allows
researchers to find suitable models, while the actual code that makes up
a model will be stored in the storage component. Once a model has been
selected by a researcher for use (see also next section), it is
retrieved from storage and copied to the processing component for use. A
similar scenario applies when a stored model is used as the basis from
which to further train a new model, or a new version of the same model
(transfer learning). Since there are no specific performance
requirements for storing a model, they will be stored in the archive
section of the media storage component. Besides models that have been
trained locally, the platform can also host and publish models that were
trained elsewhere. From the point of view of storage, these models are
treated as identical as ones trained locally. As with images,
availability of and access to models stored on the platform is subject
to governance policies.
Storage of images for
analysis
Another function of the processing component is using ‘finished’ machine
learning models for the analysis of images, resulting in the annotation
of newly uploaded images with or without metadata (such as
classification or identified regions of interest). For this purpose,
images will be uploaded by researchers, after having selected a model or
models from the repository to run on the images. Uploaded images will be
stored in the storage component, and kept there for the duration of the
experiment running the selected models. Responsibility for running these
experiments, including the loading and execution of the selected models,
lies with the processing component. Actively making available the images
to the models is facilitated by orchestration logic.
Once experiments have been completed, these images will be moved to a
low-performance part of the media storage component (archive storage),
where they are stored with the newly acquired metadata, in line with
relevant governance policies. These archived images and their
annotations are registered in the repository component, so as to make
them findable by other researchers. If, at a later stage, someone wants
to perform further image analysis on images that were analysed
previously, these images can be moved back to the active storage area
for further analysis.
The technical requirements for analysis processes are far less demanding
than those of training processes, especially with regards to the need
for constant high throughput. It is therefore conceivable that the
platform will allow researchers to access stored models remotely through
an API, in which case no images are stored locally.
Storage of model results
Value for researchers is to be gained from access to results derived
from the models on the platform. These results might be produced by
analysis processes as described hitherto, or by use of a model remotely,
either via API access or even by entirely running a model remotely. The
form of these results can be manifold; besides previously mentioned
examples such as classification or the identification of regions of
interest, they can also include more generalised performance
characteristics of a model, such as the average recall and precision for
a given set of images in case of a classification experiment. Uploading
such results, in whatever format they might take, and associating them
with the models that generated them is the responsibility of the
repository component, while the physical storage of data is taken care
of by the storage component. Negotiation between the two components,
both when storing and when retrieving, is performed by the orchestration
logic. Again, all handling of these results follows the platform’s
governance policies.
Component 3: The
Processing
The processing component encompasses all the services and pipelines
whose focus is to compute tasks on batches of incoming data into
infrastructure or already existing in the system, such as those stored
in the repository and storage components (Fig. \ref{205447}). In other words, it
supports a myriad of computational-intensive tasks, from ingesting new
data, to the automated extraction of information from media, as well as
exporting new datasets or scheduling the training of new models or the
retraining of old ones.
This component requires a considerable amount of computing power to
handle all the scheduled tasks in the system, which can even be elastic
(i.e. cloud principles) given the fluctuating demand. These are
delegated by the orchestration logic component, a set of services that
are responsible for handling the external requests, such as those from
users through frontend applications, or other external services using
one of the public APIs, serving as both gateway and manager to the main
internal components – repository, storage, and processing (Fig. \ref{205447}). The
biggest computational demand comes from tasks related to the creation of
machine learning and deep learning models periodically, updating the
existing services or adding new ones. For these, specific hardware
capabilities such as several GPU/TPU instances may be required from time
to time.
The processing component, and the tasks and services supporting it,
should be able to scale vertically, that is, handle more tasks by adding
more RAM, more CPU cores or a better GPU to a cluster node, but
preferentially able to scale horizontally, namely, by adding more nodes,
hence able to process multiple independent tasks in parallel.
The processing component can be organised into sub-components, among
which are: (1) Data ingestion, (2) Machine learning models and analytics
services (such as image segmentation, objection detection, and image
classification) (3) Analytics pipelines (processes or programming
scripts built to provide analytical services), (4) Data integration and
(5) Data export; which helps to deal with any given use case such as
depositing new images and metadata, annotating the images, and
depositing trained deep learning models.
Data ingestion
Data ingestion is the process of adding new data to the system,
encompassing tasks such as crawling, parsing, validating and
transforming information to be indexed. This process includes several
types of data, including metadata, images, annotations, analytics
pipelines, which includes services and models, and so on. To this end,
specific tools should handle the incoming data to the infrastructure,
following different paths depending on the data’s source and type. Some
brief examples are given below.
Specimen datasets
When a new dataset, such as a metadata set, is submitted, each entry of
the set undergoes a series of tasks to parse, validate and transform the
information to facilitate a standardised entry. This may include
crawling additional data from external services like GBIF and Wikidata,
or to compute metrics, validate geographic coordinates and map them to
locations. Additionally, this process will check for duplicate entries
based on the existing data in the infrastructure.
Specimen images
Following the above-mentioned logic, if a dataset contains an associated
image, the file needs to be processed before being added to the system.
This might include validating the metadata of the file, transformations
and even triggering the analytics pipelines to automatically infer new
data about the image content itself.
Image annotations
One of the key features of the system will be the ability to provide
annotations for the existing images. When a set of annotations is
supplied, these need to be validated and ingested into the system in a
series of steps, depending on the data type. This includes ingesting,
validating and transforming them into standard data types and
structures, depending on the problem (e.g., classification, object
detection, natural language processing and optical character
recognition). After preprocessing, the set of annotations will be
additionally validated to find: if they duplicate existing annotations
or if the attached labels make sense, if the tagged region falls inside
of the image and so on. This information will then be indexed and
provided by the repository component and can be included in datasets,
which will serve to improve existing inference tools and develop new
ones.
Machine learning models and analytics
services
The same applies to other tasks such as submitting a new image analysis
pipeline. As explained below, analytics pipelines provide services to
extract information from images and other sources. These sets of
services can be added over time and should be open, well documented and
reproducible. This means that the new pipeline, such as one to identify
the species of a specimen, would include data and metadata; machine
learning models; source code; service containers; automated workflow and
service provisioning information as code; results and others. Each of
these must be verified and tested, before being included as part of the
analytics toolset.
Eventually, the ingestion of data of one type will trigger other
sub-components of the system such as the analytics pipelines, to infer
data from images, or the data integration components, to invoke other
parts of the system, as the repository to index the parsed information,
or the storage component, in order to store the image and its
derivatives if needed.
Analytics pipelines
This sub-component encompasses the set of services and functionalities
responsible for processing images or other media, to automatically infer
information that would otherwise be manually tagged, e.g., identifying a
specific trait. To this end, each service provides specific
functionality, and encompasses a sequence of instructions, from using
multiple pre-trained machine learning and deep learning models, to image
transformations or other solutions, depending on the problem at hand.
For instance, when ingesting a new dataset, for each given specimen
image, various analytics pipelines will be scheduled to run, each made
of different steps and deep learning models trained for specific tasks
(e.g., detect mercuric chloride stains, identify specific botany traits,
extract collector/label information). As a result, the output might
include the predicted species of the image, the presence of stains,
particular signs of the images, text from handwritten and computerised
labels, specific traits and their values, and so on.\ref{240550}\ref{478509}