Build machine learning models and services
The analytics pipelines are built of pre-trained models, as well as
containerized applications and services which have been previously
built. The most computationally intensive part of the envisioned system
will be training and building these, either to add novel analytics
pipelines or to update existing ones. Hence, it should be possible to
schedule the execution of these heavy tasks, which will include
preparing the data to be used (e.g., resize, augmentation), configuring
the environment and parameters to use, training the models and assessing
the performance, building, testing, and packaging the services.
Moreover, if selected, these should be deployed to production as part of
the existing analytics pipelines.
The system must allow the definition of the service workflow as code,
from the infrastructure, to model training and application packaging.
This requires two parts. First, fully documenting the modelling
experiments guaranteeing its reproducibility, which includes providing
the data (i.e., link to the exact dataset) and code with the exact
environment (e.g., by using conda and
venv under Python, or
renv in R),
the pre-trained models and all the required parameters, hyperparameters
and similar, as well as controlling the randomness of such models (e.g.,
initialising seed state). Such data should be indexed by the system and
allow anyone to rerun the experiment and obtain the exact model and
results.
Secondly, the entire analytics pipeline should be documented as code,
from infrastructure to the application level. This allows for the exact
replication of the build, test, package and deployment. Over the last
decade several technologies and sets of practices have appeared to
attain such goals, normally linked to software development concepts such
as DevOps, MLOps and GitOps. GitHub provides
Actions
to attain continuous integration and deployment, allowing the automation
of the entire workflow of a software service, from building to testing
and deploying, based on simple text files (YAML). On the other hand,
Docker images and similar solutions allow services to be containerized
using similar simple definitions and shared. Going a step further, it is
nowadays possible to define both the infrastructure and how services
interact as code too (e.g., used under Docker compose or with Terraform
and Kubernetes).
Thus, such concepts must be exploited by the processing component,
allowing the submission of novel analytics pipelines fully documented as
code. As the number of annotated datasets grows over time, the system
might schedule the retraining of models and associated pipelines,
reporting the results and, if desired, replacing the existing analytics
pipelines. Moreover, all the details, code and pre-trained models can be
provided, so anyone can reuse the models and code anywhere. Given the
computation power needed, possibly requiring several GPUs for bursts of
work, hybrid solutions offloading part of this work to cloud providers
could be implemented, as an alternative to hosting and managing GPU
clusters.