Build machine learning models and services

The analytics pipelines are built of pre-trained models, as well as containerized applications and services which have been previously built. The most computationally intensive part of the envisioned system will be training and building these, either to add novel analytics pipelines or to update existing ones. Hence, it should be possible to schedule the execution of these heavy tasks, which will include preparing the data to be used (e.g., resize, augmentation), configuring the environment and parameters to use, training the models and assessing the performance, building, testing, and packaging the services. Moreover, if selected, these should be deployed to production as part of the existing analytics pipelines.
The system must allow the definition of the service workflow as code, from the infrastructure, to model training and application packaging. This requires two parts. First, fully documenting the modelling experiments guaranteeing its reproducibility, which includes providing the data (i.e., link to the exact dataset) and code with the exact environment (e.g., by using conda and venv under Python, or renv in R), the pre-trained models and all the required parameters, hyperparameters and similar, as well as controlling the randomness of such models (e.g., initialising seed state). Such data should be indexed by the system and allow anyone to rerun the experiment and obtain the exact model and results.
Secondly, the entire analytics pipeline should be documented as code, from infrastructure to the application level. This allows for the exact replication of the build, test, package and deployment. Over the last decade several technologies and sets of practices have appeared to attain such goals, normally linked to software development concepts such as DevOps, MLOps and GitOps. GitHub provides Actions  to attain continuous integration and deployment, allowing the automation of the entire workflow of a software service, from building to testing and deploying, based on simple text files (YAML). On the other hand, Docker images and similar solutions allow services to be containerized using similar simple definitions and shared. Going a step further, it is nowadays possible to define both the infrastructure and how services interact as code too (e.g., used under Docker compose or with Terraform and Kubernetes).
Thus, such concepts must be exploited by the processing component, allowing the submission of novel analytics pipelines fully documented as code. As the number of annotated datasets grows over time, the system might schedule the retraining of models and associated pipelines, reporting the results and, if desired, replacing the existing analytics pipelines. Moreover, all the details, code and pre-trained models can be provided, so anyone can reuse the models and code anywhere. Given the computation power needed, possibly requiring several GPUs for bursts of work, hybrid solutions offloading part of this work to cloud providers could be implemented, as an alternative to hosting and managing GPU clusters.