The process of developing ML solutions consists of many data hypotheses and the various algorithms search with which we try to improve the metrics of our models. After ML model development is finished, we prepare it for release, wrap it in code that will work in production, and run its business processes of the ecosystems where it will work.
During the whole process we have such components:
- artifacts (model weights and configs, metrics)
The industry standard for code storage is git. However, the rest of the components are stored out of order by companies, however there are already handy tools for data and models management, so let's talk about it in more detail.
Ecosystems around ML processes
1. DVCDVC is a git (version control system) for data, which will help to cope with their versioning and convenient management of large volumes. The service allows you to save your data sets and artifacts after training in any suitable place for you (Cloud bucket, Google Drive, HDFS, etc.). This tool also provides a bunch of additional features that allow:
- metrics tracking
- run pipelines
- build a process of model deploy using cml.dev
CML (Continuous Machine Learning) - a tool from the DVC command to build a complete cycle of ML processes. It's already possible to apply it to Github or Gitlab repositories to automatically run metrics or even full deploys. The creators propose this approach:
- CI/CD for ML - CML
- Environment management - Docker/Packer
- Infrastructure as a Code - Terraform/Docker Machine
- Data as a Code - DVC
We could also integrate it with Airflow to automate pipelines, Kubeflow and Seldon to train and deploy models, all of which we'll talk about later.
2. MLFlowMLFlow from DataBricks began as a library for logging experiments. This service is suitable for small projects as well as for large companies. MLFlow provides tools for the full cycle of Machine Learning:
- Tracking ( storing metrics, model configs, and relaunch of pipeline experiments).
- Projects (restart of experiments in conda or docker environment-e)
- Models (runs a REST server or Docker image with your model)
- Model Registry ( storing models at different stages: Staging, Production, Archived)
By the way, you can use DVC together with MLFlow. MLFlow is easy to use, unlike Kubeflow, which gives you greater system completeness and integration with other services.
3.KubeflowKubeflow is a framework from Google with Kubernetes under the hood and with support for Jupiter Notebook. Has such modules for the full life cycle of models:
- Pipelines (orchestration of ML models, experiments, storing artifacts, and metrics).
- Training (supported for TensorFlow, PyTorch, MXNet, Chainer)
- HyperParameters Search (NAS and search for optimal hyperparameters).
- Serving (supports a bunch of solutions: KFServing, Seldon, NVidia Trinton, BentoML)
- Fairing (for fast training and deploy)
- Metadata (for storing metrics and model files)
- Multitenancy (for access control)
Kubeflow has the most impressive number of services of all these systems and the largest number of integrations, which will allow you to build the necessary parts of your system from independent blocks. However, installing Kubeflow will be more difficult for Data Scientist or ML Engineer, despite the fact that we are setting it up on Python, so it is better to refer to DevOps. This framework will be an ideal solution for enterprise applications with high load and if you already actively use Kubernetes.
1. Pipelines for repetitive tasksMost developers are familiar with repetitive tasks and their automation via cronjob, but very often when working with data and other tasks we need to perform them sequentially or in parallel after previous events. The most popular solutions for such tasks are Airflow, Luigi, and Argo.
Let's take a look at Airflow:
- write pipeline on Python
- MySQL or Postgres as a database
- tasks are added to a graph (DAG)
- has a large number of integrations and operators: executing bash commands, Python code, sending email-a, HTTP, SQL, sensor (waiting for event execution), etc.
Airflow is the most full-featured, mature tool, but you need to take the time to understand it better. If you want something easier, we recommend Luigi, if you want to write pipelines in yml and are actively working with Kubernetes, Argo is for you.
2. Deploy without Kubeflow: SeldonSeldon is an open-source platform for deploying ML models into the cloud or locally (on-premise) with Kubernetes support. It consists of 3 parts: Seldon Core, alibi to monitor and for ML models explainability, alibi-detect to detect outliers and metrics.
Seldon Core supports the most popular ML frameworks (TensorFlow, PyTorch, sklearn, xgboost, Spark, etc.). The service wraps the model in REST or GRPC and allows containerizing and monitoring the microservice with it.
Seldon impresses with its simplicity compared to Kubernetes and the model and cloud-agnostic approach.
If you are interested in other solutions, you can pay attention to Cortex with Go under the hood and the ability to deploy models on AWS with scaling and monitoring from the box.
3. Worth mentioningh2o AutoML, AWS SageMaker - let the machines learn without ML Engineers
Pachyderm - data lake and git for production data pipelines
Optuna - automatic hyperparameters search
Neptune.ai - a tool for controlling experiments and collaboration
Machine Learning, compared to Software Engineering, is a young field, gradually adopting good practices and working approaches. It is very important to make the work with ML (experiments, data storage, metrics, and models) not only impressive in presentations, but also repeatable in production. We hope that using any of the tools described above will allow you to set up the ML part and bring even more benefits to people and companies and generate profit.
If you still have questions about the services for building Reproducible Machine Learning, contact us or schedule a call to review your case in detail.