Engineering
Jan 17, 2022
Engineering
Definition of MLOps and Various Tool
Mario (Manseok) Cho
Solution Architect / Consultant
Jan 17, 2022
Engineering
Definition of MLOps and Various Tool
Mario (Manseok) Cho
Solution Architect / Consultant
What is MLOps? As the fields of machine learning and deep learning have become major IT trends, various research and dissemination efforts are underway. With great research outcomes seeing the light of day, many efforts are now in progress to apply these results to people's daily lives. However, according to The OverFlow, 87% of machine learning projects1, and according to Redapt, 90% of machine learning projects are never completed to the final service stage2. The flow from machine learning development to service typically involves data scientists analyzing data, and the resulting features or models are then deployed to a service product to run online or batch predictions. To achieve this, beyond data analysis, it is necessary to manage reproducibility, release model versions, data versions, and the configuration of data, training/prediction code, and model versions, as well as monitor for data drift, to continuously deliver model performance to customers.
MLOps, short for Machine Learning Operations, refers to the technologies and tools for managing the entire lifecycle of machine learning, including data preprocessing, model development, deployment, and operations. The concepts of "data" and "trained models" are added to the traditional software development methodology of "DevOps," along with the overarching management of the machine learning lifecycle. To the basic tools of DevOps, CI (Continuous Integration)/CD (Continuous Delivery), MLOps adds CT (Continuous Training).
DevOps and MLOps The Difference Between MLOps and DevOps
In CI (Continuous Integration), in addition to testing code and components, testing and validation of data schemas and models are critical elements. CD (Continuous Delivery) involves not just the deployment of a single software package or service, but also the ML training pipeline itself. CT (Continuous Training) is a new concept unique to ML systems. Its purpose is to automatically retrain models and (by deploying them to the system) provide inference services. CE (Continuous Evaluation) is a mechanism for monitoring the accuracy of a running inference model in real-time. Through CE, it's possible to detect degradation in the quality of the machine learning system. This degradation often occurs due to changes in the dynamics of the service environment.
A World Without MLOps
Data Management: All data is stored on in-house servers, and the data used for analysis is copied to a hard disk and delivered to the development team. Versus: All data is stored in the cloud, and necessary subsets for analysis are retrieved using SQL.
Hyperparameter Tuning: Manually rewrite combinations of hyperparameters or data to train and evaluate ML models. Versus: Use experiment management tools to develop models. Mathematical optimization is used for hyperparameter searching.
ML Process Analysis: For each parameter change, a researcher sequentially performs a series of ML tests—preprocessing, training, and evaluation—as described in a Jupyter notebook. Versus: The ML process is integrated and automated, allowing for rapid iteration of training and testing.
Implementing the Inference Model for Deployment: The notebook and requirements.txt used for training are handed over to an implementation manager. The manager then extracts the inference code and configures an inference web API server using tools like Flask. Versus: Deployment to the inference web API server is automated through CI/CD tools. By using a service like GCP's AI Platform Prediction to create an inference endpoint, the modeler can focus solely on creating the machine learning model.
Continuous Training: Developing a model takes months, so retraining is rare. Versus: To respond to ever-changing environments, retraining is performed regularly using the latest data.
CI/CD: Model updates are infrequent, so CI and CD are not considered. Versus: A model trained with new data is automatically subjected to precision validation, testing, and deployment.
Monitoring: No monitoring is performed to detect things like performance degradation of the model, making it vulnerable to changes in the environment's dynamics. Versus: Statistical information about model performance is collected based on live data. If performance falls below a certain level, the retraining pipeline is executed automatically.
ML Pipeline MLOps constructs an ML pipeline that performs various functions centered around data and models. The ML pipeline can be configured in 7 stages:
Data ingestion
Data validation
Data processing
Model training
Model validation and tuning
Model deployment
Monitoring performance and capturing new data
The ML pipeline offers various advantages. Since new model development occurs continuously alongside the maintenance of existing models, the performance of the model in use can be tracked, and the service results can be understood by version. This information can be automatically shared with data scientists, developers, and service operations teams. The components that make up the ML pipeline use container technology, allowing for easy reproduction and service scaling from the development environment to the production environment.
Machine Learning Workflow and MLOps Google has defined the levels of automation for the stages in a machine learning project's workflow.
MLOps Level 0: Manual process, no automation included.
MLOps Level 1: ML pipeline automation.
MLOps Level 2: CI/CD pipeline automation, including ML pipelines and CI/CD/CT pipelines.
While it would be great if every task in the machine learning workflow executed perfectly, some things don't go as planned. For example:
i) There aren't enough computational resources for training, so learning or experimentation cannot proceed.
A function to quickly scale up/out the machine learning environment is needed.
For instance, Yesterday, I was analyzing data, so the GPU nodes were idle. Now I want to run multiple model training sessions or experiments, but a GPU can typically only run one process at a time, making parallelization impossible. Shared GPU resources are difficult to use because they require administrator approval and are not free from interference between users.
An MLOps tool is needed that allows multiple processes to use GPU resources simultaneously, prevents interference between users, and helps to quickly scale up/out the model training environment through automated resource management.
ii) It is difficult to turn a trained model into an inference model.
Even if a training or experiment yields good results, the process of providing it as a service (ML Serving) involves multiple stages of service optimization, converting the model to fit the framework for the serving model, and monitoring the service, all of which are difficult.
Framework or platform support is needed to automate the series of complex processes required to apply the model to an ML service. (Reference: Using MLOps to Bring ML to Production / The Promise of MLOps)
iii) It is difficult to reproduce the results of a model training experiment.
When software developers or data scientists discover undesirable results from ML training or experiments, it is difficult to easily reproduce the training and experimentation process to make corrections.
If it's hard to reproduce problems that occur during training or experimentation, and also difficult to reproduce them at the point of ML service delivery, then guaranteeing the quality of the ML service becomes even more difficult.
To improve this, consistent version management and guaranteed reproducibility of the data, training code, hyperparameters, trained models, and experiment results are necessary.
iv) The model had sufficient accuracy during development, but in the ML service, the processing speed is slow, or it's difficult to scale the service.
The more sophisticated a machine learning model is, the more computational resources like GPU/CPU it requires. However, there are limits to increasing computational resources for an ML service while maintaining model accuracy, and scaling this automatically presents another challenge.
To effectively provide an ML service, it is necessary to optimize by adjusting the trade-off between processing speed and accuracy, and a tool is needed that automates this process or provides some recommended values.
MLOps Tools To solve the various problems in MLOps, there are different tools for each stage, and several projects aim to provide these as a bundled platform service.
Tools for Model Serving:
Flask, FastAPI, TensorFlow Serving, TorchServe, Triton Inference Server, OpenVINO, KFServing, BentoML, AI Platform Prediction, SageMaker, etc.
Machine Learning Platforms: Managed ML, CI/CD Pipeline Provision
AWS SageMaker, GCP AI Platform, Azure Machine Learning, Kubeflow, Backend.AI, etc.
Experiment Management Tools:
MLFlow, ClearML (formerly Trains), Comet, Neptune.ai, etc.
Pipeline Management Tools:
Apache Airflow, Digdag, Metaflow, Kedro, PipelineX, etc.
Analysis and Modeling Tools:
Jupyter, JupyterLab
Hyperparameter Optimization:
Scikit-Optimize, Optuna
Data Collection and Management:
Various tools are provided by cloud services like AWS, GCP, and Azure.
Conclusion MLOps is a "set of tools for smoothly advancing machine learning projects." The tools needed to provide machine learning product services are being actively developed and released. However, because there are so many tools for each stage of MLOps, creating a machine learning service by combining them requires an understanding of each tool's characteristics and scope of application. One must think about how to combine and use them according to each specific machine learning project. This process is difficult and requires overcoming various challenges through much trial and error. Starting with this article, I will organize and share the characteristics, application scope, and usage of various tools used in MLOps with you.
Reference
- The DevOps HandBook : (Author), John Willis (Author), Jez Humble (Author)
- Key requirements for an MLOps foundation : Craig Wiley (Google Cloud)
- The 2021 Accelrate State of DevOps: Elite performance, productivity and scaling : Nicole Forsgren (Research and Strategy, DORA)