Engineering

ON THIS PAGE

May 30, 2022

Engineering

MLOps Definition and Various Tools - Part 2

Mario (Manseok) Cho
Solution Architect / Consultant

May 30, 2022

Engineering

MLOps Definition and Various Tools - Part 2

Mario (Manseok) Cho
Solution Architect / Consultant

The birth of Kubernetes, a core element of DevOps

Summary of the announcement on the Google blog (for a Korean translation, please refer to the Korean blog)

"Let me be clear. Are you saying you want to create an external version of the Borg task scheduler? Borg is one of our company's most critical competitive advantages, its very existence unknown outside the company, and you want to make it public and open source?" Kubernetes was born in the summer of 2013 in Urs Hölzle's office amid such discussions. Urs, a technical lead and principal architect behind Google's most important network innovations, listened as Craig McLuckie explained the idea of building an open-source container management system, though he didn't think gaining approval would be easy. To understand why, we need to look back at Google's history.

Google secretly built the world's best network infrastructure to power services like Google Search, Gmail, and YouTube. Built from scratch with limited budgets in the early days, they experimented with container technology as one way to squeeze every last bit of performance from their servers. While Google had secured massive computing resources to provide cloud IaaS platforms to customers, VM inefficiencies meant resource utilization was lower than expected. Google chose container technology to solve this problem, viewing it as the future of computing.

Container technology enables easy scaling and efficient resource utilization. After Docker's announcement at PyCon US 2014, Google believed it could create a superior container management system based on years of trial and error. They had already built a cluster management system called Borg, naming the project "Seven of Nine"[^1] in reference to Borg. Borg ran hundreds of thousands of jobs, greatly improving computing efficiency and maximizing data center capabilities. Google later extended this same infrastructure to Google Cloud Platform, making it available for anyone to use according to their needs.

The Kubernetes blog provides insight into early contributors' thoughts and development goals. They chose to make Kubernetes open source to leverage two advantages: the rapid development feedback loop when users encounter issues and report them to the developer community, and the ability for talented developers to collaborate on solving problems. When Docker officially supported Kubernetes in October 2017, Kubernetes established itself as the DevOps platform.

In my previous post, I defined DevOps as methods or culture for rapidly incorporating new features or improvements into services based on user feedback while maintaining stable operation. AIOps extends this definition with additional capabilities for data and machine learning models in AI services. Since calling oneself a "DevOps developer" encompasses very broad concepts, including cultural aspects beyond technology, I'll continue focusing on the technical aspects here.

I recommend leveraging cloud platforms or SaaS (Managed Cloud Services) to effectively implement DevOps and achieve the shortest possible development time and service scalability. I personally believe that standing on the shoulders of giants (well-built cloud platforms) to rapidly deliver customer-desired services is the path to DevOps excellence. Achieving these requirements demands the ability to appropriately combine various managed services from AWS, GCP, and Azure to design and implement service architectures suited to business needs. When dealing with large-scale computing resources where cost reduction is critical and cloud platform service constraints are numerous, utilizing Kubernetes or Backend.AI platforms independently can be a good choice. To operate DevOps/AIOps services stably while rapidly delivering new services to users, you must consider applying appropriate automation tools and services from the architecture design stage to prevent management overhead growth.

Proper Git Branch Design and Task Unit Management

When developing software collaboratively, reviewing source code change history and properly merging multiple contributors' work is essential. Git is currently the most widely used version control tool among various source code management tools for managing branches across different versions. There are various approaches to dividing branches and merging them, called git branching models. By selecting an appropriate model, you can automate branch management tasks. Well-known git branching models include git-flow, github-flow, and gitlab-flow. Regardless of which model you use, the system should perform automatic builds and verification when branches are created or merged. Without keeping source modification units small, the likelihood of code conflicts or degradation increases, making it difficult to achieve DevOps's goal of "reducing lead time." When adding new features or improvements feels like the code is becoming bloated, it's good to use git feature toggles to keep git branch lifespans as short as possible.

Infrastructure as Code

Manually configuring infrastructure through cloud web consoles or installing libraries by directly SSH-ing into cloud server instances creates problems such as:

Having to manually repeat the same work when creating different environments
Difficulty understanding current infrastructure or server configurations
Dependency on sequential management
Needing a "deployment expert" to handle library dependency complexities

To achieve DevOps/AIOps's goal of "shortening lead time" in infrastructure management, all infrastructure configurations should fundamentally be managed as code.

Popular infrastructure as code tools include Terraform, CloudFormation (AWS), and Deployment Manager (GCP). Recently, Terraform has become so popular that you could say "managing infrastructure as code equals using Terraform." However, when managing tfstate files in the cloud, AWS requires pre-creating S3 buckets and GCP requires GCS buckets, so you'll need CloudFormation or Deployment Manager to automate even bucket creation and deletion for tfstate file management—it's good to know these as well.

Container-Based and Microservices

Container technology ensures independence between software stacks and application execution environments, solving dependency problems in service deployment environments and providing clear advantages for large-scale deployment operations like rollbacks and autoscaling. Thanks to these advantages, container-based deployment technology is extensively used in DevOps. For container-based cloud services, AWS uses ECS or Beanstalk. Given current trends, learning Kubernetes should be considered essential for DevOps engineers. However, AIOps differs from code-centric DevOps by being data-centric. For AI training, Kubernetes usage is gradually decreasing due to high-density performance optimization needs, while for AI service deployment, Kubernetes is widely used to enjoy deployment convenience and horizontal scaling benefits as in DevOps.

Big Data Distributed Processing

Big data distributed processing can also be operated more efficiently and cost-effectively using cloud managed services. AWS offers EMR, while GCP provides Cloud Dataproc or Cloud Dataflow for big data distributed processing. In my opinion, from the perspective of "preferably not wanting to manage clusters" (the DevOps/AIOps goal), Cloud Dataflow is overwhelmingly more convenient than Cloud Dataproc in GCP. Cloud Dataproc requires direct cluster management, while Cloud Dataflow can automatically build, scale, and delete clusters at runtime. Dataflow is a managed service for Apache Beam. For AWS, there's no option besides EMR. While AWS Batch can perform parallel distributed processing, it cannot handle MapReduce-type data processing. Since most EMR usage involves Spark, Spark knowledge is necessary. Based on my past experience, Spark has a higher learning curve compared to Apache Beam.

Batch Processing and Workflow Control

Batch processing can also be relatively easily implemented using combinations of cloud managed services. AWS uses combinations of Step Functions and AWS Batch, while GCP typically uses Cloud Composer. While Step Functions and AWS Batch aren't too difficult, Cloud Composer (Airflow) has a considerably steep initial learning curve. The web UI particularly feels outdated compared to recent web trends. Both GCP and AWS provide Airflow as Kubernetes clusters.

CI/CD Pipeline Construction

CI/CD pipeline construction is the most important task for DevOps/AIOps developers. Various CI/CD tools exist in the market, with AWS providing CodeBuild and CodePipeline, and GCP offering Cloud Build as CI/CD services. The recently updated GitHub Actions has significantly improved caching capabilities and enhanced usability by allowing batch process execution based on branch filtering.

Expanding from DevOps to MLOps/AIOps

While data processing and machine learning model training based on data are primarily conducted by Data Scientists rather than DevOps or MLOps/AIOps developers, DevOps/MLOps/AIOps developers need to understand data analysis and training knowledge to some degree for smooth communication. Basic knowledge needed for communication includes Jupyter Notebook or Jupyter Lab environments, model deployment processes, and API server configuration knowledge.

ML Workflow Control

Controlling machine learning workflows including data collection, preprocessing, training, and model deployment is the most important task for MLOps/AIOps developers. There's no "one ring to rule them all" standard in ML workflow control. AWS uses Step Functions. While AWS official documentation includes articles about ML workflow control using Airflow, combinations of SageMaker and Airflow are also used. GCP controls ML workflows using Cloud Composer (Airflow) and also uses Kubeflow pipelines on GKE. However, since Airflow has been long used for DevOps pipelines and has advantages in connecting with general deployment pipelines, there's a tendency to prefer Airflow in practice.

AutoML

Services that automatically handle all or part of data preprocessing, learning algorithm and hyperparameter selection, training, and evaluation are collectively called "AutoML." I consider AWS's Amazon Personalize a representative AutoML-based service. For GCP, AutoML Vision is famous, and they introduced a new AutoML service called Vertex AI last year. Based on conversations with machine learning engineers and data scientists, tasks that previously required machine learning developers (Data Scientists) are increasingly being automated by AutoML, lowering the barrier to AI model development and leading to "AI democratization." For this reason, AutoML is clearly a core technology required in MLOps/AIOps, which aims to reduce machine learning service lead time. However, since AutoML uses tremendous computing resources, from a cost perspective, we might not witness these ideals quickly in reality.

Conclusion

This article series defined DevOps/MLOps/AIOps and introduced technologies commonly used in DevOps/AIOps along with the birth story of Kubernetes. Mastering all these technologies in a short time is truly difficult. Countless projects solving various problems continue to be released not only from cloud providers like AWS and GCP but also from the vast open-source ecosystem.

Personally, I think a lot about how to effectively solve MLOps/AIOps construction and operation problems. We need to minimize manual work and automate processes while simultaneously reducing service instability and lead time. How can we more effectively create data-centric pipelines? These problems already exist, and we receive new ones from customers daily. At Lablup, I want to continue solving these problems by standing on the shoulders of cloud giants and the open-source ecosystem they've created and opened.

[^1] The (unofficial) heroine from Star Trek: Voyager.

backend.ai

Blog

Engineering

MLOps Definition and Various Tools - Part 2

MLOps Definition and Various Tools - Part 2

The birth of Kubernetes, a core element of DevOps

Summary of the announcement on the Google blog (for a Korean translation, please refer to the Korean blog)

Proper Git Branch Design and Task Unit Management

Infrastructure as Code

Container-Based and Microservices

Big Data Distributed Processing

Batch Processing and Workflow Control

CI/CD Pipeline Construction

Expanding from DevOps to MLOps/AIOps

ML Workflow Control

AutoML

Conclusion