Tag : FastTrack
FastTrack Guide: Receiving Notifications for Model Training Results
By Jeongseok KangFrom the now-classic AlexNet to various large language models (LLMs) that are garnering a lot of attention these days, we train and evaluate various models to suit our needs. However, realistically, it's difficult for us to gauge when the training will end until we run the model multiple times and gain experience.
Backend.AI's excellent scheduling minimizes GPU idle time and allows model training to run even while we sleep. Then, what if we could receive the results of a model that finished training while we were asleep? In this article, we'll cover how to receive model training results as messages using the new feature of FastTrack and Slack.
This article is based on the Backend.AI FastTrack version 24.03.3.
Before We Start
This article does not cover how to create a Slack App and Bot. For detailed information, we recommend referring to the official documentation.
Creating a Pipeline
Let's create a pipeline for model training. A pipeline is a unit of work used in FastTrack. Each pipeline can be expressed as a collection of tasks, the minimum execution unit. Multiple tasks included in a single pipeline can have interdependencies, and they are executed sequentially according to these dependencies. Resource allocation can be set for each task, allowing flexible resource management.
When an execution command is sent to a pipeline, it is executed by replicating the exact state at that point, and this unit is called a pipeline job. Multiple pipeline jobs can be run from a single pipeline, and each pipeline job is generated from a single pipeline.
Create Pipeline button
Click the Create Pipeline button ("+") at the top of the pipeline list.
Creating a Pipeline
You can specify the pipeline's name, description, location of the data store to use, environment variables to be applied commonly across the pipeline, and the method of pipeline initialization. Enter the name "slack-pipeline-0", and then click the "Create" button at the bottom to create the pipeline.
Creating Tasks
Dragging a Task
You can see that the new pipeline has been created. Now let's add some tasks. From the task template list (Task templates) at the top, drag and drop the "Custom Task" block onto the workspace below.
Entering the Task's Actions
A task details window appears on the right where you can enter the task's specifics. You can give it a name like
model-training-task
to indicate its role, and set it to use thepytorch:1.11-py38-cuda11.3
image for model training. Since actual model training can take a long time, for this example, we'll have it execute the following simple commands:# Pause for 3 seconds to increase the execution time. sleep 3 # Create a `result.txt` file in the pipeline-dedicated folder. Assume this is the accuracy of the trained model. echo "0.$RANDOM" > /pipeline/outputs/result.txt
Creating a Task (1)
Finally, enter the resource allocation for the task, and then click the "Save" button at the bottom to create the task.
Dragging Another Task
You can see that the
model-training-task
has been created in the workspace. This time, to create a task that reads the value from theresult.txt
file saved earlier and sends a Slack notification, drag another "Custom Task" block into the workspace below.Entering the Task-level Environment Variable `SLACK_TOKEN`
For this task, set the name to
slack-alarm-task
, and enter the following script to send a notification to Slack:pip install slack-sdk python -c ' import os from pathlib import Path from slack_sdk import WebClient SLACK_BOT_TOKEN = os.environ.get("SLACK_TOKEN") JOB_ID = os.environ.get("BACKENDAI_PIPELINE_JOB_ID") def main(): result = Path("/pipeline/input1/result.txt").read_text() client = WebClient(token=SLACK_BOT_TOKEN) client.chat_postMessage( channel="#notification", text="Pipeline job({}) finished with accuracy {}".format(JOB_ID, result), ) if __name__ == "__main__": main() '
The code above uses two environment variables:
SLACK_TOKEN
andBACKENDAI_PIPELINE_JOB_ID
. Environment variables in theBACKENDAI_*
format are values automatically added by the Backend.AI and FastTrack systems, whereBACKENDAI_PIPELINE_JOB_ID
represents the unique identifier of the pipeline job in which each task is running.The other environment variable,
SLACK_TOKEN
, is a task-level environment variable. This feature allows you to manage and change various values without modifying the code.Creating a Task (2)
After allocating appropriate resources for the
slack-alarm-task
, click the "Save" button at the bottom to create the task.Adding Task Dependencies
Adding Task Dependencies
Now there are two tasks (
model-training-task
andslack-alarm-task
) in the workspace. Sinceslack-alarm-task
should be executed aftermodel-training-task
completes, we need to add a dependency between the two tasks. Drag the mouse from the bottom of the task that should run first (model-training-task
) to the top of the task that should run later (slack-alarm-task
).Running the Pipeline
Running the Pipeline (1)
You can see an arrow connecting from
model-training-task
toslack-alarm-task
, indicating that the dependency has been added. Now, to run the pipeline, click the "Run" button in the top right.Running the Pipeline (2)
Before running the pipeline, you can review a brief summary of it. After confirming the presence of the two tasks, click the "Run" button at the bottom.
Running the Pipeline (3)
The pipeline was successfully run, and a pipeline job was created. Click "OK" at the bottom to view the pipeline job information.
Pipeline Job
The pipeline job was created successfully. You can see that the model training (
model-training-task
) has completed, andslack-alarm-task
is running.Receiving Slack Notification
Slack Notification (1)
Slack Notification (2)
You can see that the pipeline job execution results have been delivered to the user via Slack. Now we can sleep soundly.
30 May 2024
23.09: September 2023 Update
By Lablup23.09: September 2023 update
In the second half of 2023, we released 23.09, a major release of Backend.AI. In 23.09, we've significantly enhanced the development, fine-tuning, and operational automation of generative AI. We've automatically scaled and load-balanced AI models based on workload, expanded support for various GPUs/NPUs, and increased stability when managing a single node as well as 100-2000+ nodes. The team is working hard to squeeze every last bit out of it. Here are the main improvements since the last [23.03 July update] (/posts/2023/07/31/Backend.AI-23.03-update).
Backend.AI Core & UI
- The Backend.AI Model Service feature has been officially released. You can now use Backend.AI to more efficiently prepare environments for inference services as well as training of large models such as LLM. For more information, see the blog Backend.AI Model Service sneak peek.
- Added the ability to sign in to Backend.AI using OpenID single sign-on (SSO).
- If your kernel image supports it, you can enable the sudo command without a password in your compute session.
- Support for Redis Sentinel without HAProxy. To test this, we added the
--configure-ha
setting to theinstall-dev.sh
file. - Added the ability to use the RPC channel between Backend.AI Manager and Agent for authenticated and encrypted communication.
- Improved the CLI logging feature of Backend.AI Manager.
- Fixed an issue where Manager could not make an RPC connection when Backend.AI Agent was placed under a NAT environment.
- The Raft algorithm library, riteraft-py, will be renamed and developed as raftify.
- Support for the following new storage backends
- VAST Data
- KT Cloud NAS (Enterprise only)
Backend.AI FastTrack
- Improved UI for supporting various heterogeneous accelerators.
- Deleting a VFolder now uses an independent unique ID value instead of the storage name.
- Upgraded Django version to 4.2.5 and Node.js version to 20.
- Added pipeline template feature to create pipelines in a preset form.
- If a folder dedicated to a pipeline is deleted, it will be marked as disabled on the FastTrack UI.
- Improved the process of deleting pipelines.
- Added a per-task (session) accessible BACKENDAI_PIPELINE_TASK_ID environment variable.
- Actual execution time per task (session) is now displayed.
Contribution Academy
Especially in the past period, the following code contributions were made by junior developer mentees through the 2023 Open Source Contribution Academy organized by NIPA.
- Created a button to copy an SSH/SFTP connection example to the clipboard.
- Refactored several Lit elements of the existing WebUI to React.
- Wrote various test code.
- Found and fixed environment variable and message errors that were not working properly.
Backend.AI is constantly evolving to provide a more powerful and user-friendly experience while supporting various environments in the AI ecosystem. Stay tuned for more updates!
Make your AI accessible with Backend.AI!This post is automatically translated from Korean
26 September 2023
23.03: July 2023 Update
By Lablup23.03: July 2023 update
A wrap-up of the ongoing updates to Backend.AI 23.03 and 22.09. The development team is working hard to squeeze every last bit out.
Here are the most important changes in this update
- Enhanced storage manageability: Added per-user and per-project storage capacity management (quotas) with VFolder v3 architecture.
- Expanded NVIDIA compatibility: Support for CUDA v12 and NVIDIA H100 series.
- Extended hardware compatibility: Support for WARBOY accelerators from FuriosaAI.
Backend.AI Core & UI
- Supports CUDA v12 and NVIDIA H100 series.
- Supports the WARBOY accelerator, the first NPU from FuriosaAI company.
- Added storage capacity management function (Quota) by user and project by applying VFolder v3 architecture.
- However, it is limited to storage that supports Directory Quota.
- Fixed an error that caused multi-node cluster session creation to fail.
- Fixed an error where a compute session in the
PULLING
state was incorrectly labeled asPREPARING
. - Fixed an error in which the
CLONING
state was incorrectly displayed when cloning a data folder with the same name when multiple storage devices have the same folder. - Improved the web terminal of a compute session to use zsh as the default shell if the zsh package is installed in the kernel image.
- Added the ability to know the health status of the (managed) storage proxy and event bus.
Backend.AI FastTrack
- Added the ability to set
multi-node
cluster mode by task. - Fixed an error where environment variables set in
.env
were not applied to the frontend. - Fixed an error recognizing out-of-date when accessing with a mobile browser.
- Added a field to show the cause message when a task-specific error occurs.
- Fixed other editor-related issues.
Backend.AI is constantly evolving to provide a more powerful and user-friendly experience while supporting various environments in the ever-changing AI ecosystem. Stay tuned for more updates!
Make your AI accessible with Backend.AI!31 July 2023
23.03: May 2023 Update
By LablupA recap of the ongoing updates to Backend.AI 23.03 and 22.09. The development team is working hard to squeeze every last bit out.
Here are the most important changes in this update:
- Expanded hardware compatibility: Expanded hardware compatibility with support for ATOM accelerator idle checking and Dell EMC storage backends from Rebeillons.
- High-speed upload enhancements: Introduced SFTP functionality to support high-speed uploads to storage.
- Development Environment Enhancements: Enhanced the development environment by allowing sessions to be accessed in remote SSH mode from local Visual Studio Code.
- Increased manageability: Improved the user interface for administrators to make it easier to set up AI accelerators and manage resource groups.
Backend.AI Core & UI
- Added support for idle state checking of ATOM accelerators.
- Introduced SFTP functionality to support high-speed uploads directly to storage.
- Added ability to force periodic password updates based on administrator settings.
- Added an upload-only session (SYSTEM) tab.
- Added Inference type to the allowed session types.
- Added the ability to connect to a session in remote SSH mode from local Visual Studio Code.
- Added support for uploading folders from Folder Explorer.
- Improved the display of the amount of shared memory allocated when creating a session.
- Added support for Dell EMC storage backend.
- Improved the accuracy of container memory usage measurement.
- Improved the ability to run multiple agents concurrently on a single compute node.
- Added project/resource group name filter for administrators.
- Added user interface for administrators to set various AI accelerators, including GPUs, in resource presets/policies.
- Added a user interface for administrators to display the allocation and current usage of various accelerators, including GPUs.
- Added a user interface for administrators to set the visibility of resource groups.
- Provided a user interface for administrators to view the idle-checks value per session.
- Added recursion option when uploading vfolders in the CLI, and improved relative path handling.
- Added a recursive option in the CLI to terminate sessions with dependencies on specific session termination at once.
- Added a new mock-accelerator plugin for developers, replacing the old cuda-mock plugin.
- Added status and statistics checking API for internal monitoring of the storage proxy for developers.
Backend.AI FastTrack
- Improved searching for vfolders by name when adding pipeline modules.
- Added an indication to easily recognize success/failure after pipeline execution.
Backend.AI Forklift
- Bug fixes and stability improvements.
- Support for deleting build job history.
- Supports pagination of the build task list.
Backend.AI is constantly evolving to support a variety of environments in the ever-changing AI ecosystem, while providing a more robust and user-friendly experience. Stay tuned to see what's next!
Make your AI accessible with Backend.AI!
This post is automatically translated from Korean
31 May 2023
23.03: March 2023 Update
By Lablup23.03: 2023년 3월 업데이트
We're excited to announce version 23.03.0, the first major release of Backend.AI for 2023. Some features will continue to be rolled out in subsequent updates.
Specifically in this update:
- Support for the 'inference' service with a new computation session type.
- Support for 'model' management with a new storage folder type.
- Support for managing storage capacity on a per-user and per-project basis.
- Significant improvements to FastTrack's pipeline versioning and UI.
Backend.AI Core & UI (23.03)
- Added model management and inference session management capabilities.
- More advanced inferential endpoint management and network routing layers will be added in subsequent updates.
- The codebase has been updated to be based on Python 3.11.
- Introduced React components to the frontend and leveraged Relay to introduce a faster and more responsive UI.
- Full support for cgroup v2 as an installation environment, starting with Ubuntu 22.04.
- Updated the vfolder structure to v3 for storage capacity management on a per-user and per-project basis.
- Kernel and sessions are now treated as separate database tables, and the state transition tracking process has been improved to work with less database load overall.
- Improved the way the agent displays the progress of the image download process when running a session.
- Improved the display of GPU usage per container in CUDA 11.7 and later environments.
- Scheduling priority can be specified by user and project within each resource group.
- Supports two-factor authentication (2FA) login based on one-time password (TOTP) to protect user accounts.
- Support for users to register their own SSH keypair for session access.
- Supports user interfaces for Graphcore IPUs and Rebellions ATOM devices.
Backend.AI Forklift (23.03)
- Added Dockerfile templates and advanced editing capabilities.
- Support for creating container images for inference.
- Extended image management capabilities to work with the Harbor registry.
Backend.AI FastTrack (23.03)
- Storage folder contents can be viewed directly from the FastTrack UI.
- Improved session state synchronisation with Core to event-based.
- You can set the maximum number of iterations for a pipeline schedule.
- If a task fails to execute, the pipeline job is automatically cancelled instead of waiting.
- Added pipeline versioning. You can track the shape history of your pipeline, and you can recall the contents at a specific point in time to continue working on it.
- You can modify pipelines in YAML format directly through the code editor.
개발 및 연구 프레임워크 지원
- Supports TensorFlow 2.12, PyTorch 1.13
- Support for NGC (NVIDIA GPU Cloud) TensorFlow 22.12 (tf2), NGC PyTorch 22.12, NGC Triton 22.08
- Added python-ff:23.01 image, which provides the same libraries and packages as Google Colab
In addition to what we've listed above, we've included many bug fixes and internal improvements.
Stay tuned for more to come!This post is automatically translated from Korean
31 March 2023
Introducing FastTrack: Backend.AI MLOps Platform
By Jihyun KangIntroducing FastTrack, the MLOps Platform of Backend.AI. FastTrack allows you to organize each step of data preprocessing, training, validation, deployment, and inference into a single pipeline. FastTrack makes it easy for you to customize each step as you build your pipeline. In this article, we'll explain why you need an MLOps platform, how Backend.AI FastTrack came to be, and what makes itself really unique.
Rise of MLOps Platforms
Over the past few years, the IT industry, as well as most industries undergoing digital transformation, has been working hard to adopt AI to make meaningful predictions from scattered data and respond to rapidly changing markets. In order to make a good use of AI in this process, it is necessary to respond to various stages such as model training and optimization, hardware introduction considering data I/O, model version management, etc. The concept of MLOps (Machine Learning Operations) emerged from this. If you are unfamiliar with the concept, we recommend that you skim our 'MLOps series' before reading this article.
FastTrack: History
In 2019, we added the Backend.AI pipeline as a beta release to address the demand for DevOps pipelines. We developed and tested the ability to simplify the process of creating and managing complex pipelines, and to operate unidirectional pipelines that split into two or more paths in the middle. However, with the rise of the MLOps concept and the proliferation of various pipeline solutions such as AirFlow, MLFlow, and KubeFlow, we shifted our development direction to integrating and supporting open source pipeline tools instead of developing pipeline features as a full-fledged feature.
Meanwhile, AI development pipelines have became increasingly complexed, and became clear that open-source MLOps pipeline tools were unable to meet the diverse needs of the users. At this point, we decided to revive the pipeline feature of Backend.AI. During the process of revitalizing and prototyping the Backend.AI pipeline, we changed the direction of development to a MLOps pipeline solution that works with the Backend.AI cluster, but stands independently, so that we could directly address our user requests.
With such a colorful history, Lablup's AI/MLOps solution is called 'FastTrack'. This name came from the airport or logistics, a lane which expedites passenger or custom clearance. FastTrack became available with Backend.AI 22.09, and still being tested to meet our customer standards.
FastTrack: What it is
FastTrack is a machine learning workflow platform that enables users to tailor multiple work units based on Backend.AI clusters and execute them as a Directed Acyclic Graph (DAG). Users can run sessions for each stage of the machine learning pipeline, linked through pre- and post-relationships, allowing them to integrate steps like data preprocessing, training, validation, deployment, monitoring, and optimization into a unified workflow as needed. This means users can more efficiently build and reuse models by structuring sessions into workflows and automatically scheduling them after each phase, rather than manually crafting them in a conventional Backend.AI cluster.
FastTrack: Structure and features
FastTrack categorizes workflow templates as pipelines, executes workflows to pipeline jobs, divides the units of work in a workflow into tasks, and the units of work that are to be executed into task instances. The flowchart following outlines the step-by-step progression of work within FastTrack.
Pipeline
A pipeline is a structured collection of data and tasks, represented by a Directed Acyclic Graph (DAG). In setting up an AI workflow, constructing a pipeline prompts FastTrack to create a specific folder in your Backend.AI cluster dedicated to pipelines. This setup facilitates the monitoring of training progress via artifacts. FastTrack streamlines the modification of task relationships with an intuitive drag-and-drop interface, allowing for immediate visual feedback in the form of a schematic flow and verification through a YAML file. Moreover, managing pipelines in YAML format allows for easy export, import, and sharing among users.
Pipeline Job
Within the FastTrack GUI, the progress of job units is indicated by the color of the nodes associated with each unit. Similar to pipelines, the information and relationships of the task instances being configured in YAML are managed. Upon completion of all task instances, the pipeline job's status is displayed as either successful or failed.
Task
A task is the smallest unit of execution in a pipeline that allows you to allocate resources by purpose. For example, sole task for model training can dedicate a lot of GPU resources, to use resources more efficiently, as opposed to preprocessing. You can also specify the execution environment. Based on the images supported by the Backend.AI cluster, you can use images such as TensorFlow, PyTorch, Python 3.x, NGC TensorFlow, NGC PyTorch, etc. without Docker building process. You can also mount virtual folders created by the Backend.AI cluster on a per-task basis as needed.
Task Instance
Task instances are physical objects created when a pipeline job is created, based on the task informations that makes up the pipeline. Executing an AI workflow means that the task instances that make up the pipeline job are executed according to the specified preceding and following relationships. Task instances currently have a 1:1 correspondence with Sessions in the Backend.AI cluster, equating the state of a session with the state of a task instance, but we plan to expand beyond sessions to other units of execution in the near future.
Wrap up
So far, we've covered MLOps with an introduction to FastTrack, the Backend.AI MLOps platform. The latest release of Backend.AI FastTrack is version 22.09 (in the time of Nov.2022). Our development plans include a range of user-friendly features, such as debugging pipelines, creating dependencies between pipelines, optimizing the usage of resources for tasks, and providing support for GitHub-based model and data repositories. True to Lablup's vision of empowering anyone to develop and use AI models from anywhere, FastTrack will simplify the process of building automated models. We look forward to your interest in our future endeavors.
29 November 2022