Mar 31, 2022

Releases

22.03: March 2022 update

Lablup
Lablup

Backend.AI

Mar 31, 2022

Releases

22.03: March 2022 update

Lablup
Lablup

Backend.AI

Here is the corrected version of the text, with grammatical errors fixed and phrasing improved for clarity and professionalism.

Backend.AI 22.03 / Enterprise R2 - March 2022 Update

Hello! We're excited to share the March 2022 improvements for Backend.AI 22.03 / Enterprise R2!

This is a comprehensive update released on a 6-month cycle. Major bug fixes and essential feature improvements will also be applied to Backend.AI 21.09 / Enterprise R2. Backend.AI 21.03 and earlier versions will no longer receive updates unless there are urgent security patches for customers currently using them.

Manager: 22.03.0
Agent: 22.03.0
Common: 22.03.0
Client SDK: 22.03.0
Storage-Proxy: 22.03.0
WebServer: 22.03.0
WebUI: 22.03.0
WSProxy: 22.03.0
Control Panel: 22.03
Core Release Notes: https://github.com/lablup/backend.ai/releases/tag/22.03.0

Enhanced MLOps Support

Official AppProxy v2 Support

Previously, when users accessed applications within containers via AppProxy, they had to go through the Manager. While this was not a major issue for simple web applications, it was difficult to handle the significant traffic scaling required for applications like model serving.

While the existing method remains available, this version adds support for the newly improved AppProxy v2, which allows direct access without routing through the Manager. This enables a horizontally scalable architecture for serving container applications.

Batch Session Dependencies and Session State Webhooks

An option to specify existing batch sessions as dependencies when creating new ones has been added. This allows you to schedule tasks to run sequentially upon the successful completion of previous tasks.

Additionally, in addition to the existing per-session event streaming API, we've added a new option to call webhooks at specified URLs. This provides relevant information whenever session scheduling and execution states change. These features can be utilized directly at the API and SDK level or through a separately published MLOps pipeline interface.

New Hardware Platform Support

Dell PowerScale Storage Integration

We've integrated the Dell PowerScale storage backend into Storage-Proxy, providing more detailed real-time statistics and monitoring capabilities when using that storage.

Heterogeneous Architecture Cluster Configuration Support

When runtime container images are built for multi-architecture, it is now possible to register and use them in the registry. The system has been improved to recognize the CPU architecture of Backend.AI Agents on each compute node in the cluster and to select images and allocate sessions to agents according to the architecture type during scheduling. Accordingly, we've opened the new cr.backend.ai/multiarch repository for deploying multi-architecture images.

Stability and Performance Enhancements

Automatic Round-Down for GPU Fraction Allocation

Previously, to prevent excessive fragmentation of GPU resources, the system would return an allocation failure if a fractional GPU allocation resulted in very small decimal units based on pre-specified quantization sizes. (For example, if a user attempted to allocate 1.0 fGPU but the remaining resource limitations only allowed for an allocation of 0.33, 0.33, and 0.34, the request would fail if the quantum size was 0.1).

Starting with this version, the system automatically rounds the allocation down to the quantum size, allowing for successful session creation even if slightly fewer resources are allocated than requested. (For example, under the same conditions, the system would now allocate 0.3, 0.3, and 0.3, and the session would be created successfully).

When the actual allocated capacity is smaller than the requested capacity, the session information is updated to reflect the actual resource usage accurately. This improves convenience for customers who heavily utilize fractional GPU scaling features.

Database Usage Optimization

To address database bottlenecks when many users simultaneously request session creation and deletion, we have moved two functions to a Redis-based operation: session count tracking per authentication key and real-time statistics collection per container. We've also optimized the automatic resource correction queries within single transactions that were causing excessive overhead.

Python 3.10 Upgrade

The Backend.AI server engine runtime environment has been upgraded to Python 3.10.

Other Improvements

Fixed conflicts with container resource constraint layers when running some applications that were built to use jemalloc installed as a system package within container images.
Disabled repetitive storage capacity scanning for container working directories to prevent excessive performance degradation in some environments.

Other User Interface and Usability Improvements

The file browser has been separated into a dedicated container managed by Storage-Proxy, rather than being part of a general compute session, which provides better file I/O performance. (BETA)
A pending timeout can now be set per resource group to automatically cancel session creation requests that are not scheduled within a specified time.
You can now specify the allowed session types for execution per resource group.
Virtual folders can now be mounted at arbitrary absolute path locations within containers.
Subdirectories of virtual folders can now be mounted under /home/work or at arbitrary absolute path locations within containers.
Session information can now be viewed more concisely in the WebUI.
The WebUI now distinguishes between batch sessions and interactive sessions.
The CLI now provides consistent JSON formatted output for mutation operations that modify server information, not just for queries that retrieve information.
We have separated the backend.ai-client and backend.ai-client-cli packages to reduce the possibility of dependency conflicts when integrating the SDK with other applications and frameworks. (BETA)
Error details and related file paths are now communicated more effectively when Storage-Proxy errors are propagated through the Manager API.

Development and Research Framework Support

TensorFlow 2.7/2.8 Support: Support for TensorFlow 2.7/2.8 has been added. For TF 2.8, some components are missing because compatible components, such as TFX, have not yet been updated.
PyTorch 1.10/1.11 Support: Support for PyTorch 1.10/1.11 has been added.
NGC TensorFlow/PyTorch 22.03, Triton 22.03 Support: The March 2022 versions of NGC TensorFlow, NGC PyTorch, and Nvidia Triton service images are now supported.
GPU-based Julia 1.7 and FluxML Support: We now support CUDA GPU acceleration-based Julia 1.7 and provide FluxML with GPU acceleration based on CUDA 11.3.
(Cloud) RStudio Support: RStudio is now supported directly on the web in Backend.AI Cloud without the need for a desktop application.
Weights & Biases Integration Support (Beta): We now provide integrated support for easily running W&B on Backend.AI. This feature is still in beta and is expected to be officially available at the end of April.

Back to Category

Blog

22.03: March 2022 update

22.03: March 2022 update

Backend.AI 22.03 / Enterprise R2 - March 2022 Update

Enhanced MLOps Support

Official AppProxy v2 Support

Batch Session Dependencies and Session State Webhooks

New Hardware Platform Support

Dell PowerScale Storage Integration

Heterogeneous Architecture Cluster Configuration Support

Stability and Performance Enhancements

Automatic Round-Down for GPU Fraction Allocation

Database Usage Optimization

Python 3.10 Upgrade

Other Improvements

Other User Interface and Usability Improvements

Development and Research Framework Support

We value your privacy