Engineering
Raft Consensus algorithm for Backend.AI: Leader election
By Jeongseok KangHigh availability (HA) has become an indispensable concept when talking about modern applications. High availability is the ability of an IT system to remain nearly 100% accessible and reliable at all times by eliminating or minimizing downtime^1. Backend.AI, which is developed and serviced by Rableup, also employs various methods to maintain high availability.
Backend.AI architecture
Background
Backend.AI consists of many different components, including managers and agents, storage proxies, and web servers. Each of these components runs as multiple processes in a distributed environment to increase reliability, especially the manager, which is responsible for scheduling session execution and many core functions of Backend.AI. Currently, the manager has an Active-Active HA structure that ensures high availability through load balancing.
One of the many features of the Backend.AI Manager is event handling. Backend.AI raises various events, such as
AgentStartedEvent
andDoScheduleEvent
, to track the lifecycle of airgets and sessions and provide optimal scheduling. For example, when a Backend.AI Agent process runs, it generates anAgentStartedEvent
, and the Backend.AI Manager process receives this event and performs a specific action (schedule()
). Backend.AI Manager also raises aDoScheduleEvent
internally to ensure periodic scheduling. This is where the problem arises. If you are running multiple Backend.AI Manager processes for high availability, having each process raise an event with its own timer adds unnecessary load and can cause the health of the entire system to be unreliable. The Backend.AI Manager implements a GlobalTimer to ensure that only one manager process generates events within the same system. TheGlobalTimer
uses distributed locks to ensure mutual exclusivity between processes and to ensure that only one process generates events.@preserve_termination_log async def generate_tick(self) -> None: try: await asyncio.sleep(self.initial_delay) if self._stopped: return while True: try: async with self._dist_lock: if self._stopped: return await self._event_producer.produce_event(self._event_factory()) if self._stopped: return await asyncio.sleep(self.interval) except asyncio.TimeoutError: # timeout raised from etcd lock if self._stopped: return log.warn("timeout raised while trying to acquire lock. retrying...") except asyncio.CancelledError: pass
Currently, Backend.AI provides an interface for distributed locks, [AbstractDistributedLock] (https://github.com/lablup/backend.ai/blob/2f90d03c4477eda8e0beeabb7fe4b067c56dae09/src/ai/backend/common/lock.py#L33-L44), and we have developed and are using [
FileLock
] (https://github.com/lablup/backend.ai/blob/2f90d03c4477eda8e0beeabb7fe4b067c56dae09/src/ai/backend/common/lock.py#L47-L142), [EtcdLock
] (https://github.com/lablup/backend.ai/blob/2f90d03c4477eda8e0beeabb7fe4b067c56dae09/src/ai/backend/common/lock.py#L145-L190) based on the [etcd concurrency API] (https://etcd.io/docs/v3.5/dev-guide/api_concurrency_reference_v3/), and [RedisLock
] (https://github.com/lablup/backend.ai/blob/2f90d03c4477eda8e0beeabb7fe4b067c56dae09/src/ai/backend/common/lock.py#L193-L248) based on [Redis Lock] (https://redis.io/docs/manual/patterns/distributed-locks/) as actual implementations.etcd is a distributed, open-source key-value store used to store and manage critical information needed to keep distributed systems running^2, most notably in Kubernetes.
class AbstractDistributedLock(metaclass=abc.ABCMeta): def __init__(self, *, lifetime: Optional[float] = None) -> None: assert lifetime is None or lifetime >= 0.0 self._lifetime = lifetime @abc.abstractmethod async def __aenter__(self) -> Any: raise NotImplementedError @abc.abstractmethod async def __aexit__(self, *exc_info) -> Optional[bool]: raise NotImplementedError
Requirements
The
GlobalTimer
does a good job of controlling event generation on a per-process basis in a distributed environment. However, requirements are always changing and the software needs to change with them. This time, the added requirement was to implement a rate limit for requests. With the current load balancing scheme, we can't guarantee that every request is handled by the same manager, which can lead to the following problems because the state of each manager is not shared.1. Set the counters for both managers to 0 and the request count limit to 1. 2. The first request is received by manager 1. 3. Increase the counter on manager 1 by 1. (C1: 0 -> 1) 4. The counter reaches the maximum allowed number of requests and the next request is rejected. 5. Manager 2 receives the second request due to load balancing. 6. The counter on manager 2 has not reached the maximum allowed number of times because it is still 0. (C2: 0) 7. Manager 2 processes the request. 8. The request count limit didn't work!
Therefore, the following issue has been proposed to discuss ways to improve these limitations.
Issue suggesting improvements to distributed timers (lablup/backend.ai#415)
To delegate global state management to a single manager process, represented by a leader, we investigated consensus algorithms and decided to use the Raft Consensus Algorithm (hereafter Raft), which is used in projects such as etcd, which is used as a repository in Kubernetes (https://kubernetes.io/docs/concepts/overview/components/#etcd), and which we believe has been well validated.
Raft consensus algorithm
The Raft algorithm was proposed in "In Search of an Understandable Consensus Algorithm"^3 submitted to USENIX in 2014. It was created to improve upon Paxos^4, the leading algorithm at the time, which was difficult to understand and implement in practice due to its complex consensus process, hence the title.
But our most important goal — and most difficult challenge — was understandability.
- In Search of an Understandable Consensus Algorithm
A Raft cluster typically consists of five nodes, because a maximum of two nodes can fail and still satisfy a
quorum
to maintain the system. Each node in a cluster has one of three states: leader, follower, or candidate. In general, there can be at most one leader in each cluster, with the rest of the nodes being followers.Glossary #1
- quorum: The minimum number of people required to make a decision. (N/2+1)
State transition diagram of a Raft node (Source: In Search of an Understandable Consensus Algorithm)
The Raft algorithm delegates all power to an elected leader and makes the flow of logs unidirectional, making it easier to understand the overall picture. The Raft algorithm has the following characteristics
Glossary #2
- term: The generation of the current leader or candidate. Incremented by 1 each time a leader election begins.
- index: Refers to the location of a specific value in the log.
- commit: Indicates that a specific value from the log was applied to the state machine.
- commitIndex: Highest index that successfully commits
- Election Safety: Each term has a maximum of one leader.
- Leader Append-Only: Readers cannot overwrite or delete logs, they can only add new ones.
- Log Matching: If two logs have values with the same index and term, all values up to that index are the same.
- Leader Completeness: If a value is committed to the log in a particular term, all subsequent generations of readers are guaranteed to have that value.
- State Machine Safety: If one server applies a log value from a particular index to its state machine, another server cannot apply a different value from the same index.
Using the above features, Raft divides the entire consensus process into three independent parts.
- Leader election: If the existing leader is not working, a new leader must be elected.
- Log replication: The leader replicates the request logs it receives from clients to other nodes. The other nodes unconditionally accept the leader's logs.
- Safety: When one server applies a log value from a particular index to its state machine, another server cannot apply a different value from the same index.
In this article, we'll discuss the different states a Raft node can be in, and implement the leader election process in code.
Follower
Followers do not send requests themselves, but only receive and respond to requests from the leader or candidate. The Behavior Spec for a Follower proposed in the paper and the code written based on it is shown below.
- Handle RPC requests from leaders and candidates.
async def on_append_entries( self, *, term: int, leader_id: RaftId, prev_log_index: int, prev_log_term: int, entries: Iterable[raft_pb2.Log], leader_commit: int, ) -> Tuple[int, bool]: await self._reset_timeout() if term < (current_term := self.current_term): return (current_term, False) await self._synchronize_term(term) return (self.current_term, True) async def on_request_vote( self, *, term: int, candidate_id: RaftId, last_log_index: int, last_log_term: int, ) -> Tuple[int, bool]: await self._reset_timeout() async with self._vote_request_lock: if term < (current_term := self.current_term): return (current_term, False) await self._synchronize_term(term) async with self._vote_lock: if self.voted_for in [None, candidate_id]: self._voted_for = candidate_id return (self.current_term, True) return (self.current_term, False) async def _synchronize_term(self, term: int) -> None: if term > self.current_term: self._current_term.set(term) await self._change_state(RaftState.FOLLOWER) async with self._vote_lock: self._voted_for = None
- If you don't receive any requests from leaders or candidates for a period of time, you'll be placed in candidate status.
async def _wait_for_election_timeout(self, interval: float = 1.0 / 30) -> None: while self._elapsed_time < self._election_timeout: await asyncio.sleep(interval) self._elapsed_time += interval await self._change_state(RaftState.CANDIDATE)
Leaders must periodically announce their presence by sending heartbeat messages to their followers. If a follower does not receive any messages for a certain amount of time (
election_timeout
), it assumes that the cluster is leaderless and starts an election by becoming a candidate to become the new leader.Candidate
The candidate's behavior statement and implementation code is as follows
- Become a follower when you receive the
AppendEntries
RPC request from the new leader (seeon_append_etries()
for followers). - Start the election with the following procedure
- Increase term by 1. (term += 1)
- Vote for yourself.
- Initialize the election timeout.
- Send a
RequestVote
RPC request to the other nodes.
async def _start_election(self) -> None: self._current_term.increase() async with self._vote_lock: self._voted_for = self.id current_term = self.current_term terms, grants = zip( *await asyncio.gather( *[ asyncio.create_task( self._client.request_vote( to=server, term=current_term, candidate_id=self.id, last_log_index=0, last_log_term=0, ), ) for server in self._configuration ] ) )
- If you receive votes from a majority of nodes, you are the leader.
for term in terms: if term > current_term: await self._synchronize_term(term) break else: if sum(grants) + 1 >= self.quorum: await self._change_state(RaftState.LEADER)
- If the election timeout occurs, start a new election.
case RaftState.CANDIDATE: while self.__state is RaftState.CANDIDATE: await self._start_election() await self._reset_election_timeout() await self._initialize_volatile_state() if self.has_leadership(): await self._initialize_leader_volatile_state() break await asyncio.sleep(self.__election_timeout)
Leader
- Send the first heartbeat message (an empty
AppendEntries
request) immediately after the election. Send heartbeat messages periodically thereafter.
async def _publish_heartbeat(self) -> None: if not self.has_leadership(): return terms, successes = zip( *await asyncio.gather( *[ asyncio.create_task( self._client.append_entries( to=server, term=self.current_term, leader_id=self.id, prev_log_index=0, prev_log_term=0, entries=(), leader_commit=self._commit_index, ), ) for server in self._configuration ] ) ) for term in terms: if term > self.current_term: await self._synchronize_term(term) break
- When it receives a request from a client, it adds a value to the log. After applying that value to the state machine, send a response to the request.
- If the follower has a log value with an index greater than the value the leader is tracking (nextIndex), replicate the log to the follower starting at nextIndex.
- If successful, update the leader's nextIndex and matchIndex.
- If it fails due to an inconsistency, it decrements the leader's nextIndex and tries again.
- If the value (N) below exists, update the commitIndex to that value.
- The majority of matchIndexes are greater than or equal to N (matchIndex >= N)
- The term of the Nth log is the same as the current term
The leader manages a nextIndex and a matchIndex for each follower.
- nextIndex: The next index that should be sent to each follower.
- matchIndex: the highest index that was successfully replicated to each follower
Conclusion
In this article, we've briefly covered the Raft algorithm and written code to perform a leader election. The remaining two features (log replication, membership changes) will face a variety of challenges in actual implementation, including timing issues. If you're interested in learning more about the Raft algorithm, we recommend reading the author's (Diego Ongaro) PhD thesis (CONSENSUS: BRIDGING THEORY AND PRACTICE)^6.
Finally, let's end by checking out how ChatGPT describes the Raft algorithm.
Raft algorithm explained by ChatGPT (Source: OpenAI ChatGPT 3.5)
This article is based on the code in lablup/aioraft-ng. Please also pay attention to lablup/raftify, the next generation Raft project currently under development at Lablup.
29 November 2023
Backend.AI Model Service Hands-on: Running GPT-NeoX
By Kyujin ChoBackend.AI version 23.09 has been officially released to the public. We covered Model Service, a key feature in version 23.09, in our previous Sneak Peek: Backend.AI Model Service preview article. Since then, we have added a variety of new features, including GUI support, authentication token history management, and more, and we are going to walk you through them in a tutorial format to make it easy to understand the Backend.AI Model Service. In this tutorial post, we will show you how to use the Backend.AI Model Service to run GPT-NeoX models on top of Triton Inference Server. Triton Inference Server is an open source model inference framework from NVIDIA that enables easy HTTP and gRPC1 delivery of its TritonRT, FasterTransformer, and TritonRT-LLM models, as well as PyTorch, TensorFlow, vLLM, and many others.
Create a Model VFolder
- Navigate to the Data & Folders tab. Click the "New Folder" button to open the VFolder creation dialog.
- Create a new model folder. It does not matter how you name the folder, but make sure to set the "Usage" at the bottom to "Model". Once you have specified all the values, click the "Create" button at the bottom. Your model VFolder has now been created.
FasterTransformer Format Model Conversion
- Navigate to the "Sessions" tab. Click the "Start" button to open the session creation dialog.
- Select
ngc-pytorch
for "Running Environment" and23.07
for "Version". Once you have made your selections, click the arrow icon in the lower right corner. - The window to select the VFolder to mount in the session. To load the model, select the VFolder you just created under the "Model storage folder to mount" section. Once you have made your selections, click the arrow icon in the lower right corner.
- A window to specify the amount of resources to be used by the model session. You should allocate at least 16 CPU cores and 128 GB of RAM to ensure smooth model conversion. Once you have made your selections, click the arrow icon in the lower right corner.
- After confirming that all settings have been applied correctly, click the "Start" button below to start the session.
- Once the session is created, a popup will appear to select an app, as shown below. Click the "Console" app to access the terminal environment.
- Run the following shell script to download the GPT-NeoX 20B model and convert it to the FasterTransformer format. Note that where the script mentions
<VFolder name>
, you must replace it with the name of the model VFolder you created.
cd /home/work/<VFolder name> pip install -U transformers bitsandbytes git clone https://github.com/NVIDIA/FasterTransformer git clone https://huggingface.co/ElutherAI/gpt-neox-20b cd neo-gptx-20b git lfs install git lfs pull
The GPT-NeoX 20B model requires at least 40GB of VRAM to run. If the physical GPU you are using has less VRAM than this and you need to split the model across multiple GPUs, adjust the number in the
-i_g
parameter to match the number of GPUs you are using.cd /home/work/<VFolder name> mkdir -p triton-deploy/gpt-neox-20b-ft python ~/<VFolder name>/FasterTransformer/examples/pytorch/gptneox/utils/huggingface_gptneox_convert.py \ -i /home/work/<VFolder name>/gpt-neox-20b \ -o /home/work/<VFolder name>/triton-deploy/gpt-neox-20b-ft \ -i_g 1 \ -m_n GPT-NeoX-20B
- If you followed all the steps up to step 7, you should have the following folders under the VFolder.
work@main1[PRRLCIqu-session]:~/GPT-NeoX-Triton-FT$ ls -al total 62 drwxr-xr-x 5 work work 11776 Oct 12 12:14 . drwxr-xr-x 9 work work 4096 Oct 12 12:29 .. drwxr-xr-x 14 work work 12800 Oct 12 11:24 FasterTransformer drwxr-xr-x 3 work work 16896 Oct 12 10:18 gpt-neox-20b drwxr-xr-x 3 work work 11776 Oct 12 11:56 triton-deploy
Now it's time to add the configuration file for Triton Inference Server. Create the file
triton-deploy/gpt-neox-20b-ft/config.pbtxt
and add the following contents.If you set the value of the
-i_g
parameter to anything other than 1 in step 7, you must modify the value oftensor_para_size
in the settings below to match the value of-i_g
.name: "gpt-neox-20b-ft" backend: "fastertransformer" default_model_filename: "gpt-neox-20b-ft" max_batch_size: 1024 model_transaction_policy { decoupled: False } input [ { name: "input_ids" data_type: TYPE_UINT32 dims: [ -1 ] }, { name: "start_id" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "end_id" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "input_lengths" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } }, { name: "request_output_len" data_type: TYPE_UINT32 dims: [ -1 ] }, { name: "runtime_top_k" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "runtime_top_p" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "beam_search_diversity_rate" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "temperature" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "len_penalty" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "repetition_penalty" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "random_seed" data_type: TYPE_UINT64 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "is_return_log_probs" data_type: TYPE_BOOL dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "beam_width" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "bad_words_list" data_type: TYPE_INT32 dims: [ 2, -1 ] optional: true }, { name: "stop_words_list" data_type: TYPE_INT32 dims: [ 2, -1 ] optional: true }, { name: "prompt_learning_task_name_ids" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "top_p_decay" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "top_p_min" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "top_p_reset_ids" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true } ] output [ { name: "output_ids" data_type: TYPE_UINT32 dims: [ -1, -1 ] }, { name: "sequence_length" data_type: TYPE_UINT32 dims: [ -1 ] }, { name: "cum_log_probs" data_type: TYPE_FP32 dims: [ -1 ] }, { name: "output_log_probs" data_type: TYPE_FP32 dims: [ -1, -1 ] } ] instance_group [ { count: 1 kind: KIND_CPU } ] parameters { key: "tensor_para_size" value: { string_value: "1" } } parameters { key: "pipeline_para_size" value: { string_value: "1" } } parameters { key: "data_type" value: { string_value: "fp16" } } parameters { key: "model_type" value: { string_value: "GPT-NeoX" } } parameters { key: "model_checkpoint_path" value: { string_value: "/models/triton-deploy/gpt-neox-20b-ft/1-gpu" } } parameters { key: "enable_custom_all_reduce" value: { string_value: "0" } }
- Finally, you need to add the Backend.AI Model Service definition file to the root of the VFolder, under
model-definition.yaml
(model-definition.yml
is also acceptable). Let's take a closer look at the model definition file for running Triton Inference Server.
models: - name: "GPT-NeoX" model_path: "/models/triton-deploy" ...
This is where you specify the model name and the path to the model.
The name and path you set here can be accessed by the model server process as the
BACKEND_MODEL_NAME
andBACKEND_MODEL_PATH
environment variables, respectively.... service: start_command: - tritonserver - --model-repository=/models/triton-deploy - --disable-auto-complete-config - --log-verbose - "1" ...
This is the part that defines the command line syntax for starting the Model Server process.
... port: 8000 ...
This is where you fill in the port for API communication that the model server process exposes. If not specified, Triton Inference Server exposes port
8000
for HTTP API communication by default, so you will also write that port in the model definition file.... health_check: path: /v2/health/ready max_retries: 3 max_wait_time: 5 expected_status_code: 200
This is where you enable and set up the Health Check feature. If the Health Check feature is enabled, Backend.AI will continuously send HTTP GET requests to the path to verify that it returns an HTTP response code corresponding to the
expected_status_code
(can be omitted, defaults to200
). If the model server does not respond, or returns an undefined response code, Backend.AI determines that the session is unhealthy and excludes it from the service. When a session is excluded from the service, it is not automatically terminated and the Model Service administrator must manually take the appropriate action by checking container logs, etc. The Health Check feature can be disabled by omitting the syntax entirely. If you do this, Backend.AI will not check the health of the model server and will always assume it is in a healthy state. Themax_wait_time
is the part that defines the API response timeout. It must be a number in seconds. Themax_retries
is the number of times the request is retried before the model server is judged to be unhealthy.
The finished model definition file looks like this.models: - name: "GPT-NeoX" model_path: "/models/triton-deploy" service: start_command: - tritonserver - --model-repository=/models/triton-deploy - --disable-auto-complete-config - --log-verbose - "1" port: 8000 health_check: path: /v2/health/ready max_retries: 3 max_wait_time: 5
More information about model definition files can be found in the Backend.AI WebUI documentation.
Now you're all set to run the Model Service.
Create a Model Service
- Navigate to the "Model Serving" tab. Click the "Start Service" button to open the Create Model Service window.
Let's take a look at each section in a little more detail.
Service name: This is where you specify the name of the Model Service. The name of the Model Service can be used as a subdomain of the Model Service Endpoint (coming soon).
Resource Group: This is the field to select the resource group where the Inference Session for the Model Service will be created.
Open your app to the outside world: When this feature is enabled, all API requests to the model server must be accompanied by an authentication header before they can be made. For more information about Model Service authentication, see the Backend.AI WebUI documentation.
Desired number of routes: A field to specify the number of inference sessions the Model Server process runs in. Setting this value to a number greater than 1 creates multiple identical sessions and enables the round-robin load balancer feature, which distributes API requests evenly among these sessions. This value can be modified at any time after Model Service creation.
A panel that specifies the amount of resources for the inference session.
The GPT-NeoX 20B model requires a minimum of 40 GB of vRAM to run. The relationship between fGPU units and vRAM in Backend.AI may apply differently depending on the settings of your Backend.AI. Consult with the administrator of your Backend.AI for more information. If you have set all the values correctly, press the "OK" button to create the Model Service.
- the Model Service has been created. If the Model Service is not yet ready for the model process in the reasoning session, the status will remain "PROVISIONING".
Click on the "INFERENCE" section of the "Sessions" tab and you'll see that an inference session has been created corresponding to the Model Service you created in 1.
Model Service administrators can click the clipboard icon in the "Control" row to view logs related to the model server processes in an inference session.
- When the Model Server process is running normally, the status of the route at the bottom and the status at the top will both change to "HEALTHY", and the address to access the Model Service will appear under "Service Endpoints".
You can now access the Triton Inference Server that ran the inference session through that address.
Conclusion
In this article, you've learned how to start serving LLM models using the Backend.AI Model Service. The Model Service feature is available in Backend.AI's Cloud Beta. Start serving your own models today!
1: Not supported by Backend.AI Model Service
This post is automatically translated from Korean
21 November 2023
- Navigate to the Data & Folders tab. Click the "New Folder" button to open the VFolder creation dialog.
High sky and plump horses, and Container Dieting
By Mario (Manseok) ChoIntroduction
Most Linux distributions, such as Ubuntu, RedHat, and CentOS, use glibc as the system's standard C library. When you install a library package, such as OpenSSL, with apt on Ubuntu or rpm (yum) on the RedHat line, it is dynamically linked with glibc by default.
GNU (Gnu) is an operating system and includes a wide range of computer software. GNU is open source, developed and maintained by the Free Software Foundation (FSF). Examples of things created by GNU include compilers and development tools such as GCC, G++, and Make. GNU uses glibc as its standard C library. glibc uses the GNU Lesser General Public License.
musl is a Linux standard C library distributed under the MIT license. Its developer is Rich Felker, and while glibc uses dynamic linking, musl aims to implement a standard C library that conforms to POSIX standards using static linking. It also implements non-standard features of Linux, BSD, and glibc.
Differences between glibc and musl in the Linux environment
When you install a package on Linux, it uses glibc by default. If you've ever built a C/C++ program using gcc, you've most likely done a glibc-based dynamic link build. However, in addition to this common glibc dynamic build, you can also do a MUSL-based dynamic/static build.
There are the following differences between
*-linux-gnu
and*-linux-musl
.| Build targets | Standard C libraries | Linking method |----------------|-------------------|----------------| |
*-linux-gnu
| glibc | dynamic linking | |*-linux-musl
| musl | dynamic/static linking |Consider the case of building an executable with Rust. When you install Rust on a Linux environment using
rustup
,*-linux-gnu
is selected as the default target.If you don't specify any other options, Rust will build the binary with the
*-linux-gnu
target and dynamically link it with glibc. To run a binary built in this way, you must have glibc installed in your Linux environment for it to work. If the binary relies on external libraries such as OpenSSL (if it is dynamically linked), you will also need to install those libraries via a package manager such as apt. If you want to run these dynamically linked binaries as a regular user, you can bundle them into a package like a DEB or RPM that describes the dependencies on external libraries. The package manager will then automatically find and install the appropriate dependent libraries. However, if you're using a library that isn't registered with the package manager, or even the same library, there are subtle compatibility issues between the installed version and the version you used to develop it, there's a chance that the binary you build won't run as intended.If you specify the
*-linux-musl
target, Rust will statically link with musl when building the binary. If you rely on external libraries like OpenSSL, it will also statically link those as well, embedding them all into the binary. This means that you end up with all of these libraries inside a single binary file in Rust. This static binary can run on any Linux environment, as long as it matches the CPU architecture and the set of system calls provided by the Linux kernel. This makes it easier to distribute binaries because you only need to pass a single binary to run it, rather than using a package like a DEB or RPM.If this makes deploying binaries so easy, why isn't the
*-linux-musl
target the default for Linux environments?The reason is that using MUSL makes build preparation somewhat more complicated. This is because if a developer-created binary package uses
*-linux-musl
and also relies on external libraries, those external libraries must also be statically linked with musl instead of dynamically linked with glibc. This means that all dependent libraries, as well as the main body of the program you want to build using the compiler for musl, must be built as static links from source code.Fortunately, you don't have to build everything from scratch if it's a commonly used external library in Rust. By utilizing a Docker image that bundles frequently used libraries with the Rust compiler/gcc, you can easily create a musl-based static build. (In the command examples that follow, I'll arbitrarily use the
<distro>#
prompt to distinguish the container environment for each Linux distribution).$ docker run -it --name ubuntu ubuntu:22.04 bash ubuntu# apt update && apt install -y curl gcc vim
Let's configure a dynamic link, glibc, and a static link, musl, in the Rust language environment, which is commonly used for development. First, install Rust on your Ubuntu environment.
ubuntu# curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh ubuntu# source $HOME/.cargo/env
Let's compare dynamic and static linking using Rust's default example, "Hello World" output.
First, let's build "Hello World" using glibc.
ubuntu# cd ubuntu# cargo new --bin hello && cd $_ Created binary (application) `hello` package ubuntu# cargo build --release Compiling hello v0.1.0 (/root/hello) Finished release [optimized] target(s) in 0.35s
Let's use the
ldd
command to verify that the library is configured as a dynamic link in the glibc environment. We can see thatlinux-vdso
,libgcc_s
,libc
, etc. are configured as dynamic links.ubuntu# ldd target/release/hello linux-vdso.so.1 (0x00007fffe87df000) libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fdce9c3f000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fdce9a17000) /lib64/ld-linux-x86-64.so.2 (0x00007fdce9cc2000)
So let's change the RUST target configuration with a MUSL static link.
ubuntu# rustup target add x86_64-unknown-linux-musl info: downloading component 'rust-std' for 'x86_64-unknown-linux-musl' info: installing component 'rust-std' for 'x86_64-unknown-linux-musl' 34.7 MiB / 34.7 MiB (100%) 8.6 MiB/s in 4s ETA: 0s ubuntu# rustup show Default host: x86_64-unknown-linux-gnu rustup home: /root/.rustup installed targets for active toolchain -------------------------------------- x86_64-unknown-linux-gnu x86_64-unknown-linux-musl active toolchain ---------------- stable-x86_64-unknown-linux-gnu (default) rustc 1.72.0 (5680fa18f 2023-08-23) ubuntu#
Let's build "Hello World" to verify that static links are configured correctly.
ubuntu# cargo build --release --target=x86_64-unknown-linux-musl Compiling hello v0.1.0 (/root/hello) Finished release [optimized] target(s) in 0.37s ubuntu# ldd target/x86_64-unknown-linux-musl/release/hello statically linked
You can see that "Hello World" is configured as a static link using the musl environment.
Now let's run "Hello World" built with both dynamic and static links by copying the binaries on CentOS and Alpine environments. CentOS 8 uses glibc dynamic linking and Alpine Linux uses musl static linking.
CentOS Container Environment
$ docker run -it --name centos centos:centos8 bash centos#
Alpine Container Environment
The Alpine distribution uses musl by default rather than glic.
$ docker run -it --rm alpine:3.18 alpine#
Let's copy 'Hello World' into a glibc environment and a musl environment to see the behavior.
$ docker cp ubuntu:/root/hello/target/x86_64-unknown-linux-musl/release/hello . $ docker cp hello centos:/root/ $ docker cp hello alpine:/root/
Let's check the behavior on centOS.
centos# ./hello Hello, world!
Let's check the behavior on alpine.
alpine# ./hello Hello, world!
Comparing glibc and musl using the Rust application 'slice'
Let's take the Rust application 'slice' and compare the container images created with glibc and musl.
The Rust implementation of 'slice', like Python's 'slice', is publicly available on the GitHub repository https://github.com/ChanTsune/slice. 'slice' is a tool that prints the contents of a file from the front or back, like 'head' or 'tail'. For example, the command below will print lines 10 through 20 from 'file.txt'.
$ slice 10:20 file.txt
When you build 'slice' in a Rust environment and create a container to use it, you can use it like this
$ docker run -i --rm -v `pwd`:`pwd` -w `pwd` slice
Let's build a container using glibc in the Ubuntu 22.04 environment.
FROM rust:latest as builder WORKDIR /work RUN git clone https://github.com/ChanTsune/slice /work/. RUN cargo build --release RUN strip /work/target/release/slice -o /slice FROM ubuntu:22.04 COPY --from=builder /slice /usr/local/bin/ ENTRYPOINT ["slice"]
This time, we'll create a container image based on Ubuntu 22.04 using musl static links.
FROM rust:latest as builder RUN rustup target add "$(uname -m)"-unknown-linux-musl WORKDIR /work RUN git clone https://github.com/ChanTsune/slice /work/. RUN cargo build --release --target "$(uname -m)"-unknown-linux-musl RUN strip /work/target/"$(uname -m)"-unknown-linux-musl/release/slice -o /slice FROM ubuntu:22.04 COPY --from=builder /slice /usr/local/bin/ ENTRYPOINT ["slice"]
Let's create a container image based on theAlpine distribution using a musl static link.
FROM rust:latest as builder RUN rustup target add "$(uname -m)"-unknown-linux-musl WORKDIR /work RUN git clone https://github.com/ChanTsune/slice /work/. RUN cargo build --release --target "$(uname -m)"-unknown-linux-musl RUN strip /work/target/"$(uname -m)"-unknown-linux-musl/release/slice -o /slice FROM alpine COPY --from=builder /slice / ENTRYPOINT ["slice"]
If we compare the size of a glibc container image and a musl container image on Ubuntu 22.04 and a musl container image on Alpine, we can see that the container image with musl is smaller.
$ docker images REPOSITORY TAG IMAGE ID CREATED SIZE slice distroless-musl d38a74f8568a 11 seconds ago 3.52MB slice alpine-musl e3abb5f0aace 39 seconds ago 8.4MB slice ubuntu22.04-musl 467edd130e79 About a minute ago 78.9MB slice ubuntu22.04-glibc 09fe5ad40d56 3 minutes ago 78.8MB
In the Ubuntu environment, using glibc or musl doesn't make much difference in the size of the container image, but in the Alpine distribution, you can see that the container image size is reduced by about a tenth. This shows that by utilizing Alpine Linux with static builds, we can make our container images lightweight and reduce deployment time.
Conclusion
Using static links in programs that use standard C libraries can simplify the process of deploying Linux binaries. It also reduces the size of the container image compared to dynamic links, and makes deployment convenient regardless of the distribution. When you replace glibc with musl, you benefit not only from the difference in container image size, but also from features newly supported by musl, such as mDNS (a multicast-DNS-based zero config system) and NUMA clusters. Furthermore, if you use distroless, which is distributed by Google to better utilize musl, as your default container image, you can deploy and take advantage of smaller container images.
This post is automatically translated from Korean
20 September 2023
Digging bitsandbytes issue
By Jeongseok KangBackend.AI is a popular choice for developing these LLMs because of its ease of use in running large clusters and distributed processing. In fact, we get a lot of feedback and requests from customers, and today I'd like to share how we solved one of them.
On April 4, 2023, we received a report of an issue where an error occurs when running certain packages in the container environment provided by the NGC Catalog[^1] (NVIDIA GPU Cloud). The NGC Catalog is a list of containers[^2] with optimized environments for developing AI/ML, metaverse, and high-performance computing applications, and because it is operated and distributed directly by NVIDIA, it is highly trusted and considered the standard for CUDA environments in particular. Therefore, an issue with this environment represents a potential risk that many users will face in the future, and we have decided to address this issue as a high priority.
Reproducing the problem
I first went through the process of reproducing the issue to determine the exact cause. In this case, I was running ViperGPT[^3] developed by Columbia University and encountered an error in a package called
bitsandbytes
. ViperGPT has a dependency onbitsandbytes
as shown below.accelerate==0.18.0 backoff==2.2.1 // highlight-next-line bitsandbytes==0.38.1 cityscapesscripts==2.2.1 git+https://github.com/openai/CLIP.git decord==0.6.0 dill==0.3.6 ...
I was able to reproduce the problem by simply
importing
bitsandbytes
.The execution environment used the nvcr.io/nvidia/pytorch:22.05-py3 image.
$ pip install bitsandbytes # 0.37.1 $ python >> import bitsandbytes ===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues ================================================================================ CUDA exception! Error code: OS call failed or operation not supported on this OS CUDA exception! Error code: initialization error CUDA SETUP: CUDA runtime path found: /home/work/data/miniconda3/envs/vipergpt/lib/libcudart.so /home/work/data/miniconda3/envs/vipergpt/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: No GPU detected! Check your CUDA paths. Proceeding to load CPU-only library... warn(msg) CUDA SETUP: Detected CUDA version 116 CUDA SETUP: Loading binary /home/work/data/miniconda3/envs/vipergpt/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so... /home/work/data/miniconda3/envs/vipergpt/lib/python3.10/site-packages/bitsandbytes/cextension.py:31: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable. warn("The installed version of bitsandbytes was compiled without GPU support. "
The
bitsandbytes
traverses all the CUDA devices installed in the execution environment and checks their Compute Capability [^4]. We were supposed to check the number of CUDA devices installed in the execution environment usinglibcuda.so
in the following way. We noticed that an error occurs when we callcuDeviceGetCount()
[^5]. The error was 304 CUDA_ERROR_OPERATING_SYSTEM.def get_compute_capabilities(cuda): """ 1. find libcuda.so library (GPU driver) (/usr/lib) init_device -> init variables -> call function by reference 2. call extern C function to determine CC (https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__DEVICE__DEPRECATED.html) 3. Check for CUDA errors https://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api # bits taken from https://gist.github.com/f0k/63a664160d016a491b2cbea15913d549 """ nGpus = ct.c_int() cc_major = ct.c_int() cc_minor = ct.c_int() device = ct.c_int() # highlight-next-line check_cuda_result(cuda, cuda.cuDeviceGetCount(ct.byref(nGpus))) ccs = [] for i in range(nGpus.value): check_cuda_result(cuda, cuda.cuDeviceGet(ct.byref(device), i)) ref_major = ct.byref(cc_major) ref_minor = ct.byref(cc_minor) # 2. call extern C function to determine CC check_cuda_result(cuda, cuda.cuDeviceComputeCapability(ref_major, ref_minor, device)) ccs.append(f"{cc_major.value}.{cc_minor.value}") return ccs
What is bitsandbytes?
Since the advent of Transformer, language models have shown high performance gains, and it has become a trend to increase the size of the model by stacking more Transformer blocks. This has led to a large number of GPU resources being required not only to train the model but also to service it. For example, to service GPT-3 with 175B parameters, eight 80GB A100 GPUs costing about $15,000 are required. This is a huge burden not only for individuals, but also for enterprises or research institutes, which is why there is a lot of research on lightweighting inference models for servicing.
bitsandbytes has open-sourced LLM.int8()[^6], a work by Tim Dettmers, a PhD candidate at the University of Washington, with Facebook AI Research (now Meta AI). It has shown to reduce the size of the model while maintaining performance by applying a vector-wise quantization method that treats each vector independently when computing matrix products, and by using a mix of 8-bit and 16-bit techniques to minimize losses by representing important vectors in 16-bit. It has been merged into Hugging Face's Transformer implementation and is used in a variety of models including [Llama2] (https://github.com/facebookresearch/llama-recipes/blob/cd82118b74d2fd739bd6227af33b661d04a97406/requirements.txt#L6), [QLoRA] (https://github.com/artidoro/qlora/blob/6c6fc4653abd17ce550f48878a24c7bd8772e98a/requirements.txt#L1), [KoAlpaca] (https://github.com/Beomi/KoAlpaca/blob/4596f882957d286b4d60559b97dcf783822d23f5/webui/requirements.txt#L5), and [KULLM] (https://github.com/nlpai-lab/KULLM/blob/b7a78b62ed6cd9d83c51ad5a92a9dd40b9f35998/requirements.txt#L4).
Identify the cause
Now that we've located and reproduced the problem, it's time to get to the bottom of it. I looked to see if there were any similar cases, but I couldn't find any. Also,
cuInit()
was called normally, making it even more difficult to pinpoint the cause.import ctypes count = ctypes.c_int() libcuda = ctypes.CDLL("libcuda.so") libcuda.cuInit(0) # 0 (CUDA_SUCCESS) libcuda.cuDeviceGetCount(ctypes.byref(count)) # 304 (CUDA_ERROR_OPERATING_SYSTEM) libcudart = ctypes.CDLL("libcudart.so") libcudart.cudaGetDeviceCount(ctypes.byref(count)) # 304 (CUDA_ERROR_OPERATING_SYSTEM)
I filed an issue on the GitHub repo (TimDettmers/bitsandbytes#264) for advice, and was told to update the package to the latest version and try again. After updating to version 0.38.0.post1, which was the latest at the time, I tested again, and the same problem occurred. I couldn't afford to lose too much time, so I decided to switch gears and remove the offending part.
Image source: Greco-Roman Mythology in Comics (Ghana Publishers)
Troubleshooting
My first approach was to use CUDA-Python[^7]. CUDA-Python is the CUDA Python Low-Level Bindings package officially distributed by NVIDIA. I had used it before and found it useful, so I immediately thought of it and decided to install and test it.
$ pip install cuda-python
from cuda import cuda from cuda import cudart cuda.cuInit(0) # (<CUresult.CUDA_SUCCESS: 0>,) cudart.cudaGetDeviceCount() # (<cudaError_t.cudaSuccess: 0>, 1)
Fortunately,
cudart.cudaGetDeviceCount()
worked fine, and I proceeded to test integrating it intobitsandbytes
. However, callingtorch.cuda.is_available()
after callingcuda.cuInit(0)
resulted in an error. This was because I calledcudaGetDeviceCount()
insidetorch.cuda.is_available()
.from cuda import cuda, cudart cuda.cuInit(0) # <CUresult.CUDA_SUCCESS: 0>,) cuda.cudaGetDeviceCount() # (<cudaError_t.cudaSuccess: 0>, 1) import bitsandbytes # ... # /opt/conda/lib/python3.8/site-packages/torch/cuda/__init__.py:82: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 304: OS call failed or operation not supported on this OS (Triggered internally at /opt/pytorch/pytorch/c10/cuda/CUDAFunctions.cpp:109.) # return torch._C._cuda_getDeviceCount() > 0 # ...
The problem seemed to be back to square one. I took a breath and calmly reread the error log above. Then something caught my eye.
torch._C._cuda_getDeviceCount() > 0
Note that
bitsandbytes
was already using PyTorch internally, which means it had a dependency on PyTorch. To be precise, `bitsandbytes' had a dependency on lion-pytorch, which had a dependency on PyTorch. And PyTorch already had an interface to CUDA functions, which I decided to take advantage of this time.Fortunately, all of the CUDA functions used by
bitsandbytes
existed in PyTorch. I made the following changes to the functions that were previously called vialibcuda.so
andlibcudart.so
.libcuda/libcudart torch libcuda.cuDeviceGetCount() torch.cuda.device_count() libcuda.cuDeviceGet() torch.cuda.device() libcuda.cuDeviceComputeCapability() torch.cuda.get_device_capability() libcudart.cudaRuntimeGetVersion() torch.version.cuda After verifying that it worked after the change, I registered a PR in the GitHub repository (TimDettmers/bitsandbytes#375) to apply to the distribution package version.
Postscript
On July 14, 2023, about two months after registering the PR, the patch was merged into the main branch and included in version 0.40.1.
I was also able to get some feedback from the author, Tim Dettmers, whose thoughts and philosophy are evident in this short article.
Through this opportunity, I was able to learn more about LLM's ecosystem. It was also the first time in a long time that I was able to feel the fun of open source activities. I think the appeal of open source activities is that we can collaborate beyond spatial constraints and learn from each other's ideas. We run an open source version of Backend.AI alongside an enterprise version. We will always strive to provide a better user experience and a better developer experience.
[^1]: NVIDIA GPU Cloud [^2]: The NGC catalog hosts containers for AI/ML, metaverse, and HPC applications and are performance-optimized, tested, and ready to deploy on GPU-powered on-prem, cloud, and edge systems. [^3]: ViperGPT: Visual Inference via Python Execution for Reasoning, March 14, 2023. [^4]: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capability [^5]: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__DEVICE.html#group__CUDA__DEVICE_1g52b5ce05cb8c5fb6831b2c0ff2887c74 [^6]: LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale, November 10, 2022. [^7]: https://developer.nvidia.com/cuda-python
This post is automatically translated from Korean
28 July 2023
Sneak Peek: Backend.AI Model Service Preview
By Kyujin ChoIntroduction
As super-sized AI models flood the market, there is a growing concern about not only developing the models, but also how to deliver them "well" and "efficiently" to users. Prior to Large Language Models (LLMs), the computing power of AI models was focused on training rather than inference, as the hardware requirements for attempting to make inferences with a trained model were much smaller than the computing power needed to train the model. Deployers of models could get enough power for inference from the NPU of a real user's end device (such as a smartphone). However, with the advent of LLMs, the tables were turned.
Take Meta's [OPT 175b] (https://github.com/facebookresearch/metaseq) as an example: OPT-175b, as its name implies, has 175 billion parameters and requires roughly 320+ GB of GPU memory just to load them onto the GPU to perform inference tasks. That's a huge difference from the 4GB that pre-LLM image processing models used to require.
With this change in AI model behavior, efficiently managing service resources has become paramount to keeping your service running reliably. In this article, we'll preview Backend.AI's upcoming model service feature, Backend.AI Model Service, and show you how it will allow you to efficiently run your AI model from training to serving with a single infrastructure.Backend.AI Model Service
Backend.AI Model Service is a model serving system that runs on top of the existing Backend.AI solution. It takes Backend.AI's tried-and-true container management technology and container app delivery system, AppProxy[^1], to the next level, enabling both AI training and model service in one infrastructure without installing additional components and by simply upgrading the existing Backend.AI infrastructure. It also supports an auto-scaling feature that automatically scales up and down inference sessions based on per-session GPU usage, number of API calls, or time of day, allowing you to effectively manage AI resources used for inference.
Inference Sessions
Inference sessions in Backend.AI are conceptually the same as traditional training sessions. You can use the same execution environment you've been using for training for inference sessions, or you can deploy a dedicated execution environment just for inference sessions. Inference sessions are volatile and stateless, so you can terminate them at any time if the session is not performing well. In this case, Backend.AI will attempt to recover the original state by creating a new inference session, while simultaneously forwarding inference requests to other living inference sessions to minimize downtime for the inference service.
Model storage
Models to be served through Backend.AI are managed as "model storage" units. Model storage consists of model files, code for model services, and model definition files.
Model definition file
The model definition file is where you define the information for running a service provider's model in the Backend.AI Model Service. The model definition file contains information about the model, the ports exposed by the model service, and a set of tasks that must be executed to run the model service. If your model service provides a health check feature that reports its own health, you can use that information to take action, such as excluding sessions from the service if they are in bad health.
models: - name: "KoAlpaca-5.8B-model" model_path: "/models/KoAlpaca-5.8B" service: pre_start_actions: - action: run_command args: command: ["pip3", "install", "-r", "/models/requirements.txt"] start_command: - uvicorn - --app-dir - /models - chatbot-api:app - --port - "8000" - --host - "0.0.0.0" port: 8000 health_check: path: /health max_retries: 10
Here is an example of a well-defined model definition file, which contains a set of steps to run the KoAlpaca 5.8B model as a model service.
Tutorial: Model Service with Backend.AI Model Service
In this tutorial, we'll actually use Backend.AI to service a KoAlpaca 5.8B model quantized to 8 bits.
Write the API server code
Write a simple API server to serve the model.
import os from typing import Any, List from fastapi import FastAPI, Response from fastapi.responses import RedirectResponse, StreamingResponse, JSONResponse from fastapi.staticfiles import StaticFiles import numpy as np from pydantic import BaseModel import torch from transformers import pipeline, AutoModelForCausalLM import uvicorn URL = "localhost:8000" KOALPACA_MODEL = os.environ["BACKEND_MODEL_PATH"] torch.set_printoptions(precision=6) app = FastAPI() model = AutoModelForCausalLM.from_pretrained( KOALPACA_MODEL, device_map="auto", load_in_8bit=True, ) pipe = pipeline( "text-generation", model=model, tokenizer=KOALPACA_MODEL, ) class Message(BaseModel): role: str content: str class ChatRequest(BaseModel): messages: List[Message] BASE_CONTEXTS = [ Message(role="맥락", content="KoAlpaca(코알파카)는 EleutherAI에서 개발한 Polyglot-ko 라는 한국어 모델을 기반으로, 자연어 처리 연구자 Beomi가 개발한 모델입니다."), Message(role="맥락", content="ChatKoAlpaca(챗코알파카)는 KoAlpaca를 채팅형으로 만든 것입니다."), Message(role="명령어", content="친절한 AI 챗봇인 ChatKoAlpaca 로서 답변을 합니다."), Message(role="명령어", content="인사에는 짧고 간단한 친절한 인사로 답하고, 아래 대화에 간단하고 짧게 답해주세요."), ] def preprocess_messages(messages: List[Message]) -> List[Message]: ... def flatten_messages(messages: List[Message]) -> str: ... def postprocess(answer: List[Any]) -> str: ... @app.post("/api/chat") async def chat(req: ChatRequest) -> StreamingResponse: messages = preprocess_messages(req.messages) conversation_history = flatten_messages(messages) ans = pipe( conversation_history, do_sample=True, max_new_tokens=512, temperature=0.7, top_p=0.9, return_full_text=False, eos_token_id=2, ) msg = postprocess(ans) async def iterator(): yield msg.strip().encode("utf-8") return StreamingResponse(iterator()) @app.get("/health") async def health() -> Response: return JSONResponse(content={"healthy": True}) @app.exception_handler(404) async def custom_404_handler(_, __): return RedirectResponse("/404.html") app.mount( "/", StaticFiles(directory=os.path.join(KOALPACA_MODEL, "..", "chatbot-ui"), html=True), name="html", )
Create a model definition file
Create a model definition file for your API server.
models: - name: "KoAlpaca-5.8B-model" model_path: "/models/KoAlpaca-Ployglot-5.8B" service: pre_start_actions: - action: run_command args: command: ["pip3", "install", "-r", "/models/requirements.txt"] start_command: - uvicorn - --app-dir - /models - chatbot-api:app - --port - "8000" - --host - "0.0.0.0" port: 8000 health_check: path: /health max_retries: 10
In a session of the model service, model storage is always mounted under the
/models
path.Prepare model storage
Add the model API server code you wrote, the model definition file, and the KoAlpaca model to your model storage.
Create a model service
With both the model file and the model definition file ready, you can now start the Backend.AI Model Service. The Model Service can be created using the
backend.ai service create
command in the Backend.AI CLI. The arguments accepted byservice create
are almost identical to thebackend.ai session create
command. After the image to use, you pass the ID of the model storage and the number of inference sessions to initially create.Using
backend.ai service info
, you can check the status of the model service and the inference sessions belonging to the service. You can see that one inference session has been successfully created.Use the Reasoning API
You can use the
backend.ai service get-endpoint
command to see the inference endpoint of a created model service. The inference endpoint continues to have a unique value until a model service is created and removed. If a model service belongs to multiple inference sessions, AppProxy will distribute requests across the multiple inference sessions.Restricting access to the Reasoning API
If you want to restrict who can access the inference API, you can enable authentication for the inference API by starting the model service with the
--public
option removed. Authentication tokens can be issued with thebackend.ai service generate-token
command.Scaling inference sessions
The
backend.ai service scale
command allows you to change the scale of inference sessions belonging to the model service.Closing thoughts
So far, we've learned about Backend.AI Model Service and how to actually deploy a model service with the Model Service feature. Backend.AI Model Service is targeted for general availability in Backend.AI 23.03. We're working hard to make the Model Service feature publicly available in the near future, so stay tuned.
---]
[^1]: Available from Backend.AI Enterprise.
This post is automatically translated from Korean
30 May 2023