Engineering

ON THIS PAGE

May 30, 2023

Engineering

Sneak Peek: Backend.AI Model Service Preview

Kyujin Cho
Senior Software Engineer

May 30, 2023

Engineering

Sneak Peek: Backend.AI Model Service Preview

Kyujin Cho
Senior Software Engineer

Introduction

As super-sized AI models flood the market, there is a growing concern about not only developing the models, but also how to deliver them "well" and "efficiently" to users. Prior to Large Language Models (LLMs), the computing power of AI models was focused on training rather than inference, as the hardware requirements for attempting to make inferences with a trained model were much smaller than the computing power needed to train the model. Deployers of models could get enough power for inference from the NPU of a real user's end device (such as a smartphone). However, with the advent of LLMs, the tables were turned.

Take Meta's [OPT 175b] (https://github.com/facebookresearch/metaseq) as an example: OPT-175b, as its name implies, has 175 billion parameters and requires roughly 320+ GB of GPU memory just to load them onto the GPU to perform inference tasks. That's a huge difference from the 4GB that pre-LLM image processing models used to require.
With this change in AI model behavior, efficiently managing service resources has become paramount to keeping your service running reliably. In this article, we'll preview Backend.AI's upcoming model service feature, Backend.AI Model Service, and show you how it will allow you to efficiently run your AI model from training to serving with a single infrastructure.

Backend.AI Model Service

Backend.AI Model Service is a model serving system that runs on top of the existing Backend.AI solution. It takes Backend.AI's tried-and-true container management technology and container app delivery system, AppProxy¹, to the next level, enabling both AI training and model service in one infrastructure without installing additional components and by simply upgrading the existing Backend.AI infrastructure. It also supports an auto-scaling feature that automatically scales up and down inference sessions based on per-session GPU usage, number of API calls, or time of day, allowing you to effectively manage AI resources used for inference.

Inference Sessions

Inference sessions in Backend.AI are conceptually the same as traditional training sessions. You can use the same execution environment you've been using for training for inference sessions, or you can deploy a dedicated execution environment just for inference sessions. Inference sessions are volatile and stateless, so you can terminate them at any time if the session is not performing well. In this case, Backend.AI will attempt to recover the original state by creating a new inference session, while simultaneously forwarding inference requests to other living inference sessions to minimize downtime for the inference service.

Model storage

Models to be served through Backend.AI are managed as "model storage" units. Model storage consists of model files, code for model services, and model definition files.

Model definition file

The model definition file is where you define the information for running a service provider's model in the Backend.AI Model Service. The model definition file contains information about the model, the ports exposed by the model service, and a set of tasks that must be executed to run the model service. If your model service provides a health check feature that reports its own health, you can use that information to take action, such as excluding sessions from the service if they are in bad health.

models:
  - name: "KoAlpaca-5.8B-model"
    model_path: "/models/KoAlpaca-5.8B"
    service:
      pre_start_actions:
        - action: run_command
          args:
            command: ["pip3", "install", "-r", "/models/requirements.txt"]
      start_command:
        - uvicorn
        - --app-dir
        - /models
        - chatbot-api:app
        - --port
        - "8000"
        - --host
        - "0.0.0.0"
      port: 8000
      health_check:
        path: /health
        max_retries: 10

Here is an example of a well-defined model definition file, which contains a set of steps to run the KoAlpaca 5.8B model as a model service.

Tutorial: Model Service with Backend.AI Model Service

In this tutorial, we'll actually use Backend.AI to service a KoAlpaca 5.8B model quantized to 8 bits.

Write the API server code

Write a simple API server to serve the model.

import os
from typing import Any, List

from fastapi import FastAPI, Response
from fastapi.responses import RedirectResponse, StreamingResponse, JSONResponse
from fastapi.staticfiles import StaticFiles
import numpy as np
from pydantic import BaseModel
import torch
from transformers import pipeline, AutoModelForCausalLM
import uvicorn

URL = "localhost:8000"
KOALPACA_MODEL = os.environ["BACKEND_MODEL_PATH"]

torch.set_printoptions(precision=6)

app = FastAPI()

model = AutoModelForCausalLM.from_pretrained(
    KOALPACA_MODEL,
    device_map="auto",
    load_in_8bit=True,
)


pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=KOALPACA_MODEL,
)


class Message(BaseModel):
    role: str
    content: str


class ChatRequest(BaseModel):
    messages: List[Message]


BASE_CONTEXTS = [
    Message(role="맥락", content="KoAlpaca(코알파카)는 EleutherAI에서 개발한 Polyglot-ko 라는 한국어 모델을 기반으로, 자연어 처리 연구자 Beomi가 개발한 모델입니다."),
    Message(role="맥락", content="ChatKoAlpaca(챗코알파카)는 KoAlpaca를 채팅형으로 만든 것입니다."),
    Message(role="명령어", content="친절한 AI 챗봇인 ChatKoAlpaca 로서 답변을 합니다."),
    Message(role="명령어", content="인사에는 짧고 간단한 친절한 인사로 답하고, 아래 대화에 간단하고 짧게 답해주세요."),
]


def preprocess_messages(messages: List[Message]) -> List[Message]:
    ...


def flatten_messages(messages: List[Message]) -> str:
    ...


def postprocess(answer: List[Any]) -> str:
    ...


@app.post("/api/chat")
async def chat(req: ChatRequest) -> StreamingResponse:
    messages = preprocess_messages(req.messages)
    conversation_history = flatten_messages(messages)
    ans = pipe(
        conversation_history,
        do_sample=True,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.9,
        return_full_text=False,
        eos_token_id=2,
    )
    msg = postprocess(ans)

    async def iterator():
        yield msg.strip().encode("utf-8")

    return StreamingResponse(iterator())


@app.get("/health")
async def health() -> Response:
    return JSONResponse(content={"healthy": True})


@app.exception_handler(404)
async def custom_404_handler(_, __):
    return RedirectResponse("/404.html")


app.mount(
    "/",
    StaticFiles(directory=os.path.join(KOALPACA_MODEL, "..", "chatbot-ui"), html=True),
    name="html",
)

Create a model definition file

Create a model definition file for your API server.

models:
  - name: "KoAlpaca-5.8B-model"
    model_path: "/models/KoAlpaca-Ployglot-5.8B"
    service:
      pre_start_actions:
        - action: run_command
          args:
            command: ["pip3", "install", "-r", "/models/requirements.txt"]
      start_command:
        - uvicorn
        - --app-dir
        - /models
        - chatbot-api:app
        - --port
        - "8000"
        - --host
        - "0.0.0.0"
      port: 8000
      health_check:
        path: /health
        max_retries: 10

In a session of the model service, model storage is always mounted under the /models path.

Prepare model storage

Add the model API server code you wrote, the model definition file, and the KoAlpaca model to your model storage.

Create a model service

With both the model file and the model definition file ready, you can now start the Backend.AI Model Service. The Model Service can be created using the backend.ai service create command in the Backend.AI CLI. The arguments accepted by service create are almost identical to the backend.ai session create command. After the image to use, you pass the ID of the model storage and the number of inference sessions to initially create.

Using backend.ai service info, you can check the status of the model service and the inference sessions belonging to the service. You can see that one inference session has been successfully created.

Use the Reasoning API

You can use the backend.ai service get-endpoint command to see the inference endpoint of a created model service. The inference endpoint continues to have a unique value until a model service is created and removed. If a model service belongs to multiple inference sessions, AppProxy will distribute requests across the multiple inference sessions.

Restricting access to the Reasoning API

If you want to restrict who can access the inference API, you can enable authentication for the inference API by starting the model service with the --public option removed. Authentication tokens can be issued with the backend.ai service generate-token command.

Scaling inference sessions

The backend.ai service scale command allows you to change the scale of inference sessions belonging to the model service.

Closing thoughts

So far, we've learned about Backend.AI Model Service and how to actually deploy a model service with the Model Service feature. Backend.AI Model Service is targeted for general availability in Backend.AI 23.03. We're working hard to make the Model Service feature publicly available in the near future, so stay tuned.

---]

This post is automatically translated from Korean

Available from Backend.AI Enterprise. ↩

Blog

Engineering

Sneak Peek: Backend.AI Model Service Preview

Sneak Peek: Backend.AI Model Service Preview

Introduction

Backend.AI Model Service

Inference Sessions

Model storage

Model definition file

Tutorial: Model Service with Backend.AI Model Service

Write the API server code

Create a model definition file

Prepare model storage

Create a model service

Use the Reasoning API

Restricting access to the Reasoning API

Scaling inference sessions

Closing thoughts

Footnotes