Engineering

Jan 29, 2024

Engineering

Backend.AI Meets Tool LLMs : Revolutionizing AI Interaction with Tools - Part 2

  • Sergey Leksikov

    Machine Learning Researcher

Jan 29, 2024

Engineering

Backend.AI Meets Tool LLMs : Revolutionizing AI Interaction with Tools - Part 2

  • Sergey Leksikov

    Machine Learning Researcher

Part 2. Backend.AI Gorilla LLM model serving

Previously, we talked about the Tool LLM capabilities and usage. In this article, there will be a step-by-step demonstration of how to run the Gorilla LLM model on the Backend.AI Cloud while using Backend.AI Desktop app.

Figure 1. A Backend.AI Desktop app installed on MacOs

  1. Press a start button to make a session creation menu appear.

Figure 2. New session start interactive screen

  1. Select NGC-Pytorch 23.07 image

  2. Attach a vFolder which is a working directory containing the model files. For example: api_helper/ directory name.

Figure 3. Attaching vFolder screen

  1. Select the resource amount 128 GB RAM and 5 fGPU

Figure 4. Resource selection screen

  1. Select a Visual Studio Code Desktop environment

Figure 5. IDE environment selection screen

  1. At /home/work/api_helper/ directory create a server.py file

  2. Create a requirements.txt file

Figure 6. Content of requirements.txt file

To install requirements run the command: pip install -r requirements.txt

Figure 7. Executing install requirements command

  1. Create a server.py and define using transformers library the tokenizer and model loader.

Figure 8. Code snippet of server.py

  1. Define server IP address and port number

Figure 9. Definition of server IP address and port number

  1. To run the model type: python server.py

Figure 10. Starting a server.py

  1. Accessing the created server

VSCode automatically creates a port tunneling session from your device to a Backend.AI Cloud server. You may see the server status by accessing the localhost address and the request will be tunneled to a Backend.AI Cloud. In addition, you may define other custom endpoints according your needs.

Figure 11. The server run log

Figure 12. VSCode Port Forwarding configuration

Figure 13. Accessing the root of a server

Up to this point, we create a computation session on Backend.AI Cloud, attached an api_helper/ vFolder directory with requirements.txt file and server.py. Then we started our FastAPI server where the Gorilla LLM is gets downloaded from HuggingFace repository and loaded into computation session memory with inference/ api .endpoint

  1. API Inference testing To test the API inference of Gorilla LLM you may create a curl request from your local computer command line:
curl -X POST -H "Content-Type: application/json" -d '{"text":"Object detection on a photo. <<<api_domain>>>:"}' http://127.0.0.1:8000/inference

Figure 14. An example of curl request

Figure 15. The GPU workload on a server after receiving the request

Figure 16. The server logs of receiving the request and printing the result

  1. Defining UI web app. You may use any web technology to make a UI app which can display the result in a better way. For example, you may use html and JavaScript files and place them in static directory under root of server.py Then define an endpoint for a web app.

Figure 17. Example of adding an html web app to a FastAPI server

  1. Gorilla LLM Web App prototype - an API tuned Large Language Model for API question answering and code generation.

Figure 18. Gorilla LLM web app prototype. Example 1

Figure 19. Gorilla LLM web app prototype. Example 2

Conclusion

Despite some difficulties of Gorilla LLM serving, LLM tuned on own API has a large potential and promises. Since, the model can provided the most recent results with more accurate parameters and function calls than commercial large models and be useful in tasks such as question answering over API, code autocomplete, API code executions.

Limitations and difficulties:

While trying to server the Gorilla LLM model there were following issues to consider:

  • Model may generate response in not expected format
  • Model may generate result different for same questions
  • Parsing and rendering LLM response
  • Eliminating the duplicate sentences and lines

We're here for you!

Complete the form and we'll be in touch soon

Contact Us

Headquarter & HPC Lab

Namyoung Bldg. 4F/5F, 34, Seolleung-ro 100-gil, Gangnam-gu, Seoul, Republic of Korea

© Lablup Inc. All rights reserved.