Engineering

Apr 23, 2026

Engineering

Building Production RAG Systems: Lessons from Tariff Support

  • Sergey Leksikov

    Sergey Leksikov

    Machine Learning Researcher

Apr 23, 2026

Engineering

Building Production RAG Systems: Lessons from Tariff Support

  • Sergey Leksikov

    Sergey Leksikov

    Machine Learning Researcher

At Lablup, we build AI infrastructure software . But infrastructure alone doesn't solve domain problems — you need systems that can reason over specialized knowledge at scale. Over the past year, we've developed two production RAG systems that tackle very different challenges: HSense, a multi-agent system for Korean HS code classification across 11,000+ HS codes, and a Backend.AI RAG Assistant that handles customer support queries across seven documentation projects.

This post covers what we learned building both — the architecture decisions, the approaches that didn't work, and the analytical characteristics of our approach.

The Problem with Naive RAG

The standard RAG recipe — chunk your documents, embed them, retrieve top-k, generate — works well for simple Q&A over homogeneous corpora. It may not work as expected when:

  1. Documents have hierarchical structure. Korean tariff law is a 2,000+ page framework with nested classification rules. Flat chunking destroys the relationships between sections.
  2. Classification requires multi-step reasoning. Determining the correct 10-digit HS code for a product isn't a lookup — it requires applying General Rules of Interpretation (GRI), navigating code hierarchies, and cross-referencing precedents.
  3. Knowledge is scattered across heterogeneous sources. Backend.AI documentation spans official docs and internal knowledge bases — each with different formats, quality levels, and relevance patterns.

We needed architectures that go beyond retrieve-and-generate.

HSense: Multi-Agent RAG for HS code Classification

What is HSense?

Sense is an AI-based HS code classification system developed through the XaaS Frontier Project led by the Ministry of Science and ICT in Korea. It takes text- or image-based product descriptions as input, recommends the most appropriate HS codes for the items, and provides the basis for classification grounded in relevant laws and regulations.

Why Agents?

HS code classification is a complex decision problem. A single product might fall under multiple candidate codes, and the correct classification depends on material composition, intended use, GRI rule interpretation, and framework definitions — often simultaneously.

Rather than building one monolithic prompt that tries to do everything, we decomposed the task into four specialist roles:

AgentRoleTools
Tariff Officer (Coordinator)Orchestrates the team, guides search direction, makes final decisionDelegates to other agents
GRI Rule Expert (Interpreter)Explains classification rules, provides interpretation guidanceRule-based reasoning
Framework Expert (Scholar)Searches 2,000+ pages of text (~3M tokens)RAG over FAISS vectorstore + SQLite FTS5 keyword search
HS Code Navigator (Pathfinder)Navigates code hierarchies from 2-digit to 10-digitSQLite database of code trees

This mirrors how human customs specialists actually work — no single person holds all the context. The coordinator asks specific questions, specialists provide evidence, and classification emerges from structured deliberation.

Retrieval Architecture

The Framework Expert uses a dual-retrieval strategy:

  • Semantic search via FAISS vectorstore for meaning-based queries ("products made from mixed animal and vegetable fats")
  • Keyword search via SQLite FTS5 for precise lookups ("heading 1517", "subheading note 2")

Documents are converted from Korean PDFs to structured Markdown using marker, then chunked by Markdown headers using LangChain's header-based splitter. This preserves document structure better than character-count splitting.

The HS Code Navigator operates over a separate SQLite database containing the full code hierarchy — no vector search needed, just structured traversal.

The Agent Framework

We use Agno as the multi-agent framework. Here's a simplified view of the team definition:

from agno.agent import Agent from agno.team import Team team = Team( name="HSense Classification Team", mode="coordinate", # coordinator delegates to specialists members=[tariff_officer, gri_expert, legal_expert, hs_navigator], success_criteria="Provide HS code classification with supporting evidence", max_discussion_rounds=5, stream_intermediate_steps=True, )

Each agent has a Pydantic-modeled output schema for structured responses, built-in reasoning mode for complex cases, and conversation memory for multi-turn sessions.

Performance

On a 500-case test set sampled from the Korean HS code classification database, using Gemma-3-27B-IT with multimodal image annotations:

MetricScore
Top-1 Accuracy92.40%
Top-3 Accuracy96.40%
Top-5 Accuracy97.40%
Top-20 Accuracy98.40%
2-Digit (Chapter) Accuracy96.60%
F1-Score86.17%
Avg tokens per case1,549

For context, a 2024 benchmark by Bryce Judy1 found commercial HS code classification models ranging from 44–89% Top-1 accuracy on 10-digit codes. Our 92.4% puts HSense at the top of that range — and we're running a 27B open-weight model, not GPT-4.

Multiple chapters hit 100% Top-1 accuracy (Chapters 11, 15, 20, 23, 29, 30, 33, 34, 35, 38, 40, 60, 82, 94). The toughest chapters — where products are ambiguous or rules are particularly nuanced — still hover around 50–62.5%.

What Didn't Work

Chunking vectorstores for hierarchical documents. This is our biggest lesson. Tariff law has a tree structure — Chapters contain Headings, Headings contain Subheadings, and Subheadings have Notes that modify interpretation of everything above them. Flat vector search over chunks loses these relationships. The SQLite-based navigation approach for code hierarchies works better.

Single-agent approaches. When one agent tries to handle everything — GRI interpretation, code navigation, framework search, and final decision — it gets confused. Context windows fill up with irrelevant information, and the model can't maintain focus. Splitting responsibilities improved both accuracy and debuggability.

Unconstrained coordinator decisions. Early versions had the coordinator making wrong high-level decisions (picking the wrong chapter), causing all specialists to waste effort in the wrong direction. We added backtracking capability: if evidence from specialists contradicts the initial direction, the coordinator can reset and explore alternatives.

Backend.AI RAG Assistant

Introduction

The Backend.AI RAG Assistant was developed to provide systematic support for Backend.AI customers by addressing their questions and requests. It answers user queries based on a GitHub repository containing Backend.AI documentation, along with part of Lablup’s internal knowledge base.

Architecture

The assistant answers customer questions by searching across seven documentation projects:

User Query
  → RequestClassifier (GPT-4.1-mini, multi-label routing)
  → VectorDBManager (7 FAISS indices, k=10 per project)
  → Confidence Filter (L2 > 1.5 removed)
  → Fusion Date Re-ranking (similarity-primary, date as tiebreaker)
  → Global Top-15 Selection
  → RAGManager (Qwen3.5-35B-A3B, LangChain v0.3, streaming)
  → Response

The RequestClassifier routes queries to relevant projects before retrieval. A question about "how to create a compute session" should search webui and backendai docs, not enterprise-guide or realworld_data_support. Classification uses GPT-4.1-mini with structured output parsing across 8 categories (7 documentation projects and one conversation catogory), incorporating the last 3 conversation messages for context.

The Confidence Filter removes chunks with L2 distance > 1.5 — these are low-relevance results that would dilute the context. Fusion Date Re-ranking uses similarity as the primary signal and chunk date as a tiebreaker: when two chunks have similar L2 scores, the more recent one ranks higher.

The anti-hallucination system prompt enforces strict context-only answers: the model must never fabricate IPs, ports, paths, server names, or commands. When multiple solutions exist across different dates, it presents the most recent first as the primary recommendation, then lists alternatives with their dates.

Evaluation Results

We evaluated the pipeline on 100 quality-filtered technical Q&A test cases sampled from 25K real-world data. The generation model is Qwen3.5-35B-A3B, served on Backend.AI infrastructure. Two retrieval modes were compared:

MetricSample DataAll Datasets
Relevance0.8090.824
Informativeness0.7750.792
Info Security0.9500.960
Usability0.7660.778
Verbosity0.8470.845
SemScore0.7030.704
Overall Score0.8130.823
Retrieval top-1 hit rate45%45%
Retrieval top-3 hit rate60%59%
Avg Response Time3.02s3.32s

All-datasets mode consistently outperforms sample data, confirming that cross-project context adds value. Information Security scores 0.95+ — the anti-hallucination system prompt effectively prevents PII leakage and fabricated specifics.

Evaluation Methodology

We use LLM-as-judge evaluation with a single structured-output call that scores 6 metrics simultaneously — 25x faster than the previous per-metric DeepEval GEval approach (~90 min vs ~28 hours for 200 evaluations).

Metrics (all 0.0–1.0):

  • Relevance: Does the response technically solve the problem? Accepts alternative valid approaches, not just exact-match to expected answer.
  • Informativeness: Completeness and depth of information provided.
  • Information Security: Checks for PII leakage — customer names, IPs, account IDs, internal infrastructure details.
  • Usability: Whether advice is practical, actionable, and applicable in real-world environments.
  • Verbosity: Response length appropriateness — penalizes both too short and too long.
  • SemScore: Cosine similarity via text-embedding-3-small embeddings between generated and expected answers. Objective and reproducible — no LLM judge variance.

The Overall Score is a weighted average: Relevance (0.25), Info Security (0.20), Informativeness (0.15), Usability (0.15), SemScore (0.15), Verbosity (0.10).

Test cases go through a two-stage quality filter: tag pre-filter (25K → 19K) then LLM classifier (500 → 323 valid → 100 sampled). Exact-match source chunks are removed after retrieval to prevent data leakage. The judge evaluates: "Would a DevOps engineer receiving this response be able to solve the problem?"

Shared Lessons

  1. Retrieval is the weakest link

Only 45% of test cases have the correct source document as the top-1 retrieval result (60% for top-3). This means more than half the time, the generation model works with suboptimal context. Improving retrieval — through hybrid BM25 + semantic search, cross-encoder re-ranking, or better chunking — will have a higher impact than switching models. Retrieval quality deserves more engineering attention than model selection.

  1. The judge prompt matters more than the generation model

Our single largest accuracy improvement (+0.163) came not from changing the generation model or retrieval parameters, but from fixing the judge prompt. The previous judge penalized technically valid alternative answers because they didn't match the expected text exactly. The new judge asks: "Would a DevOps engineer solve the problem with this response?" — accepting registry-1.docker.io as equivalent to docker.io when both solve Docker layer download issues.

  1. Structured retrieval beats vector search for structured data

HSense's HS code navigator uses SQLite, not FAISS. The Backend.AI classifier uses an LLM, not embeddings. When your data has known structure — hierarchies, categories, metadata fields — use that structure directly. Vector search is a fallback for unstructured content, not a universal solution.

  1. Date-aware retrieval changes response quality

When chunks from different time periods cover the same topic, the model can present outdated solutions as primary recommendations. Fusion date re-ranking and date metadata in context headers let the model label solutions with dates and recommend the most recent approach first. In 51% of test cases, the model presents multiple solution approaches with date context.

  1. Multi-agent systems need observability

When one agent makes a wrong decision, the entire team can spiral. HSense's coordinator once picked the wrong product chapter, causing three specialists to generate confident but wrong evidence. Without step-by-step logging and the ability to trace decision chains, these failures are invisible.

  1. Open-weight models are production-ready

Qwen3.5-35B-A3B achieves 0.823 Overall Score served entirely on our own GPU infrastructure with 3-second response times. For domain-specific RAG — where the retrieval context does most of the heavy lifting — the gap between open-weight and proprietary models is small and shrinking.

What's Next

For HSense:

  • Replacing flat vectorstore retrieval with hierarchy-aware search for tariff documents
  • Prompt optimization with automated A/B testing and feedback loops
  • Adding a product case RAG pipeline to supplement rule-based classification with historical precedents

For the Backend.AI assistant:

  • Fixing the chunking pipeline (proper boundary detection, index rebuild)
  • Hybrid BM25 + semantic search to improve the 45% top-1 retrieval hit rate
  • Cross-encoder re-ranking for more precise top-k selection
  • Expanding evaluation to cover troubleshooting and reasoning queries
  • Implementing RAGAS metrics (Faithfulness, Context Precision/Recall) alongside current G-Eval
  • Human evaluation via our Streamlit-based A/B testing platform

Both systems run on Backend.AI infrastructure, which means GPU allocation, model serving, and scaling are handled by the same platform we're building RAG for. There's a certain satisfaction in using your own product to improve your own product.

Beyond what's covered in this post, we also explore synthetic data generation and iterative refinement through domain expert feedback in a separate talk:


If you're interested in Backend.AI or building RAG systems on GPU infrastructure, check out backend.ai or reach out to us at info@lablup.com.

Footnotes

  1. [Benchmarking Harmonized Tariff Schedule Classification Models] (https://arXiv:2412.14179), arXive

We're here for you!

Complete the form and we'll be in touch soon

Contact Us

Headquarter & HPC Lab

KR Office: 8F, 577, Seolleung-ro, Gangnam-gu, Seoul, Republic of Korea US Office: 3003 N First st, Suite 221, San Jose, CA 95134

© Lablup Inc. All rights reserved.

We value your privacy

We use cookies to enhance your browsing experience, analyze site traffic, and understand where our visitors are coming from. By clicking "Accept All", you consent to our use of cookies. Learn more