Engineering
Apr 23, 2026
Engineering
Building Production RAG Systems: Lessons from Tariff Support

Sergey Leksikov
Machine Learning Researcher
Apr 23, 2026
Engineering
Building Production RAG Systems: Lessons from Tariff Support

Sergey Leksikov
Machine Learning Researcher
At Lablup, we build AI infrastructure software . But infrastructure alone doesn't solve domain problems — you need systems that can reason over specialized knowledge at scale. Over the past year, we've developed two production RAG systems that tackle very different challenges: HSense, a multi-agent system for Korean HS code classification across 11,000+ HS codes, and a Backend.AI RAG Assistant that handles customer support queries across seven documentation projects.
This post covers what we learned building both — the architecture decisions, the approaches that didn't work, and the analytical characteristics of our approach.
The Problem with Naive RAG
The standard RAG recipe — chunk your documents, embed them, retrieve top-k, generate — works well for simple Q&A over homogeneous corpora. It may not work as expected when:
- Documents have hierarchical structure. Korean tariff law is a 2,000+ page framework with nested classification rules. Flat chunking destroys the relationships between sections.
- Classification requires multi-step reasoning. Determining the correct 10-digit HS code for a product isn't a lookup — it requires applying General Rules of Interpretation (GRI), navigating code hierarchies, and cross-referencing precedents.
- Knowledge is scattered across heterogeneous sources. Backend.AI documentation spans official docs and internal knowledge bases — each with different formats, quality levels, and relevance patterns.
We needed architectures that go beyond retrieve-and-generate.
HSense: Multi-Agent RAG for HS code Classification
What is HSense?
Sense is an AI-based HS code classification system developed through the XaaS Frontier Project led by the Ministry of Science and ICT in Korea. It takes text- or image-based product descriptions as input, recommends the most appropriate HS codes for the items, and provides the basis for classification grounded in relevant laws and regulations.
Why Agents?
HS code classification is a complex decision problem. A single product might fall under multiple candidate codes, and the correct classification depends on material composition, intended use, GRI rule interpretation, and framework definitions — often simultaneously.
Rather than building one monolithic prompt that tries to do everything, we decomposed the task into four specialist roles:
| Agent | Role | Tools |
|---|---|---|
| Tariff Officer (Coordinator) | Orchestrates the team, guides search direction, makes final decision | Delegates to other agents |
| GRI Rule Expert (Interpreter) | Explains classification rules, provides interpretation guidance | Rule-based reasoning |
| Framework Expert (Scholar) | Searches 2,000+ pages of text (~3M tokens) | RAG over FAISS vectorstore + SQLite FTS5 keyword search |
| HS Code Navigator (Pathfinder) | Navigates code hierarchies from 2-digit to 10-digit | SQLite database of code trees |
This mirrors how human customs specialists actually work — no single person holds all the context. The coordinator asks specific questions, specialists provide evidence, and classification emerges from structured deliberation.
Retrieval Architecture
The Framework Expert uses a dual-retrieval strategy:
- Semantic search via FAISS vectorstore for meaning-based queries ("products made from mixed animal and vegetable fats")
- Keyword search via SQLite FTS5 for precise lookups ("heading 1517", "subheading note 2")
Documents are converted from Korean PDFs to structured Markdown using marker, then chunked by Markdown headers using LangChain's header-based splitter. This preserves document structure better than character-count splitting.
The HS Code Navigator operates over a separate SQLite database containing the full code hierarchy — no vector search needed, just structured traversal.
The Agent Framework
We use Agno as the multi-agent framework. Here's a simplified view of the team definition:
from agno.agent import Agent
from agno.team import Team
team = Team(
name="HSense Classification Team",
mode="coordinate", # coordinator delegates to specialists
members=[tariff_officer, gri_expert, legal_expert, hs_navigator],
success_criteria="Provide HS code classification with supporting evidence",
max_discussion_rounds=5,
stream_intermediate_steps=True,
)Each agent has a Pydantic-modeled output schema for structured responses, built-in reasoning mode for complex cases, and conversation memory for multi-turn sessions.
Performance
On a 500-case test set sampled from the Korean HS code classification database, using Gemma-3-27B-IT with multimodal image annotations:
| Metric | Score |
|---|---|
| Top-1 Accuracy | 92.40% |
| Top-3 Accuracy | 96.40% |
| Top-5 Accuracy | 97.40% |
| Top-20 Accuracy | 98.40% |
| 2-Digit (Chapter) Accuracy | 96.60% |
| F1-Score | 86.17% |
| Avg tokens per case | 1,549 |
For context, a 2024 benchmark by Bryce Judy1 found commercial HS code classification models ranging from 44–89% Top-1 accuracy on 10-digit codes. Our 92.4% puts HSense at the top of that range — and we're running a 27B open-weight model, not GPT-4.
Multiple chapters hit 100% Top-1 accuracy (Chapters 11, 15, 20, 23, 29, 30, 33, 34, 35, 38, 40, 60, 82, 94). The toughest chapters — where products are ambiguous or rules are particularly nuanced — still hover around 50–62.5%.
What Didn't Work
Chunking vectorstores for hierarchical documents. This is our biggest lesson. Tariff law has a tree structure — Chapters contain Headings, Headings contain Subheadings, and Subheadings have Notes that modify interpretation of everything above them. Flat vector search over chunks loses these relationships. The SQLite-based navigation approach for code hierarchies works better.
Single-agent approaches. When one agent tries to handle everything — GRI interpretation, code navigation, framework search, and final decision — it gets confused. Context windows fill up with irrelevant information, and the model can't maintain focus. Splitting responsibilities improved both accuracy and debuggability.
Unconstrained coordinator decisions. Early versions had the coordinator making wrong high-level decisions (picking the wrong chapter), causing all specialists to waste effort in the wrong direction. We added backtracking capability: if evidence from specialists contradicts the initial direction, the coordinator can reset and explore alternatives.
Backend.AI RAG Assistant
Introduction
The Backend.AI RAG Assistant was developed to provide systematic support for Backend.AI customers by addressing their questions and requests. It answers user queries based on a GitHub repository containing Backend.AI documentation, along with part of Lablup’s internal knowledge base.
Architecture
The assistant answers customer questions by searching across seven documentation projects:
User Query
→ RequestClassifier (GPT-4.1-mini, multi-label routing)
→ VectorDBManager (7 FAISS indices, k=10 per project)
→ Confidence Filter (L2 > 1.5 removed)
→ Fusion Date Re-ranking (similarity-primary, date as tiebreaker)
→ Global Top-15 Selection
→ RAGManager (Qwen3.5-35B-A3B, LangChain v0.3, streaming)
→ Response
The RequestClassifier routes queries to relevant projects before retrieval. A question about "how to create a compute session" should search webui and backendai docs, not enterprise-guide or realworld_data_support. Classification uses GPT-4.1-mini with structured output parsing across 8 categories (7 documentation projects and one conversation catogory), incorporating the last 3 conversation messages for context.
The Confidence Filter removes chunks with L2 distance > 1.5 — these are low-relevance results that would dilute the context. Fusion Date Re-ranking uses similarity as the primary signal and chunk date as a tiebreaker: when two chunks have similar L2 scores, the more recent one ranks higher.
The anti-hallucination system prompt enforces strict context-only answers: the model must never fabricate IPs, ports, paths, server names, or commands. When multiple solutions exist across different dates, it presents the most recent first as the primary recommendation, then lists alternatives with their dates.
Evaluation Results
We evaluated the pipeline on 100 quality-filtered technical Q&A test cases sampled from 25K real-world data. The generation model is Qwen3.5-35B-A3B, served on Backend.AI infrastructure. Two retrieval modes were compared:
| Metric | Sample Data | All Datasets |
|---|---|---|
| Relevance | 0.809 | 0.824 |
| Informativeness | 0.775 | 0.792 |
| Info Security | 0.950 | 0.960 |
| Usability | 0.766 | 0.778 |
| Verbosity | 0.847 | 0.845 |
| SemScore | 0.703 | 0.704 |
| Overall Score | 0.813 | 0.823 |
| Retrieval top-1 hit rate | 45% | 45% |
| Retrieval top-3 hit rate | 60% | 59% |
| Avg Response Time | 3.02s | 3.32s |
All-datasets mode consistently outperforms sample data, confirming that cross-project context adds value. Information Security scores 0.95+ — the anti-hallucination system prompt effectively prevents PII leakage and fabricated specifics.
Evaluation Methodology
We use LLM-as-judge evaluation with a single structured-output call that scores 6 metrics simultaneously — 25x faster than the previous per-metric DeepEval GEval approach (~90 min vs ~28 hours for 200 evaluations).
Metrics (all 0.0–1.0):
- Relevance: Does the response technically solve the problem? Accepts alternative valid approaches, not just exact-match to expected answer.
- Informativeness: Completeness and depth of information provided.
- Information Security: Checks for PII leakage — customer names, IPs, account IDs, internal infrastructure details.
- Usability: Whether advice is practical, actionable, and applicable in real-world environments.
- Verbosity: Response length appropriateness — penalizes both too short and too long.
- SemScore: Cosine similarity via
text-embedding-3-smallembeddings between generated and expected answers. Objective and reproducible — no LLM judge variance.
The Overall Score is a weighted average: Relevance (0.25), Info Security (0.20), Informativeness (0.15), Usability (0.15), SemScore (0.15), Verbosity (0.10).
Test cases go through a two-stage quality filter: tag pre-filter (25K → 19K) then LLM classifier (500 → 323 valid → 100 sampled). Exact-match source chunks are removed after retrieval to prevent data leakage. The judge evaluates: "Would a DevOps engineer receiving this response be able to solve the problem?"
Shared Lessons
- Retrieval is the weakest link
Only 45% of test cases have the correct source document as the top-1 retrieval result (60% for top-3). This means more than half the time, the generation model works with suboptimal context. Improving retrieval — through hybrid BM25 + semantic search, cross-encoder re-ranking, or better chunking — will have a higher impact than switching models. Retrieval quality deserves more engineering attention than model selection.
- The judge prompt matters more than the generation model
Our single largest accuracy improvement (+0.163) came not from changing the generation model or retrieval parameters, but from fixing the judge prompt. The previous judge penalized technically valid alternative answers because they didn't match the expected text exactly. The new judge asks: "Would a DevOps engineer solve the problem with this response?" — accepting registry-1.docker.io as equivalent to docker.io when both solve Docker layer download issues.
- Structured retrieval beats vector search for structured data
HSense's HS code navigator uses SQLite, not FAISS. The Backend.AI classifier uses an LLM, not embeddings. When your data has known structure — hierarchies, categories, metadata fields — use that structure directly. Vector search is a fallback for unstructured content, not a universal solution.
- Date-aware retrieval changes response quality
When chunks from different time periods cover the same topic, the model can present outdated solutions as primary recommendations. Fusion date re-ranking and date metadata in context headers let the model label solutions with dates and recommend the most recent approach first. In 51% of test cases, the model presents multiple solution approaches with date context.
- Multi-agent systems need observability
When one agent makes a wrong decision, the entire team can spiral. HSense's coordinator once picked the wrong product chapter, causing three specialists to generate confident but wrong evidence. Without step-by-step logging and the ability to trace decision chains, these failures are invisible.
- Open-weight models are production-ready
Qwen3.5-35B-A3B achieves 0.823 Overall Score served entirely on our own GPU infrastructure with 3-second response times. For domain-specific RAG — where the retrieval context does most of the heavy lifting — the gap between open-weight and proprietary models is small and shrinking.
What's Next
For HSense:
- Replacing flat vectorstore retrieval with hierarchy-aware search for tariff documents
- Prompt optimization with automated A/B testing and feedback loops
- Adding a product case RAG pipeline to supplement rule-based classification with historical precedents
For the Backend.AI assistant:
- Fixing the chunking pipeline (proper boundary detection, index rebuild)
- Hybrid BM25 + semantic search to improve the 45% top-1 retrieval hit rate
- Cross-encoder re-ranking for more precise top-k selection
- Expanding evaluation to cover troubleshooting and reasoning queries
- Implementing RAGAS metrics (Faithfulness, Context Precision/Recall) alongside current G-Eval
- Human evaluation via our Streamlit-based A/B testing platform
Both systems run on Backend.AI infrastructure, which means GPU allocation, model serving, and scaling are handled by the same platform we're building RAG for. There's a certain satisfaction in using your own product to improve your own product.
Beyond what's covered in this post, we also explore synthetic data generation and iterative refinement through domain expert feedback in a separate talk:
If you're interested in Backend.AI or building RAG systems on GPU infrastructure, check out backend.ai or reach out to us at info@lablup.com.
Footnotes
-
[Benchmarking Harmonized Tariff Schedule Classification Models] (https://arXiv:2412.14179), arXive ↩