Building Production RAG Systems: Lessons from Tariff Support

Engineering

Apr 23, 2026

Engineering

Building Production RAG Systems: Lessons from Tariff Support

Sergey Leksikov
Machine Learning Researcher

Apr 23, 2026

Engineering

Building Production RAG Systems: Lessons from Tariff Support

Sergey Leksikov
Machine Learning Researcher

At Lablup, we build AI infrastructure software . But infrastructure alone doesn't solve domain problems — you need systems that can reason over specialized knowledge at scale. Over the past year, we've developed two production RAG systems that tackle very different challenges: HSense, a multi-agent system for Korean HS code classification across 11,000+ HS codes, and a Backend.AI RAG Assistant that handles customer support queries across seven documentation projects.

This post covers what we learned building both — the architecture decisions, the approaches that didn't work, and the analytical characteristics of our approach.

The Problem with Naive RAG

The standard RAG recipe — chunk your documents, embed them, retrieve top-k, generate — works well for simple Q&A over homogeneous corpora. It may not work as expected when:

Documents have hierarchical structure. Korean tariff law is a 2,000+ page framework with nested classification rules. Flat chunking destroys the relationships between sections.
Classification requires multi-step reasoning. Determining the correct 10-digit HS code for a product isn't a lookup — it requires applying General Rules of Interpretation (GRI), navigating code hierarchies, and cross-referencing precedents.
Knowledge is scattered across heterogeneous sources. Backend.AI documentation spans official docs and internal knowledge bases — each with different formats, quality levels, and relevance patterns.

We needed architectures that go beyond retrieve-and-generate.

HSense: Multi-Agent RAG for HS code Classification

What is HSense?

Sense is an AI-based HS code classification system developed through the XaaS Frontier Project led by the Ministry of Science and ICT in Korea. It takes text- or image-based product descriptions as input, recommends the most appropriate HS codes for the items, and provides the basis for classification grounded in relevant laws and regulations.

Why Agents?

HS code classification is a complex decision problem. A single product might fall under multiple candidate codes, and the correct classification depends on material composition, intended use, GRI rule interpretation, and framework definitions — often simultaneously.

Rather than building one monolithic prompt that tries to do everything, we decomposed the task into four specialist roles:

Agent	Role	Tools
Tariff Officer (Coordinator)	Orchestrates the team, guides search direction, makes final decision	Delegates to other agents
GRI Rule Expert (Interpreter)	Explains classification rules, provides interpretation guidance	Rule-based reasoning
Framework Expert (Scholar)	Searches 2,000+ pages of text (~3M tokens)	RAG over FAISS vectorstore + SQLite FTS5 keyword search
HS Code Navigator (Pathfinder)	Navigates code hierarchies from 2-digit to 10-digit	SQLite database of code trees

This mirrors how human customs specialists actually work — no single person holds all the context. The coordinator asks specific questions, specialists provide evidence, and classification emerges from structured deliberation.

Retrieval Architecture

The Framework Expert uses a dual-retrieval strategy:

Semantic search via FAISS vectorstore for meaning-based queries ("products made from mixed animal and vegetable fats")
Keyword search via SQLite FTS5 for precise lookups ("heading 1517", "subheading note 2")

Documents are converted from Korean PDFs to structured Markdown using marker, then chunked by Markdown headers using LangChain's header-based splitter. This preserves document structure better than character-count splitting.

The HS Code Navigator operates over a separate SQLite database containing the full code hierarchy — no vector search needed, just structured traversal.

The Agent Framework

We use Agno as the multi-agent framework. Here's a simplified view of the team definition:

from agno.agent import Agent
from agno.team import Team

team = Team(
    name="HSense Classification Team",
    mode="coordinate",  # coordinator delegates to specialists
    members=[tariff_officer, gri_expert, legal_expert, hs_navigator],
    success_criteria="Provide HS code classification with supporting evidence",
    max_discussion_rounds=5,
    stream_intermediate_steps=True,
)

Each agent has a Pydantic-modeled output schema for structured responses, built-in reasoning mode for complex cases, and conversation memory for multi-turn sessions.

Performance

On a 500-case test set sampled from the Korean HS code classification database, using Gemma-3-27B-IT with multimodal image annotations:

Metric	Score
Top-1 Accuracy	92.40%
Top-3 Accuracy	96.40%
Top-5 Accuracy	97.40%
Top-20 Accuracy	98.40%
2-Digit (Chapter) Accuracy	96.60%
F1-Score	86.17%
Avg tokens per case	1,549

For context, a 2024 benchmark by Bryce Judy¹ found commercial HS code classification models ranging from 44–89% Top-1 accuracy on 10-digit codes. Our 92.4% puts HSense at the top of that range — and we're running a 27B open-weight model, not GPT-4.

Multiple chapters hit 100% Top-1 accuracy (Chapters 11, 15, 20, 23, 29, 30, 33, 34, 35, 38, 40, 60, 82, 94). The toughest chapters — where products are ambiguous or rules are particularly nuanced — still hover around 50–62.5%.

What Didn't Work

Chunking vectorstores for hierarchical documents. This is our biggest lesson. Tariff law has a tree structure — Chapters contain Headings, Headings contain Subheadings, and Subheadings have Notes that modify interpretation of everything above them. Flat vector search over chunks loses these relationships. The SQLite-based navigation approach for code hierarchies works better.

Single-agent approaches. When one agent tries to handle everything — GRI interpretation, code navigation, framework search, and final decision — it gets confused. Context windows fill up with irrelevant information, and the model can't maintain focus. Splitting responsibilities improved both accuracy and debuggability.

Unconstrained coordinator decisions. Early versions had the coordinator making wrong high-level decisions (picking the wrong chapter), causing all specialists to waste effort in the wrong direction. We added backtracking capability: if evidence from specialists contradicts the initial direction, the coordinator can reset and explore alternatives.

Backend.AI RAG Assistant

Introduction

The Backend.AI RAG Assistant was developed to provide systematic support for Backend.AI customers by addressing their questions and requests. It answers user queries based on a GitHub repository containing Backend.AI documentation, along with part of Lablup’s internal knowledge base.

Architecture

The assistant answers customer questions by searching across seven documentation projects:

User Query
  → RequestClassifier (GPT-4.1-mini, multi-label routing)
  → VectorDBManager (7 FAISS indices, k=10 per project)
  → Confidence Filter (L2 > 1.5 removed)
  → Fusion Date Re-ranking (similarity-primary, date as tiebreaker)
  → Global Top-15 Selection
  → RAGManager (Qwen3.5-35B-A3B, LangChain v0.3, streaming)
  → Response

The RequestClassifier routes queries to relevant projects before retrieval. A question about "how to create a compute session" should search webui and backendai docs, not enterprise-guide or realworld_data_support. Classification uses GPT-4.1-mini with structured output parsing across 8 categories (7 documentation projects and one conversation catogory), incorporating the last 3 conversation messages for context.

The Confidence Filter removes chunks with L2 distance > 1.5 — these are low-relevance results that would dilute the context. Fusion Date Re-ranking uses similarity as the primary signal and chunk date as a tiebreaker: when two chunks have similar L2 scores, the more recent one ranks higher.

The anti-hallucination system prompt enforces strict context-only answers: the model must never fabricate IPs, ports, paths, server names, or commands. When multiple solutions exist across different dates, it presents the most recent first as the primary recommendation, then lists alternatives with their dates.

Evaluation Results

We evaluated the pipeline on 100 quality-filtered technical Q&A test cases sampled from 25K real-world data. The generation model is Qwen3.5-35B-A3B, served on Backend.AI infrastructure. Two retrieval modes were compared:

Metric	Sample Data	All Datasets
Relevance	0.809	0.824
Informativeness	0.775	0.792
Info Security	0.950	0.960
Usability	0.766	0.778
Verbosity	0.847	0.845
SemScore	0.703	0.704
Overall Score	0.813	0.823
Retrieval top-1 hit rate	45%	45%
Retrieval top-3 hit rate	60%	59%
Avg Response Time	3.02s	3.32s

All-datasets mode consistently outperforms sample data, confirming that cross-project context adds value. Information Security scores 0.95+ — the anti-hallucination system prompt effectively prevents PII leakage and fabricated specifics.

Evaluation Methodology

We use LLM-as-judge evaluation with a single structured-output call that scores 6 metrics simultaneously — 25x faster than the previous per-metric DeepEval GEval approach (~90 min vs ~28 hours for 200 evaluations).

Metrics (all 0.0–1.0):

Relevance: Does the response technically solve the problem? Accepts alternative valid approaches, not just exact-match to expected answer.
Informativeness: Completeness and depth of information provided.
Information Security: Checks for PII leakage — customer names, IPs, account IDs, internal infrastructure details.
Usability: Whether advice is practical, actionable, and applicable in real-world environments.
Verbosity: Response length appropriateness — penalizes both too short and too long.
SemScore: Cosine similarity via text-embedding-3-small embeddings between generated and expected answers. Objective and reproducible — no LLM judge variance.

The Overall Score is a weighted average: Relevance (0.25), Info Security (0.20), Informativeness (0.15), Usability (0.15), SemScore (0.15), Verbosity (0.10).

Test cases go through a two-stage quality filter: tag pre-filter (25K → 19K) then LLM classifier (500 → 323 valid → 100 sampled). Exact-match source chunks are removed after retrieval to prevent data leakage. The judge evaluates: "Would a DevOps engineer receiving this response be able to solve the problem?"

Shared Lessons

Retrieval is the weakest link

Only 45% of test cases have the correct source document as the top-1 retrieval result (60% for top-3). This means more than half the time, the generation model works with suboptimal context. Improving retrieval — through hybrid BM25 + semantic search, cross-encoder re-ranking, or better chunking — will have a higher impact than switching models. Retrieval quality deserves more engineering attention than model selection.

The judge prompt matters more than the generation model

Our single largest accuracy improvement (+0.163) came not from changing the generation model or retrieval parameters, but from fixing the judge prompt. The previous judge penalized technically valid alternative answers because they didn't match the expected text exactly. The new judge asks: "Would a DevOps engineer solve the problem with this response?" — accepting registry-1.docker.io as equivalent to docker.io when both solve Docker layer download issues.

Structured retrieval beats vector search for structured data

HSense's HS code navigator uses SQLite, not FAISS. The Backend.AI classifier uses an LLM, not embeddings. When your data has known structure — hierarchies, categories, metadata fields — use that structure directly. Vector search is a fallback for unstructured content, not a universal solution.

Date-aware retrieval changes response quality

When chunks from different time periods cover the same topic, the model can present outdated solutions as primary recommendations. Fusion date re-ranking and date metadata in context headers let the model label solutions with dates and recommend the most recent approach first. In 51% of test cases, the model presents multiple solution approaches with date context.

Multi-agent systems need observability

When one agent makes a wrong decision, the entire team can spiral. HSense's coordinator once picked the wrong product chapter, causing three specialists to generate confident but wrong evidence. Without step-by-step logging and the ability to trace decision chains, these failures are invisible.

Open-weight models are production-ready

Qwen3.5-35B-A3B achieves 0.823 Overall Score served entirely on our own GPU infrastructure with 3-second response times. For domain-specific RAG — where the retrieval context does most of the heavy lifting — the gap between open-weight and proprietary models is small and shrinking.

What's Next

For HSense:

Replacing flat vectorstore retrieval with hierarchy-aware search for tariff documents
Prompt optimization with automated A/B testing and feedback loops
Adding a product case RAG pipeline to supplement rule-based classification with historical precedents

For the Backend.AI assistant:

Fixing the chunking pipeline (proper boundary detection, index rebuild)
Hybrid BM25 + semantic search to improve the 45% top-1 retrieval hit rate
Cross-encoder re-ranking for more precise top-k selection
Expanding evaluation to cover troubleshooting and reasoning queries
Implementing RAGAS metrics (Faithfulness, Context Precision/Recall) alongside current G-Eval
Human evaluation via our Streamlit-based A/B testing platform

Both systems run on Backend.AI infrastructure, which means GPU allocation, model serving, and scaling are handled by the same platform we're building RAG for. There's a certain satisfaction in using your own product to improve your own product.

Beyond what's covered in this post, we also explore synthetic data generation and iterative refinement through domain expert feedback in a separate talk:

Building Production RAG Systems: Synthetic Data and Expert Feedback

If you're interested in Backend.AI or building RAG systems on GPU infrastructure, check out backend.ai or reach out to us at info@lablup.com.

[Benchmarking Harmonized Tariff Schedule Classification Models] (https://arXiv:2412.14179), arXive ↩

backend.ai

Blog

Engineering

Building Production RAG Systems: Lessons from Tariff Support

Building Production RAG Systems: Lessons from Tariff Support

The Problem with Naive RAG

HSense: Multi-Agent RAG for HS code Classification

What is HSense?

Why Agents?

Retrieval Architecture

The Agent Framework

Performance

What Didn't Work

Backend.AI RAG Assistant

Introduction

Architecture

Evaluation Results

Evaluation Methodology

Shared Lessons

What's Next

We value your privacy

Blog

Engineering

Building Production RAG Systems: Lessons from Tariff Support

Building Production RAG Systems: Lessons from Tariff Support

The Problem with Naive RAG

HSense: Multi-Agent RAG for HS code Classification

What is HSense?

Why Agents?

Retrieval Architecture

The Agent Framework

Performance

What Didn't Work

Backend.AI RAG Assistant

Introduction

Architecture

Evaluation Results

Evaluation Methodology

Shared Lessons

What's Next

Footnotes

We value your privacy