Engineering

Jun 16, 2026

Engineering

Agent coding at long context: What KV cache offloading on VAST Data & Backend.AI buys you

  • Jinho Heo

    Jinho Heo

    Technical Writer

  • Kyujin Cho

    Kyujin Cho

    Software Engineer

  • Anat Heilper

    Anat Heilper

    AI Architect Director @VAST Data

Jun 16, 2026

Engineering

Agent coding at long context: What KV cache offloading on VAST Data & Backend.AI buys you

  • Jinho Heo

    Jinho Heo

    Technical Writer

  • Kyujin Cho

    Kyujin Cho

    Software Engineer

  • Anat Heilper

    Anat Heilper

    AI Architect Director @VAST Data

This article was co-authored by Lablup and VAST Data based on benchmark validation tests conducted together.

In an agent-coding session, multi-turn coding assistants reuse the same long base context turn after turn, and time-to-first-token (TTFT) is the cost users actually feel. The trouble starts when that base context grows into the hundreds of thousands of tokens: even when the same prefix is reused over and over, each turn tends to pay almost the full prefill cost again. Once the GPU's KV cache fills up, blocks from earlier turns keep getting evicted, and whatever is evicted has to be recomputed from scratch. So a workload that looks repetitive on paper, and ought to get faster for it, sees its TTFT settle at a baseline that won't drop.

How to handle workloads with long contexts

There are a few ways to address this. One is to give the GPU more memory: with a larger memory package, KV blocks wouldn't be evicted, recomputation would drop, and TTFT would fall along with overall wall-clock. But this isn't a real option for most teams, since GPUs with that much memory cost tens of thousands of dollars each and there is a hard physical ceiling on how much a memory package can carry. The second option is KV cache offloading: moving KV cache blocks out to external storage and pulling them back when they are needed.

KV cache offloading doesn't fight eviction. It lets blocks spill out of memory and makes them reusable from storage instead of recomputing them on every call. That is far more practical than buying bigger GPUs, but it carries one condition: on a regular NFS share, storage is slow enough that loading an offloaded block can cost more than recomputing it on the GPU. Using KV cache offloading effectively means putting RDMA, GPUDirect Storage, and a high performance data platform together in the right combination, and tail latency improves or doesn't depending on how that configuration is built.

Why agent coding defeats GPU-only KV cache

A chatbot prompt typically shares almost nothing across users, and an enterprise RAG service typically shares a single long prefix across most requests. Agent coding doesn't fit either pattern, and that middle position is exactly where GPU-only KV cache breaks down. Each agent session pins a different long prefix, and a long-running session takes many turns against that prefix.

Without offloading, two different pressures collide. The first is that any single long-context prefix is large enough to consume a meaningful slice of HBM, since the KV cache grows in proportion to context length and on top of the model weights. Compounding that, cycling through several such sessions guarantees eviction. By the time the same agent comes back for its next turn, its KV blocks have been pushed out by another agent's context, and the engine has to redo prefill from scratch.

KV cache offloading delivers its sharpest payoff on the workload shape that GPU-only KV cache handles worst: a heavy agent-coding loop that cycles through many long-context tasks against a single inference server. On that shape, wall-clock nearly halves and average TTFT more than halves, with the more useful number living in the shape of the TTFT distribution rather than the average alone. The measurements come from a Lablup and VAST Data benchmark on Backend.AI, with a VAST KV Cache VFolder mounted into the inference session.

Setup

Lablup and VAST Data verified the effectiveness of KV cache offloading by mounting the KV Cache VFolder of VAST AI OS v5.4 to an inference session in a Backend.AI 26.4 environment.

The benchmark ran inside a Backend.AI session on 8 H100 GPUs, with vLLM 0.20.0 and LMCache, the KV cache layer that handles the offload path, packaged in the vllm-openai:0.20.0-cuda12.9-ubuntu22.04 container image. The model was Mistral Medium 3.5 128B. The base context for every workload was a single 140K-token chunk of the Backend.AI source code.

The KV cache backend was a VAST KV Cache VFolder mounted into the session through Backend.AI's storage proxy. LMCache's gds_path setting points at the VAST mount, with an NVIDIA GPUDirect Storage (GDS) cuFile buffer (LMCACHE_CUFILE_BUFFER_SIZE) of 24 GiB per GPU, large enough to stage a full maximum-context KV cache while staying below vLLM's HBM allocation. From there, GDS handles the actual block movement, opening a direct path between GPU memory and the VAST cluster over the 100Gbps fabric without staging through host RAM. No additional configuration was needed on the storage side; any VAST folder type works as a cache target.

Workload design

The benchmark cycles a single inference server through ten distinct agent contexts, with five turns per context. The engine takes the first turn against each context in order, then loops back for second turns, and continues. By the time any given context returns for its next turn, nine other long-context sessions have passed through the GPU, representing the worst-case ordering for GPU-resident KV cache.

Benchmark result

vast_offloading_combined.svg

MetricWithout OffloadingWith OffloadingEffect
Total wall-clock1177.97s601.41s1.96× faster
Average TTFT22,104 ms10,573 ms2.09× faster
TPOT (per-token decode)14.5 ms14.5 msunchanged

By looking at the benchmark result, we could find out that Total wall-clock and average TTFT both improve by roughly 2×, while TPOT stays identical at 14.5 ms. The gain is entirely in getting to the first token; once generation starts, the GPU runs at the same rate either way. That asymmetry is the signature of a prefill-stage intervention, not a decode-stage one. Even so, the average TTFT figure hides the more important pattern, which only shows up when first-visit turns are separated from subsequent reuses against the same context.

vast_ttft_by_turn.svg

TTFT by Turn TypeWithout OffloadingWith OffloadingEffect
First turn per context (cold)~22.3s~26.3s~4s offload penalty
Subsequent turns per context (warm reuse)~22.0s~6.6s~3.3× faster

As can be seen in the table above, the first turn for each context costs more than the disabled baseline, because the engine is computing prefill and writing the resulting KV blocks out to VAST in the same pass. Once those blocks are in storage, every later turn against the same context reads them back instead of recomputing, and TTFT collapses into a tight band that sits well under a third of the GPU-only baseline. For the end user, that means generation starts after six seconds rather than twenty-two.

The one number worth internalizing before deploying this is the LMCache cuFile buffer (LMCACHE_CUFILE_BUFFER_SIZE), which for Mistral Medium 3.5 is 24 GiB per GPU. GDS needs that pinned HBM region to stage blocks between GPU memory and VAST, and it lives alongside vLLM's own allocator. On an 80 GB H100 where weights (~16 GB) and overhead (~3 GB) already claim ~19 GB, vLLM at 0.9 utilization leaves a resident KV pool near 53 GB, so a 24 GiB buffer is roughly 45% of it. What that figure measures, though, is staging capacity, not cache capacity. Cached prefixes no longer have to fit in HBM at all; they live on VAST and stream back at line rate. The buffer is the price of converting an HBM-bound cache into a storage-backed one, and what it buys is the ability to keep far more sessions live at once than any resident-only pool could hold. The wall-clock result above, finishing faster with a smaller resident cache, is the visible edge of that trade: less of the GPU's time goes to prefill that offload makes unnecessary.

Note: Enabling KV cache offloading introduces several seconds of additional latency on the first turn of a vLLM session. This latency was observed consistently across storage media, indicating that the root cause lies in the LMCache implementation rather than the underlying hardware configuration. Until a fix is available, this behavior should be documented as a known limitation of LMCache.

Achieving Line-Rate Read Throughput with VAST & Backend.AI

Against a 100 Gbps fabric with a theoretical ceiling of 12.5 GB/s, the VAST mount peaked at 10,693 MB/s in throughput and 10,441 ops/s on reads, pushing the infrastructure to within striking distance of the physical limit. The cache reads are running at line rate. Average read round-trip time stayed at 8.265 ms. At these numbers, the storage path simply isn't the bottleneck.

This result directly satisfies a fundamental condition for KV cache offloading to outperform recomputation: storage must be fast enough to load the cache before the model would finish recomputing it from scratch. On a standard TCP NFS share, that condition fails. From the results, NFS over a 100 Gbps fabric, GPUDirect Storage, and the VAST DASE (Disaggregated Shared-Everything) architecture do the heavy lifting on the storage side. Connecting that stack to the model runtime is where Backend.AI completes the pipeline and makes KV cache offloading the faster path.

Making This the Right Configuration

KV cache offloading on VAST pays off where several long, distinct prefixes cycle through the same inference server. The clearest production case is a heavy agent coding workflow: each agent carries its own system prompt, toolchain context, and in-progress code state, and those prefixes rotate continuously across requests. In this setup, offloading materially reduces TTFT by serving cached prefixes from storage rather than recomputing them on each call.

Other strong fits include long-context RAG systems where the retrieved document set rotates across requests, and multi-tenant coding-assistant services where each user or team pins a distinct context. A general-purpose chatbot with mostly one-shot prompts gets nothing from this configuration. The same is true for a RAG service built around a single long shared prefix that never gets evicted, because vLLM's own prefix caching already handles that case.


The configuration described above runs on Backend.AI for compute orchestration and VAST Data for the storage tier. The performance results reflect what the two deliver jointly. To reproduce this benchmark or evaluate it against your own agent workload, contact Lablup.

Reproduction summary

  • Inference engine: vLLM 0.20.0 + LMCache
  • Container image: vllm-openai:0.20.0-cuda12.9-ubuntu22.04
  • Model: Mistral Medium 3.5 128B
  • GPUs: 8× NVIDIA H100
  • Base context: 140K-token chunk of the Backend.AI source repository
  • Cache backend:* VAST KV Cache VFolder mounted via Backend.AI; LMCache gds_path set to the VAST mount; GDS cuFile buffer (LMCACHE_CUFILE_BUFFER_SIZE) of 24 GiB per GPU, sized to fit one maximum-context KV cache while staying below vLLM's HBM allocation
  • Network: 100Gbps fabric between GPU node and VAST Data cluster
  • Workload: Single inference server cycling through 10 distinct agent contexts, 5 turns per context

We're here for you!

Complete the form and we'll be in touch soon

Contact Us

Headquarter & HPC Lab

KR Office: 8F, 577, Seolleung-ro, Gangnam-gu, Seoul, 06143, Republic of Korea US Office: 3003 N First st, Suite 221, San Jose, CA 95134

© Lablup Inc. All rights reserved.

We value your privacy

We use cookies to enhance your browsing experience, analyze site traffic, and understand where our visitors are coming from. By clicking "Accept All", you consent to our use of cookies. Learn more