Engineering

Apr 27, 2026

Engineering

How to save GPU memory in LLM serving: Principles and operating conditions of KV cache offloading

  • Kyujin Cho

    Kyujin Cho

    Software Engineer

  • Jinho Heo

    Jinho Heo

    Technical Writer

Apr 27, 2026

Engineering

How to save GPU memory in LLM serving: Principles and operating conditions of KV cache offloading

  • Kyujin Cho

    Kyujin Cho

    Software Engineer

  • Jinho Heo

    Jinho Heo

    Technical Writer

In LLM serving for agentic AI, context length is one of the variables that decides GPU memory headroom. As AI takes on more complex work, multi-turn conversations and agentic sessions stay open longer, with context piling up into the tens of thousands of tokens. Under that load, the element that consumes GPU memory fastest is the KV cache. The KV cache stores the key and value* tensors generated during inference so they can be reused for the next token, and unlike model weights it grows in proportion to the number of users and the context length. Open a 128K context window on Llama 3 70B and a single user's KV cache reaches 40GB1, around half of an H100's 80GB of HBM.

  • Key is the vector matched against the query; Value is the vector carrying the actual information.

What if we moved the KV cache outside GPU memory? That question motivates KV cache offloading. The technique is being pushed forward across the industry through vLLM and LMCache integration, NVIDIA's Dynamo platform, and GPU-to-storage direct paths from vendors like VAST Data.

KV cache offloading swings widely with conditions. Under the right setup it pays off; under the wrong one, it can be slower than not offloading at all. This article walks through what the KV cache is, how offloading actually moves data, and the conditions under which it helps.

How the KV Cache Is Built and Why It Pressures Memory

Inference with a Transformer-based LLM happens in two phases. The first is prefill. In this phase, the engine produces K/V vectors from the entire prompt the user submitted and stores them in the KV cache. Prefill computation parallelizes well, so the absolute compute power of the GPU cores has the biggest effect on this phase. Once prefill is done, the engine moves into decode. Decode generates each subsequent token from the KV cache built up from the prompt. Because decode reads the KV cache on every step, fast memory I/O determines how quickly the engine can churn out tokens. The asymmetry in resource demand between prefill and decode is exactly what motivates Prefill-Decode Disaggregation, where prefill-optimized and decode-optimized accelerators are paired together.

In an idealized environment, you'd hope each user finished a question and closed their session right away. Real usage doesn't look like that. Users juggle several conversations, picking up older threads at random and switching contexts mid-session. If you only have GPU memory to work with, the KV caches of inactive sessions sit on the GPU consuming space, and memory pressure grows steadily as more users overlap on the same machine. When the GPU finally runs out, the oldest KV blocks get evicted and recomputed, which wastes compute. That is the cost KV cache offloading is meant to address. The next question is how it actually moves the blocks.

Data Movement Paths

KV cache offloading splits into two flows: cache writes and cache reads. The whole technique only works if external KV cache can be loaded faster than the GPU can recompute it, which makes the read and write speed of the external store the deciding factor.

The most optimized read path bypasses CPU memory entirely. NVIDIA Magnum IO GPUDirect Storage (GDS) creates a direct data path between GPU and storage without involving the CPU2. GDS works across both local and remote storage backends. On local NVMe SSDs it uses PCIe peer-to-peer transfers; on remote storage it works over NVMe-over-Fabrics, NFS-over-RDMA, or RDMA-capable distributed file systems like WEKA, VAST Data, and DDN Exascaler. In every case, KV data lands directly in GPU memory from storage, freeing host RAM and CPU cycles from the transfer.

Token-Hash Cache Identification

To minimize data movement, the engine has to find and reuse the right KV blocks. vLLM and LMCache identify cache blocks by hashing token sequences. Rather than storing the entire prompt as one chunk, they split it into larger token segments and compute a hash from each segment's token-ID sequence. That hash becomes the block's identifier.

When a new inference request arrives, the engine first converts the prompt into a token-ID sequence, splits it into blocks, and computes the hash of each block. The engine then searches GPU memory first, falls back to CPU memory, and finally to external storage. Because the hash depends only on the token content, the cache block map doesn't need to be persistent. Multiple inference engine instances on different hosts don't need any coordination, and even if a storage system uses lifecycle policies for time-based expiration, deleted blocks just appear as cache misses. That's why heterogeneous backends like S3, Redis, and local disk can sit behind one interface.

The other piece is multi-user reuse. If two different users send the same system prompt, or include the same RAG document inside their prompts, the matching hashes let both reuse the cached KV blocks. In a RAG workload where the same retrieved document gets embedded in many prompts, the chunks covering that document match and the engine skips prefill for them. That cross-user reuse is what makes the system efficient at scale.

Prefill Stage: The Cost Curve of Loading vs. Recomputing

With the structure clear, the next question is whether loading KV cache from outside actually beats recomputing it on the GPU. The answer depends on workload and hardware, and several recent studies map out the territory. The paper "Compute Or Load KV Cache? Why Not Both?" (arXiv:2410.03065, 2024)3 argues that prefix caching reduces GPU compute, but doesn't always produce the fastest time to first token (TTFT). Pulling KV cache from external storage means moving data, and the transfer can take longer than expected. With low-bandwidth storage like a SATA SSD or HDD, reading the KV cache alone takes seconds, and routing through network-attached remote storage stretches it further. On the other side, recomputing prefill on the GPU isn't free either. With a 70B-class model and a long context, regenerating the KV cache can cost tens of seconds. There is a crossover where "load time" and "recompute time" converge, and which side of that line a deployment lands on depends on the actual environment.

Two of the variables that decide the winning side are context length and model size. With short input and a small model, the GPU finishes prefill quickly, so going outside for KV cache can take longer than just regenerating it. In that regime, offloading doesn't help and may slow things down. Place the KV cache in a relatively fast tier like CPU memory, and loading wins in most cases. Reports from LMCache and vLLM benchmarks4 show 3 to 10x latency reductions for the same reason. Push further into high-bandwidth RDMA storage and even external stores stay competitive. When VAST Data tested vLLM and LMCache on a DGX SuperPOD environment5, loading a precomputed KV cache cut TTFT at 128K context from over 11 seconds down to 1.5. That result depended on a 400Gbps RDMA fabric, BlueField-3 DPUs, NVIDIA Magnum IO GPUDirect Storage, and a sufficiently long context all coming together.

Model Size, Context Length, and Network Bandwidth & Latency: The Variables That Tilt the Decision

Three variables matter when working through these cases: model size, context length, and the bandwidth and latency of the network. First, larger model size means higher prefill cost per token and a proportionally larger KV cache to store. The compute cost grows faster than cache size as models scale up, so KV cache offloading tilts in favor of larger models. Second, longer context grows attention recompute cost rapidly, while KV cache transfer time grows more gently in proportion to data size. Short contexts of a few hundred tokens often favor recomputing, while contexts in the tens of thousands of tokens make offloading the better bet. Third, if the storage sits on a regular NAS over plain NFS rather than a high-bandwidth NFS-over-RDMA fabric, the cached KV blocks won't arrive in time. Moving to 100-400Gbps NFS-over-RDMA (12-50 GB/s, microsecond-scale latency) cuts the round-trip time on small blocks and the transfer time on large ones.

Beyond Prefill: HBM Headroom and Session Mobility During Decode

Freeing HBM Slots to Scale Concurrency

When inactive chat sessions get pushed out to NAS, CPU memory, or disk, the GPU recovers memory it was holding. A GPU running inference for ten users with long contexts hits its memory ceiling and starts queuing new requests; with offloading, idle session caches move out and the GPU can take on more concurrent decode work. Decode itself doesn't get faster, but the GPU handles more streams in parallel, which shortens end-to-end inference times indirectly.

Session Migration: Avoiding Repeat Prefill

Offloading also avoids unnecessary prefill reruns. Multi-node AI clusters move workloads between GPUs frequently, whether for load balancing, elastic resource allocation, or failure recovery. If the local KV cache disappears with the GPU, the new GPU has to redo prefill from scratch. With KV cache offloading, the new GPU loads the stored KV blocks and resumes decode where the previous one left off.

Software Stacks That Support KV Cache Offloading

Beyond the vLLM + LMCache combination commonly used to implement KV cache offloading, several inference frameworks support it.

StackRoleKV Cache Capabilities
vLLM + LMCacheInference engine + KV cache layerCPU, disk, and remote tier offload (S3, Redis, etc.). The implementation discussed in this article.
NVIDIA DynamoDistributed inference frameworkDisaggregated prefill/decode with KV cache transfer between GPUs (RDMA, NVMe-oF, S3) via the NIXL library6
llm-dKubernetes-based distributed schedulerKV cache indexer tracks cache state across pods, prefix-aware routing. Runs on top of vLLM + LMCache7
KServeKubernetes model servingLMCache KV offloading integration in the vLLM backend
SGLangInference engineBuilt-in prefix caching, integration support for Dynamo disaggregation

Beyond the inference and KV cache management frameworks themselves, libraries like NVIDIA NIXL (NVIDIA Inference Xfer Library) form the underlying transport layer that makes faster KV cache transfer possible.

Where It Helps and Where It Doesn't

KV cache offloading isn't a universal win. Picking workloads that match its strengths is what makes the difference. The table below maps common scenarios to whether offloading is likely to pay off.

ScenarioVerdictReason
General chatbot (mostly one-shot prompts)DisadvantageEach request has a different prefix, so previous KV blocks rarely get reused. Offload adds I/O without payoff.
Team-scale RAG or agentic (single long shared prefix)DisadvantageThe shared prefix gets reused on every request and stays in GPU memory rather than being evicted. Offloading produces no hits.
General TCP NFS environmentDisadvantageTransfer latency exceeds recompute latency.
Multi-team RAG or multi-codebase agentic (multiple long shared prefixes)AdvantageMultiple prefix sets cycle through eviction, and the cost of reloading beats the cost of recomputing.
70B+ long-context multi-turn (10K+ tokens)AdvantageThe crossover point where prefill cost overtakes RDMA transfer cost.
Cross-node session migrationAdvantageThe new GPU resumes decode from stored KV instead of running prefill again.

A note on the team-scale RAG case: it can look like a strong fit for offloading because the system prompt is long and retrieved knowledge produces repeating prompts across requests. In practice, the prefix's KV blocks stay resident on the GPU rather than getting evicted, so cache eviction rarely happens. Adding offloading on top only introduces hashing and movement overhead, which slows the system down. Stripping out the external offload layer and relying on vLLM's built-in prefix caching is the better fit there.

Closing Thoughts

KV cache offloading is a useful technique, but its useful range is narrow enough that knowing when it helps and when it doesn't is what matters. Before deploying it, check whether the target scenario falls inside the range where it actually pays off, and start with measurement rather than reasoning. Model size, context length, and network bandwidth and latency move together, which makes a single threshold hard to draw, but the standard practice is to pick two or three representative workloads, measure prefill time and cache load time directly, and use the token length where the two cross as the threshold for the deployment. Watching those four indicators over the first few days while raising CPU buffer size in steps reveals the point where the cache hit rate trends toward the threshold. If the result lands outside the favorable region, leaning on the inference framework's built-in caching may be enough rather than adding an external offload layer. For scale-out deployments, consider a router that tracks KV state in real time, such as Backend.AI Continuum Router. Backend.AI has been validated against RDMA storage vendors including VAST Data, and existing workflows carry over without modification.


If you're interested in inference infrastructure optimization including KV cache offloading, see the Backend.AI webpage or contact lablup.

Footnotes

  1. Llama 3.1, 128K context, GQA (8 KV heads), 70B 128K KV cache ≈ 40GB. Meta, "Llama 3.1 - 405B, 70B & 8B with multilinguality and long context." Hugging Face Blog. https://huggingface.co/blog/llama31

  2. GPUDirect Storage supports both local NVMe (PCIe P2P) and remote storage (NVMe-oF, NFS-over-RDMA, RDMA-capable distributed file systems such as WEKA, VAST, and DDN Exascaler). NVIDIA, "GPUDirect Storage Overview Guide." https://docs.nvidia.com/gpudirect-storage/overview-guide/index.html / NVIDIA Developer Blog, "GPUDirect Storage: A Direct Path Between Storage and GPU Memory." https://developer.nvidia.com/blog/gpudirect-storage/

  3. Bidirectional scheduling, SSD/HDD/NVMe bandwidth comparison, 3 GB/s · 25GB · 8s / 70B · 72K = 30s / average 2.6× TTFT reduction. S. Jin, X. Liu, Q. Zhang, Z. M. Mao, "Compute Or Load KV Cache? Why Not Both?" arXiv:2410.03065 (2024). https://arxiv.org/abs/2410.03065

  4. LMCache + vLLM 3 to 10× latency reduction benchmark. LMCache Project, GitHub repository. https://github.com/LMCache/LMCache

  5. DGX SuperPOD reference architecture with 400Gbps RDMA, BlueField-3 DPUs, and GDS, 128K context TTFT 11s to 1.5s. VAST Data, "Accelerating Inference." https://www.vastdata.com/blog/accelerating-inference

  6. NVIDIA Dynamo distributed inference framework and the NIXL KV cache transfer library (RDMA, NVMe-oF, S3 support). NVIDIA, "Introducing NVIDIA Dynamo" (GTC 2025). https://developer.nvidia.com/blog/introducing-nvidia-dynamo-a-low-latency-distributed-inference-framework-for-scaling-reasoning-ai-models/

  7. Production prefix caching hit rates (warm cache around 87%) and Prometheus metrics (kv_cache_usage_percent, prefix_cache_hit_rate, eviction rate, effective cache throughput). llm-d, "KV-Cache Wins You Can See: From Prefix Caching in vLLM to Distributed Scheduling with llm-d." https://llm-d.ai/blog/kvcache-wins-you-can-see

We're here for you!

Complete the form and we'll be in touch soon

Contact Us

Headquarter & HPC Lab

KR Office: 8F, 577, Seolleung-ro, Gangnam-gu, Seoul, Republic of Korea US Office: 3003 N First st, Suite 221, San Jose, CA 95134

© Lablup Inc. All rights reserved.

We value your privacy

We use cookies to enhance your browsing experience, analyze site traffic, and understand where our visitors are coming from. By clicking "Accept All", you consent to our use of cookies. Learn more