The KV cache stores precomputed key and value tensors from transformer attention layers during autoregressive generation. Instead of recomputing attention for all previous tokens, the model caches these tensors and reuses them, avoiding redundant computation and significantly accelerating inference.
Eliminates redundant recomputation of K/V matrices for cached tokens
Grows linearly with sequence length and model dimension
Essential optimization for token-by-token generation
Memory footprint becomes primary bottleneck at scale
KV cache memory consumption grows linearly with sequence length multiplied by model dimensions. For large models generating long sequences, this becomes the dominant memory bottleneck, often exceeding model parameter memory. A LLaMA-2 7B model can consume 4x more cache memory than parameters alone when generating 32K token sequences.
Scales with sequence length and model size exponentially
Fragmentation wastes 60-80% of GPU memory in traditional batching
Critical constraint limiting batch size and throughput
LLM inference has two distinct phases: prefill processes the entire prompt in parallel (compute-bound), and decode generates one token at a time (memory-bound). The KV cache enables efficient prefill-to-decode transition by bridging the computational gap while preserving all cached attention data.
KV cache size determines memory overhead per token generated
Batching strategies differ significantly between phases
Multi-Head Attention (MHA) remains the standard, but Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) reduce KV head count while maintaining quality. This directly reduces cache memory proportionally to the ratio of query to key-value heads.
MHA: One KV head per query head (baseline)
MQA: Single KV head shared across all queries (aggressive reduction)
GQA: Multiple query heads share each KV head (balanced approach)
Reduces cache memory by 8-16× compared to full MHA
Quantization, token pruning, and sliding window attention reduce the KV cache footprint without architectural changes. Techniques like INT8/INT4 quantization, H₂O eviction, and attention sink phenomena enable serving longer sequences within fixed memory budgets.
INT8/INT4 quantization: 50-75% memory reduction with minimal quality loss
H₂O: Evicts unimportant middle tokens while keeping first and last
Sliding window: Mistral's 4K window reduces cache by 90%+ for very long sequences
Speculative decoding: Requires separate cache for draft models
PagedAttention eliminates fragmentation through OS-inspired virtual memory for KV cache. vLLM's implementation uses logical blocks mapped to physical memory, achieving 2-4× throughput improvements by enabling flexible cache sharing and prefix caching for system prompts.
Traditional allocation: 60-80% GPU memory wasted due to fragmentation
PagedAttention: Logical blocks mapped to physical pages dynamically
Copy-on-write enables efficient prompt sharing across sequences
2-4× throughput improvement in practical deployments
Major inference frameworks implement KV cache optimizations differently. vLLM leads with PagedAttention and continuous batching, TensorRT-LLM focuses on in-flight batching and graph optimization, while HuggingFace TGI balances flexibility with performance through modular architecture.
References and sources for further study on the topics covered in this deep dive.
Attention Mechanism & The Need for Caching
In transformer self-attention, the model computes three matrices from input: Queries (Q), Keys (K), and Values (V). The attention output for position t depends on scores computed from Qt against all K and V from positions 1 to t.
During autoregressive generation, each new token requires computing attention against all previous tokens. Without caching: each new token recomputes K and V for all context tokens—quadratic work. With caching: K and V from previous steps are reused, enabling linear scaling of inference cost.
Key Insight
Queries are never cached because they change with each new token. Keys and Values can be cached because they depend only on past tokens, which never change during generation.
Step-by-Step Generation with Cache
Step 1 (Prefill): Entire prompt [token₁, token₂, ..., token_n] is processed in parallel. K and V are computed for all positions and cached in GPU memory.
Step 2+ (Decode): For each new token:
Compute Q from the single new token
Retrieve cached K and V from all previous positions
Compute attention and generate next token
Append new K and V to cache
Memory Layout of KV Cache
For each layer, K and V are stored as separate dense tensors:
K: [batch_size, num_heads, seq_len, head_dim]
V: [batch_size, num_heads, seq_len, head_dim]
Total per layer: 2 × batch × heads × seq × head_dim × 2 bytes (FP16)
All layers' caches stack in memory, growing linearly with depth and sequence length. This dense storage enables efficient GPU access patterns but creates significant fragmentation when serving multiple requests of varying sequence lengths.
Impact: Cache dominates memory at long sequences. At 2K tokens, cache is ~3.4GB per batch. At 32K tokens, it explodes to 55GB—a 16× increase for 10× longer sequences.
Sequence Length Impact
Memory grows linearly with sequence length but the impact is non-linear on throughput and batch size:
Short Context (2K tokens)
Cache ~3.4 GB/batch. Multiple batches parallel. High utilization of GPU.
In traditional request batching, each request reserves contiguous cache space for its maximum possible length. If a request generates fewer tokens than reserved, that space is wasted. Across a batch of diverse requests, 60-80% of allocated cache remains unused.
Critical Problem
Fragmentation is the hidden crisis of KV cache—GPU memory is allocated but inaccessible to other requests, making effective batch size far smaller than theoretical capacity.
Compute-Bound vs Memory-Bound Inference
Prefill and decode exhibit opposite computational characteristics:
Prefill (Compute-Bound)
Process n tokens in parallel
High arithmetic intensity: (n² × d) work per (n × d) memory access
GPUs reach 50-70% peak FLOPS utilization
Batch size can be large (128-256)
Latency-insensitive; throughput-critical
Decode (Memory-Bound)
Generate 1 token at a time
Low arithmetic intensity: (cache × d) work per (d) memory access
GPUs reach only 10-20% peak FLOPS utilization
Batch size limited by memory: typically 8-32
Latency-critical; throughput secondary
The KV cache bridges these phases: prefill dumps large cache, decode consumes it gradually. Optimal systems amortize prefill latency across many decode tokens while maximizing cache utilization.
Batching Strategies
Static batching: Fixed batch size, requests padded to max length. Wasteful for variable-length requests.
Dynamic batching: Requests leave batch when finished. Poor scheduling leads to GPU idle time.
Continuous batching (vLLM): Requests enter/leave dynamically without synchronization points. Achieves near-optimal utilization by overlapping prefill and decode phases.
Phase Interaction: The Bridge
Multi-Head Attention (MHA) — Baseline
Standard transformer architecture. Each query head has corresponding key and value heads. In a model with 32 attention heads and hidden dimension 4096:
32
Query heads
32
Key heads
32
Value heads
1:1 ratio
Q:KV
Cache memory: 2 × 32 × seq_len × head_dim per token. Baseline reference.
Multi-Query Attention (MQA)
All query heads share a single key and value head. Reduces KV heads from 32 to 1—a 32× reduction in cache size with trade-offs:
32
Query heads
1
Key head
1
Value head
32:1 ratio
Q:KV
Used in Falcon, PaLM. Dramatic cache reduction but head-to-head comparisons show modest perplexity degradation on held-out benchmarks (1-2 point PPL increase).
Grouped-Query Attention (GQA)
Balanced middle ground: query heads grouped around fewer key/value heads. For instance, 32 query heads grouped into 8 groups × 4 queries per group, with 8 KV heads:
32
Query heads
8
Key heads
8
Value heads
4:1 ratio
Q:KV
Achieves ~8× cache reduction with <1 PPL quality gap. Used in LLaMA-2, Mistral, Llama-3 (recommended sweet spot).
Config: MQA, single KV head. Extreme cache efficiency.
Tradeoff Grid
MHA/GQA Strengths
Maximum model capacity and expressiveness
Well-established training procedures
Better quality on benchmarks
Flexible scaling to larger models
MHA/GQA Limitations
Large cache memory footprint
Limited to smaller batch sizes at long context
Higher inference cost and latency
Expensive to serve at scale
MQA Strengths
Minimal cache memory (32× reduction)
Highest batch sizes and throughput
Deployable on smaller hardware
Cost-effective at inference scale
MQA Limitations
Quality degradation vs full MHA
Requires retraining from scratch
Less proven at very large scales
Query-head bottleneck potential
Quantization of KV Cache
Reduce precision from FP16 (2 bytes per value) to INT8/INT4:
INT8: 1 byte per value. 50% memory reduction. Minimal quality loss with proper calibration.
INT4: 0.5 bytes per value. 75% reduction. Requires per-token or per-channel quantization.
FP8: Emerging standard. 1 byte, dynamic range similar to FP16, better than INT8.
Trade-off: Quantization introduces rounding error, but effects are often modest because attention values stabilize after early layers and are relatively smooth.
Token Eviction Strategies
H₂O (Heavy-Hitter Oracle): Keep the first and last few tokens (attention sinks) plus a sliding window of recent tokens. Evict middle tokens whose attention scores are low. Achieves 90% cache reduction with <2% perplexity increase on long sequences (32K+ tokens).
StreamingLLM: Similar to H₂O; focus on maintaining sink tokens (often BOS token) that appear in all attention patterns.
SnapKV: Dynamic importance scoring with eviction. Trade computation for cache reduction.
Attention Sink Phenomenon
Empirical finding: The first token(s) in a sequence (typically BOS or padding tokens) receive high attention scores in all layers and positions. These "sinks" are critical to model function but represent a tiny fraction of the sequence. Techniques that identify and preserve sinks while evicting others achieve dramatic cache reduction.
Sliding Window Attention
Mistral's approach: attention computed only over recent N tokens (e.g., 4096), ignoring older history. Reduces cache to constant size regardless of total sequence length.
4K tokens
Mistral window
90%+
Memory saved
<2% PPL
Quality loss
Trade-off: Cannot attend to very distant context. Suitable for many applications (e.g., dialogue, document processing) but not long-context tasks requiring full history.
Speculative Decoding with Cache
Speculative decoding uses a draft model to generate candidates, then verifies with the main model. Both require caches. The draft model (smaller) needs less cache, but total cache is 1.5-2× baseline. Requires careful memory budgeting and synchronization of cache states.
Traditional KV Cache Allocation Problem
Typical allocation: Preallocate contiguous cache buffer for each request's entire maximum length. If request terminates early or generation is shorter than max, that space is wasted and inaccessible to other requests.
Request A (max 8K)
Actually generates 2K tokens. 6K slots wasted.
Request B (max 8K)
Cannot reuse A's wasted space due to contiguous requirement.
Request C (max 4K)
Allocated but competes with A&B for remaining contiguous space.
GPU Memory
60-80% of total cache allocated is unused but blocked from other requests.
PagedAttention Solution
Inspired by OS virtual memory: KV cache divided into fixed-size physical blocks (e.g., 16K tokens per block). Each request has a logical block table that maps to physical blocks. As requests extend their cache, new physical blocks are allocated non-contiguously.
Key Mechanism
Logical blocks decoupled from physical blocks enable fragmentation-free allocation. Multiple requests can share physical blocks; compute kernels perform block-table lookup during attention.
Block Table Architecture
Copy-On-Write & Prefix Caching
If multiple requests share the same prompt (e.g., system prompt for many users), PagedAttention can share the prefix's cached blocks. When a request diverges (e.g., different user input after system prompt), it copies only the diverging portion. This saves significant memory when many requests share context.
2-4×
Throughput improvement
60% → 5%
Memory fragmentation
~10-20%
Compute overhead (block table)
vLLM Implementation
vLLM uses PagedAttention in its core execution model. Each request tracks a block table. During prefill, new blocks are allocated. During decode, the block table is indexed to gather K,V for attention. When a request completes, its blocks are returned to the free pool and can be reused immediately.
Integration with continuous batching: requests enter/leave the batch independently. New requests can immediately utilize freed blocks from completed requests. No synchronization between prefill/decode or between requests—maximum GPU utilization.
Performance Impact
Real vLLM deployments see:
2-4× improvement in requests/second
Batch sizes 4-8× larger at same latency SLO
Memory efficiency: ~80-85% cache utilization vs ~20% with traditional allocation
Prefix caching: additional 1.5-2× improvement when many requests share prompts
Framework Comparison Matrix
vLLM
KV Cache: PagedAttention with continuous batching. Prefix caching for system prompts. Token counting and abort-request support.
TensorRT-LLM
KV Cache: In-flight batching with dynamic buffer management. Fused compute kernels. Heavy NVIDIA optimization.
HuggingFace TGI
KV Cache: Modular approach; uses paged attention variant. Focus on distributed inference and scalability.
VLLM
KV Cache: Alternative paged approach; emphasis on integration with serving frameworks.
Feature Comparison Grid
Framework
PagedAttention
Prefix Caching
Distributed
vLLM
✓
✓
✓
TensorRT-LLM
✓
—
✓
HuggingFace TGI
✓
✓
✓
Deployment Considerations
vLLM: Easiest to deploy; PyTorch-native. Best for single-node setups and research. Excellent developer experience.
TensorRT-LLM: Highest performance on NVIDIA hardware. Complex compilation pipeline. Requires expertise. Best for large-scale production.
HuggingFace TGI: Good balance of usability and performance. Docker-first. Native Hugging Face integration. Suited for enterprises seeking managed inference.
Continuous Batching Strategy
All frameworks employ continuous batching or in-flight batching to maximize utilization:
Requests added to batch dynamically (no waiting for full batch)
Each request progresses independently through prefill/decode
Requests exit when finished; free resources immediately reclaimed
No synchronization barriers between requests—GPU stays busy
Outcome: 3-5× throughput improvement over batched inference where all requests in a batch must complete together.
Cross-Request KV Cache Sharing
LMCache project: external distributed KV store for cross-request cache sharing. Multiple model replicas can query a shared cache backend for previously computed K,V tensors. Enables:
Cache reuse across different users' requests with same prompt
Amortization of expensive prefill across many requests
Significant throughput gains for repetitive workloads (e.g., chatbots with fixed system prompts)
Emerging Pattern
Cache-as-a-service: Treat KV cache as a first-class distributed resource, not just per-request GPU memory.
Disaggregated Inference
Future architecture: separate clusters for prefill and decode phases:
Prefill cluster: Optimized for throughput. Heavy parallelism. Large batch sizes.
KV Store: Centralized cache between clusters. Enables asynchronous prefill without waiting for decode capacity.
Benefit: Independent scaling of prefill and decode workloads. If decode becomes bottleneck, add more decode nodes without reprovisioning prefill. Enables resource efficiency and elasticity.
Speculative Decoding & KV Cache
Speculative decoding generates multiple candidate tokens in parallel, then verifies with the main model. KV cache interaction:
Draft model maintains its own KV cache (smaller, faster)
Main model maintains separate KV cache for verification
If draft prediction correct: main model's cache extended
If draft prediction wrong: rollback and retry with fewer candidates
Total cache footprint: 1.5-2× baseline due to dual caches. But token generation speed can improve 2-3× if draft model has high correctness.
Multi-Tier Storage (Future)
Research direction: KV cache tiering across heterogeneous storage:
Hot (GPU VRAM)
Recent decode tokens. High-speed access needed for every step.