The KV cache stores precomputed key and value tensors from transformer attention layers during autoregressive generation. Instead of recomputing attention for all previous tokens, the model caches these tensors and reuses them, avoiding redundant computation and significantly accelerating inference.

  • Eliminates redundant recomputation of K/V matrices for cached tokens
  • Grows linearly with sequence length and model dimension
  • Essential optimization for token-by-token generation
  • Memory footprint becomes primary bottleneck at scale

KV cache memory consumption grows linearly with sequence length multiplied by model dimensions. For large models generating long sequences, this becomes the dominant memory bottleneck, often exceeding model parameter memory. A LLaMA-2 7B model can consume 4x more cache memory than parameters alone when generating 32K token sequences.

  • Memory per token: 2 × layers × heads × hidden_dim × 2 bytes (FP16)
  • Scales with sequence length and model size exponentially
  • Fragmentation wastes 60-80% of GPU memory in traditional batching
  • Critical constraint limiting batch size and throughput

LLM inference has two distinct phases: prefill processes the entire prompt in parallel (compute-bound), and decode generates one token at a time (memory-bound). The KV cache enables efficient prefill-to-decode transition by bridging the computational gap while preserving all cached attention data.

  • Prefill phase: Highly parallelizable, FLOPS-limited processing
  • Decode phase: Sequential generation, memory bandwidth-limited
  • KV cache size determines memory overhead per token generated
  • Batching strategies differ significantly between phases

Multi-Head Attention (MHA) remains the standard, but Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) reduce KV head count while maintaining quality. This directly reduces cache memory proportionally to the ratio of query to key-value heads.

  • MHA: One KV head per query head (baseline)
  • MQA: Single KV head shared across all queries (aggressive reduction)
  • GQA: Multiple query heads share each KV head (balanced approach)
  • Reduces cache memory by 8-16× compared to full MHA

Quantization, token pruning, and sliding window attention reduce the KV cache footprint without architectural changes. Techniques like INT8/INT4 quantization, H₂O eviction, and attention sink phenomena enable serving longer sequences within fixed memory budgets.

  • INT8/INT4 quantization: 50-75% memory reduction with minimal quality loss
  • H₂O: Evicts unimportant middle tokens while keeping first and last
  • Sliding window: Mistral's 4K window reduces cache by 90%+ for very long sequences
  • Speculative decoding: Requires separate cache for draft models

PagedAttention eliminates fragmentation through OS-inspired virtual memory for KV cache. vLLM's implementation uses logical blocks mapped to physical memory, achieving 2-4× throughput improvements by enabling flexible cache sharing and prefix caching for system prompts.

  • Traditional allocation: 60-80% GPU memory wasted due to fragmentation
  • PagedAttention: Logical blocks mapped to physical pages dynamically
  • Copy-on-write enables efficient prompt sharing across sequences
  • 2-4× throughput improvement in practical deployments

Major inference frameworks implement KV cache optimizations differently. vLLM leads with PagedAttention and continuous batching, TensorRT-LLM focuses on in-flight batching and graph optimization, while HuggingFace TGI balances flexibility with performance through modular architecture.

  • vLLM: PagedAttention, continuous batching, prefix caching
  • TensorRT-LLM: In-flight batching, graph optimization, NVIDIA integration
  • HuggingFace TGI: Modular design, dynamic batching, distributed inference
  • Each trades off flexibility, performance, and deployment complexity

Future research explores distributed KV stores for serving massive models, cross-request caching with LMCache, speculative decoding with shared KV cache, and hardware co-design for KV cache operations. Multi-tier storage (GPU→CPU→SSD) may extend effective cache capacity beyond GPU memory.

  • LMCache: Distributed KV store for cross-request cache sharing
  • Disaggregated inference: Separate prefill and decode clusters
  • Speculative decoding: Requires efficient cache sharing with draft models
  • Hardware innovations targeting KV cache compute patterns
09

Sources & References

References and sources for further study on the topics covered in this deep dive.