KV Cache in LLM Inference

Attention Mechanism & The Need for Caching

In transformer self-attention, the model computes three matrices from input: Queries (Q), Keys (K), and Values (V). The attention output for position t depends on scores computed from Q_t against all K and V from positions 1 to t.

Attention equation: Attention(Q, K, V) = softmax(Q K^T / √d_k) V

During autoregressive generation, each new token requires computing attention against all previous tokens. Without caching: each new token recomputes K and V for all context tokens—quadratic work. With caching: K and V from previous steps are reused, enabling linear scaling of inference cost.

Key Insight

Queries are never cached because they change with each new token. Keys and Values can be cached because they depend only on past tokens, which never change during generation.

Step-by-Step Generation with Cache

Step 1 (Prefill): Entire prompt [token₁, token₂, ..., token_n] is processed in parallel. K and V are computed for all positions and cached in GPU memory.

Step 2+ (Decode): For each new token:

Compute Q from the single new token
Retrieve cached K and V from all previous positions
Compute attention and generate next token
Append new K and V to cache

Memory Layout of KV Cache

For each layer, K and V are stored as separate dense tensors:

K: [batch_size, num_heads, seq_len, head_dim]
V: [batch_size, num_heads, seq_len, head_dim]
Total per layer: 2 × batch × heads × seq × head_dim × 2 bytes (FP16)

All layers' caches stack in memory, growing linearly with depth and sequence length. This dense storage enables efficient GPU access patterns but creates significant fragmentation when serving multiple requests of varying sequence lengths.

Memory Calculation Formula

Cache memory per batch for one layer:

M = 2 × batch × seq_len × num_heads × head_dim × 2 bytes

Total across L layers and all parameters:

M_cache = M × L (per batch element)

Case Study: LLaMA-2 Memory Profile

7B params

Model size

13.8 GB

Model weights (FP16)

32 heads

Attention heads

Cache per batch at 32K tokens:

55 GB

Cache only (seq=32K)

4:1

Cache:params ratio

1

Batch size max (80GB GPU)

Impact: Cache dominates memory at long sequences. At 2K tokens, cache is ~3.4GB per batch. At 32K tokens, it explodes to 55GB—a 16× increase for 10× longer sequences.

Sequence Length Impact

Memory grows linearly with sequence length but the impact is non-linear on throughput and batch size:

Short Context (2K tokens)

Cache ~3.4 GB/batch. Multiple batches parallel. High utilization of GPU.

Medium Context (8K tokens)

Cache ~13.6 GB/batch. Batch size reduced. GPU remains fairly loaded.

Long Context (32K tokens)

Cache ~55 GB/batch. Single batch only. GPU severely underutilized.

Very Long (128K tokens)

Cache ~220 GB/batch. Multiple GPUs required. Extreme fragmentation risk.

Fragmentation Waste

In traditional request batching, each request reserves contiguous cache space for its maximum possible length. If a request generates fewer tokens than reserved, that space is wasted. Across a batch of diverse requests, 60-80% of allocated cache remains unused.

Critical Problem

Fragmentation is the hidden crisis of KV cache—GPU memory is allocated but inaccessible to other requests, making effective batch size far smaller than theoretical capacity.

Compute-Bound vs Memory-Bound Inference

Prefill and decode exhibit opposite computational characteristics:

Prefill (Compute-Bound)

Process n tokens in parallel
High arithmetic intensity: (n² × d) work per (n × d) memory access
GPUs reach 50-70% peak FLOPS utilization
Batch size can be large (128-256)
Latency-insensitive; throughput-critical

Decode (Memory-Bound)

Generate 1 token at a time
Low arithmetic intensity: (cache × d) work per (d) memory access
GPUs reach only 10-20% peak FLOPS utilization
Batch size limited by memory: typically 8-32
Latency-critical; throughput secondary

The KV cache bridges these phases: prefill dumps large cache, decode consumes it gradually. Optimal systems amortize prefill latency across many decode tokens while maximizing cache utilization.

Batching Strategies

Static batching: Fixed batch size, requests padded to max length. Wasteful for variable-length requests.

Dynamic batching: Requests leave batch when finished. Poor scheduling leads to GPU idle time.

Continuous batching (vLLM): Requests enter/leave dynamically without synchronization points. Achieves near-optimal utilization by overlapping prefill and decode phases.

Phase Interaction: The Bridge

Multi-Head Attention (MHA) — Baseline

Standard transformer architecture. Each query head has corresponding key and value heads. In a model with 32 attention heads and hidden dimension 4096:

32

Query heads

32

Key heads

32

Value heads

1:1 ratio

Q:KV

Cache memory: 2 × 32 × seq_len × head_dim per token. Baseline reference.

Multi-Query Attention (MQA)

All query heads share a single key and value head. Reduces KV heads from 32 to 1—a 32× reduction in cache size with trade-offs:

32

Query heads

1

Key head

1

Value head

32:1 ratio

Q:KV

Used in Falcon, PaLM. Dramatic cache reduction but head-to-head comparisons show modest perplexity degradation on held-out benchmarks (1-2 point PPL increase).

Grouped-Query Attention (GQA)

Balanced middle ground: query heads grouped around fewer key/value heads. For instance, 32 query heads grouped into 8 groups × 4 queries per group, with 8 KV heads:

32

Query heads

8

Key heads

8

Value heads

4:1 ratio

Q:KV

Achieves ~8× cache reduction with <1 PPL quality gap. Used in LLaMA-2, Mistral, Llama-3 (recommended sweet spot).

Real-World Model Comparison

GPT-4

Config: MHA with ~100+ heads. Proprietary architecture. Likely uses internal optimizations.

LLaMA-2 7B

Config: 32 heads, MHA. Large cache. 13B variant uses GQA in some variants.

Mistral 7B

Config: 32 query heads, 8 KV heads (GQA). Aggressive cache reduction.

Falcon 40B

Config: MQA, single KV head. Extreme cache efficiency.

Tradeoff Grid

MHA/GQA Strengths

Maximum model capacity and expressiveness
Well-established training procedures
Better quality on benchmarks
Flexible scaling to larger models

MHA/GQA Limitations

Large cache memory footprint
Limited to smaller batch sizes at long context
Higher inference cost and latency
Expensive to serve at scale

MQA Strengths

Minimal cache memory (32× reduction)
Highest batch sizes and throughput
Deployable on smaller hardware
Cost-effective at inference scale

MQA Limitations

Quality degradation vs full MHA
Requires retraining from scratch
Less proven at very large scales
Query-head bottleneck potential

Quantization of KV Cache

Reduce precision from FP16 (2 bytes per value) to INT8/INT4:

INT8: 1 byte per value. 50% memory reduction. Minimal quality loss with proper calibration.
INT4: 0.5 bytes per value. 75% reduction. Requires per-token or per-channel quantization.
FP8: Emerging standard. 1 byte, dynamic range similar to FP16, better than INT8.

Trade-off: Quantization introduces rounding error, but effects are often modest because attention values stabilize after early layers and are relatively smooth.

Token Eviction Strategies

H₂O (Heavy-Hitter Oracle): Keep the first and last few tokens (attention sinks) plus a sliding window of recent tokens. Evict middle tokens whose attention scores are low. Achieves 90% cache reduction with <2% perplexity increase on long sequences (32K+ tokens).

StreamingLLM: Similar to H₂O; focus on maintaining sink tokens (often BOS token) that appear in all attention patterns.

SnapKV: Dynamic importance scoring with eviction. Trade computation for cache reduction.

Attention Sink Phenomenon

Empirical finding: The first token(s) in a sequence (typically BOS or padding tokens) receive high attention scores in all layers and positions. These "sinks" are critical to model function but represent a tiny fraction of the sequence. Techniques that identify and preserve sinks while evicting others achieve dramatic cache reduction.

Sliding Window Attention

Mistral's approach: attention computed only over recent N tokens (e.g., 4096), ignoring older history. Reduces cache to constant size regardless of total sequence length.

4K tokens

Mistral window

90%+

Memory saved

<2% PPL

Quality loss

Trade-off: Cannot attend to very distant context. Suitable for many applications (e.g., dialogue, document processing) but not long-context tasks requiring full history.

Speculative Decoding with Cache

Speculative decoding uses a draft model to generate candidates, then verifies with the main model. Both require caches. The draft model (smaller) needs less cache, but total cache is 1.5-2× baseline. Requires careful memory budgeting and synchronization of cache states.

Traditional KV Cache Allocation Problem

Typical allocation: Preallocate contiguous cache buffer for each request's entire maximum length. If request terminates early or generation is shorter than max, that space is wasted and inaccessible to other requests.

Request A (max 8K)

Actually generates 2K tokens. 6K slots wasted.

Request B (max 8K)

Cannot reuse A's wasted space due to contiguous requirement.

Request C (max 4K)

Allocated but competes with A&B for remaining contiguous space.

GPU Memory

60-80% of total cache allocated is unused but blocked from other requests.

PagedAttention Solution

Inspired by OS virtual memory: KV cache divided into fixed-size physical blocks (e.g., 16K tokens per block). Each request has a logical block table that maps to physical blocks. As requests extend their cache, new physical blocks are allocated non-contiguously.

Key Mechanism

Logical blocks decoupled from physical blocks enable fragmentation-free allocation. Multiple requests can share physical blocks; compute kernels perform block-table lookup during attention.

Block Table Architecture

Copy-On-Write & Prefix Caching

If multiple requests share the same prompt (e.g., system prompt for many users), PagedAttention can share the prefix's cached blocks. When a request diverges (e.g., different user input after system prompt), it copies only the diverging portion. This saves significant memory when many requests share context.

2-4×

Throughput improvement

60% → 5%

Memory fragmentation

~10-20%

Compute overhead (block table)

vLLM Implementation

vLLM uses PagedAttention in its core execution model. Each request tracks a block table. During prefill, new blocks are allocated. During decode, the block table is indexed to gather K,V for attention. When a request completes, its blocks are returned to the free pool and can be reused immediately.

Integration with continuous batching: requests enter/leave the batch independently. New requests can immediately utilize freed blocks from completed requests. No synchronization between prefill/decode or between requests—maximum GPU utilization.

Performance Impact

Real vLLM deployments see:

2-4× improvement in requests/second
Batch sizes 4-8× larger at same latency SLO
Memory efficiency: ~80-85% cache utilization vs ~20% with traditional allocation
Prefix caching: additional 1.5-2× improvement when many requests share prompts

Framework Comparison Matrix

vLLM

KV Cache: PagedAttention with continuous batching. Prefix caching for system prompts. Token counting and abort-request support.

TensorRT-LLM

KV Cache: In-flight batching with dynamic buffer management. Fused compute kernels. Heavy NVIDIA optimization.

HuggingFace TGI

KV Cache: Modular approach; uses paged attention variant. Focus on distributed inference and scalability.

VLLM

KV Cache: Alternative paged approach; emphasis on integration with serving frameworks.

Feature Comparison Grid

Framework	PagedAttention	Prefix Caching	Distributed
vLLM	✓	✓	✓
TensorRT-LLM	✓	—	✓
HuggingFace TGI	✓	✓	✓

Deployment Considerations

vLLM: Easiest to deploy; PyTorch-native. Best for single-node setups and research. Excellent developer experience.
TensorRT-LLM: Highest performance on NVIDIA hardware. Complex compilation pipeline. Requires expertise. Best for large-scale production.
HuggingFace TGI: Good balance of usability and performance. Docker-first. Native Hugging Face integration. Suited for enterprises seeking managed inference.

Continuous Batching Strategy

All frameworks employ continuous batching or in-flight batching to maximize utilization:

Requests added to batch dynamically (no waiting for full batch)
Each request progresses independently through prefill/decode
Requests exit when finished; free resources immediately reclaimed
No synchronization barriers between requests—GPU stays busy

Outcome: 3-5× throughput improvement over batched inference where all requests in a batch must complete together.

Cross-Request KV Cache Sharing

LMCache project: external distributed KV store for cross-request cache sharing. Multiple model replicas can query a shared cache backend for previously computed K,V tensors. Enables:

Cache reuse across different users' requests with same prompt
Amortization of expensive prefill across many requests
Significant throughput gains for repetitive workloads (e.g., chatbots with fixed system prompts)

Emerging Pattern

Cache-as-a-service: Treat KV cache as a first-class distributed resource, not just per-request GPU memory.

Disaggregated Inference

Future architecture: separate clusters for prefill and decode phases:

Prefill cluster: Optimized for throughput. Heavy parallelism. Large batch sizes.
Decode cluster: Optimized for latency. Smaller batches. Specialized hardware (e.g., TPU pods).
KV Store: Centralized cache between clusters. Enables asynchronous prefill without waiting for decode capacity.

Benefit: Independent scaling of prefill and decode workloads. If decode becomes bottleneck, add more decode nodes without reprovisioning prefill. Enables resource efficiency and elasticity.

Speculative Decoding & KV Cache

Speculative decoding generates multiple candidate tokens in parallel, then verifies with the main model. KV cache interaction:

Draft model maintains its own KV cache (smaller, faster)
Main model maintains separate KV cache for verification
If draft prediction correct: main model's cache extended
If draft prediction wrong: rollback and retry with fewer candidates

Total cache footprint: 1.5-2× baseline due to dual caches. But token generation speed can improve 2-3× if draft model has high correctness.

Multi-Tier Storage (Future)

Research direction: KV cache tiering across heterogeneous storage:

Hot (GPU VRAM)

Recent decode tokens. High-speed access needed for every step.

Warm (CPU RAM)

Older prefill tokens. Accessed occasionally. Enables larger effective cache.

Cold (SSD/NVMe)

Very old sequences. Rarely accessed. Enables serving 1M+ token sequences.

Trade-off: Latency increase from accessing lower tiers; throughput gains from serving longer sequences with more concurrent requests.

Hardware Co-Design

New hardware optimized for KV cache operations:

Specialized memory hierarchies for cache-heavy workloads
Hardware block table accelerators
Efficient gather/scatter for non-contiguous cache access
In-memory compute to avoid cache movement

Timeline of Key Developments

2017

Original Transformer paper. KV cache becomes standard optimization.

2019

MQA introduced. First attempt at head sharing to reduce cache.

2021

Flash Attention. Improves attention compute efficiency.

2023

vLLM PagedAttention. Major breakthrough in cache fragmentation.

2023

GQA widespread. LLaMA-2, Mistral adopt grouped-query design.

2024

LMCache, disaggregated inference emerge. Cross-request caching exploration.

2025+

Hardware specialized for KV cache. Multi-tier storage exploration.

09 · Sources

Sources & References

Foundational Transformer Paper

Vaswani et al. — Attention Is All You Need (2017) — Seminal paper introducing the Transformer architecture with self-attention mechanisms. Foundation for understanding KV-cache optimization.

Efficiency & KV-Cache Analysis

Pope et al. — Efficiently Scaling Transformers: A Comprehensive Survey — Comprehensive survey of transformer scaling techniques, memory optimization, and KV-cache strategies.

Chen et al. — The State and Fate of Linguistic Diversity and Multilingualism in the EU — Analysis of attention patterns and memory requirements in large language models.

Large Language Model Architecture

Child et al. — Generating Long Sequences with Sparse Transformers (2019) — Techniques for efficient attention with sparse patterns, motivating KV-cache optimizations.

Raffel et al. — Exploring the Limits of Transfer Learning with Unified Text-to-Text Transformer (2020) — T5 model analysis with discussion of efficient decoding and attention optimization.

Reference & Implementation

Wikipedia: Transformer (Machine Learning) — Comprehensive overview of transformer architecture and self-attention mechanisms.