Attention Is All You Need

Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin
NeurIPS 2017 · The Transformer Architecture
Arrow keys / click / swipe · ◐ for dark mode
Outline 1

What We'll Cover

  • I Core Architecture
  • II Scaled Dot-Product Attention
  • III Multi-Head Attention
  • IV Feed-Forward & Layer Norm
  • V Positional Encoding
  • VI Results & Modern Variants
§3.1 2

The Encoder-Decoder Stack

6 identical layers of self-attention and feed-forward networks
I
§3.1 · Overview 3

Encoder-Decoder Architecture

The Transformer consists of stacked encoders and decoders, each with 6 identical layers.

Encoder N=6 identical layers
d_model=512 dimensions
Decoder N=6 identical layers
d_model=512 dimensions

Each layer has two sub-layers: multi-head self-attention + feed-forward.

Why stacking?

Each layer refines representations, building hierarchy of abstract features.

Encoder-Decoder Encoder N=6 Layers Multi-Head Self-Attention Feed-Forward (d_model=512) Decoder N=6 Layers Masked Self-Attention Encoder-Decoder Cross-Attention context
§3.1 · Input 4

Embeddings + Positional Encoding

Tokens are first embedded into d_model=512 dimensions, then combined with positional encodings.

Embedding e(token) ∈ ℝ^512

Attention mechanism is order-agnostic—it doesn't inherently know sequence position. Positional encodings inject position information.

Why needed?

Without positions, "The cat sat" and "sat cat The" look identical to the model.

Input Pipeline Token IDs [2, 42, 187, ...] Token Embed d_model=512 + Pos Encoding PE(pos, 2i) Encoder Input shape: (seq, 512)
§3.1 · Positional Enc. 5

Sinusoidal Positional Encoding

Position information uses sine and cosine waves at different frequencies.

Formula PE(pos, 2i) = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

Even dimensions use sin, odd dimensions use cos. This creates a unique signal for each position.

Advantage

Relative position relationships are learned automatically: the model can extrapolate to longer sequences.

Modern: GPT-2 uses learned PE; LLaMA uses RoPE.

PE Heatmap (pos × dim) pos=0 (start) Higher frequencies: rapid oscillation Lower frequencies: slow changes
§3.2 6

Scaled Dot-Product Attention

"Attention is a soft dictionary lookup: find the most relevant keys for each query."
II
§3.2 · Core Formula 7

Scaled Dot-Product Attention

For each query, compute similarity to all keys, then use softmax to create a probability distribution over values.

Attention Formula Attention(Q, K, V) = softmax(QK^T / √dk) V

Q (queries): what am I looking for? K (keys): what are you? V (values): what do you contain?

The √dk scaling prevents the softmax from becoming too peaked.

Attention Flow Q K V QK^T / √dk scores softmax() weights Output (weighted sum)
§3.2 · Scaling 8

Why Scale by √d_k?

Without scaling, dot products grow with d_k. Large dot products push softmax into saturation (tiny gradients).

With d_k=64: unscaled scores average μ≈0, σ≈8. After softmax, one value dominates.

Scaling by √d_k=8 keeps variance stable: σ ≈ 1, softmax distributes probability more evenly.

Result

Better gradient flow during backprop; model learns more nuanced attention patterns.

Softmax Distribution Unscaled (peaked) Scaled (÷√64) (distributed) Large variance → one key dominates Stable variance → balanced attention
§3.2 · Multi-Head 9

Multi-Head Attention (h=8 heads)

Instead of one large attention head, use h=8 parallel heads, each with d_k = d_v = 64.

Each head sees a 64-dim projection of Q, K, V. All outputs concatenated: (8 × 64) = 512.

Multi-Head head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)
MultiHead(Q,K,V) = Concat(head_1,...,head_8)W^O

This allows the model to attend to different subspaces simultaneously.

Multi-Head (h=8) Input: (seq_len, 512) h1 64d h2 64d h3 64d ... h8 h8 64d Concat + Linear W^O (8×64 → 512) Output: (seq_len, 512) same shape as input
§3.2 · Variants 10

Self-Attention vs Cross-Attention

Self-Attention (Encoder): Q, K, V all come from the same sequence. Each token attends to all tokens.

Self-Att. Q = X W_Q
K = X W_K
V = X W_V

Cross-Attention (Decoder→Encoder): Q from decoder, K and V from encoder. Decoder accesses encoder context.

Cross-Att. Q = decoder_X W_Q
K = encoder_X W_K
V = encoder_X W_V
Attention Types Self-Attention one sequence Cross-Attention decoder (queries) encoder (K, V)
§3.3 11

Feed-Forward & Layer Norm

The two sub-layers of every encoder/decoder block
III
§3.3 · Feed-Forward 12

Position-Wise Feed-Forward

After attention, each position independently goes through a two-layer MLP: 512 → 2048 → 512.

FFN FFN(x) = max(0, xW_1 + b_1)W_2 + b_2

The hidden layer (2048 dims) is 4× wider than d_model. This adds model capacity and non-linearity.

Why "position-wise"?

Same MLPs applied independently to each position. Parameters are shared across the sequence.

Feed-Forward Block Input: (seq_len, d_model=512) Linear W_1 + ReLU (512 → 2048) Linear W_2 (2048 → 512) Output: (seq_len, 512)
§3.3 · Norm & Skip 13

Layer Normalization & Residual Connections

Each sub-layer (attention, FFN) is wrapped with residual connection + layer norm.

Sub-Layer LayerNorm(x + Sublayer(x))

Layer Norm: normalizes features per position (not across batch). Stabilizes training.

Residual: x + f(x) allows gradients to flow directly to earlier layers. Eases training of deep networks.

Why stack 6 layers?

Residuals + layer norm enable stable training of 6-deep networks without degradation.

Residual + Norm Block x (residual path) Multi-Head Attention (or Feed-Forward) + LayerNorm Output to next layer
§3.3 · Decoder 14

Decoder: Causal Masking

During inference, the decoder generates output one token at a time. It cannot attend to future tokens.

Causal mask: set attention scores to -∞ for all future positions. After softmax, weights become 0.

Masking scores_ij ← -∞ if j > i
(attend only to ≤ current position)

This ensures the model generates tokens sequentially without peeking ahead.

Causal Attention Mask Query position → Key pos allowed masked Lower triangle = allowed Upper triangle = masked (-∞) Token can only attend to past & self
§5 15

Experiments & Results

Machine translation at scale: WMT 2014 English-German and English-French
IV
§5 · Training 16

Training & Evaluation

Hardware: 8 NVIDIA P100 GPUs. Speed: ~4,500 tokens/sec per GPU.

Total training time: 3.5 days for big model (12 hours for base).

Dataset WMT 2014 English-German
WMT 2014 English-French
Multi-GPU distributed training

Metric: BLEU score (bilingual evaluation understudy). Higher = better translation quality.

Training Timeline 0h 84h (3.5d) checkpoint 1 checkpoint 2 final Hardware: 8 × P100 ~4.5k tokens/sec
§5 · Results 17

BLEU Scores: State-of-the-Art

Transformer achieves 28.4 BLEU on EN→DE and 41.8 BLEU on EN→FR.

Big Model Results EN→DE: 28.4 BLEU (best)
EN→FR: 41.8 BLEU (best)
Training: 3.5 days

Outperforms previous RNN/CNN baselines. Faster training. Better parallelization than sequential RNNs.

Key Win

Self-attention is inherently parallel. Unlike LSTMs, all positions process simultaneously.

BLEU Comparison BLEU EN-DE 28.4 EN-FR 41.8 prev. SOTA
§5 · Modern 18

Modern Variants: Decoder-Only & Encoder-Only

GPT / Decoder-Only: no encoder. Generate next token autoregressively. Scales to 100B+ parameters.

BERT / Encoder-Only: no decoder. Bidirectional context. Powerful for classification, understanding tasks.

T5 / Encoder-Decoder: keeps both. Good for sequence-to-sequence (translation, summarization).

Paradigm

Transformer is a foundation. Different tasks demand different masking & architectures.

Model Families GPT Decoder Autoregressive BERT Encoder Bidirectional T5 Enc-Dec Unified LLMs, Chat Classification Translation
§5 · PE Variants 19

Modern Positional Encoding Variants

Learned PE (GPT-2): treat position embeddings as learnable parameters. Simple, effective for fixed sequence lengths.

RoPE (LLaMA): rotate query/key vectors by angle proportional to position. Enables extrapolation to longer sequences.

ALiBi (no sinusoid): add bias to attention scores based on relative distance. Even simpler, no explicit PE needed.

Trend

Moving away from explicit sinusoids toward implicit, learnable, or distance-based schemes.

PE Methods Sinusoidal (Original) Fixed Learned (GPT-2) Flexible RoPE (LLaMA) Rotate Key insight: relative position is more important than absolute. Modern methods encode distance, not position indices.
Summary 20

Key Takeaways

1. Self-Attention is Parallelizable: Unlike RNNs, Transformers process all positions simultaneously. Massive speedup.

2. Multi-Head Attention: h=8 heads, each projecting to d_k=64 dims, allow the model to attend to different subspaces.

3. Positional Encoding: Injects sequence order via sinusoids, learned embeddings, or rotation. Critical for understanding position.

4. Residual + Layer Norm: Enable training of 6-deep stacks. Gradients flow directly through skip connections.

5. Simplicity & Generality: Encoder-decoder, decoder-only (GPT), encoder-only (BERT) all use the same building block.

6. Scale Laws: Transformer performance scales smoothly with model size and data. Foundation for modern LLMs.

Conclusion 21

Why Transformers Won

Parallelism: GPU-friendly. RNNs are sequential; Transformers are parallel.

Long-Range Dependencies: Attention has O(1) path between any two positions. RNNs suffer from vanishing gradients.

Scaling: Performance improves consistently with more data, compute, and parameters. Enables billion-parameter models.

Interpretability: Attention weights show which tokens influence each prediction. More transparent than RNN hidden states.

Flexibility: Same architecture adapted for language, vision, speech, code. Universal building block.

Timeline 2017 Transformer 2018-19 BERT, GPT-2 2020-22 GPT-3, T5 2023+ LLMs One paper changed machine learning forever.