Attention Is All You Need

Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin

NeurIPS 2017 · The Transformer Architecture

Arrow keys / click / swipe · ◐ for dark mode

Outline 1

What We'll Cover

I Core Architecture
II Scaled Dot-Product Attention
III Multi-Head Attention
IV Feed-Forward & Layer Norm
V Positional Encoding
VI Results & Modern Variants

§3.1 2

The Encoder-Decoder Stack

6 identical layers of self-attention and feed-forward networks

I

§3.1 · Overview 3

Encoder-Decoder Architecture

The Transformer consists of stacked encoders and decoders, each with 6 identical layers.

Encoder N=6 identical layers
d_model=512 dimensions

Decoder N=6 identical layers
d_model=512 dimensions

Each layer has two sub-layers: multi-head self-attention + feed-forward.

Why stacking?

Each layer refines representations, building hierarchy of abstract features.

§3.1 · Input 4

Embeddings + Positional Encoding

Tokens are first embedded into d_model=512 dimensions, then combined with positional encodings.

Embedding e(token) ∈ ℝ^512

Attention mechanism is order-agnostic—it doesn't inherently know sequence position. Positional encodings inject position information.

Why needed?

Without positions, "The cat sat" and "sat cat The" look identical to the model.

§3.1 · Positional Enc. 5

Sinusoidal Positional Encoding

Position information uses sine and cosine waves at different frequencies.

Formula PE(pos, 2i) = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

Even dimensions use sin, odd dimensions use cos. This creates a unique signal for each position.

Advantage

Relative position relationships are learned automatically: the model can extrapolate to longer sequences.

Modern: GPT-2 uses learned PE; LLaMA uses RoPE.

§3.2 6

Scaled Dot-Product Attention

"Attention is a soft dictionary lookup: find the most relevant keys for each query."

II

§3.2 · Core Formula 7

Scaled Dot-Product Attention

For each query, compute similarity to all keys, then use softmax to create a probability distribution over values.

Attention Formula Attention(Q, K, V) = softmax(QK^T / √dk) V

Q (queries): what am I looking for? K (keys): what are you? V (values): what do you contain?

The √dk scaling prevents the softmax from becoming too peaked.

§3.2 · Scaling 8

Why Scale by √d_k?

Without scaling, dot products grow with d_k. Large dot products push softmax into saturation (tiny gradients).

With d_k=64: unscaled scores average μ≈0, σ≈8. After softmax, one value dominates.

Scaling by √d_k=8 keeps variance stable: σ ≈ 1, softmax distributes probability more evenly.

Result

Better gradient flow during backprop; model learns more nuanced attention patterns.

§3.2 · Multi-Head 9

Multi-Head Attention (h=8 heads)

Instead of one large attention head, use h=8 parallel heads, each with d_k = d_v = 64.

Each head sees a 64-dim projection of Q, K, V. All outputs concatenated: (8 × 64) = 512.

Multi-Head head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)
MultiHead(Q,K,V) = Concat(head_1,...,head_8)W^O

This allows the model to attend to different subspaces simultaneously.

§3.2 · Variants 10

Self-Attention vs Cross-Attention

Self-Attention (Encoder): Q, K, V all come from the same sequence. Each token attends to all tokens.

Self-Att. Q = X W_Q
K = X W_K
V = X W_V

Cross-Attention (Decoder→Encoder): Q from decoder, K and V from encoder. Decoder accesses encoder context.

Cross-Att. Q = decoder_X W_Q
K = encoder_X W_K
V = encoder_X W_V

§3.3 11

Feed-Forward & Layer Norm

The two sub-layers of every encoder/decoder block

III

§3.3 · Feed-Forward 12

Position-Wise Feed-Forward

After attention, each position independently goes through a two-layer MLP: 512 → 2048 → 512.

FFN FFN(x) = max(0, xW_1 + b_1)W_2 + b_2

The hidden layer (2048 dims) is 4× wider than d_model. This adds model capacity and non-linearity.

Why "position-wise"?

Same MLPs applied independently to each position. Parameters are shared across the sequence.

§3.3 · Norm & Skip 13

Layer Normalization & Residual Connections

Each sub-layer (attention, FFN) is wrapped with residual connection + layer norm.

Sub-Layer LayerNorm(x + Sublayer(x))

Layer Norm: normalizes features per position (not across batch). Stabilizes training.

Residual: x + f(x) allows gradients to flow directly to earlier layers. Eases training of deep networks.

Why stack 6 layers?

Residuals + layer norm enable stable training of 6-deep networks without degradation.

§3.3 · Decoder 14

Decoder: Causal Masking

During inference, the decoder generates output one token at a time. It cannot attend to future tokens.

Causal mask: set attention scores to -∞ for all future positions. After softmax, weights become 0.

Masking scores_ij ← -∞ if j > i
(attend only to ≤ current position)

This ensures the model generates tokens sequentially without peeking ahead.

§5 15

Experiments & Results

Machine translation at scale: WMT 2014 English-German and English-French

IV

§5 · Training 16

Training & Evaluation

Hardware: 8 NVIDIA P100 GPUs. Speed: ~4,500 tokens/sec per GPU.

Total training time: 3.5 days for big model (12 hours for base).

Dataset WMT 2014 English-German
WMT 2014 English-French
Multi-GPU distributed training

Metric: BLEU score (bilingual evaluation understudy). Higher = better translation quality.

§5 · Results 17

BLEU Scores: State-of-the-Art

Transformer achieves 28.4 BLEU on EN→DE and 41.8 BLEU on EN→FR.

Big Model Results EN→DE: 28.4 BLEU (best)
EN→FR: 41.8 BLEU (best)
Training: 3.5 days

Outperforms previous RNN/CNN baselines. Faster training. Better parallelization than sequential RNNs.

Key Win

Self-attention is inherently parallel. Unlike LSTMs, all positions process simultaneously.

§5 · Modern 18

Modern Variants: Decoder-Only & Encoder-Only

GPT / Decoder-Only: no encoder. Generate next token autoregressively. Scales to 100B+ parameters.

BERT / Encoder-Only: no decoder. Bidirectional context. Powerful for classification, understanding tasks.

T5 / Encoder-Decoder: keeps both. Good for sequence-to-sequence (translation, summarization).

Paradigm

Transformer is a foundation. Different tasks demand different masking & architectures.

§5 · PE Variants 19

Modern Positional Encoding Variants

Learned PE (GPT-2): treat position embeddings as learnable parameters. Simple, effective for fixed sequence lengths.

RoPE (LLaMA): rotate query/key vectors by angle proportional to position. Enables extrapolation to longer sequences.

ALiBi (no sinusoid): add bias to attention scores based on relative distance. Even simpler, no explicit PE needed.

Trend

Moving away from explicit sinusoids toward implicit, learnable, or distance-based schemes.

Summary 20

Key Takeaways

1. Self-Attention is Parallelizable: Unlike RNNs, Transformers process all positions simultaneously. Massive speedup.

2. Multi-Head Attention: h=8 heads, each projecting to d_k=64 dims, allow the model to attend to different subspaces.

3. Positional Encoding: Injects sequence order via sinusoids, learned embeddings, or rotation. Critical for understanding position.

4. Residual + Layer Norm: Enable training of 6-deep stacks. Gradients flow directly through skip connections.

5. Simplicity & Generality: Encoder-decoder, decoder-only (GPT), encoder-only (BERT) all use the same building block.

6. Scale Laws: Transformer performance scales smoothly with model size and data. Foundation for modern LLMs.

Conclusion 21

Why Transformers Won

Parallelism: GPU-friendly. RNNs are sequential; Transformers are parallel.

Long-Range Dependencies: Attention has O(1) path between any two positions. RNNs suffer from vanishing gradients.

Scaling: Performance improves consistently with more data, compute, and parameters. Enables billion-parameter models.

Interpretability: Attention weights show which tokens influence each prediction. More transparent than RNN hidden states.

Flexibility: Same architecture adapted for language, vision, speech, code. Universal building block.