The Transformer consists of stacked encoders and decoders, each with 6 identical layers.
Each layer has two sub-layers: multi-head self-attention + feed-forward.
Each layer refines representations, building hierarchy of abstract features.
Tokens are first embedded into d_model=512 dimensions, then combined with positional encodings.
Attention mechanism is order-agnostic—it doesn't inherently know sequence position. Positional encodings inject position information.
Without positions, "The cat sat" and "sat cat The" look identical to the model.
Position information uses sine and cosine waves at different frequencies.
Even dimensions use sin, odd dimensions use cos. This creates a unique signal for each position.
Relative position relationships are learned automatically: the model can extrapolate to longer sequences.
Modern: GPT-2 uses learned PE; LLaMA uses RoPE.
For each query, compute similarity to all keys, then use softmax to create a probability distribution over values.
Q (queries): what am I looking for? K (keys): what are you? V (values): what do you contain?
The √dk scaling prevents the softmax from becoming too peaked.
Without scaling, dot products grow with d_k. Large dot products push softmax into saturation (tiny gradients).
With d_k=64: unscaled scores average μ≈0, σ≈8. After softmax, one value dominates.
Scaling by √d_k=8 keeps variance stable: σ ≈ 1, softmax distributes probability more evenly.
Better gradient flow during backprop; model learns more nuanced attention patterns.
Instead of one large attention head, use h=8 parallel heads, each with d_k = d_v = 64.
Each head sees a 64-dim projection of Q, K, V. All outputs concatenated: (8 × 64) = 512.
This allows the model to attend to different subspaces simultaneously.
Self-Attention (Encoder): Q, K, V all come from the same sequence. Each token attends to all tokens.
Cross-Attention (Decoder→Encoder): Q from decoder, K and V from encoder. Decoder accesses encoder context.
After attention, each position independently goes through a two-layer MLP: 512 → 2048 → 512.
The hidden layer (2048 dims) is 4× wider than d_model. This adds model capacity and non-linearity.
Same MLPs applied independently to each position. Parameters are shared across the sequence.
Each sub-layer (attention, FFN) is wrapped with residual connection + layer norm.
Layer Norm: normalizes features per position (not across batch). Stabilizes training.
Residual: x + f(x) allows gradients to flow directly to earlier layers. Eases training of deep networks.
Residuals + layer norm enable stable training of 6-deep networks without degradation.
During inference, the decoder generates output one token at a time. It cannot attend to future tokens.
Causal mask: set attention scores to -∞ for all future positions. After softmax, weights become 0.
This ensures the model generates tokens sequentially without peeking ahead.
Hardware: 8 NVIDIA P100 GPUs. Speed: ~4,500 tokens/sec per GPU.
Total training time: 3.5 days for big model (12 hours for base).
Metric: BLEU score (bilingual evaluation understudy). Higher = better translation quality.
Transformer achieves 28.4 BLEU on EN→DE and 41.8 BLEU on EN→FR.
Outperforms previous RNN/CNN baselines. Faster training. Better parallelization than sequential RNNs.
Self-attention is inherently parallel. Unlike LSTMs, all positions process simultaneously.
GPT / Decoder-Only: no encoder. Generate next token autoregressively. Scales to 100B+ parameters.
BERT / Encoder-Only: no decoder. Bidirectional context. Powerful for classification, understanding tasks.
T5 / Encoder-Decoder: keeps both. Good for sequence-to-sequence (translation, summarization).
Transformer is a foundation. Different tasks demand different masking & architectures.
Learned PE (GPT-2): treat position embeddings as learnable parameters. Simple, effective for fixed sequence lengths.
RoPE (LLaMA): rotate query/key vectors by angle proportional to position. Enables extrapolation to longer sequences.
ALiBi (no sinusoid): add bias to attention scores based on relative distance. Even simpler, no explicit PE needed.
Moving away from explicit sinusoids toward implicit, learnable, or distance-based schemes.
1. Self-Attention is Parallelizable: Unlike RNNs, Transformers process all positions simultaneously. Massive speedup.
2. Multi-Head Attention: h=8 heads, each projecting to d_k=64 dims, allow the model to attend to different subspaces.
3. Positional Encoding: Injects sequence order via sinusoids, learned embeddings, or rotation. Critical for understanding position.
4. Residual + Layer Norm: Enable training of 6-deep stacks. Gradients flow directly through skip connections.
5. Simplicity & Generality: Encoder-decoder, decoder-only (GPT), encoder-only (BERT) all use the same building block.
6. Scale Laws: Transformer performance scales smoothly with model size and data. Foundation for modern LLMs.
Parallelism: GPU-friendly. RNNs are sequential; Transformers are parallel.
Long-Range Dependencies: Attention has O(1) path between any two positions. RNNs suffer from vanishing gradients.
Scaling: Performance improves consistently with more data, compute, and parameters. Enables billion-parameter models.
Interpretability: Attention weights show which tokens influence each prediction. More transparent than RNN hidden states.
Flexibility: Same architecture adapted for language, vision, speech, code. Universal building block.