Parameter explosion: A 224×224×3 image has 150K pixels. A single fully-connected hidden layer of 1000 units requires 150M parameters.
No spatial structure: Fully-connected layers treat every pixel equally, ignoring that nearby pixels are related.
No translation invariance: A cat in the top-left corner looks different to the network than the same cat in the center.
Images have local structure. Small filters that detect edges, corners, and textures are more powerful than learning pixel-by-pixel relationships.
A kernel (or filter) is a small matrix of learned weights. It slides over the input, computing element-wise products and summing.
Example: A 5×5 kernel with learnable weights detects patterns like edges or textures across the image.
Key benefits: Shared weights across space, local connectivity, fewer parameters, translation invariance.
Stride: How many pixels the kernel moves each step. Stride=1 means one pixel shift; stride=2 means skip one pixel.
Padding: Adding zero-valued border pixels to preserve spatial dimensions. A 32×32 input with 5×5 kernel and padding=2 produces 32×32 output (without padding, output shrinks to 28×28).
output_size = (input_size − kernel_size + 2×padding) / stride + 1
High stride = smaller output (downsampling). Padding preserves edge information.
Pooling layers reduce spatial dimensions by applying a stateless operation (max or average) over small windows.
Max pooling: Selects the maximum value in each window. Captures the strongest feature response.
Average pooling: Computes the average. Smoother, less sensitive to noise.
Benefits: Reduces parameters, adds translation invariance, prevents overfitting, enables larger receptive fields.
The receptive field of a neuron is the region of the input image that influences its output.
In layer 1, a 5×5 convolution has a 5×5 receptive field. But in layer 2, stacking another 5×5 convolution expands the receptive field to 9×9.
Deep networks see the whole image: Shallow layers detect edges; deeper layers see textures, shapes, objects. Large receptive fields are crucial for understanding context.
Designed by Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner for handwritten digit recognition.
Architecture: 2 convolutional layers + 3 fully-connected layers. Total: ~60K parameters.
Impact: First successful CNN, demonstrated that local feature learning works. Used in bank check processing.
LeNet proved that convolutions + shared weights could solve real-world vision tasks efficiently.
Input: 32×32 grayscale. Output: 10 classes (digits 0-9).
Designed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. Won ImageNet 2012 with 26% → 15% top-5 error—a stunning 11-point improvement.
Architecture: 5 convolutional + 3 fully-connected layers. ~60M parameters. First to use ReLU activation and dropout regularization.
AlexNet sparked the modern deep learning era. Proved that CNNs could learn complex image representations at scale with GPUs.
Input: 224×224 RGB (ImageNet). Output: 1000 classes.
VGGNet (Visual Geometry Group): 16-19 layers using only 3×3 convolutions stacked deeply. Key insight: two 3×3 convs = one 5×5 conv with more non-linearity, fewer parameters.
ResNet (Microsoft): 50-152 layers with skip connections (identity shortcuts). Each block outputs input + transformation. Solves vanishing gradient problem in very deep networks.
y = x + F(x), where F is the learned layers. Preserves original signal, enabling 100+ layer networks.
Batch Normalization: Normalizes layer inputs to mean 0, variance 1 within each batch. Stabilizes training, allows higher learning rates, acts as regularizer.
Dropout: Randomly deactivate neurons with probability p during training. Prevents co-adaptation, forces network to learn robust features. Not applied at test time.
Batch norm + dropout = faster training + better generalization. Enables deeper, more powerful networks.
Sequential data: Text, time series, audio, video. Each element depends on previous elements.
Variable length: Sentences range from 5 to 1000+ tokens. Fully-connected layers expect fixed input size.
Long-range dependencies: In "The bank executive who works in New York said the …", the pronoun "she" refers back to a subject many words away.
Need architecture that processes variable-length sequences, remembers past context, and handles long-range dependencies.
RNNs maintain a hidden state h that updates at each time step. Same weights are used for every token.
Key idea: The hidden state acts as memory, carrying information forward through time. Weight sharing across timesteps is parameter-efficient.
Training: Backpropagation through time (BPTT). Gradients flow backward through unrolled sequence.
Problem: When training RNNs with BPTT, gradients shrink exponentially as they flow backward through many time steps. After ~20-30 steps, gradient ≈ 0.
Root cause: Chain rule in calculus: dL/dh_0 = dL/dh_T × dh_T/dh_{T-1} × ... × dh_1/dh_0. Each multiplication by a value < 1 → exponential decay.
Vanilla RNNs cannot learn long-range dependencies (> 5-10 steps). Early time steps stop receiving meaningful gradient updates.
Discovered: Hochreiter (1991), Bengio et al. (1994)
LSTM cells: Replace vanilla RNN cells with gated units. Three gates (forget, input, output) control information flow.
Cell state (c): Long-term memory with additive updates. Forget gate selects what to discard; input gate selects what to add.
Hidden state (h): Short-term output. Output gate controls what of the cell state is exposed.
Additive cell updates (+ operator) allow gradients to flow unimpeded. Multiplicative gates provide selective memory.
Forget gate: f_t = sigmoid(W_f · [h_{t-1}, x_t] + b_f). Values in [0,1]. Multiplies cell state; 0 = discard, 1 = keep.
Input gate: i_t = sigmoid(W_i · [h_{t-1}, x_t] + b_i). Controls how much of the new candidate (tanh) to add to the cell state.
Cell state update: c_t = f_t ⊙ c_{t-1} + i_t ⊙ tanh(...). Additive update ensures gradient flow.
Output gate: o_t = sigmoid(W_o · [h_{t-1}, x_t] + b_o). Controls exposure of cell state as hidden state h_t.
Simplified LSTM variant. Combines forget and input gates into a single update gate. Only two gates (reset, update) vs. three in LSTM.
Trade-off: Fewer parameters, slightly faster to train. Similar performance to LSTM. Good when GPU memory is constrained.
Bidirectional RNN: Run two RNNs—one forward (left-to-right), one backward (right-to-left). Concatenate outputs. Gives access to future context too.
Deep RNN: Stack multiple RNN layers. Layer 1 processes input sequence; layer 2 processes layer 1's outputs. Learn hierarchical temporal representations.
Practical: BiLSTM (bidirectional LSTM) is standard for NLP tasks (POS tagging, NER). Stacking 2-3 layers is common; deeper becomes redundant/expensive.
Architecture: Encoder LSTM reads input sequence, outputs final hidden state (context vector). Decoder LSTM generates output sequence, conditioned on context.
Applications: Machine translation ("Hello" → "Hola"), summarization, dialogue, image captioning.
Problem: Single context vector bottleneck. Can't capture all information from long sequences.
Attention mechanisms (2015+) allow decoder to focus on relevant encoder states, not just final state.
CNNs excel at: Images, spatial grids. Hierarchical feature learning. Fast parallelization. Modern object detection, segmentation.
RNNs/LSTMs excel at: Variable-length sequences. Temporal dependencies. Language modeling, machine translation, time series prediction.
Hybrid: Video = sequence of images. Use CNN to extract spatial features per frame, RNN to model temporal patterns. CNNs in encoder for image captioning.
CNN if data has spatial structure; RNN if sequential/temporal; hybrid if both.
Key idea: "Attention is All You Need." Replace RNNs entirely with stacked self-attention layers. No recurrence = massive parallelization.
Self-attention: Each token attends to all other tokens in parallel. Learn which tokens are relevant to which. No bottleneck like seq2seq.
Benefits: Train 100x faster than RNNs. Better long-range dependency modeling (no gradient decay). Dominates NLP (BERT, GPT, T5).
Transformers replaced RNNs for most NLP. Later adapted to vision (Vision Transformers).
Vision Transformers (Dosovitskiy et al., 2021): Divide image into patches (16×16 tokens), apply Transformer. No convolutions.
Performance: Surpasses ResNet on large datasets. Better scaling with data/compute. Becomes standard for vision tasks.
Impact: CNNs (inductive bias for locality) give way to attention (pure learning). Shows local connectivity isn't essential with enough data.
Vision increasingly moves from CNNs → Transformers. CNNs remain efficient for small data, edge devices.
Transfer learning: Pre-train on large dataset (ImageNet, COCO), fine-tune on domain-specific small dataset. Leverages learned features.
Why effective: Lower layers learn universal features (edges, textures). Higher layers adapt to specific task. Reusing lower layers saves parameters and data.
Strategies: (1) Freeze backbone, train only head. (2) Fine-tune all layers with low learning rate. (3) Progressive unfreezing.
Impact: Most modern applications use pre-trained models. Solo training from scratch is rare (except when domain is very different).
Narrative: CNNs dominated vision for 15 years. RNNs/LSTMs for language. Transformers unified both domains, enabling foundation models (BERT, GPT, etc.).
CNNs remain essential: Efficient on constrained devices, small datasets, real-time vision. Foundation for hybrid models (CNN + Transformer backbones). Inductive bias still valuable.
RNNs/LSTMs: Specialized RNN variants (Gated Recurrent Units, Bidirectional LSTMs) still used where Transformers are overkill or latency matters. Strong teaching architectures for understanding sequential processing.
Conceptual legacy: CNNs + RNNs introduced fundamental ideas (parameter sharing, local connectivity, gating, attention mechanisms) that shaped all modern architectures.
Transformers now dominant, but CNNs/RNNs remain deployed in production. Hybrid architectures (Perceiver, Flamingo) blend all three paradigms.
Local connectivity, shared weights, pooling. Ideal for images. Efficient, well-understood, inductive bias for locality.
Hidden state, sequential processing, gating mechanisms. Handle variable-length sequences. LSTMs solve vanishing gradient via additive cell updates.
All-to-all attention, no recurrence, massively parallel. Scales to foundation models. Now dominant in NLP, expanding to vision.
Transfer learning from pre-trained models is standard. Fine-tuning beats training from scratch. Hybrid architectures combine strengths of all three.
Landmark Papers:
• LeNet: LeCun et al. (1998) — "Gradient-Based Learning Applied to Document Recognition"
• LSTM: Hochreiter & Schmidhuber (1997) — "Long Short-Term Memory"
• AlexNet: Krizhevsky et al. (2012) — "ImageNet Classification with Deep Convolutional Networks"
• ResNet: He et al. (2015) — "Deep Residual Learning for Image Recognition"
• Transformers: Vaswani et al. (2017) — "Attention Is All You Need"
• ViT: Dosovitskiy et al. (2021) — "An Image is Worth 16x16 Words"
Courses & Textbooks:
• Goodfellow, Bengio, Courville (2016) — "Deep Learning" (MIT Press)
• Karpathy's CS231n, CS224n (Stanford online)
• "Dive into Deep Learning" (D2L.ai) — free online with code