CNNs & RNNs
Neural Architectures for Spatial & Sequential Data

Deep learning fundamentals for ML students
From images to sequences — the two pillars of modern architecture
Arrow keys / click / swipe · ◐ for dark mode
Contents 1

Learning Map

  • IConvolutional Neural Networks
  • IIRecurrent Neural Networks
  • IIIClassic Architectures
  • IVModern Context & Evolution
§1 2

Convolutional Neural Networks

Capturing spatial structure in images and grids
I
§1 · Problem Space 3

Why Fully-Connected Layers Fail on Images

Parameter explosion: A 224×224×3 image has 150K pixels. A single fully-connected hidden layer of 1000 units requires 150M parameters.

No spatial structure: Fully-connected layers treat every pixel equally, ignoring that nearby pixels are related.

No translation invariance: A cat in the top-left corner looks different to the network than the same cat in the center.

Core Insight

Images have local structure. Small filters that detect edges, corners, and textures are more powerful than learning pixel-by-pixel relationships.

224×224×3 150K params Hidden Layer Problem: Dense connections • Every pixel connects to every neuron • Ignores spatial locality • Not translation invariant
§1 · Convolution 4

The Convolution Operation

A kernel (or filter) is a small matrix of learned weights. It slides over the input, computing element-wise products and summing.

Example: A 5×5 kernel with learnable weights detects patterns like edges or textures across the image.

Convolution Output output[i,j] = Σ(kernel × input_patch[i,j])

Key benefits: Shared weights across space, local connectivity, fewer parameters, translation invariance.

The same 5×5 kernel is reused thousands of times, drastically reducing parameters vs. fully-connected.
Input Image 5×5 Kernel Convolution Process 1. Position kernel over input patch 2. Element-wise multiply kernel weights × input values 3. Sum all products → single output value 4. Slide kernel, repeat (stride controls movement)
§1 · Stride & Padding 5

Controlling Spatial Dimensions

Stride: How many pixels the kernel moves each step. Stride=1 means one pixel shift; stride=2 means skip one pixel.

Padding: Adding zero-valued border pixels to preserve spatial dimensions. A 32×32 input with 5×5 kernel and padding=2 produces 32×32 output (without padding, output shrinks to 28×28).

Output Dimension Formula

output_size = (input_size − kernel_size + 2×padding) / stride + 1

High stride = smaller output (downsampling). Padding preserves edge information.

Core Input P padding stride=2 move 2px Example: 32×32 Input, 5×5 Kernel No padding, stride=1: (32 - 5) / 1 + 1 = 28×28 Padding=2, stride=1: (32 - 5 + 4) / 1 + 1 = 32×32 No padding, stride=2: (32 - 5) / 2 + 1 = 14×14 Padding=2, stride=2: (32 - 5 + 4) / 2 + 1 = 16×16
§1 · Pooling 6

Pooling: Downsampling & Feature Aggregation

Pooling layers reduce spatial dimensions by applying a stateless operation (max or average) over small windows.

Max pooling: Selects the maximum value in each window. Captures the strongest feature response.

Average pooling: Computes the average. Smoother, less sensitive to noise.

Benefits: Reduces parameters, adds translation invariance, prevents overfitting, enables larger receptive fields.

Pooling has no learnable parameters—it's a deterministic operation.
3 9 2 5 8 1 4 2 6 2×2 pool Input Max Pool 9 2 4 6 Output Max Pooling 2×2 with stride=2 Top-left 2×2 block [3,9,5,8]: max = 9 Top-right 2×2 block [2,1,...]: max = 2 (and so on) Result: • 4×4 input reduces to 2×2 output • Most important features are preserved • Spatial dimensions halved, robustness increased
§1 · Receptive Field 7

Understanding Receptive Field

The receptive field of a neuron is the region of the input image that influences its output.

In layer 1, a 5×5 convolution has a 5×5 receptive field. But in layer 2, stacking another 5×5 convolution expands the receptive field to 9×9.

Receptive Field Growth RF = 1 + Σ(kernel_size - 1) × stride[i]

Deep networks see the whole image: Shallow layers detect edges; deeper layers see textures, shapes, objects. Large receptive fields are crucial for understanding context.

Layer 1 RF: 5×5 Layer 2 RF: 9×9 Receptive Field Growth Layer 1 (5×5 kernel): RF = 5×5 Layer 2 (5×5 kernel): RF = 5 + (5-1)×1 = 9×9 Layer 3 (5×5 kernel): RF = 9 + (5-1)×1 = 13×13 Deeper = larger context = better high-level understanding
§1 · LeNet (1998) 8

LeNet-5: The Pioneer

Designed by Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner for handwritten digit recognition.

Architecture: 2 convolutional layers + 3 fully-connected layers. Total: ~60K parameters.

Impact: First successful CNN, demonstrated that local feature learning works. Used in bank check processing.

Historical Significance

LeNet proved that convolutions + shared weights could solve real-world vision tasks efficiently.

Input: 32×32 grayscale. Output: 10 classes (digits 0-9).

Input 32×32 Conv1 6×5×5 Pool1 2×2 Conv2 16×5×5 Pool2 2×2 FC 120 Output 10 LeNet-5 Architecture Input: 32×32 grayscale image C1: 6 filters 5×5 → 28×28×6 S2: Max pool 2×2 → 14×14×6 C3: 16 filters 5×5 → 10×10×16 S4: Max pool 2×2 → 5×5×16 (flatten to 400) F5, F6: Fully-connected 120 → 84 → 10 (softmax)
§1 · AlexNet (2012) 9

AlexNet: The Deep Learning Revolution

Designed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. Won ImageNet 2012 with 26% → 15% top-5 error—a stunning 11-point improvement.

Architecture: 5 convolutional + 3 fully-connected layers. ~60M parameters. First to use ReLU activation and dropout regularization.

Breakthrough Moment

AlexNet sparked the modern deep learning era. Proved that CNNs could learn complex image representations at scale with GPUs.

Input: 224×224 RGB (ImageNet). Output: 1000 classes.

Input 224 C1 11×11 C2 5×5 C3 3×3 C4 3×3 C5 3×3 FC 4096 Out 1000 Key Innovations ReLU Activation: Faster training than tanh/sigmoid Dropout: Random neuron deactivation prevents overfitting GPU Training: NVIDIA GPUs made 60M parameters trainable Data Augmentation: Crops, flips, color shifts during training Large Scale: Trained on 1.2M ImageNet images
§1 · VGG & ResNet 10

Deeper Networks: VGGNet (2014) & ResNet (2015)

VGGNet (Visual Geometry Group): 16-19 layers using only 3×3 convolutions stacked deeply. Key insight: two 3×3 convs = one 5×5 conv with more non-linearity, fewer parameters.

ResNet (Microsoft): 50-152 layers with skip connections (identity shortcuts). Each block outputs input + transformation. Solves vanishing gradient problem in very deep networks.

Skip Connection

y = x + F(x), where F is the learned layers. Preserves original signal, enabling 100+ layer networks.

VGGNet (3×3 only) Conv 3×3 Conv 3×3 ... ×16 ResNet (with Skip) x Conv Conv skip + y y = x + F(x) Architecture Comparison Feature VGGNet ResNet Depth 16-19 50-152 Skip Connections None Yes (identity) Main Innovation 3×3 stacking y = x + F(x)
§1 · Regularization 11

Batch Normalization & Dropout

Batch Normalization: Normalizes layer inputs to mean 0, variance 1 within each batch. Stabilizes training, allows higher learning rates, acts as regularizer.

Batch Norm x_norm = (x - batch_mean) / sqrt(batch_var + ε)

Dropout: Randomly deactivate neurons with probability p during training. Prevents co-adaptation, forces network to learn robust features. Not applied at test time.

Combined Effect

Batch norm + dropout = faster training + better generalization. Enables deeper, more powerful networks.

Batch Normalization Before μ=?, σ=? After μ=0, σ=1 Dropout (p=0.5) Random deactivation Benefits & Impact Batch Norm: • Stabilizes training, allows higher learning rates • Reduces internal covariate shift • Acts as regularizer (slightly reduces overfitting) Dropout: • Prevents co-adaptation of neurons • Reduces overfitting significantly • Not applied at test time (scale predictions)
§2 12

Recurrent Neural Networks

Processing sequential and temporal data
II
§2 · Sequential Data 13

Why CNNs Fail on Sequences

Sequential data: Text, time series, audio, video. Each element depends on previous elements.

Variable length: Sentences range from 5 to 1000+ tokens. Fully-connected layers expect fixed input size.

Long-range dependencies: In "The bank executive who works in New York said the …", the pronoun "she" refers back to a subject many words away.

Core Challenge

Need architecture that processes variable-length sequences, remembers past context, and handles long-range dependencies.

The cat sat on ... mat Sequence: variable length, temporal order Long-Range Dependencies "The bank executive who works in New York said she would..." refers to Why Standard Networks Fail CNNs: Fixed spatial extent, no temporal memory Fully-connected: Require fixed input size, no parameter sharing across time Solution: RNNs process one token at a time, maintain hidden state Hidden state: Memory of what the network has seen so far
§2 · Vanilla RNN 14

The Vanilla RNN: Hidden State & Weight Sharing

RNNs maintain a hidden state h that updates at each time step. Same weights are used for every token.

RNN Update h_t = tanh(W_h × h_{t-1} + W_x × x_t + b)
y_t = W_y × h_t + b_y

Key idea: The hidden state acts as memory, carrying information forward through time. Weight sharing across timesteps is parameter-efficient.

Training: Backpropagation through time (BPTT). Gradients flow backward through unrolled sequence.

Folded View RNN h x_t y_t Unfolded Through Time h x_t-1 h x_t y_t h Same weights repeated RNN Computation h_t = tanh(W_h · h_{t-1} + W_x · x_t + b) y_t = W_y · h_t + b_y Weight sharing: W_h, W_x, W_y are fixed for all timesteps Hidden state: h carries information forward (memory) tanh: Non-linearity allows learning complex temporal patterns
§2 · Vanishing Gradient 15

The Vanishing Gradient Problem

Problem: When training RNNs with BPTT, gradients shrink exponentially as they flow backward through many time steps. After ~20-30 steps, gradient ≈ 0.

Root cause: Chain rule in calculus: dL/dh_0 = dL/dh_T × dh_T/dh_{T-1} × ... × dh_1/dh_0. Each multiplication by a value < 1 → exponential decay.

Consequence

Vanilla RNNs cannot learn long-range dependencies (> 5-10 steps). Early time steps stop receiving meaningful gradient updates.

Discovered: Hochreiter (1991), Bengio et al. (1994)

Gradient Flow Through Time t=0 t=5 t=10 t=20 t=30 t=50 gradient magnitude Chain Rule & Gradient Decay dL/dh_0 = dL/dh_T × ∏(dh_i/dh_{i-1}) Each multiplication: |dh/dh'| < 1 (usually 0.1 - 0.9) After 30 steps: (0.5)^30 ≈ 10^-9 (gradient essentially zero) Result: Network cannot learn dependencies > 5-10 timesteps Solution: LSTM, GRU — gating mechanisms prevent gradient decay
§2 · LSTM 16

LSTMs: Long Short-Term Memory (Hochreiter & Schmidhuler, 1997)

LSTM cells: Replace vanilla RNN cells with gated units. Three gates (forget, input, output) control information flow.

Cell state (c): Long-term memory with additive updates. Forget gate selects what to discard; input gate selects what to add.

Hidden state (h): Short-term output. Output gate controls what of the cell state is exposed.

Key Innovation

Additive cell updates (+ operator) allow gradients to flow unimpeded. Multiplicative gates provide selective memory.

x_t, h_{t-1} f forget i input o output Cell State (c) × forget + add input tanh candidate × h_t Gates & Operations × = multiplication (selective), + = addition (memory), sigmoid/tanh = non-linearity
§2 · LSTM Gates 17

LSTM Gates Explained

Forget gate: f_t = sigmoid(W_f · [h_{t-1}, x_t] + b_f). Values in [0,1]. Multiplies cell state; 0 = discard, 1 = keep.

Input gate: i_t = sigmoid(W_i · [h_{t-1}, x_t] + b_i). Controls how much of the new candidate (tanh) to add to the cell state.

Cell state update: c_t = f_t ⊙ c_{t-1} + i_t ⊙ tanh(...). Additive update ensures gradient flow.

Output gate: o_t = sigmoid(W_o · [h_{t-1}, x_t] + b_o). Controls exposure of cell state as hidden state h_t.

Sigmoid outputs are smooth [0,1] probabilities; tanh candidates are centered [-1,1].
Forget Gate f_t = σ(W_f · [h, x] + b_f) Output: [0,1] per cell Input Gate i_t = σ(W_i · [h, x] + b_i) Controls new info Candidate c̃_t = tanh(W_c · [h, x]) Output: [-1,1] Output Gate o_t = σ(W_o · [h, x]) Controls h_t exposure LSTM Recurrence Equations c_t = f_t ⊙ c_{t-1} + i_t ⊙ tanh(W_c·[h_{t-1}, x_t]) h_t = o_t ⊙ tanh(c_t) Why this works: • Addition (+) preserves gradients: dc_t/dc_{t-1} can be ≈ 1
§2 · GRU 18

GRU: Gated Recurrent Unit (Cho et al., 2014)

Simplified LSTM variant. Combines forget and input gates into a single update gate. Only two gates (reset, update) vs. three in LSTM.

GRU Equations r_t = σ(W_r · [h_{t-1}, x_t]) (reset gate)
z_t = σ(W_z · [h_{t-1}, x_t]) (update gate)
h̃_t = tanh(W_h · [r_t ⊙ h_{t-1}, x_t])
h_t = (1-z_t) ⊙ h̃_t + z_t ⊙ h_{t-1}

Trade-off: Fewer parameters, slightly faster to train. Similar performance to LSTM. Good when GPU memory is constrained.

x_t, h_{t-1} r reset z update candidate blend h_t GRU vs LSTM GRU: 2 gates, no separate cell state, faster, simpler LSTM: 3 gates, explicit cell state, richer expressiveness, proven track record
§2 · Bidirectional & Deep 19

Bidirectional & Deep RNNs

Bidirectional RNN: Run two RNNs—one forward (left-to-right), one backward (right-to-left). Concatenate outputs. Gives access to future context too.

Deep RNN: Stack multiple RNN layers. Layer 1 processes input sequence; layer 2 processes layer 1's outputs. Learn hierarchical temporal representations.

Practical: BiLSTM (bidirectional LSTM) is standard for NLP tasks (POS tagging, NER). Stacking 2-3 layers is common; deeper becomes redundant/expensive.

Bidirectional requires seeing the full sequence first (not suitable for real-time prediction).
Unidirectional h1 h2 h3 h4 Bidirectional forward → ← backward concatenate Deep RNN (3 layers) Layer 1 Layer 2 Why Stack? • Layer 1 learns low-level temporal patterns • Layer 2 learns abstract sequence-level features • Usually 2-3 layers is optimal • Beyond ~4 layers: diminishing returns
§2 · Seq2Seq 20

Sequence-to-Sequence: Encoder-Decoder

Architecture: Encoder LSTM reads input sequence, outputs final hidden state (context vector). Decoder LSTM generates output sequence, conditioned on context.

Applications: Machine translation ("Hello" → "Hola"), summarization, dialogue, image captioning.

Problem: Single context vector bottleneck. Can't capture all information from long sequences.

Next: Attention

Attention mechanisms (2015+) allow decoder to focus on relevant encoder states, not just final state.

Encoder hello world <EOS> c Decoder c hola mundo <EOS> Seq2Seq Flow 1. Encoder: Read input sequence, output context vector c 2. Context: Single vector summarizes entire input 3. Decoder: Generate output tokens autoregressively using c Problem: Context bottleneck loses information from long sequences Solution (next): Attention lets decoder focus on relevant encoder states Breakthrough: Attention is All You Need (Transformers, 2017)
§3 21

Modern Context & Evolution

From CNNs and RNNs to Transformers and Vision Transformers
III
§3 · When to Use What 22

CNN vs RNN: When to Use Which?

CNNs excel at: Images, spatial grids. Hierarchical feature learning. Fast parallelization. Modern object detection, segmentation.

RNNs/LSTMs excel at: Variable-length sequences. Temporal dependencies. Language modeling, machine translation, time series prediction.

Hybrid: Video = sequence of images. Use CNN to extract spatial features per frame, RNN to model temporal patterns. CNNs in encoder for image captioning.

Practical Rule

CNN if data has spatial structure; RNN if sequential/temporal; hybrid if both.

CNNs Images Detection Segmentation Fast Inference RNNs/LSTMs Text/NLP Translation Time Series Var. Length Hybrid: CNN + RNN Video analysis (CNN per frame + RNN for temporal), image captioning (CNN encoder + RNN decoder), visual QA
§3 · Transformers 23

Transformers: Beyond Recurrence (Vaswani et al., 2017)

Key idea: "Attention is All You Need." Replace RNNs entirely with stacked self-attention layers. No recurrence = massive parallelization.

Self-attention: Each token attends to all other tokens in parallel. Learn which tokens are relevant to which. No bottleneck like seq2seq.

Benefits: Train 100x faster than RNNs. Better long-range dependency modeling (no gradient decay). Dominates NLP (BERT, GPT, T5).

Paradigm Shift

Transformers replaced RNNs for most NLP. Later adapted to vision (Vision Transformers).

Self-Attention Mechanism I love Python Every token attends to all others in parallel RNN vs Transformer Aspect RNN (LSTM) Transformer Parallelization Sequential (slow) Parallel (fast) Long-range deps Harder (vanishing) Easier (all-to-all) Memory Low High (O(n²)) Training speed ~1x ~100x faster
§3 · Vision Transformers 24

Vision Transformers (ViT): CNNs Under Pressure

Vision Transformers (Dosovitskiy et al., 2021): Divide image into patches (16×16 tokens), apply Transformer. No convolutions.

Performance: Surpasses ResNet on large datasets. Better scaling with data/compute. Becomes standard for vision tasks.

Impact: CNNs (inductive bias for locality) give way to attention (pure learning). Shows local connectivity isn't essential with enough data.

Paradigm Shift

Vision increasingly moves from CNNs → Transformers. CNNs remain efficient for small data, edge devices.

Image to Patches 224×224 split 16×16 patches Transformer Self-Att. ViT Advantages No inductive bias: Learns from data, not hard-coded assumptions Better scaling: Performance improves with larger datasets/models Unified architecture: Same model for vision, language, multimodal Trade-off: Needs more data than CNNs; higher memory for self-attention Status: SOTA on ImageNet, increasingly standard industry choice
§3 · Transfer Learning 25

Transfer Learning & Fine-Tuning

Transfer learning: Pre-train on large dataset (ImageNet, COCO), fine-tune on domain-specific small dataset. Leverages learned features.

Why effective: Lower layers learn universal features (edges, textures). Higher layers adapt to specific task. Reusing lower layers saves parameters and data.

Strategies: (1) Freeze backbone, train only head. (2) Fine-tune all layers with low learning rate. (3) Progressive unfreezing.

Impact: Most modern applications use pre-trained models. Solo training from scratch is rare (except when domain is very different).

Phase 1: Pre-training ImageNet 1.2M images ResNet-50 softmax Phase 2: Fine-tuning Domain Data 100-1000 images frozen train Transfer Learning Benefits Data efficiency: Pre-trained features reduce data needed by 10-100x Faster convergence: Start from learned features, not random weights Better performance: Often outperforms training from scratch on small datasets Standard practice: 99% of real-world applications use pre-trained models Strategies: 1. Freeze backbone, fine-tune classifier only (low data) 2. Fine-tune all layers with low learning rate (medium data)
§3 · History 26

Key Milestones: From LeNet to Transformers

Timeline of Breakthroughs 1998: LeNet-5 (LeCun et al.) — first CNN, digit recognition
1991-1994: Hochreiter, Bengio identify vanishing gradient problem
1997: LSTM (Hochreiter & Schmidhuber) — gated recurrent cells
2012: AlexNet (Krizhevsky et al.) — deep learning revolution, ImageNet
2014: VGGNet, GRU (Cho et al.) — simplification & depth
2015: ResNet (He et al.) — skip connections, 152 layers
2017: Transformers (Vaswani et al.) — "Attention is All You Need"
2021: Vision Transformers (Dosovitskiy et al.) — ViT surpasses ResNet

Narrative: CNNs dominated vision for 15 years. RNNs/LSTMs for language. Transformers unified both domains, enabling foundation models (BERT, GPT, etc.).

§3 · Legacy 27

The Legacy of CNNs & RNNs

CNNs remain essential: Efficient on constrained devices, small datasets, real-time vision. Foundation for hybrid models (CNN + Transformer backbones). Inductive bias still valuable.

RNNs/LSTMs: Specialized RNN variants (Gated Recurrent Units, Bidirectional LSTMs) still used where Transformers are overkill or latency matters. Strong teaching architectures for understanding sequential processing.

Conceptual legacy: CNNs + RNNs introduced fundamental ideas (parameter sharing, local connectivity, gating, attention mechanisms) that shaped all modern architectures.

Future Outlook

Transformers now dominant, but CNNs/RNNs remain deployed in production. Hybrid architectures (Perceiver, Flamingo) blend all three paradigms.

CNN Era 1998-2012 RNN Era 1997-2017 Transformer Era 2017-present Current Usage (2026) CNNs: Still SOTA for many vision tasks, efficient inference, mobile/edge, real-time applications RNNs: Specialized applications (time series forecasting, speech, where latency is critical) Transformers: Dominant in NLP (LLMs), increasingly in vision (ViT), multimodal (CLIP) Hybrid: ConvNets + Transformers (e.g., Swin Transformer), domain-specific fusion Key Insight: Transformers don't replace CNNs/RNNs entirely. Each has optimal use cases. Understanding all three paradigms is essential for modern ML engineers.
Summary 28

Key Takeaways

CNNs: Spatial Structure

Local connectivity, shared weights, pooling. Ideal for images. Efficient, well-understood, inductive bias for locality.

RNNs/LSTMs: Temporal Dependencies

Hidden state, sequential processing, gating mechanisms. Handle variable-length sequences. LSTMs solve vanishing gradient via additive cell updates.

Transformers: Attention Paradigm

All-to-all attention, no recurrence, massively parallel. Scales to foundation models. Now dominant in NLP, expanding to vision.

Practical Reality

Transfer learning from pre-trained models is standard. Fine-tuning beats training from scratch. Hybrid architectures combine strengths of all three.

Resources 29

Further Reading & References

Landmark Papers:

• LeNet: LeCun et al. (1998) — "Gradient-Based Learning Applied to Document Recognition"

• LSTM: Hochreiter & Schmidhuber (1997) — "Long Short-Term Memory"

• AlexNet: Krizhevsky et al. (2012) — "ImageNet Classification with Deep Convolutional Networks"

• ResNet: He et al. (2015) — "Deep Residual Learning for Image Recognition"

• Transformers: Vaswani et al. (2017) — "Attention Is All You Need"

• ViT: Dosovitskiy et al. (2021) — "An Image is Worth 16x16 Words"

Courses & Textbooks:

• Goodfellow, Bengio, Courville (2016) — "Deep Learning" (MIT Press)

• Karpathy's CS231n, CS224n (Stanford online)

• "Dive into Deep Learning" (D2L.ai) — free online with code