CNNs & RNNs
Neural Architectures for Spatial & Sequential Data

Deep learning fundamentals for ML students

From images to sequences — the two pillars of modern architecture

Arrow keys / click / swipe · ◐ for dark mode

Contents 1

Learning Map

IConvolutional Neural Networks
IIRecurrent Neural Networks
IIIClassic Architectures
IVModern Context & Evolution

§1 2

Convolutional Neural Networks

Capturing spatial structure in images and grids

I

§1 · Problem Space 3

Why Fully-Connected Layers Fail on Images

Parameter explosion: A 224×224×3 image has 150K pixels. A single fully-connected hidden layer of 1000 units requires 150M parameters.

No spatial structure: Fully-connected layers treat every pixel equally, ignoring that nearby pixels are related.

No translation invariance: A cat in the top-left corner looks different to the network than the same cat in the center.

Core Insight

Images have local structure. Small filters that detect edges, corners, and textures are more powerful than learning pixel-by-pixel relationships.

§1 · Convolution 4

The Convolution Operation

A kernel (or filter) is a small matrix of learned weights. It slides over the input, computing element-wise products and summing.

Example: A 5×5 kernel with learnable weights detects patterns like edges or textures across the image.

Convolution Output output[i,j] = Σ(kernel × input_patch[i,j])

Key benefits: Shared weights across space, local connectivity, fewer parameters, translation invariance.

The same 5×5 kernel is reused thousands of times, drastically reducing parameters vs. fully-connected.

§1 · Stride & Padding 5

Controlling Spatial Dimensions

Stride: How many pixels the kernel moves each step. Stride=1 means one pixel shift; stride=2 means skip one pixel.

Padding: Adding zero-valued border pixels to preserve spatial dimensions. A 32×32 input with 5×5 kernel and padding=2 produces 32×32 output (without padding, output shrinks to 28×28).

Output Dimension Formula

output_size = (input_size − kernel_size + 2×padding) / stride + 1

High stride = smaller output (downsampling). Padding preserves edge information.

§1 · Pooling 6

Pooling: Downsampling & Feature Aggregation

Pooling layers reduce spatial dimensions by applying a stateless operation (max or average) over small windows.

Max pooling: Selects the maximum value in each window. Captures the strongest feature response.

Average pooling: Computes the average. Smoother, less sensitive to noise.

Benefits: Reduces parameters, adds translation invariance, prevents overfitting, enables larger receptive fields.

Pooling has no learnable parameters—it's a deterministic operation.

§1 · Receptive Field 7

Understanding Receptive Field

The receptive field of a neuron is the region of the input image that influences its output.

In layer 1, a 5×5 convolution has a 5×5 receptive field. But in layer 2, stacking another 5×5 convolution expands the receptive field to 9×9.

Receptive Field Growth RF = 1 + Σ(kernel_size - 1) × stride[i]

Deep networks see the whole image: Shallow layers detect edges; deeper layers see textures, shapes, objects. Large receptive fields are crucial for understanding context.

§1 · LeNet (1998) 8

LeNet-5: The Pioneer

Designed by Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner for handwritten digit recognition.

Architecture: 2 convolutional layers + 3 fully-connected layers. Total: ~60K parameters.

Impact: First successful CNN, demonstrated that local feature learning works. Used in bank check processing.

Historical Significance

LeNet proved that convolutions + shared weights could solve real-world vision tasks efficiently.

Input: 32×32 grayscale. Output: 10 classes (digits 0-9).

§1 · AlexNet (2012) 9

AlexNet: The Deep Learning Revolution

Designed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. Won ImageNet 2012 with 26% → 15% top-5 error—a stunning 11-point improvement.

Architecture: 5 convolutional + 3 fully-connected layers. ~60M parameters. First to use ReLU activation and dropout regularization.

Breakthrough Moment

AlexNet sparked the modern deep learning era. Proved that CNNs could learn complex image representations at scale with GPUs.

Input: 224×224 RGB (ImageNet). Output: 1000 classes.

§1 · VGG & ResNet 10

Deeper Networks: VGGNet (2014) & ResNet (2015)

VGGNet (Visual Geometry Group): 16-19 layers using only 3×3 convolutions stacked deeply. Key insight: two 3×3 convs = one 5×5 conv with more non-linearity, fewer parameters.

ResNet (Microsoft): 50-152 layers with skip connections (identity shortcuts). Each block outputs input + transformation. Solves vanishing gradient problem in very deep networks.

Skip Connection

y = x + F(x), where F is the learned layers. Preserves original signal, enabling 100+ layer networks.

§1 · Regularization 11

Batch Normalization & Dropout

Batch Normalization: Normalizes layer inputs to mean 0, variance 1 within each batch. Stabilizes training, allows higher learning rates, acts as regularizer.

Batch Norm x_norm = (x - batch_mean) / sqrt(batch_var + ε)

Dropout: Randomly deactivate neurons with probability p during training. Prevents co-adaptation, forces network to learn robust features. Not applied at test time.

Combined Effect

Batch norm + dropout = faster training + better generalization. Enables deeper, more powerful networks.

§2 12

Recurrent Neural Networks

Processing sequential and temporal data

II

§2 · Sequential Data 13

Why CNNs Fail on Sequences

Sequential data: Text, time series, audio, video. Each element depends on previous elements.

Variable length: Sentences range from 5 to 1000+ tokens. Fully-connected layers expect fixed input size.

Long-range dependencies: In "The bank executive who works in New York said the …", the pronoun "she" refers back to a subject many words away.

Core Challenge

Need architecture that processes variable-length sequences, remembers past context, and handles long-range dependencies.

§2 · Vanilla RNN 14

The Vanilla RNN: Hidden State & Weight Sharing

RNNs maintain a hidden state h that updates at each time step. Same weights are used for every token.

RNN Update h_t = tanh(W_h × h_{t-1} + W_x × x_t + b)
y_t = W_y × h_t + b_y

Key idea: The hidden state acts as memory, carrying information forward through time. Weight sharing across timesteps is parameter-efficient.

Training: Backpropagation through time (BPTT). Gradients flow backward through unrolled sequence.

§2 · Vanishing Gradient 15

The Vanishing Gradient Problem

Problem: When training RNNs with BPTT, gradients shrink exponentially as they flow backward through many time steps. After ~20-30 steps, gradient ≈ 0.

Root cause: Chain rule in calculus: dL/dh_0 = dL/dh_T × dh_T/dh_{T-1} × ... × dh_1/dh_0. Each multiplication by a value < 1 → exponential decay.

Consequence

Vanilla RNNs cannot learn long-range dependencies (> 5-10 steps). Early time steps stop receiving meaningful gradient updates.

Discovered: Hochreiter (1991), Bengio et al. (1994)

§2 · LSTM 16

LSTMs: Long Short-Term Memory (Hochreiter & Schmidhuler, 1997)

LSTM cells: Replace vanilla RNN cells with gated units. Three gates (forget, input, output) control information flow.

Cell state (c): Long-term memory with additive updates. Forget gate selects what to discard; input gate selects what to add.

Hidden state (h): Short-term output. Output gate controls what of the cell state is exposed.

Key Innovation

Additive cell updates (+ operator) allow gradients to flow unimpeded. Multiplicative gates provide selective memory.

§2 · LSTM Gates 17

LSTM Gates Explained

Forget gate: f_t = sigmoid(W_f · [h_{t-1}, x_t] + b_f). Values in [0,1]. Multiplies cell state; 0 = discard, 1 = keep.

Input gate: i_t = sigmoid(W_i · [h_{t-1}, x_t] + b_i). Controls how much of the new candidate (tanh) to add to the cell state.

Cell state update: c_t = f_t ⊙ c_{t-1} + i_t ⊙ tanh(...). Additive update ensures gradient flow.

Output gate: o_t = sigmoid(W_o · [h_{t-1}, x_t] + b_o). Controls exposure of cell state as hidden state h_t.

Sigmoid outputs are smooth [0,1] probabilities; tanh candidates are centered [-1,1].

§2 · GRU 18

GRU: Gated Recurrent Unit (Cho et al., 2014)

Simplified LSTM variant. Combines forget and input gates into a single update gate. Only two gates (reset, update) vs. three in LSTM.

GRU Equations r_t = σ(W_r · [h_{t-1}, x_t]) (reset gate)
z_t = σ(W_z · [h_{t-1}, x_t]) (update gate)
h̃_t = tanh(W_h · [r_t ⊙ h_{t-1}, x_t])
h_t = (1-z_t) ⊙ h̃_t + z_t ⊙ h_{t-1}

Trade-off: Fewer parameters, slightly faster to train. Similar performance to LSTM. Good when GPU memory is constrained.

§2 · Bidirectional & Deep 19

Bidirectional & Deep RNNs

Bidirectional RNN: Run two RNNs—one forward (left-to-right), one backward (right-to-left). Concatenate outputs. Gives access to future context too.

Deep RNN: Stack multiple RNN layers. Layer 1 processes input sequence; layer 2 processes layer 1's outputs. Learn hierarchical temporal representations.

Practical: BiLSTM (bidirectional LSTM) is standard for NLP tasks (POS tagging, NER). Stacking 2-3 layers is common; deeper becomes redundant/expensive.

Bidirectional requires seeing the full sequence first (not suitable for real-time prediction).

§2 · Seq2Seq 20

Sequence-to-Sequence: Encoder-Decoder

Architecture: Encoder LSTM reads input sequence, outputs final hidden state (context vector). Decoder LSTM generates output sequence, conditioned on context.

Applications: Machine translation ("Hello" → "Hola"), summarization, dialogue, image captioning.

Problem: Single context vector bottleneck. Can't capture all information from long sequences.

Next: Attention

Attention mechanisms (2015+) allow decoder to focus on relevant encoder states, not just final state.

§3 21

Modern Context & Evolution

From CNNs and RNNs to Transformers and Vision Transformers

III

§3 · When to Use What 22

CNN vs RNN: When to Use Which?

CNNs excel at: Images, spatial grids. Hierarchical feature learning. Fast parallelization. Modern object detection, segmentation.

RNNs/LSTMs excel at: Variable-length sequences. Temporal dependencies. Language modeling, machine translation, time series prediction.

Hybrid: Video = sequence of images. Use CNN to extract spatial features per frame, RNN to model temporal patterns. CNNs in encoder for image captioning.

Practical Rule

CNN if data has spatial structure; RNN if sequential/temporal; hybrid if both.

§3 · Transformers 23

Transformers: Beyond Recurrence (Vaswani et al., 2017)

Key idea: "Attention is All You Need." Replace RNNs entirely with stacked self-attention layers. No recurrence = massive parallelization.

Self-attention: Each token attends to all other tokens in parallel. Learn which tokens are relevant to which. No bottleneck like seq2seq.

Benefits: Train 100x faster than RNNs. Better long-range dependency modeling (no gradient decay). Dominates NLP (BERT, GPT, T5).

Paradigm Shift

Transformers replaced RNNs for most NLP. Later adapted to vision (Vision Transformers).

§3 · Vision Transformers 24

Vision Transformers (ViT): CNNs Under Pressure

Vision Transformers (Dosovitskiy et al., 2021): Divide image into patches (16×16 tokens), apply Transformer. No convolutions.

Performance: Surpasses ResNet on large datasets. Better scaling with data/compute. Becomes standard for vision tasks.

Impact: CNNs (inductive bias for locality) give way to attention (pure learning). Shows local connectivity isn't essential with enough data.

Paradigm Shift

Vision increasingly moves from CNNs → Transformers. CNNs remain efficient for small data, edge devices.

§3 · Transfer Learning 25

Transfer Learning & Fine-Tuning

Transfer learning: Pre-train on large dataset (ImageNet, COCO), fine-tune on domain-specific small dataset. Leverages learned features.

Why effective: Lower layers learn universal features (edges, textures). Higher layers adapt to specific task. Reusing lower layers saves parameters and data.

Strategies: (1) Freeze backbone, train only head. (2) Fine-tune all layers with low learning rate. (3) Progressive unfreezing.

Impact: Most modern applications use pre-trained models. Solo training from scratch is rare (except when domain is very different).

§3 · History 26

Key Milestones: From LeNet to Transformers

Timeline of Breakthroughs 1998: LeNet-5 (LeCun et al.) — first CNN, digit recognition
1991-1994: Hochreiter, Bengio identify vanishing gradient problem
1997: LSTM (Hochreiter & Schmidhuber) — gated recurrent cells
2012: AlexNet (Krizhevsky et al.) — deep learning revolution, ImageNet
2014: VGGNet, GRU (Cho et al.) — simplification & depth
2015: ResNet (He et al.) — skip connections, 152 layers
2017: Transformers (Vaswani et al.) — "Attention is All You Need"
2021: Vision Transformers (Dosovitskiy et al.) — ViT surpasses ResNet

Narrative: CNNs dominated vision for 15 years. RNNs/LSTMs for language. Transformers unified both domains, enabling foundation models (BERT, GPT, etc.).

§3 · Legacy 27

The Legacy of CNNs & RNNs

CNNs remain essential: Efficient on constrained devices, small datasets, real-time vision. Foundation for hybrid models (CNN + Transformer backbones). Inductive bias still valuable.

RNNs/LSTMs: Specialized RNN variants (Gated Recurrent Units, Bidirectional LSTMs) still used where Transformers are overkill or latency matters. Strong teaching architectures for understanding sequential processing.

Conceptual legacy: CNNs + RNNs introduced fundamental ideas (parameter sharing, local connectivity, gating, attention mechanisms) that shaped all modern architectures.

Future Outlook

Transformers now dominant, but CNNs/RNNs remain deployed in production. Hybrid architectures (Perceiver, Flamingo) blend all three paradigms.

Summary 28

Key Takeaways

CNNs: Spatial Structure

Local connectivity, shared weights, pooling. Ideal for images. Efficient, well-understood, inductive bias for locality.

RNNs/LSTMs: Temporal Dependencies

Hidden state, sequential processing, gating mechanisms. Handle variable-length sequences. LSTMs solve vanishing gradient via additive cell updates.

Transformers: Attention Paradigm

All-to-all attention, no recurrence, massively parallel. Scales to foundation models. Now dominant in NLP, expanding to vision.

Practical Reality

Transfer learning from pre-trained models is standard. Fine-tuning beats training from scratch. Hybrid architectures combine strengths of all three.

Resources 29

CNNs & RNNsNeural Architectures for Spatial & Sequential Data