Linear models are fundamentally limited: they cannot express non-linear decision boundaries or capture feature interactions. Neural networks overcome this through composition of nonlinear transformations. The universal approximation theorem guarantees that with enough hidden units and nonlinear activations, networks can approximate any continuous function. This power comes from learned feature representations that gradually transform raw inputs into higher-level abstractions.
Deep networks learn hierarchical representations where early layers capture low-level patterns and deeper layers combine them into semantic concepts. The nonlinearity enables the network to fold and twist the input space, creating arbitrary decision boundaries. Modern architectures exploit this to solve tasks that linear methods cannot: image recognition, language understanding, and complex reasoning all rely on the expressive power that depth and nonlinearity provide.
02
Network Structure
Neural networks are organized into layers: input, hidden, and output. Each layer contains units (neurons) connected to the previous layer via learnable weights and biases. For a unit in layer ℓ, the activation is computed from weighted inputs plus bias, then passed through an activation function. Common activations include ReLU (piecewise linear), sigmoid (smooth probability), and tanh (centered). The choice of activation affects optimization dynamics and representational capacity.
Architecture design involves choosing depth (number of hidden layers), width (units per layer), and activation functions. Wider networks increase capacity but may overfit; deeper networks learn hierarchical representations but become harder to train. Skip connections bypass layers, enabling very deep models. Modern practice favors ReLU and variants (Leaky ReLU, GELU) for their computational efficiency and training stability compared to older sigmoid/tanh activations.
03
Computing Activations
Forward propagation computes layer-by-layer activations by matrix multiplication and element-wise operations. For input x, each layer ℓ computes: z^[ℓ] = W^[ℓ]a^[ℓ-1] + b^[ℓ], then a^[ℓ] = σ(z^[ℓ]) where σ is the activation. Vectorized computation processes entire batches simultaneously, dramatically improving GPU utilization. A batch of N samples computes N predictions in parallel: Z^[ℓ] is (units × N), resulting in A^[ℓ] also (units × N).
Vectorization is essential for practical deep learning. Computing one sample at a time wastes GPU parallelism; batching amortizes overhead. The forward pass accumulates intermediate activations and pre-activations needed later for backpropagation. Memory usage scales with batch size, batch length, and model depth, creating practical trade-offs. Efficient implementations minimize data movement between GPU memory and computation.
04
Gradient Flow
Backpropagation computes gradients by applying the chain rule backward through the network. The key insight: derivatives flow from loss L backward through layers, combining with local layer computations to get gradients for each parameter. If loss ∂L/∂a^[out] is known, then ∂L/∂z^[out] = ∂L/∂a^[out] ⊙ σ\'(z^[out]). Then ∂L/∂W = ∂L/∂z · (a^[prev])^T and derivatives propagate to previous layers.
Backprop is the reverse of forward propagation: information flows backward. This efficient algorithm computes all gradients in O(cost of forward pass), avoiding the naive approach of numerically differentiating each parameter. Modern frameworks automate backprop via computational graphs that track operations and automatically derive gradient rules. Understanding backprop is crucial for debugging, choosing architectures, and designing new layers.
05
Jacobians & Gradients
Jacobians are matrices of all first-order partial derivatives. For a vector function f: ℝⁿ → ℝᵐ, the Jacobian J is m×n with entries ∂f_i/∂x_j. In neural networks, Jacobians describe how each output changes with each input. For a single-layer network with weights W and bias b, the Jacobian of z = Wx+b is simply W. For composed functions, Jacobians multiply via the chain rule: ∂z/∂x = (∂z/∂y)(∂y/∂x).
Upstream gradients arrive from the loss function; downstream gradients propagate to earlier layers. At each layer, local Jacobians transform upstream gradients into downstream gradients. The product of Jacobians along the path determines sensitivity. Deep networks can suffer vanishing gradients (products approach zero) or exploding gradients (products grow exponentially). Modern techniques like normalization, careful initialization, and ReLU activations mitigate these issues, enabling stable training of very deep models.
06
Basic Loss Functions
Backward functions define gradient computation for each module. Linear layer: given upstream ∂L/∂z, compute ∂L/∂W = ∂L/∂z · a^T and ∂L/∂a = W^T · ∂L/∂z. ReLU: gradient is zero where input is negative, passes through where positive (∂L/∂x = [z > 0] ⊙ ∂L/∂z). Sigmoid and tanh: gradients involve σ(1-σ) and 1-tanh². Softmax with cross-entropy loss has a simple gradient: ∂L/∂z = (softmax(z) - y), where y is the one-hot target.
Loss functions couple with final layers to produce clean gradients. Cross-entropy for classification creates ∂L/∂logits = (predictions - targets), enabling efficient gradient descent. MSE loss for regression yields ∂L/∂predictions = 2(predictions - targets). Proper loss-layer combinations are crucial: using squared error with softmax is suboptimal compared to softmax+cross-entropy. Understanding backward functions enables building custom layers and debugging gradient flow issues.
07
Advanced Techniques
Modern modules improve training stability and performance beyond basic layers. BatchNorm normalizes layer inputs to zero mean/unit variance, reducing internal covariate shift and enabling higher learning rates. Dropout randomly zeros activations during training, acting as implicit ensemble and regularizer. Skip connections let gradients bypass layers, alleviating vanishing gradient problem in very deep networks. Residual networks learn incremental changes via z = x + f(x) rather than z = f(x), enabling networks with hundreds of layers.
Attention mechanisms compute weighted combinations of inputs, with weights learned from content similarity. Self-attention relates each position to all others; this enables transformers to process sequences without recurrence. Layer normalization and various positional encodings support attention-based architectures. These modern components are composable: combining residuals, attention, and normalization creates powerful architectures for vision and language. Understanding these building blocks is essential for reading and implementing contemporary models.
08
Batch Processing
Vectorization processes entire batches simultaneously. Instead of computing one sample at a time, matrices have shape (features, batch_size). Forward pass: X is (input_dims, N), W is (hidden_dims, input_dims), so Z = W @ X + b broadcasts bias to N samples. Backward pass: dW = (dZ @ X.T) / N accumulates gradients across the batch. This amortizes weight matrix reads and enables GPU cores to work in parallel on different samples simultaneously.
GPU parallelism requires thinking in matrix operations. A typical GPU can launch thousands of threads; processing individual samples serializes these. Large batches (hundreds to thousands) best exploit hardware. Frameworks like PyTorch and TensorFlow automatically handle vectorization when you code with tensors. Understanding batch dimensions is critical: shape (32, 64, 28, 28) might represent 32 images of 64 channels at 28×28 resolution, and operations happen independently across the batch dimension, then are aggregated for gradient updates.
Linear models compute ŷ = w·x + b, producing predictions that are linear combinations of input features. In 2D, a linear classifier draws a straight line boundary; in higher dimensions, a hyperplane. This is fundamentally limiting: many problems require nonlinear boundaries. A simple example is XOR: no line separates the positive cases (0,1) and (1,0) from the negative cases (0,0) and (1,1). Linear models cannot express this function.
Neural networks overcome this through composition of nonlinear transformations. Each hidden layer applies a nonlinear function σ to weighted combinations: h = σ(W₁x + b₁). Then the next layer transforms h nonlinearly: h₂ = σ(W₂h + b₂). Finally, outputs are computed from deep features: ŷ = W_out h_k + b_out. This composition of nonlinearities enables the network to compute arbitrary functions.
The universal approximation theorem states: any continuous function f: [0,1]ⁿ → ℝ can be approximated arbitrarily closely by a neural network with one hidden layer of sufficient width and nonlinear activation. This assumes appropriate weights exist; training is another matter. In practice, deep networks (multiple hidden layers) are more efficient: they can express functions with far fewer parameters than shallow (one hidden layer) networks.
Feature learning is the key insight: hidden layers learn representations of data. Early layers in deep networks learn simple patterns (edges in vision). Deeper layers combine these into complex features (shapes, objects). This hierarchy emerges from backpropagation without explicit programming. Each layer's weights are optimized to minimize the loss when combined with all downstream layers. This joint optimization of features and classifiers is far more powerful than hand-engineered features plus simple models.
The depth-width trade-off: deep networks with fewer units per layer need fewer parameters than shallow wide networks for the same expressive power. However, deeper networks are harder to train due to gradient flow issues (vanishing/exploding gradients). Modern techniques like normalization, skip connections, and careful initialization mitigate these issues, enabling training of very deep networks (hundreds of layers).
Expressivity comes at a cost: overfitting. A network with many parameters can memorize training data without learning generalizable patterns. Regularization (L2 penalty on weights, dropout, early stopping) prevents overfitting. The bias-variance tradeoff persists: more expressive models have lower bias but higher variance. Proper validation and test set evaluation is essential to assess true generalization performance.
Section 02
Neural Network Architecture
A neural network is a directed acyclic graph of layers. Layers are organized sequentially: input layer (the data), hidden layers (learned representations), and output layer (predictions). Each layer contains units (sometimes called neurons, though this is a loose biological analogy). Each unit in layer ℓ receives inputs from all units in layer ℓ-1, weighted by learnable parameters W^[ℓ] and shifted by biases b^[ℓ].
The computation in a single unit is: z = Σᵢ wᵢ·aᵢ + b, then a = σ(z). Here aᵢ are inputs from previous layer, wᵢ are weights, b is bias, z is the pre-activation, σ is an activation function, and a is the output activation. This operation—weighted sum then nonlinearity—is repeated in parallel across all units in the layer, then propagated forward.
Activation functions introduce nonlinearity. ReLU (Rectified Linear Unit): σ(z) = max(0,z). This piecewise linear function is zero for negative inputs, linear for positive. It's computationally efficient and avoids saturation (constant derivatives) unlike older activations. Sigmoid: σ(z) = 1/(1+e^{-z}), smooth S-shaped curve from 0 to 1, useful for binary classification outputs. Tanh: σ(z) = (e^z - e^{-z})/(e^z + e^{-z}), similar to sigmoid but ranges [-1,1], zero-centered output.
ReLU and variants (Leaky ReLU, ELU) are modern standards for hidden layers. They avoid the vanishing gradient problem of sigmoid/tanh: gradients of sigmoid approach zero for extreme inputs, making deep networks hard to train. ReLU gradients are constant (1) for positive inputs. Variants like Leaky ReLU (f(z) = z if z > 0, else 0.01z) avoid the "dying ReLU" problem where a unit permanently outputs zero.
Depth refers to the number of hidden layers. A network with one hidden layer is "shallow," commonly used for simpler problems. Deep networks have many hidden layers, enabling hierarchical learning. The trade-off: more expressivity with depth, but harder optimization (gradient flow issues). Modern architectures use skip connections (residual connections) to enable training of 100+ layer networks.
Width refers to the number of units per hidden layer. Wider layers increase capacity. The universal approximation theorem requires width proportional to input dimensionality for a single hidden layer, but is not constructive—training may not find such solutions. In practice, moderate widths (comparable to input/output dimensions) work well, though problems with many parameters may need very wide hidden layers.
Architectural choices cascade: depth interacts with width, activation function choice, and training procedures. Convolutional networks use local connectivity and weight sharing for vision. Recurrent networks process sequences by maintaining hidden state. Transformers use attention mechanisms instead of convolutions or recurrence. The diversity of architectures reflects different inductive biases suited to different data structures (images, sequences, graphs).
Section 03
Forward Propagation: Computing Activations
Forward propagation computes layer-by-layer activations from input to output. For a network with L hidden layers, the computation is: Given input x = a^[0], for each layer ℓ = 1, 2, ..., L: (1) z^[ℓ] = W^[ℓ] a^[ℓ-1] + b^[ℓ], (2) a^[ℓ] = σ(z^[ℓ]). Finally, output predictions are ŷ = a^[L].
Each step is a matrix-vector operation. W^[ℓ] has shape (num_units^[ℓ], num_units^[ℓ-1]); a^[ℓ-1] has shape (num_units^[ℓ-1]). The product z^[ℓ] = W^[ℓ] a^[ℓ-1] has shape (num_units^[ℓ]). Then bias b^[ℓ] (shape num_units^[ℓ]) is added to each column. Finally, σ is applied element-wise. This chain of operations transforms raw features into increasingly abstract representations.
Vectorization processes multiple samples simultaneously. Instead of computing one sample at a time, inputs and activations are matrices where the second dimension is the batch index. X has shape (input_dims, batch_size); Z^[ℓ] = W^[ℓ] X + b^[ℓ] (broadcasting b across batch); A^[ℓ] = σ(Z^[ℓ]). All N samples propagate in parallel through all layers. This is orders of magnitude faster on GPUs than processing samples sequentially.
Intermediate activations and pre-activations are stored during forward propagation for use in backpropagation. The memory footprint is roughly O(L × batch_size × max_hidden_units). Very large models or batches may exceed GPU memory, requiring gradient accumulation (processing multiple mini-batches before updating) or checkpointing (recomputing some activations during backprop instead of storing all).
Numerical stability is crucial. Intermediate values can become very large (exploding) or tiny (vanishing). Softmax is traditionally computed as softmax(z)_i = e^{z_i} / Σⱼ e^{z_j}, but if z is large, exp(z) overflows. The numerically stable version subtracts max(z): softmax(z)_i = e^{z_i - max(z)} / Σⱼ e^{z_j - max(z)}, giving the same result without overflow.
Batch normalization normalizes pre-activations in each layer to zero mean/unit variance across the mini-batch. This reduces internal covariate shift—the phenomenon that as earlier layers change during training, the distributions of inputs to later layers shift. Normalization allows higher learning rates and faster convergence. During training, normalization uses batch statistics; during inference, running statistics accumulated during training are used.
Section 04
Backpropagation: Computing Gradients
Backpropagation is an efficient algorithm for computing gradients of the loss with respect to all parameters. The key insight: derivatives compose via the chain rule. If loss is L, and we know ∂L/∂z^[L] (gradient at output), we can compute ∂L/∂z^[L-1] from ∂L/∂z^[L] and local derivatives of layer L-1. This propagates backward to the input layer, computing ∂L/∂W^[ℓ] for all ℓ.
The procedure: (1) Forward pass: compute and store all activations and pre-activations. (2) Compute output gradient: ∂L/∂z^[L] from loss function and output activation. (3) Backward pass, layer by layer from L to 1: (a) Compute ∂L/∂a^[ℓ] = (W^[ℓ+1])^T ∂L/∂z^[ℓ+1], (b) Compute ∂L/∂z^[ℓ] = ∂L/∂a^[ℓ] ⊙ σ'(z^[ℓ]), (c) Compute ∂L/∂W^[ℓ] = ∂L/∂z^[ℓ] (a^[ℓ-1])^T, ∂L/∂b^[ℓ] = ∂L/∂z^[ℓ].
The computational cost of backpropagation is O(1) times the forward pass cost—roughly equal time, maybe 2-3x due to additional gradient computations. This is far better than numerical differentiation, which would require O(number of parameters) forward passes, making it infeasible for networks with millions of parameters.
Backprop relies on the chain rule. For composed functions h = g(f(x)), dh/dx = (dh/df)(df/dx). For deep networks: dL/dW^[1] = (dL/d^[L])(d^[L]/d^[L-1])...(d^[2]/dW^[1]). These products of Jacobians determine gradient magnitude. If Jacobians have eigenvalues > 1, products explode (exploding gradients). If < 1, products vanish (vanishing gradients).
Vanishing gradients are a key challenge in training deep networks with sigmoid/tanh. The derivative of sigmoid is σ(z)(1-σ(z)), which is ≤ 0.25. Multiplying through many layers: 0.25^100 ≈ 0, so gradients for early layers become negligible. Modern solutions include ReLU activations (constant gradient 1 for positive inputs), batch normalization (keeps activation ranges reasonable), and skip connections (provide direct gradient paths).
Section 05
Partial Derivatives: Jacobians & Gradient Flow
A Jacobian matrix captures all first-order partial derivatives of a vector-valued function. For f: ℝⁿ → ℝᵐ, the Jacobian J is m×n with J_ij = ∂f_i/∂x_j. In neural networks, understanding Jacobians explains how information flows through layers. For the linear layer z = Wx + b (ignoring bias), the Jacobian is simply W: each output depends linearly on inputs via W's rows.
For a composed function z = σ(Wx + b) where σ is a nonlinear activation, the Jacobian is Jσ(W), where Jσ is the Jacobian of σ. For element-wise activations, Jσ is diagonal: Jσ = diag(σ'(z)). So J = diag(σ'(z)) W. The activation derivative σ'(z) scales the weight matrix's influence on output gradients.
Upstream gradients arrive from the loss function at the output layer. Downstream gradients flow to earlier layers. At each layer ℓ, the local Jacobian J^[ℓ] transforms upstream gradients: downstream = J^[ℓ] · upstream. This is backpropagation: ∂L/∂z^[ℓ] = J^[ℓ]^T · ∂L/∂z^[ℓ+1], where J^[ℓ] is the Jacobian of layer ℓ's output with respect to input.
The product of Jacobians along a path determines end-to-end sensitivity. For a 10-layer network where each layer's Jacobian has spectral radius 0.9 (eigenvalues' magnitude at most 0.9), the product is 0.9^10 ≈ 0.35. For 100 layers: 0.9^100 ≈ 0. Gradients vanish. This is why deep sigmoid networks are notoriously hard to train—sigmoid's derivative is small everywhere.
Modern activation functions mitigate this. ReLU's derivative is 1 (for z > 0) or 0 (for z < 0). For positive pre-activations, gradients flow unchanged through ReLU, avoiding both vanishing and exploding problems (unless many units die). Batch normalization keeps intermediate values in reasonable ranges, ensuring activation derivatives stay bounded. Skip connections provide direct gradient paths: dL/dz = dL/d(z+f(z)) includes a direct dL/dz term, preventing vanishing gradients even in very deep networks.
Exploding gradients also cause problems: if spectral radius is 1.1, then 1.1^100 is huge. Gradient clipping (capping gradient norm at a threshold) prevents this. Weight initialization is crucial: initializing weights from a distribution with variance 1/fan_in (where fan_in is the number of inputs) ensures Jacobians have reasonable spectral properties, enabling stable training from the start.
Section 06
Backward Functions: Basic Loss Functions
Each layer type requires a defined backward function specifying how to compute parameter gradients from upstream gradients. These are composed to form the full backpropagation algorithm. A framework like PyTorch or TensorFlow stores these functions and applies them automatically; understanding them reveals how gradients flow.
Linear layer backward: Given upstream gradient ∂L/∂z (shape: num_output × batch_size), compute: (1) ∂L/∂W = (∂L/∂z) @ (a_prev)^T, summed over batch, (2) ∂L/∂b = sum(∂L/∂z, over batch), (3) downstream ∂L/∂a_prev = W^T @ ∂L/∂z. The derivatives reflect each parameter's role: W multiplies activations, so its gradient involves activations; b is broadcast, so gradient is summed over batch.
ReLU backward: Recall a = max(0, z). Derivative: da/dz = 1 if z > 0, else 0. So ∂L/∂z_i = [z_i > 0] * ∂L/∂a_i. In code: where z was positive, gradients pass through; where negative, gradients are zero. This is efficient: one element-wise multiply with a binary mask.
Sigmoid backward: σ(z) = 1/(1+e^{-z}). Derivative: σ'(z) = σ(z)(1-σ(z)). So ∂L/∂z = σ(z)(1-σ(z)) * ∂L/∂a. Since sigmoid outputs a, this is a(1-a) * ∂L/∂a. Small a or (1-a) leads to small gradients (vanishing).
Tanh backward: tanh'(z) = 1 - tanh(z)^2. Similar to sigmoid, gradients can vanish for extreme inputs. Modern networks prefer ReLU variants.
Softmax + Cross-Entropy: This is the standard for multi-class classification. Softmax normalizes logits z to probabilities: p = softmax(z), p_i = e^{z_i}/Σⱼ e^{z_j}. Cross-entropy loss: L = -Σᵢ y_i log(p_i), where y is one-hot. The gradient is ∂L/∂z = p - y. This is remarkably clean: gradients are simply the difference between predicted and target probabilities. This elegant gradient motivated the popularity of softmax+cross-entropy.
Mean Squared Error (MSE): For regression, L = (1/N) Σ(y - ŷ)². Gradient: ∂L/∂ŷ = 2(ŷ - y). MSE penalizes large errors quadratically, useful for outputs with unbounded range. Cross-entropy is better for classification (outputs are probabilities).
Binary Cross-Entropy: For binary classification with sigmoid output: L = -[y log(σ(z)) + (1-y) log(1-σ(z))]. Gradient: ∂L/∂z = σ(z) - y, same form as softmax+cross-entropy. This motivation (simple gradient) is why sigmoid + BCE is standard for binary tasks.
Section 07
Modern Modules: Advanced Techniques
Batch Normalization (BN): Normalizes pre-activations across the mini-batch to zero mean and unit variance, then applies learnable scale and shift. For a layer's pre-activations z, compute batch mean μ_b and variance σ²_b across the batch, then normalize: z_norm = (z - μ_b) / sqrt(σ²_b + ε). Then scale-shift: z_out = γ z_norm + β, where γ and β are learned parameters (initialized to 1 and 0). This reduces internal covariate shift—the problem that as earlier layers change, distributions into later layers shift, forcing later layers to constantly re-adapt.
Benefits: enables higher learning rates, faster convergence, acts as regularizer (noise from batch statistics). During training, BN uses batch statistics; during inference, it uses running averages computed during training (exponential moving average of batch statistics). BN has been phenomenally successful; many modern architectures use it extensively.
Layer Normalization (LN): Normalizes across features (hidden units) instead of across the batch. For a sample's activations a (vector), compute z_norm = (a - mean(a)) / sqrt(var(a) + ε), then scale-shift. LN is independent of batch size—useful when batch size is small or varies. Transformers typically use LN before or after attention/MLP blocks.
Dropout: During training, randomly set each activation to 0 with probability p (typically 0.5), and scale the remaining activations by 1/(1-p). This is equivalent to multiplying activations by a random binary mask, then rescaling. Effect: prevents co-adaptation of units (unit A cannot rely on unit B always being present). Acts as implicit ensemble: training samples multiple sub-networks. During inference, use full network without dropout (no scaling needed due to the 1/(1-p) during training).
Skip Connections (Residual Connections): Instead of z = f(x), compute z = x + f(x). The network learns the residual (change) rather than absolute function. This enables very deep networks: gradients have a direct path from output back to input, preventing vanishing gradients. ResNet (He et al., 2016) showed that 100+ layer networks trained with skip connections exceed shallow networks in accuracy. Skip connections fundamentally changed deep learning, enabling architectures previously thought impossible to train.
Attention Mechanisms: Compute weighted combinations of values based on learned similarity (attention weights) between queries and keys. Self-attention relates each position to all others; used in Transformers. Query Q, Key K, Value V matrices project input features. Attention weights: α = softmax(Q K^T / sqrt(d_k)). Output: weighted sum of values based on α. This enables modeling long-range dependencies without depth, addressing RNN limitations. Transformers stack attention layers, achieving state-of-the-art on NLP and increasingly on vision tasks.
Positional Encoding: Transformers process all positions in parallel (unlike RNNs which process sequentially), losing order information. Positional encodings (learned or sinusoidal functions of position) are added to inputs to inject position information. Sinusoidal: PE(pos,2i) = sin(pos/10000^{2i/d}), PE(pos,2i+1) = cos(pos/10000^{2i/d}). This enables the transformer to learn position-dependent patterns.
Section 08
Vectorization: Batch Processing
Vectorization is the practice of organizing computation in matrices/tensors to exploit parallel hardware. Instead of a loop over samples, we organize samples as a batch dimension and process all simultaneously. Example: computing predictions for N samples with a linear layer. Naive approach: for i in range(N): z[i] = W @ x[i] + b. This calls matrix-vector multiply N times, underutilizing GPU parallelism.
Vectorized approach: organize inputs as matrix X (input_dims, N), then Z = W @ X + b (matrix-matrix multiply) broadcasts b across columns. Modern GPU hardware excels at matrix operations: a 4096×4096 matrix multiply is far faster than 4096 4096×1 matrix-vector products. Batching amortizes overhead and saturates GPU cores.
Batch size trade-offs: Larger batches use hardware more efficiently but require more memory and may hurt generalization (large batches sometimes converge to sharper minima, which generalize worse). Typical practice: use the largest batch fitting in GPU memory (often 32-256 samples, or higher for large GPUs). Gradient accumulation: process mini-batches, accumulate gradients, then update—effectively simulates a larger batch without extra memory.
Shape conventions: Typically (batch, features) or (batch, channels, height, width) for images. Modern frameworks use (batch, ...) as the first dimension for most operations. Understanding shapes is critical for debugging: a (32, 64) input is 32 samples of 64 features; after a layer with 128 units, output is (32, 128); activation functions and element-wise ops preserve shape.
Frameworks like PyTorch and TensorFlow abstract vectorization details but expose it via tensor operations. Writing code at the tensor level (not loops) enables automatic batching and GPU acceleration. A simple pattern: x = torch.randn(batch_size, 784); y = net(x) propagates the entire batch through the network, much faster than looping.
Memory hierarchy: GPU memory (limited, fast), GPU compute (highly parallel), CPU memory (large, slow), disk (persistent). Strategies: fit model in GPU memory, use gradient checkpointing (recompute activations during backprop instead of storing) for large models, use mixed precision (16-bit floats where possible) to reduce memory footprint. Understanding these trade-offs is essential for scaling deep learning to large models and datasets.
Distributed training extends vectorization across multiple GPUs/TPUs. Data parallelism: different samples in a batch go to different devices, gradients are averaged. Model parallelism: different layers go to different devices. Gradient synchronization introduces communication overhead; the sweet spot depends on bandwidth and compute speed. Modern large-scale training uses sophisticated strategies (ZeRO, FSDP) to efficiently distribute training across thousands of devices.
Section 09 — Sources
Sources & References
Course Materials
CS229 Main Lecture Notes — Neural networks, backpropagation algorithm, and deep learning fundamentals