Stanford XCS229 · Machine Learning
Neural Networks
Part II · Deep Learning & Backpropagation
01

Beyond Linear Boundaries

Linear models are fundamentally limited: they cannot express non-linear decision boundaries or capture feature interactions. Neural networks overcome this through composition of nonlinear transformations. The universal approximation theorem guarantees that with enough hidden units and nonlinear activations, networks can approximate any continuous function. This power comes from learned feature representations that gradually transform raw inputs into higher-level abstractions.

Deep networks learn hierarchical representations where early layers capture low-level patterns and deeper layers combine them into semantic concepts. The nonlinearity enables the network to fold and twist the input space, creating arbitrary decision boundaries. Modern architectures exploit this to solve tasks that linear methods cannot: image recognition, language understanding, and complex reasoning all rely on the expressive power that depth and nonlinearity provide.

02

Network Structure

Neural networks are organized into layers: input, hidden, and output. Each layer contains units (neurons) connected to the previous layer via learnable weights and biases. For a unit in layer ℓ, the activation is computed from weighted inputs plus bias, then passed through an activation function. Common activations include ReLU (piecewise linear), sigmoid (smooth probability), and tanh (centered). The choice of activation affects optimization dynamics and representational capacity.

Architecture design involves choosing depth (number of hidden layers), width (units per layer), and activation functions. Wider networks increase capacity but may overfit; deeper networks learn hierarchical representations but become harder to train. Skip connections bypass layers, enabling very deep models. Modern practice favors ReLU and variants (Leaky ReLU, GELU) for their computational efficiency and training stability compared to older sigmoid/tanh activations.

03

Computing Activations

Forward propagation computes layer-by-layer activations by matrix multiplication and element-wise operations. For input x, each layer ℓ computes: z^[ℓ] = W^[ℓ]a^[ℓ-1] + b^[ℓ], then a^[ℓ] = σ(z^[ℓ]) where σ is the activation. Vectorized computation processes entire batches simultaneously, dramatically improving GPU utilization. A batch of N samples computes N predictions in parallel: Z^[ℓ] is (units × N), resulting in A^[ℓ] also (units × N).

Vectorization is essential for practical deep learning. Computing one sample at a time wastes GPU parallelism; batching amortizes overhead. The forward pass accumulates intermediate activations and pre-activations needed later for backpropagation. Memory usage scales with batch size, batch length, and model depth, creating practical trade-offs. Efficient implementations minimize data movement between GPU memory and computation.

04

Gradient Flow

Backpropagation computes gradients by applying the chain rule backward through the network. The key insight: derivatives flow from loss L backward through layers, combining with local layer computations to get gradients for each parameter. If loss ∂L/∂a^[out] is known, then ∂L/∂z^[out] = ∂L/∂a^[out] ⊙ σ\'(z^[out]). Then ∂L/∂W = ∂L/∂z · (a^[prev])^T and derivatives propagate to previous layers.

Backprop is the reverse of forward propagation: information flows backward. This efficient algorithm computes all gradients in O(cost of forward pass), avoiding the naive approach of numerically differentiating each parameter. Modern frameworks automate backprop via computational graphs that track operations and automatically derive gradient rules. Understanding backprop is crucial for debugging, choosing architectures, and designing new layers.

05

Jacobians & Gradients

Jacobians are matrices of all first-order partial derivatives. For a vector function f: ℝⁿ → ℝᵐ, the Jacobian J is m×n with entries ∂f_i/∂x_j. In neural networks, Jacobians describe how each output changes with each input. For a single-layer network with weights W and bias b, the Jacobian of z = Wx+b is simply W. For composed functions, Jacobians multiply via the chain rule: ∂z/∂x = (∂z/∂y)(∂y/∂x).

Upstream gradients arrive from the loss function; downstream gradients propagate to earlier layers. At each layer, local Jacobians transform upstream gradients into downstream gradients. The product of Jacobians along the path determines sensitivity. Deep networks can suffer vanishing gradients (products approach zero) or exploding gradients (products grow exponentially). Modern techniques like normalization, careful initialization, and ReLU activations mitigate these issues, enabling stable training of very deep models.

06

Basic Loss Functions

Backward functions define gradient computation for each module. Linear layer: given upstream ∂L/∂z, compute ∂L/∂W = ∂L/∂z · a^T and ∂L/∂a = W^T · ∂L/∂z. ReLU: gradient is zero where input is negative, passes through where positive (∂L/∂x = [z > 0] ⊙ ∂L/∂z). Sigmoid and tanh: gradients involve σ(1-σ) and 1-tanh². Softmax with cross-entropy loss has a simple gradient: ∂L/∂z = (softmax(z) - y), where y is the one-hot target.

Loss functions couple with final layers to produce clean gradients. Cross-entropy for classification creates ∂L/∂logits = (predictions - targets), enabling efficient gradient descent. MSE loss for regression yields ∂L/∂predictions = 2(predictions - targets). Proper loss-layer combinations are crucial: using squared error with softmax is suboptimal compared to softmax+cross-entropy. Understanding backward functions enables building custom layers and debugging gradient flow issues.

07

Advanced Techniques

Modern modules improve training stability and performance beyond basic layers. BatchNorm normalizes layer inputs to zero mean/unit variance, reducing internal covariate shift and enabling higher learning rates. Dropout randomly zeros activations during training, acting as implicit ensemble and regularizer. Skip connections let gradients bypass layers, alleviating vanishing gradient problem in very deep networks. Residual networks learn incremental changes via z = x + f(x) rather than z = f(x), enabling networks with hundreds of layers.

Attention mechanisms compute weighted combinations of inputs, with weights learned from content similarity. Self-attention relates each position to all others; this enables transformers to process sequences without recurrence. Layer normalization and various positional encodings support attention-based architectures. These modern components are composable: combining residuals, attention, and normalization creates powerful architectures for vision and language. Understanding these building blocks is essential for reading and implementing contemporary models.

08

Batch Processing

Vectorization processes entire batches simultaneously. Instead of computing one sample at a time, matrices have shape (features, batch_size). Forward pass: X is (input_dims, N), W is (hidden_dims, input_dims), so Z = W @ X + b broadcasts bias to N samples. Backward pass: dW = (dZ @ X.T) / N accumulates gradients across the batch. This amortizes weight matrix reads and enables GPU cores to work in parallel on different samples simultaneously.

GPU parallelism requires thinking in matrix operations. A typical GPU can launch thousands of threads; processing individual samples serializes these. Large batches (hundreds to thousands) best exploit hardware. Frameworks like PyTorch and TensorFlow automatically handle vectorization when you code with tensors. Understanding batch dimensions is critical: shape (32, 64, 28, 28) might represent 32 images of 64 channels at 28×28 resolution, and operations happen independently across the batch dimension, then are aggregated for gradient updates.