Week 3–4 · Adversarial Training & Wasserstein Distance
01
Adversarial Framework
Generative Adversarial Networks pit two neural networks in a min-max game: a generator G creates synthetic data from noise, while a discriminator D attempts to distinguish real from fake. This adversarial process drives both networks toward an equilibrium where the generator produces indistinguishable samples.
The framework formulates unsupervised learning as a competition. As D improves at detection, G is forced to generate more realistic samples. The Nash equilibrium represents the optimal solution where D cannot improve further.
02
GAN Objective & Loss
The GAN objective function V(G, D) balances two log-likelihood terms: maximizing D's ability to identify real samples and minimizing D's ability to identify fake ones. The generator minimizes the probability of being detected, creating a zero-sum game dynamic.
The optimal discriminator D* has a closed form: it outputs the probability that a sample is real given the data. This theoretical insight guides practical training, even though computing D* exactly is intractable. The minimax formulation defines a theoretical global optimum where real and generated distributions align; however, practical optimization with gradient-based methods does not guarantee convergence to this point and often faces challenges such as mode collapse and training instability.
03
Training Dynamics
Training GANs requires alternating gradient updates: D is updated to maximize classification accuracy, while G is updated to fool D. Instability arises from this sequential optimization—when one network is overparameterized, training can diverge or collapse to trivial solutions.
Mode collapse occurs when G learns to generate only a subset of the data distribution's modes, evading D's discrimination while avoiding the complexity of full coverage. Non-convergence is endemic: theoretical guarantees are weak, and practice shows oscillation between network improvements rather than stable equilibrium. Careful hyperparameter tuning, architectural choices, and training protocols are essential.
04
Wasserstein GAN
Wasserstein GANs address training instability by replacing the divergence metric. Instead of KL or JS divergence, WGAN uses Earth Mover's distance, which provides a smoother gradient landscape. Weight clipping constrains the discriminator to be 1-Lipschitz, ensuring the loss is a valid metric on the space of distributions.
Spectral normalization and gradient penalty further stabilize training by controlling discriminator gradients. Wasserstein loss provides a meaningful training signal even when distributions are disjoint, enabling mode coverage and more reliable convergence. These techniques have become standard practice in modern GAN training.
05
Conditional GANs
Conditional GANs inject class labels into both generator and discriminator, enabling controlled generation. The class information guides G to generate samples from a specific class, while D learns to discriminate both on authenticity and class consistency. This extends GANs from unsupervised to semi-supervised learning.
Pix2pix applies cGANs to image-to-image translation: paired images condition the generator to learn fine-grained transformations like sketch-to-photograph or day-to-night. The discriminator becomes a "patch discriminator," evaluating local realism rather than global authenticity, improving fine detail synthesis.
06
Progressive & Style GANs
Progressive GANs grow the network architecture layer-by-layer during training, starting with low resolution and gradually adding detail. This accelerates convergence and improves stability by allowing the generator to first learn coarse structure before refining details. StyleGAN extends this with a mapping network that translates latent codes into style parameters.
StyleGAN uses adaptive instance normalization (AdaIN) to inject style information at multiple scales, decoupling content from style. Latent interpolation in the intermediate style space produces smooth, high-quality transitions. These architectural innovations enable stable training of high-resolution image generation, exemplified by face synthesis at 1024×1024.
07
Evaluation Metrics
Evaluating GANs is challenging because they lack explicit likelihood models. Inception Score (IS) measures sample diversity and quality using a pre-trained classifier, but is sensitive to mode collapse. Fréchet Inception Distance (FID) compares feature distributions between real and generated samples, providing more robust evaluation.
Precision and recall metrics directly measure mode coverage: precision quantifies the fraction of generated samples within the data manifold, while recall measures the fraction of data modes captured by the generator. Together with FID, these metrics provide comprehensive evaluation of both quality and diversity in generated samples.
08
GAN Theory & Limitations
GANs lack a likelihood-based training objective, making theoretical analysis difficult. However, the minimax framework guarantees convergence under restrictive assumptions (continuous distributions, sufficient capacity, infinite optimization). In practice, these assumptions rarely hold, and networks exhibit oscillation rather than convergence.
The diversity-quality tradeoff is fundamental: generating highly realistic samples requires fine-tuning to a narrow distribution, while maintaining diversity requires exploring the full data manifold. Truncation tricks trade diversity for quality by sampling latent codes from a restricted region. Understanding these limitations guides proper application: GANs excel at high-quality synthesis but require careful validation of mode coverage and diversity.
09
References & Further Reading
Generative Adversarial Networks introduced a paradigm shift through adversarial training. This section gathers foundational papers, improvements, and resources for understanding the GAN framework from theory to modern applications in image generation and beyond.
From the original formulation to conditional variants and architectural innovations, these materials trace the evolution and impact of adversarial training in deep generative modeling.
01
Adversarial Framework
A GAN consists of two competing neural networks. The generator G maps a noise vector z from a simple prior (uniform or Gaussian) to a sample in the data space. The discriminator D evaluates whether a sample is from the real data or generated by G. This two-player game drives mutual improvement.
The generator's objective is to produce samples indistinguishable from real data. The discriminator's objective is perfect classification. As the game progresses, G generates increasingly realistic samples to fool D, while D becomes more adept at detection. The process reaches Nash equilibrium when further improvement by either player is impossible.
The min-max formulation describes a system where G and D have opposing objectives, yet they improve together. G cannot observe real data directly—it only receives feedback through D's gradients. D must learn to extract discriminative features that capture the essence of the data distribution.
Generator and Discriminator Architectures
The generator is typically a deep convolutional transpose network that progressively upsample latent vectors to high-dimensional samples. Early layers capture coarse structure, while later layers refine details. Skip connections and batch normalization stabilize training.
The discriminator mirrors this architecture: a convolutional network that progressively downsamples images to a binary classification. Unlike standard classifiers, the discriminator must learn features that distinguish real from fake, not classify into semantic categories. This unsupervised feature learning is a significant characteristic of GANs.
Architectural choices profoundly affect training dynamics. Networks with imbalanced capacity—one much larger than the other—lead to dominance by the larger network and training collapse. Symmetric architectures with careful scaling promote stable, balanced improvement.
Nash Equilibrium and Theoretical Convergence
The Nash equilibrium of the GAN game is achieved when the discriminator cannot improve (always outputs 0.5 probability) and the generator distribution matches the real data distribution. At this point, the generator has solved the learning problem.
However, reaching Nash equilibrium is non-trivial. The alternating optimization of separate networks does not guarantee convergence to equilibrium, even theoretically. In practice, networks exhibit cyclic behavior: G improves, then D adapts, leading to oscillation rather than stable equilibrium.
02
GAN Objective & Loss
The fundamental GAN objective is: min_G max_D V(G,D) = E_x[log D(x)] + E_z[log(1 - D(G(z)))]. The discriminator maximizes this value (correctly classifying real and fake), while the generator minimizes it (fooling the discriminator). This zero-sum formulation creates the adversarial dynamic.
The first term measures the discriminator's accuracy on real samples: D(x) should be close to 1. The second term measures its accuracy on fake samples: D(G(z)) should be close to 0, so 1 - D(G(z)) is close to 1. High V indicates strong discriminator performance; low V indicates successful deception by the generator.
In practice, the generator's loss is often reformulated. The original objective can lead to vanishing gradients when D(G(z)) is near 0, providing weak training signals. An alternative is max_G E_z[log D(G(z))], which provides stronger gradients by rewarding the generator for being detected as fake.
Optimal Discriminator
For a fixed generator G, the optimal discriminator D* can be derived in closed form. Taking the functional derivative of V with respect to D yields: D*(x) = p_data(x) / (p_data(x) + p_G(x)). The optimal discriminator outputs the posterior probability that a sample is real.
This theoretical result defines the "ground truth" discriminator, even though computing it exactly is intractable (as p_data or p_G are typically unknown). However, the formula guides intuition: an effective discriminator learns to estimate the ratio of real to total samples, which is essential for distinguishing real data from generated samples.
At Nash equilibrium, D* = 0.5 everywhere (all samples equally likely to be real or fake), and p_G = p_data (generator matches reality). Computing the optimal D for any finite, discrete dataset involves empirical probability estimation from samples.
Divergence Interpretations
The GAN objective can be rewritten as minimizing the Jensen-Shannon divergence between p_data and p_G. This connects GANs to information-theoretic principles: the discriminator implicitly estimates divergence, and the generator minimizes it by moving closer to the data distribution.
Different GAN variants use different divergences: Wasserstein GANs minimize Earth Mover's distance, f-GANs minimize f-divergences. Each choice affects training dynamics, gradient quality, and convergence behavior. Divergence choice is a design decision with practical consequences.
03
Training Dynamics
Training alternates between discriminator and generator updates. Each iteration updates D for k steps (to improve classification), then updates G for 1 step (to improve generation). The choice of k and the step sizes profoundly affects convergence. Excessive D training can overpower G; insufficient training leaves it poorly informed.
The gradient flow from D to G carries classification information: large losses indicate unrealistic samples, guiding G toward improvement. However, this feedback is often noisy and non-stationary as D constantly changes, making G optimization unstable.
Batch normalization is critical for stability. It normalizes activations to zero mean and unit variance, reducing internal covariate shift. In GANs, batch norm in D and layer norm in G (rather than batch norm) prevent information leakage between batches of real and fake samples during training.
Mode Collapse
Mode collapse is a significant and frequently discussed failure mode: G learns to generate only a narrow subset of the data distribution. For example, a GAN trained on MNIST might generate only images of the digits 0 and 3, completely ignoring other classes. From D's perspective, this is effective deception; from the perspective of learning, it represents a severe limitation in achieving the learning objective of modeling the full data distribution.
Mode collapse arises from the optimization dynamics. Once G finds a region in data space that fools D, improving D only teaches G to refine that region rather than explore others. There is no mechanism forcing G to cover the full distribution. Diversity penalties, unrolled discriminators, and mixture models partially mitigate this.
Theoretical analysis shows that for Wasserstein distance and other proper metrics, mode collapse is less likely because the loss provides a continuous gradient even when distributions are completely disjoint. However, it can still occur in practice due to capacity limitations.
Non-Convergence and Oscillation
Unlike supervised learning with a fixed target distribution, GAN training is inherently non-stationary: the target (real/fake boundary) moves as D changes. This creates a moving target problem. G can appear to improve (producing more realistic samples) while D simultaneously improves (detecting more effectively).
Empirically, well-trained GANs exhibit cyclic behavior: G generates improving samples, D adapts, then cycles restart. True convergence in the theoretical sense (Nash equilibrium) is rarely achieved. Instead, practitioners use inception scores, FID, or manual inspection to determine when to stop training based on sample quality.
Early stopping based on validation metrics is standard. Checkpointing throughout training and selecting the best checkpoint (not the final one) often yields better results than waiting for convergence that may never arrive.
Hyperparameter Sensitivity
GANs exhibit high sensitivity to hyperparameters. Learning rates too high cause oscillation and divergence; too low cause slow, unstable training. The ratio of discriminator to generator learning rates matters significantly. Architecture choices (layer norm vs batch norm, activation functions, capacity ratios) heavily influence stability.
Initialization is critical: poor initialization can prevent the generator from even receiving useful gradients from the discriminator early in training. Spectral normalization on the discriminator provides automatic regularization, improving training stability across hyperparameter ranges.
04
Wasserstein GAN
Wasserstein GANs address fundamental training instability by replacing KL/JS divergence with Wasserstein distance (Earth Mover's distance). The Wasserstein distance between distributions p and q is the minimum cost of transporting mass from p to q. Geometrically, it measures how much "earth" must be moved and how far.
A key characteristic is that Wasserstein distance provides meaningful gradients even when two distributions are completely disjoint (zero overlap). In contrast, KL and JS divergence are flat (give no gradient) outside the support of both distributions. When p_data and p_G are disjoint—common early in training—standard GANs provide nearly zero gradients. The Wasserstein distance mitigates this issue.
The WGAN objective becomes: min_G max_D E_x[D(x)] - E_z[D(G(z))], where D is constrained to be 1-Lipschitz (its gradients are bounded). This is the dual form of the Wasserstein distance. The discriminator (now called a critic) outputs a real-valued score rather than a probability, interpreting V directly as the distance between distributions.
Weight Clipping and Spectral Normalization
To enforce the 1-Lipschitz constraint, original WGANs used weight clipping: clip all weight matrices to [-0.01, 0.01]. While straightforward, this method effectively enforces the constraint. However, clipping drives many weights to the clipping boundaries, harming optimization. Spectral normalization is an alternative method that addresses these limitations: normalize weights by their largest singular value, ensuring Lipschitz constant of 1 while preserving gradient flow.
Spectral normalization computes the largest singular value of weight matrices efficiently using power iteration. It adds limited computational overhead and becomes standard in modern GANs. Combined with appropriate learning rate scheduling and architecture design, it enables stable training for high-resolution image synthesis.
Gradient penalty is another approach: add a regularization term L_gp = lambda * E[(||∇_x D(x)||_2 - 1)^2], which encourages gradient norm 1 and is more flexible than hard clipping. Different penalty formulations exist; gradient penalty with interpolated samples (between real and fake) is empirically effective.
Improved Training Stability
WGAN training exhibits greater stability than standard GANs. The loss curves are meaningful: decreasing loss corresponds to improving generation quality. Generator collapse is less frequent because the distance metric continues to provide feedback even when G generates mode-collapsed samples.
The improved signal allows increased training frequency: G can be updated multiple times per D update, or vice versa, without immediate collapse. Hyperparameter tuning is less sensitive. These properties make WGAN and its variants (WGAN-GP, spectral normalization) practical enhancements widely adopted in industry.
05
Conditional GANs
Conditional GANs (cGANs) inject class information into both generator and discriminator. The generator receives both a latent code z and a condition c (class label, image, etc.) and generates samples from the conditional distribution p(x|c). The discriminator evaluates both authenticity and condition consistency.
This extension enables controlled generation: given a class label, the generator produces samples of that class. The discriminator learns to enforce consistency: detecting not just fake samples, but also real samples misaligned with their label. This semi-supervised learning setup improves both generator quality and discriminator features.
Mathematically, cGANs optimize: min_G max_D E_x[log D(x,c)] + E_z[log(1 - D(G(z,c), c))]. The discriminator now takes both sample and condition, enabling joint evaluation. Class information can be concatenated to latent vectors, injected as multiple layers, or added via more sophisticated mechanisms.
Pix2Pix and Image-to-Image Translation
Pix2pix applies cGANs to paired image-to-image translation: sketch-to-photograph, day-to-night, semantic segmentation map-to-image, etc. The condition is a source image, and the generator learns to translate it to a target domain while preserving spatial structure.
The key innovation is the patch discriminator: instead of classifying the entire image as real/fake, it evaluates 70x70 patches independently, then averages the results. This local evaluation forces the generator to produce fine-grained detail at multiple scales, avoiding the "blurry average" problem of global discriminators.
Pix2pix adds a reconstruction loss (L1 or L2 distance between generated and ground truth) to the adversarial loss. This hybrid objective constrains the generator to remain faithful to the source image while achieving photorealism. The balance between adversarial and reconstruction losses is a key design choice.
Class-Conditional Generation
In class-conditional GANs trained on MNIST or ImageNet, each class induces its own subregion in the data manifold. The generator learns these regions and can generate samples on-demand from any class. The latent space z controls variation within a class, while c controls class identity.
This separation enables interesting operations: interpolating between latent codes within a class produces smooth transitions; interpolating class codes produces class-blending effects. Disentanglement between z and c is often incomplete in practice, leading to some class information encoded in z and vice versa.
Class information is typically injected via concatenation with latent vectors, or more sophisticatedly via embedding layers and attention mechanisms. The choice affects how tightly class identity is enforced versus how much freedom z has to modulate style and details.
Additional Conditioning Mechanisms
Beyond class labels and source images, GANs can be conditioned on text descriptions (text-to-image synthesis), audio (speech-to-video), semantic layouts, depth maps, or any structured input. Each conditioning modality requires appropriate encoding (embeddings for discrete, CNNs for images, etc.).
Multi-modal conditioning is also possible: generating high-resolution faces from low-resolution faces and a text description. The generator must learn to reconcile potentially conflicting conditions and produce coherent outputs. This is more challenging but enables richer, more controllable generation.
06
Progressive & Style GANs
Progressive GANs accelerate and stabilize training by growing the network architecture gradually. Training begins with low-resolution generation (e.g., 4x4 images). After convergence, new layers are added to both generator and discriminator, progressively increasing resolution (8x8, 16x16, ..., 1024x1024). New layers start with small weights, smoothly fading in as training progresses.
This approach has multiple benefits: low-resolution training converges quickly and is stable; adding layers is like progressive regularization; the network learns hierarchical features naturally. High-frequency detail is learned only after coarse structure is solid, mirroring human perception and natural image statistics.
Progressive growth uses fade-in: new layers are added at small scale with a mixing coefficient alpha that grows from 0 to 1 over several iterations. This smooth transition prevents sharp changes in the loss landscape that would destabilize training. Progressive training requires careful coordination but yields superior convergence and quality.
StyleGAN and Style Injection
StyleGAN introduces a mapping network that transforms latent codes z (from a standard normal) to an intermediate latent space w. This intermediate space is more disentangled: interpolating in w corresponds to style changes (hair, pose, identity), while different z in the same w region modulate fine details. The separation of high-level and low-level variation is explicit.
Style is injected via adaptive instance normalization (AdaIN) at each layer: normalize the convolutional output to zero mean/unit variance, then scale and shift by style vectors derived from w. This mechanism allows precise control: style from deep layers affects large structures, while shallow layers control fine texture.
StyleGAN's loss remains adversarial, but the architecture and training procedure enable a high degree of control. Interpolation in w-space produces smooth, semantic transitions. Mixing styles (using different w for different layers) creates realistic variation. Truncation in w-space trades diversity for quality by restricting the range of w.
Advanced Architecture Components
StyleGAN uses several architectural innovations: learnable constant input (all generators start from the same constant, not noise), skip connections, minibatch standard deviation (adding a statistic to help D evaluate realism), and careful layer normalization placement. These architectural innovations accumulate to substantial improvements in quality and stability.
StyleGAN2 further improved training with path length regularization (encouraging smooth gradients in w-space), lazy regularization (computing regularization terms less frequently), and rotational equivariance in the discriminator. These enhancements enable even higher fidelity and better disentanglement.
07
Evaluation Metrics
Evaluating GANs is challenging because they lack tractable likelihood. Standard metrics for generative models (perplexity, log-likelihood) do not apply. Instead, evaluation typically focuses on two separate aspects: sample quality (realism) and sample diversity (mode coverage). No single metric captures both, requiring multiple complementary measures.
Inception Score (IS) uses a pre-trained ImageNet classifier to evaluate quality and diversity. A good generative sample should be classified confidently into one class (high entropy reduction), and the overall distribution should cover many classes (high entropy across samples). IS = exp(E[KL(p(y|x) || p(y))]) combines these criteria.
IS has limitations: it depends on classifier capacity and bias, it cannot detect out-of-distribution (unrealistic) samples if they fool the classifier, and it is sensitive to the number of samples. Despite these flaws, it is widely reported and enables historical comparison.
Fréchet Inception Distance
Fréchet Inception Distance (FID) compares feature distributions between real and generated samples using a pre-trained Inception network. Samples are embedded into a feature space (typically the penultimate layer), and FID computes the Wasserstein distance between the empirical Gaussian distributions fit to real and generated features.
FID = ||mu_r - mu_g||^2 + Tr(Sigma_r + Sigma_g - 2(Sigma_r Sigma_g)^(1/2)) where mu and Sigma are mean and covariance. Lower FID indicates closer distributions, interpreted as higher quality. FID is more robust than IS: it doesn't require high classification confidence, and it detects distribution mismatch more reliably.
FID correlates well with human perception of image quality and is now standard in papers and benchmarks. It is more stable than IS and less sensitive to the specific classifier. However, it still relies on an ImageNet-trained network, which may have domain-specific biases.
Precision and Recall
Precision measures the fraction of generated samples that lie on the real data manifold. High precision means the generator produces realistic samples; low precision means it generates out-of-distribution hallucinations. Precision is computed by checking if generated samples have nearest neighbors in the real data with distance below a threshold.
Recall measures the fraction of real data modes captured by the generator. High recall means comprehensive coverage; low recall means mode collapse. These metrics directly address the quality-diversity tradeoff: one GAN might achieve high precision (realistic samples, limited diversity) while another achieves high recall (comprehensive mode coverage, lower average quality).
Precision and recall require defining a threshold distance in feature space. Different thresholds give different values, so reporting multiple thresholds is common. These metrics offer a clear conceptual framework but are computationally expensive (requiring nearest neighbor search over large sample sets).
Other Evaluation Approaches
Human evaluation is considered a gold standard but is expensive and subjective. User studies comparing generated samples to real samples yield ground-truth quality assessments. However, these are expensive to conduct and difficult to standardize across research groups.
Task-specific metrics are also employed: for super-resolution, PSNR and SSIM measure pixel-level fidelity; for segmentation-guided generation, IoU measures mask accuracy. For text-to-image synthesis, CLIP score measures alignment with descriptions. While not universally applicable, these metrics provide clear insights within their specific application domains.
08
GAN Theory & Limitations
GANs have no explicit likelihood-based training objective. While this enables flexible, unsupervised learning, it makes theoretical analysis difficult. Standard information theory and convergence guarantees don't directly apply. The minimax formulation provides some theoretical grounding, but assumptions (continuous distributions, infinite capacity, infinite optimization) are rarely satisfied in practice.
Goodfellow et al. proved that GANs converge under strong assumptions: if both networks have sufficient capacity, if the discriminator can be optimized to convergence in each iteration, and if the distributions are continuous. In practice, these conditions fail: networks are finite, optimization is incomplete, and discrete datasets violate continuity. Theoretical convergence guarantees are weak.
However, empirical convergence is frequently observed. When properly trained (good architectures, hyperparameters, regularization), GANs do learn meaningful distributions. The theory-practice gap suggests that implicit regularization and architecture constraints implicitly satisfy sufficient conditions for convergence, even when formal theorems don't apply.
Implicit Density and Mode Coverage
GANs model the data distribution implicitly: the generator defines a density p_G through its transformations, but this density is never computed explicitly. This characteristic offers advantages (flexibility, scalability) and presents challenges (difficulty analyzing the learned representations of the generator). Unlike VAEs with explicit likelihood, GANs cannot compute p_G(x) for a given sample.
Implicit density creates a fundamental challenge: assessing mode coverage. The generator might ignore entire regions of the data space without any signal to correct it. Precision/recall metrics detect this post-hoc, but during training, there is no continuous "pressure" to maintain diversity. Some modes are simply never explored.
This contrasts with likelihood-based models (VAEs, autoregressive models) which have a continuous incentive to model all regions of the data: low-probability regions incur high likelihood loss. GANs lack this property—once a region is adequately fooling D, G has no incentive to improve there or explore elsewhere.
Truncation Trick and Quality-Diversity Tradeoff
The truncation trick samples latent codes from a restricted region (e.g., z ~ N(0, sigma^2) with sigma < 1 rather than sigma = 1). This biases the generator toward the mode of its learned distribution, trading diversity for quality. Highly truncated samples often exhibit high fidelity but demonstrate reduced variety.
This tradeoff is fundamental to GANs. Achieving both maximum diversity and maximum quality appears impossible: the generator's optimal strategy to fool D is to specialize in high-density regions of the data. Exploring low-density regions (rare classes, unusual variations) reduces average realism.
Modern GANs try to mitigate this by carefully balancing spectral normalization, gradient penalties, and training duration. However, the underlying tradeoff remains. Practitioners must choose: use full truncation for publication-quality results (with limited diversity), or relax truncation for better mode coverage (with lower average quality).
Distinguishability and the Fundamental Limit
At Nash equilibrium, the discriminator achieves D*(x) = 0.5 everywhere: all samples are equally likely to be real or fake. At this point, the generator distribution exactly matches the real data. However, this outcome is unachievable if the true data is high-dimensional and complex.
Binnacle et al. showed that unless the generator has infinite capacity and is trained to convergence, achieving perfect indistinguishability is impossible. The discriminator can always find some feature that distinguishes real from fake, leading to perpetual pressure on the generator to improve.
In practice, sufficient indistinguishability is achieved: samples are perceived as realistic by human observers, pass standard metrics, and are useful for applications. Perfect equilibrium is neither necessary nor achievable. The goal is sufficient alignment of distributions for the intended use case.