Evaluating Generative Models

Stanford XCS236 · Deep Generative Models

Evaluating Generative Models

Why Evaluation Is Hard

Evaluating generative models is fundamentally challenging because there is no single "ground truth" metric that captures all aspects of quality. Unlike supervised learning where we optimize for accuracy against labeled targets, generative models must balance multiple competing objectives: sample quality (are generated images realistic?), diversity (does the model cover the entire data distribution?), and mode coverage (does it generate from all semantic categories). A metric that achieves high quality might produce repetitive, low-diversity outputs. Conversely, high diversity could mean sacrificing image realism.

The challenge is exacerbated by the subjective nature of "quality" in generative tasks. Visual quality is inherently perceptual; what humans prefer varies based on context, artistic style, and use case. Additionally, computational efficiency and scalability constraints mean evaluators must choose between comprehensive human studies (expensive, slow) and automatic metrics (fast but sometimes misaligned with human judgment). Different application domains—art generation, scientific imaging, medical synthesis—may require entirely different evaluation priorities.

Multiple

Quality Dimensions

Subjective

Perceptual Quality

Trade-offs

Diversity vs. Fidelity

Context

Domain-Dependent

Log-Likelihood Metrics

Log-likelihood (LL) is a theoretically principled metric grounded in probability: it measures how likely the model is to have generated observed data from the true distribution. For a set of held-out test images, we compute the negative log-likelihood (NLL), expressed in bits per dimension (BPD). Lower NLL indicates the model assigns higher probability to real data—a direct measure of how well the learned distribution approximates the true data distribution. This metric is particularly useful because it provides a rigorous, probabilistically interpretable score without requiring reference to a separate dataset or feature extractor.

However, log-likelihood has significant practical limitations for image generation. Computing exact LL requires tractable density estimation, which is feasible for flow-based and autoregressive models but not for GANs or diffusion models without additional machinery (e.g., importance-weighted bounds). For images with high-dimensional, continuous distributions, LL can be dominated by low-frequency image properties, potentially giving high scores to blurry but probabilistically coherent samples. Bits-per-dimension scaling makes comparisons across different image resolutions non-trivial. Additionally, models with similar LL can produce visually different outputs, suggesting LL captures different information than perceptual quality.

Held-out

NLL Test Set

Bits/Dim

Standard Unit

Tractable

Some Models

Orthogonal

To Perception

Fréchet Inception Distance

Fréchet Inception Distance (FID) is one of the most widely adopted metrics in generative modeling. The core idea is to extract feature representations from both real and generated images using an InceptionV3 network (pretrained on ImageNet), compute the mean and covariance of these feature distributions, and measure the Wasserstein-2 distance between them. Formally, FID = ||μ_real − μ_gen||^2 + Tr(Σ_real + Σ_gen − 2(Σ_real·Σ_gen)^{1/2}). Lower FID indicates better alignment between generated and real distributions. FID is appealing because it captures both distribution mean (quality) and covariance (diversity) without requiring paired ground truth.

FID has well-known limitations that practitioners must understand. It depends entirely on InceptionV3 features, which were optimized for object classification on ImageNet—not necessarily aligned with human perceptual preferences or generative quality. The metric can fail for out-of-distribution domains (e.g., medical imaging, artwork) where ImageNet features are less informative. FID is sensitive to the specific test set and feature dimension reduction; swapping the covariance computation or using different layer activations yields different results. High FID can also result from mode dropping (low diversity) while still producing realistic individual samples. Finally, FID cannot assess fine-grained perceptual properties like text legibility in generated scenes, making it an incomplete quality measure.

Inception

Feature Space

Wasserstein

Distance

Fast

Computation

Limited

Domain Transfer

Inception Score

Inception Score (IS) quantifies both sample quality and diversity using a single number derived from an InceptionV3 classifier. The metric is based on KL divergence: IS = exp(E_x[KL(p(y|x) || p(y))]). For each generated image x, we extract the predicted class distribution p(y|x). Across all generated images, we compute the marginal class distribution p(y). High IS occurs when p(y|x) is confident (sharp, peaked distribution—indicating recognizable samples) and p(y) is uniform (diverse class coverage). IS ranges from 1 (worst: uniform predictions) to number of classes (best: perfect coverage), making it interpretable and scale-invariant compared to raw likelihood.

Despite its elegance, IS has profound failure modes that limit its reliability. The metric assumes that high classifier confidence indicates high sample quality—a problematic assumption for adversarial examples and out-of-distribution images. IS biases toward generating ImageNet classes, making it uninformative for non-natural-image domains. It cannot distinguish between genuine diversity and classifier confusion; a generator that produces slightly blurry but confidently classified images often scores better than one that generates diverse but slightly harder-to-classify variations. IS also ignores intra-class quality differences and cannot assess fine details. Consequently, high IS has proven to correlate poorly with human perception in many cases, and researchers increasingly view IS as a coarse proxy rather than a reliable quality metric.

1 – K

Score Range

KL Divergence

Quality + Diversity

Classifier

Bias Prone

Poor

Human Correlation

Precision & Recall

Precision and Recall metrics directly address mode coverage and quality trade-offs by treating generative evaluation as a manifold estimation problem. The approach (Kynkäänniemi et al., 2019) defines both real and generated samples as points in a feature space and uses nearest-neighbor statistics. Precision measures the fraction of generated samples that have a real neighbor within a distance threshold—high precision means most generated samples land near real data (quality-focused). Recall measures the fraction of real samples that have a generated neighbor—high recall means the generator covers the real data manifold comprehensively (diversity/mode-coverage-focused). This pair naturally decouples the quality-diversity trade-off, allowing researchers to see exactly where their models underperform.

Precision-Recall are theoretically cleaner than FID in some respects: they directly measure coverage and don't depend on distance metrics being meaningful in the original feature space. However, they introduce new complexities. The approach requires choosing a distance threshold (often set as the k-th nearest neighbor distance in real data for normalization), which significantly impacts results and requires care. Computing nearest neighbors in high dimensions is computationally expensive and prone to curse-of-dimensionality artifacts. The metrics are also sensitive to the specific feature extractor and can still be misaligned with human judgment in edge cases. Despite these limitations, P&R have become increasingly popular because they provide interpretable, disentangled insights into model performance that single-number metrics lack.

Precision

Quality → NN

Recall

Coverage → Manifold

0–1

Each Metric

Interpretable

Trade-offs

Perceptual Metrics

Perceptual metrics aim to align automated evaluation with human visual perception by either using pretrained neural networks or directly collecting human judgments. LPIPS (Learned Perceptual Image Patch Similarity, Zhang et al., 2018) computes L2 distance in deep feature spaces (VGG, AlexNet, or SqueezeNet) with learned per-layer weights, capturing perceptual differences humans find salient. SSIM (Structural Similarity) measures luminance, contrast, and structure similarity, working well for small distribution shifts but less so for diverse generative tasks. Human preference studies, while computationally expensive, provide the ground truth: researchers show images to annotators and ask which is higher quality, establishing empirical benchmarks that metrics should ultimately match.

The tension between automatic metrics and human evaluation is fundamental. Automatic metrics are reproducible, fast, and enable large-scale model comparisons—essential for research velocity. However, they inevitably lose nuance: LPIPS depends on feature extractors trained on specific datasets, human preference is context-dependent and subjective, and preferences vary across demographic groups and use cases. Recent work on preference-based evaluations (e.g., ranking-based human studies) shows better alignment with user satisfaction than single-number quality scores. Best practice increasingly involves combining multiple metrics: use FID/Precision-Recall for broad comparisons, LPIPS for perceptual fine-tuning, and human studies for final validation, especially in deployed systems where preferences matter.

LPIPS

Neural Distance

SSIM

Structure-Based

Human

Ground Truth

Context

Dependent

Downstream Tasks

Downstream task evaluation tests whether generated or learned representations actually enable good performance on practical applications. A natural approach is to use generated images to train classifiers or use learned feature representations (e.g., from diffusion models or VAEs) as inputs to downstream tasks. High FID but low downstream accuracy suggests the metric is misleading; conversely, models with strong downstream performance are demonstrably useful. Common downstream benchmarks include: training image classifiers on synthetic data and measuring accuracy, using learned features for semi-supervised learning, or evaluating representation quality via linear probes (training a simple classifier on frozen features). This approach directly connects evaluation to business value—a model that produces diverse, high-FID samples but learns poor feature representations may be less useful than one with modest FID but strong linear-probe accuracy.

Downstream evaluation reveals important gaps between automatic metrics and practical utility. A generator might achieve excellent FID by producing a narrow, high-quality mode that covers common ImageNet classes—sufficient for FID but insufficient for tasks requiring diverse poses, lighting, or compositions. Feature representations from generative models show varied quality: some generative pretraining approaches (e.g., masked image modeling) learn excellent features, while others focus purely on sample quality without learning transfer-useful representations. The key insight is that generative and representation quality are not always aligned. Modern evaluation practice increasingly emphasizes downstream tasks for generative models in real applications, especially in domains like scientific imaging or medical synthesis where practical downstream utility is paramount. However, computational cost means full downstream evaluation is often reserved for final model selection rather than hyperparameter tuning.

Classification

On Synthetic Data

Linear Probe

Feature Quality

Practical

Utility Test

Task-Dependent

Metric Alignment

Comprehensive Evaluation

Best-practice generative model evaluation combines multiple metrics to form a comprehensive picture. Start with automatic metrics that are fast and scalable: report FID (de-facto standard for image comparison), Precision-Recall (interpretable trade-off insights), and log-likelihood (where tractable, for theoretical grounding). Layer perceptual metrics like LPIPS or domain-specific structural measures as appropriate. For visual fidelity assessment, include SSIM or other reference-based metrics when you have paired data. Always report sample diversity metrics alongside quality metrics—a single number hides critical trade-offs. Use these for iterative model development and hyperparameter tuning.

For final validation and model selection, conduct human evaluation studies: ask annotators to rate quality on ordinal scales or perform pairwise comparisons. In production systems, track downstream task performance (e.g., classifier accuracy on synthetic training data, or downstream model quality when generative outputs are used as inputs). Benchmark datasets are essential: ImageNet-derived datasets (for FID/IS evaluation), domain-specific datasets (for perceptual alignment), and holdout human preference sets (for human correlation). When publishing or deploying, report multiple metrics with confidence intervals (where possible), disclose which datasets were used for evaluation, and acknowledge metric limitations. Avoid cherry-picking metrics that favor your approach; instead, present a balanced view including metrics where competitors excel. Emerging best practices emphasize open-source evaluation code, standardized evaluation protocols, and community benchmarks (e.g., GenEval, HPSv2) to improve reproducibility and comparability across research groups.

4–6

Auto Metrics

Human

Validation

Downstream

Task Tests

Reproducible

Reporting

References & Further Reading

Evaluating generative models remains one of the most challenging aspects of the field. This section compiles key papers, benchmark datasets, and practical guidelines for comprehensive evaluation combining likelihood metrics, perceptual quality measures, and downstream task performance.

From foundational work on FID and IS to modern composite evaluation strategies, these materials guide rigorous assessment of generative model quality across multiple dimensions.

Why Evaluation Is Hard

Log-Likelihood Metrics

Fréchet Inception Distance

Inception Score

Precision & Recall

Perceptual Metrics

Downstream Tasks

Comprehensive Evaluation

References & Further Reading

Why Evaluation Is Hard

Key Evaluation Dimensions

Log-Likelihood Metrics

When to Use Log-Likelihood

Fréchet Inception Distance

Practical Recommendations

Inception Score

Current Status

Precision & Recall

Usage and Interpretation

Perceptual Metrics

Best Practices

Downstream Tasks

Representation Quality Evaluation

Comprehensive Evaluation

Common Pitfalls and How to Avoid Them

References & Further Reading

Foundational Papers

Key Metrics

Evaluation Frameworks & Benchmarks

Human Evaluation

Best Practices

Learning Resources