Reinforcement Learning

01

Markov Decision Processes

A Markov Decision Process (MDP) is the mathematical framework for sequential decision-making under uncertainty. It consists of states S, actions A, transition probabilities P(s'|s,a), rewards R(s,a), and a discount factor γ ∈ [0,1]. The Markov property ensures that future states depend only on the current state and action, not on the history of transitions.

The goal of reinforcement learning is to find an optimal policy π*(s) that maps states to actions, maximizing cumulative discounted reward: E[∑t=0∞ γt rt]. Key insights include the concept of value functions V(s) = E[cumulative future reward from state s] and Q-functions Q(s,a) = E[reward from taking action a in state s]. These functions satisfy recursive relationships known as Bellman equations.

4

Core Components

γ ≤ 1

Discount Factor

π*(s)

Optimal Policy

Bellman

Recursion

02

Value Iteration Algorithm

Value iteration is a dynamic programming algorithm that computes the optimal value function by repeatedly applying the Bellman optimality operator. Starting from an arbitrary initialization V(s), the algorithm updates: V(s) ← maxₐ [R(s,a) + γ E[V(s')]]. This update is guaranteed to converge to V*(s) because the Bellman operator is a contraction mapping with modulus γ < 1.

Once the optimal value function is computed, the optimal policy is extracted via π*(s) = argmaxₐ [R(s,a) + γ E[V*(s')]]. Value iteration has exponential convergence in the number of iterations, making it efficient for problems with moderate state spaces. The algorithm requires knowledge of the transition model P(s'|s,a) and reward function, limiting applicability to model-free settings.

Contraction Mapping

Bellman operator satisfies ||T V - T V'|| ≤ γ ||V - V'||, guaranteeing convergence at rate O(γ^n).

Policy Extraction

After convergence, optimal policy is deterministically extracted from value function in one sweep.

03

Policy Iteration Method

Policy iteration alternates between two steps: policy evaluation computes V^π(s) for a fixed policy π, and policy improvement updates π(s) ← argmaxₐ [R(s,a) + γ E[V^π(s')]]. This greedy improvement step is guaranteed to produce π' ≥ π in the sense of V^π'(s) ≥ V^π(s) for all states. The algorithm terminates when the improved policy equals the current policy.

Policy iteration typically converges in fewer iterations than value iteration because policy improvement can make substantial jumps. However, each iteration requires solving a system of linear equations for V^π (policy evaluation), which can be computationally expensive. In practice, generalized policy iteration (GPI) performs partial policy evaluation steps to balance computation cost with convergence speed.

Policy Improvement

Greedy improvement guarantees: V^π'(s) ≥ V^π(s) for every state, leading to monotonic improvement.

Faster Convergence

Often converges in fewer total iterations than value iteration despite higher per-iteration cost.

04

Learning Environment Models

In model-based reinforcement learning, the agent learns or uses a transition model P(s'|s,a) and reward model to plan optimal behavior. When the true model is unknown, the agent can learn an approximate model from experience (s,a,s',r) tuples using maximum likelihood or other supervised learning techniques. The learned model is then used with value iteration or policy iteration for planning.

The exploration vs. exploitation dilemma arises naturally: the agent must balance exploring unfamiliar state-action pairs (to improve the model) with exploiting actions that appear optimal under the current model. Methods like ε-greedy exploration, upper confidence bounds (UCB), and Thompson sampling address this tradeoff. Dyna-style algorithms combine learning from real experience with planning using the learned model to accelerate learning.

Model Learning

Learn P(s'|s,a) from experience transitions; use maximum likelihood or Bayesian methods for uncertainty.

Dyna Architecture

Interleave real experience collection with simulated planning steps using the learned model.

05

Continuous State MDPs

Many real-world problems have continuous state spaces (e.g., robot position, velocity) where discretization is impractical due to curse of dimensionality. Value function approximation addresses this by parameterizing V(s) ≈ V(s;θ) with a function approximator (neural network, linear basis functions, etc.). Fitted value iteration becomes: sample state transitions, compute targets using Bellman equation, and perform supervised learning to minimize ||V(s;θ) - target||².

Linear function approximation using basis functions φ(s) offers theoretical guarantees: V(s;θ) = θᵀφ(s). However, nonlinear approximators like neural networks are more flexible but lack convergence guarantees and can exhibit instability. Key practical challenges include managing function approximation error, choosing appropriate features, and handling off-policy learning where the data distribution doesn't match the current policy.

Basis Functions

Linear approximation V(s;θ) = θᵀφ(s) with convergence guarantees under gradient descent.

Neural Networks

Nonlinear function approximation enables complex value functions but requires careful regularization and stability techniques.

06

Linear Quadratic Regulator

The Linear Quadratic Regulator (LQR) is the optimal solution for continuous control of linear dynamical systems. Given dynamics ẋ = Ax + Bu and quadratic cost ∑(xᵀQx + uᵀRu), LQR computes an optimal linear policy u* = -Kx where K is derived from the algebraic Riccati equation. The solution is closed-form, computationally efficient, and optimal in expectation.

LQR serves as a foundation for many advanced control algorithms. The value function is quadratic: V(x) = xᵀPx where P is the solution to the Riccati equation. Finite-horizon variants of LQR can be solved via backward dynamic programming. Extensions to time-varying systems, partially observed systems (LQG), and constrained settings are well-studied. LQR's optimality and computational efficiency make it the gold standard for linear regulator problems.

Riccati Equation

Closed-form solution P via algebraic Riccati equation yields optimal feedback gain K = (R + BᵀPB)⁻¹BᵀPA.

Optimality Guarantee

Proven optimal in expectation for linear systems with quadratic costs; computational complexity O(n³) for n-dimensional state.

07

Differential Dynamic Programming

Differential Dynamic Programming (DDP) extends LQR to nonlinear systems by linearizing around a nominal trajectory. At each iteration, DDP computes a linear approximation of the value function along the trajectory, solves a local LQR problem, and takes a step toward the computed policy. Iterative LQR (iLQR) simplifies this by solving the local LQR subproblem without explicit second-order terms.

DDP and iLQR are trajectory optimization methods: they optimize a specific plan (sequence of states and actions) rather than a state-dependent policy. These algorithms converge rapidly (quadratically for DDP) near good solutions. Extensions include differential dynamic programming with constraints, iterative value function computation, and connections to probabilistic inference formulations of control. iLQR has become a practical workhorse in robotics due to its balance of efficiency and flexibility.

Trajectory Optimization

Optimizes finite-horizon plans via successive linearization and local LQR solution.

Fast Convergence

Quadratic convergence near optima; applicable to high-dimensional nonlinear systems in robotics.

08

Policy Gradient Methods

Policy gradient methods directly optimize the policy π(a|s;θ) parameterized by θ, using gradient ascent on the objective J(θ) = E[∑γᵗrₜ]. The policy gradient theorem shows that ∇J(θ) ∝ E[∇log π(a|s;θ) Q(s,a)], enabling unbiased gradient estimates. REINFORCE uses sampled rewards as estimates of Q, while more advanced methods use learned value functions V(s) as baselines to reduce variance.

Policy gradients enjoy universal function approximation (any policy can be approximated), but suffer from high variance in gradient estimates. Variance reduction techniques like baselines, generalized advantage estimation (GAE), and actor-critic methods significantly improve sample efficiency. Modern variants like PPO and TRPO use trust region optimization to stabilize learning. Policy gradients are particularly effective for continuous control and exploration-friendly policies (e.g., Gaussians).

Policy Gradient Theorem

∇J(θ) ∝ E[∇log π(a|s;θ) Q(s,a)] enables unbiased gradient estimation without value function.

Variance Reduction

Baselines V(s), GAE, and actor-critic methods dramatically reduce variance while maintaining unbiasedness.

09

References & Further Reading

Key References:

Markov Decision Processes

Value Iteration Algorithm

Contraction Mapping

Policy Extraction

Policy Iteration Method

Policy Improvement

Faster Convergence

Learning Environment Models

Model Learning

Dyna Architecture

Continuous State MDPs

Basis Functions

Neural Networks

Linear Quadratic Regulator

Riccati Equation

Optimality Guarantee

Differential Dynamic Programming

Trajectory Optimization

Fast Convergence

Policy Gradient Methods

Policy Gradient Theorem

Variance Reduction

References & Further Reading

Markov Decision Processes

Value Iteration Algorithm

Policy Iteration Method

Learning Environment Models

Continuous State MDPs

Linear Quadratic Regulator

Differential Dynamic Programming

Policy Gradient Methods

Sources & References

Course Materials

Textbooks & Papers