Markov Decision Processes
A Markov Decision Process (MDP) is the mathematical framework for sequential decision-making under uncertainty. It consists of states S, actions A, transition probabilities P(s'|s,a), rewards R(s,a), and a discount factor γ ∈ [0,1]. The Markov property ensures that future states depend only on the current state and action, not on the history of transitions.
The goal of reinforcement learning is to find an optimal policy π*(s) that maps states to actions, maximizing cumulative discounted reward: E[∑t=0∞ γt rt]. Key insights include the concept of value functions V(s) = E[cumulative future reward from state s] and Q-functions Q(s,a) = E[reward from taking action a in state s]. These functions satisfy recursive relationships known as Bellman equations.
Value Iteration Algorithm
Value iteration is a dynamic programming algorithm that computes the optimal value function by repeatedly applying the Bellman optimality operator. Starting from an arbitrary initialization V(s), the algorithm updates: V(s) ← maxₐ [R(s,a) + γ E[V(s')]]. This update is guaranteed to converge to V*(s) because the Bellman operator is a contraction mapping with modulus γ < 1.
Once the optimal value function is computed, the optimal policy is extracted via π*(s) = argmaxₐ [R(s,a) + γ E[V*(s')]]. Value iteration has exponential convergence in the number of iterations, making it efficient for problems with moderate state spaces. The algorithm requires knowledge of the transition model P(s'|s,a) and reward function, limiting applicability to model-free settings.
Contraction Mapping
Bellman operator satisfies ||T V - T V'|| ≤ γ ||V - V'||, guaranteeing convergence at rate O(γ^n).
Policy Extraction
After convergence, optimal policy is deterministically extracted from value function in one sweep.
Policy Iteration Method
Policy iteration alternates between two steps: policy evaluation computes V^π(s) for a fixed policy π, and policy improvement updates π(s) ← argmaxₐ [R(s,a) + γ E[V^π(s')]]. This greedy improvement step is guaranteed to produce π' ≥ π in the sense of V^π'(s) ≥ V^π(s) for all states. The algorithm terminates when the improved policy equals the current policy.
Policy iteration typically converges in fewer iterations than value iteration because policy improvement can make substantial jumps. However, each iteration requires solving a system of linear equations for V^π (policy evaluation), which can be computationally expensive. In practice, generalized policy iteration (GPI) performs partial policy evaluation steps to balance computation cost with convergence speed.
Policy Improvement
Greedy improvement guarantees: V^π'(s) ≥ V^π(s) for every state, leading to monotonic improvement.
Faster Convergence
Often converges in fewer total iterations than value iteration despite higher per-iteration cost.
Learning Environment Models
In model-based reinforcement learning, the agent learns or uses a transition model P(s'|s,a) and reward model to plan optimal behavior. When the true model is unknown, the agent can learn an approximate model from experience (s,a,s',r) tuples using maximum likelihood or other supervised learning techniques. The learned model is then used with value iteration or policy iteration for planning.
The exploration vs. exploitation dilemma arises naturally: the agent must balance exploring unfamiliar state-action pairs (to improve the model) with exploiting actions that appear optimal under the current model. Methods like ε-greedy exploration, upper confidence bounds (UCB), and Thompson sampling address this tradeoff. Dyna-style algorithms combine learning from real experience with planning using the learned model to accelerate learning.
Model Learning
Learn P(s'|s,a) from experience transitions; use maximum likelihood or Bayesian methods for uncertainty.
Dyna Architecture
Interleave real experience collection with simulated planning steps using the learned model.
Continuous State MDPs
Many real-world problems have continuous state spaces (e.g., robot position, velocity) where discretization is impractical due to curse of dimensionality. Value function approximation addresses this by parameterizing V(s) ≈ V(s;θ) with a function approximator (neural network, linear basis functions, etc.). Fitted value iteration becomes: sample state transitions, compute targets using Bellman equation, and perform supervised learning to minimize ||V(s;θ) - target||².
Linear function approximation using basis functions φ(s) offers theoretical guarantees: V(s;θ) = θᵀφ(s). However, nonlinear approximators like neural networks are more flexible but lack convergence guarantees and can exhibit instability. Key practical challenges include managing function approximation error, choosing appropriate features, and handling off-policy learning where the data distribution doesn't match the current policy.
Basis Functions
Linear approximation V(s;θ) = θᵀφ(s) with convergence guarantees under gradient descent.
Neural Networks
Nonlinear function approximation enables complex value functions but requires careful regularization and stability techniques.
Linear Quadratic Regulator
The Linear Quadratic Regulator (LQR) is the optimal solution for continuous control of linear dynamical systems. Given dynamics ẋ = Ax + Bu and quadratic cost ∑(xᵀQx + uᵀRu), LQR computes an optimal linear policy u* = -Kx where K is derived from the algebraic Riccati equation. The solution is closed-form, computationally efficient, and optimal in expectation.
LQR serves as a foundation for many advanced control algorithms. The value function is quadratic: V(x) = xᵀPx where P is the solution to the Riccati equation. Finite-horizon variants of LQR can be solved via backward dynamic programming. Extensions to time-varying systems, partially observed systems (LQG), and constrained settings are well-studied. LQR's optimality and computational efficiency make it the gold standard for linear regulator problems.
Riccati Equation
Closed-form solution P via algebraic Riccati equation yields optimal feedback gain K = (R + BᵀPB)⁻¹BᵀPA.
Optimality Guarantee
Proven optimal in expectation for linear systems with quadratic costs; computational complexity O(n³) for n-dimensional state.
Differential Dynamic Programming
Differential Dynamic Programming (DDP) extends LQR to nonlinear systems by linearizing around a nominal trajectory. At each iteration, DDP computes a linear approximation of the value function along the trajectory, solves a local LQR problem, and takes a step toward the computed policy. Iterative LQR (iLQR) simplifies this by solving the local LQR subproblem without explicit second-order terms.
DDP and iLQR are trajectory optimization methods: they optimize a specific plan (sequence of states and actions) rather than a state-dependent policy. These algorithms converge rapidly (quadratically for DDP) near good solutions. Extensions include differential dynamic programming with constraints, iterative value function computation, and connections to probabilistic inference formulations of control. iLQR has become a practical workhorse in robotics due to its balance of efficiency and flexibility.
Trajectory Optimization
Optimizes finite-horizon plans via successive linearization and local LQR solution.
Fast Convergence
Quadratic convergence near optima; applicable to high-dimensional nonlinear systems in robotics.
Policy Gradient Methods
Policy gradient methods directly optimize the policy π(a|s;θ) parameterized by θ, using gradient ascent on the objective J(θ) = E[∑γᵗrₜ]. The policy gradient theorem shows that ∇J(θ) ∝ E[∇log π(a|s;θ) Q(s,a)], enabling unbiased gradient estimates. REINFORCE uses sampled rewards as estimates of Q, while more advanced methods use learned value functions V(s) as baselines to reduce variance.
Policy gradients enjoy universal function approximation (any policy can be approximated), but suffer from high variance in gradient estimates. Variance reduction techniques like baselines, generalized advantage estimation (GAE), and actor-critic methods significantly improve sample efficiency. Modern variants like PPO and TRPO use trust region optimization to stabilize learning. Policy gradients are particularly effective for continuous control and exploration-friendly policies (e.g., Gaussians).
Policy Gradient Theorem
∇J(θ) ∝ E[∇log π(a|s;θ) Q(s,a)] enables unbiased gradient estimation without value function.
Variance Reduction
Baselines V(s), GAE, and actor-critic methods dramatically reduce variance while maintaining unbiasedness.
References & Further Reading
Key References: