AI Summary Hub

Deep reinforcement learning (DRL)

RL with deep neural networks for function approximation.

Definition

Deep reinforcement learning (DRL) extends classical reinforcement learning by replacing hand-crafted feature engineering and tabular value functions with deep neural networks. This allows RL algorithms to handle high-dimensional state spaces — such as raw pixels from a game screen or joint angles from a robot — that were previously intractable. The neural network acts as a universal function approximator for the policy, the value function, or both.

The pivotal breakthrough came with DQN (Mnih et al., 2015), which combined Q-learning with a convolutional network, experience replay, and target networks to achieve human-level play on 49 Atari games. Since then, the field has produced a rich family of algorithms: value-based (DQN, Rainbow), policy gradient (REINFORCE, A3C), and actor-critic (PPO, SAC, TD3). Each family makes different trade-offs between sample efficiency, stability, and applicability to continuous versus discrete action spaces.

Training deep RL agents is notoriously unstable without specific stabilization techniques. Experience replay breaks temporal correlations by storing transitions in a buffer and sampling random mini-batches. Target networks are slowly-updated copies of the value network that provide stable regression targets. Advantage estimation (e.g., GAE) reduces variance in policy gradient updates. Modern algorithms such as PPO and SAC incorporate these ideas by default, making them reliable baselines for continuous control, robotics, and LLM alignment (RLHF, DPO).

How it works

Neural network policy and value functions

The state (e.g., an image, sensor vector, or token embedding) is encoded by a neural network that outputs either action probabilities (policy network) or expected returns (value network). In actor-critic methods both heads can share a backbone.

Experience replay

Transitions (state, action, reward, next state, done) are stored in a replay buffer. Random mini-batches are sampled for each gradient update, breaking harmful temporal correlations and improving data efficiency.

Target networks

A copy of the value network — updated slowly (Polyak averaging) or periodically — provides stable regression targets. Without this, gradient updates can oscillate because the target changes every step.

Advantage estimation

Policy gradient methods compute the advantage A(s, a) = Q(s, a) − V(s) to tell the agent how much better an action was than average. Generalized Advantage Estimation (GAE) trades bias for variance with a hyperparameter λ, and is standard in PPO.

Key algorithms

AlgorithmFamilyAction spaceKey trait
DQNValue-basedDiscreteExperience replay + target networks
PPOActor-criticBothClipped policy updates for stability
SACActor-criticContinuousEntropy maximization for exploration
TD3Actor-criticContinuousTwin critics reduce overestimation bias

When to use / When NOT to use

ScenarioUse DRLAvoid DRL
High-dimensional observations (pixels, sensors)Yes — neural nets handle raw inputsNo — tabular RL if state space is small
Simulator available for safe trial-and-errorYes — DRL requires millions of samplesNo — real-world only with limited interactions
Complex, long-horizon control tasksYes — PPO/SAC excel at continuous controlNo — imitation learning is faster if expert data exists
Limited compute or interpretability requiredNo — DRL is compute-intensive and opaque
Simple rule-based or low-D problemNo — classical RL or optimization suffices

Comparisons

MethodState spaceStabilitySample efficiencyTypical use
Tabular Q-learningSmall discreteHighLowToy environments
DQNHigh-dim discreteMediumLow-mediumAtari games
PPOAnyHighMediumRobotics, RLHF
SACContinuousHighHigherRobot manipulation

Pros and cons

ProsCons
Handles raw pixel and high-dimensional inputsExtremely sample-inefficient vs. supervised learning
State-of-the-art on games, robotics, and LLM alignmentHyperparameter sensitivity; unstable without tricks
PPO/SAC are robust, general-purpose baselinesReward misspecification leads to unexpected behavior
Scales with compute and model capacityRequires simulator or large environment interaction budget

Code examples

Minimal PPO training loop using Stable-Baselines3:

from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env

# Vectorized environment for parallel rollout collection
env = make_vec_env("LunarLander-v2", n_envs=4)

model = PPO(
    policy="MlpPolicy",
    env=env,
    learning_rate=3e-4,
    n_steps=2048,        # Steps per rollout per env
    batch_size=64,
    n_epochs=10,         # Gradient epochs per rollout
    gamma=0.99,
    gae_lambda=0.95,
    clip_range=0.2,      # PPO clipping parameter
    verbose=1,
)

model.learn(total_timesteps=500_000)
model.save("ppo_lunar_lander")

# Evaluate
obs = env.reset()
for _ in range(1000):
    action, _ = model.predict(obs, deterministic=True)
    obs, rewards, dones, info = env.step(action)

Practical resources

See also