Life Lessons from Reinforcement Learning
I have watched Shusen Wang’s video lectures on reinforcement learning. I think the video lectures will greatly help me with reading Sutton’s book on reinforcement learning - especially with the notation.
I also write down some life lessons, followed by an AI-generated compilation of Shusen Wang’s slides.
You need to define what you want
If you do not define your reward, there is no direction on what you are optimizing for.
You need to be clear what you are optimizing for.
Otherwise, you are likely not exactly optimizing what you really want to optimize.
If you want someone to optimize for something, you need to be very careful in defining the rewards.
For example you want someone to write correct and concise code that solve LeetCode problems. Without careful definition of what concise refers to, you might end up with unreadable code.
Your job is to suggest the next move given your current position
You cannot change your current position. You can only move forward in time.
Similar to a large language model, you are a next token predictor.
Your job is to produce the best move for yourself.
You should be aware what your current position is, what your available moves are, and choose the best move.
There is no point agonizing over past mistakes if it does not help you make a better decision.
There exists an ideal
There exists an ideal move for every position.
There exists a correct answer to whether one position is better than another.
There exists a correct answer to whether one move is better than another.
Even if you will never be able to calculate the ideal, it does not mean that the ideal does not exist. Even if you will never be able to calculate the correct answer, it does not mean that the correct answer does not exist.
For brevity, “God” here denotes the ideal.
We live with approximations
For many early game chess positons, you will never know whether one move is better than the other.
For many actions in live, you cannot fully enumerate over all the possible actions.
You can only estimate.
You can only estimate a finite number of times.
Yet, you still need to make the best decision. Yet, you still need to learn. You need to be able to under uncertainty. You need to implement processes that still work even with approximations.
The best action that God takes for himself might not be the best action you could take for yourself
There are chess moves played by God that you do not understand.
If you do not truly understand why the move is played, it is a bad move for you, even though it is the best move for God.
God is not going to play all the actions for you.
You should figure out the best moves for yourself. The best move for another person might not be the best move for you.
You might not be taking the best action you could take for yourself
It is possible that the action you are likely to take is not the best action you could take.
For example, you might know that exercising regularly is the best action for your health, but you choose to skip it for immediate comfort.
You will over-estimate if you sample the best outcome
You have 10 coins. You flip each of them 100 times.
It is likely that one of the coins produces more than 50, or even more than 55 heads.
You might wrongly conclude that the best coin is the one that produces the most heads, when in reality, all coins are fair coins.
Similarly, you have 10 possible actions to take, and you sample the results from each of the actions.
There are some lessons from this.
Do not look at the most successful outcome and think that you could have become like that. Plenty of luck is involved.
If you look only at the sequence of most successful outcomes, you will overestimate the success of a given set of moves.
You can learn without playing through the entire game
There are reinforcement learning methods that learn estimates based on other estimates without waiting for a final outcome.
There are things to be learned even if you do not complete it.
For example, you can build small projects and have a better appreciation of how much it takes to complete the full project without completing the full project.
Summary of Shusen Wang’s slides
The slides on Deep Reinforcement Learning are available on Github.
Table of Contents
1. Mathematical Foundations
Random Variables and Probability
Definition: A random variable is unknown; its values depend on outcomes of random events.
Notation:
-
Uppercase letter $X$ for a random variable
-
Lowercase letter $x$ for an observed value
Probability Mass Function (PMF) for discrete distributions:
\[\sum_{x \in \mathcal{X}} p(x) = 1\]Probability Density Function (PDF) for continuous distributions:
\[\int_{\mathcal{X}} p(x) dx = 1\]Key Concepts
-
Domain: Random variable $X$ is in the domain $\mathcal{X}$
-
PDF provides relative likelihood that the value of the random variable would equal that sample
2. Reinforcement Learning Basics
Agent-Environment Interaction
The RL paradigm consists of:
-
Agent: The learner/decision maker (e.g., Mario in Super Mario Bros)
-
Environment: Everything the agent interacts with
-
State $s_t$: Current situation of the agent
-
Action $a_t$: Decision made by the agent
-
Reward $r_t$: Feedback from the environment
Core Terminology
Policy $\pi$
Definition: Policy function $\pi: (s, a) \mapsto [0,1]$
\[\pi(a \vert s) = \mathbb{P}(A = a \vert S = s)\]The policy represents the probability of taking action $a$ in state $s$.
State Transition
-
State transition can be random
-
Randomness comes from the environment
-
An action leads from old state to new state
Return (Cumulative Future Reward)
Definition: Return (aka cumulative future reward)
\[U_t = R_t + R_{t+1} + R_{t+2} + R_{t+3} + \cdots\]Question: At time $t$, are $R_t$ and $R_{t+1}$ equally important?
Discounted Return:
\[U_t = R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + \cdots + \gamma^{n-t} R_n\]where $\gamma$ is the discount factor ($0 < \gamma \leq 1$).
Action-Value Function $Q_\pi(s, a)$
Definition: Action-value function (Q-function)
\[Q_\pi(s_t, a_t) = \mathbb{E}[U_t \vert S_t = s_t, A_t = a_t]\]This represents the expected discounted return when:
-
Starting in state $s_t$
-
Taking action $a_t$
-
Following policy $\pi$ thereafter
Practical Environment: OpenAI Gym
OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms.
-
Website: https://gym.openai.com/
-
Provides classical control problems like:
-
Cart Pole: Balance a pole on a cart
-
Pendulum: Control an inverted pendulum
3. Value-Based Methods
Deep Q-Network (DQN) Architecture
Input: Screenshots/observations (state $s$)
Output: Q-values for all possible actions
Architecture:
\[\text{State } s \rightarrow \text{Conv Layer} \rightarrow \text{Dense Layer} \rightarrow \begin{cases} Q(s, \text{"left"}; w) \\ Q(s, \text{"right"}; w) \\ Q(s, \text{"up"}; w) \end{cases}\]Key Properties:
-
Input shape: Size of the screenshot
-
Output shape: Dimension of action space
-
Uses convolutional layers for visual input processing
-
Dense layers for final Q-value estimation
Learning Process Example
Scenario: Driving from NYC to Atlanta
-
Model $Q(\mathbf{w})$ estimates travel time (e.g., 1000 minutes)
-
Actual travel time: 860 minutes
-
Question: How do I update the model?
Process:
-
Make prediction: $q = Q(\mathbf{w}) = 1000$
-
Observe actual result: $y = 860$
-
Update model to reduce prediction error
4. Temporal Difference Learning
TD Target and Learning
Temporal Difference (TD) Learning provides more reliable estimates:
Example Continuation:
-
Model’s estimate: $Q(\mathbf{w}) = 1000$ minutes
-
Intermediate observation: NYC → DC takes 300 minutes (actual)
-
Updated estimate: $300 + 600 = 900$ minutes (TD target)
TD Target: $y_t = r_t + \gamma Q_\pi(s_{t+1}, a_{t+1})$
Key Insight: TD target $y = 900$ is more reliable than the original estimate of 1000.
Loss Function: $L = \frac{1}{2}(Q(\mathbf{w}) - y)^2$
Gradient: $\frac{\partial L}{\partial \mathbf{w}} = (1000 - 900) \cdot \frac{\partial Q(\mathbf{w})}{\partial \mathbf{w}}$
SARSA Algorithm
SARSA (State-Action-Reward-State-Action) is an on-policy TD learning method.
Algorithm Steps:
-
Observe transition $(s_t, a_t, r_t, s_{t+1})$
-
Sample $a_{t+1} \sim \pi(\cdot \vert s_{t+1})$ using current policy
-
Compute TD target: $y_t = r_t + \gamma Q_\pi(s_{t+1}, a_{t+1})$
Tabular Version:
For each state-action pair $(s, a)$:
\[Q(s, a) \leftarrow Q(s, a) + \alpha[r + \gamma Q(s', a') - Q(s, a)]\]Key Properties:
-
On-policy: Uses the same policy for action selection and evaluation
-
Updates Q-values using the next action actually taken
-
Converges to the optimal Q-function under certain conditions
Derive TD Target
Mathematical Derivation:
Starting with the Bellman equation:
\[Q_\pi(s_t, a_t) = \mathbb{E}[U_t \vert S_t = s_t, A_t = a_t]\]Using the recursive property:
\[Q_\pi(s_t, a_t) = \mathbb{E}[R_t + \gamma U_{t+1} \vert S_t = s_t, A_t = a_t]\] \[= \mathbb{E}[R_t \vert S_t = s_t, A_t = a_t] + \gamma \mathbb{E}[Q_\pi(S_{t+1}, A_{t+1}) \vert S_t = s_t, A_t = a_t]\]This leads to the TD target:
\[y_t = r_t + \gamma Q_\pi(s_{t+1}, a_{t+1})\]5. Deep Q-Networks (DQN)
Experience Replay
Problem: Waste of Experience
-
A transition: $(s_t, a_t, r_t, s_{t+1})$
-
Experience: All transitions over time
-
Previously: Discard $(s_t, a_t, r_t, s_{t+1})$ after using it once
-
This is wasteful…
Solution: Experience Replay
-
Store transitions in a replay buffer
-
Sample random batches for training
-
Improves data efficiency and stability
Benefits:
-
Data Efficiency: Reuse experiences multiple times
-
Stability: Break correlation between consecutive samples
-
Better Learning: More diverse training batches
DQN Improvements
Key Innovations:
-
Experience Replay: Store and reuse transitions
-
Target Network: Separate network for computing targets
-
Clipping: Clip gradients for stability
DQN Algorithm:
-
Collect experience $(s_t, a_t, r_t, s_{t+1})$
-
Store in replay buffer
-
Sample mini-batch from buffer
-
Compute TD targets using target network
-
Update main network via gradient descent
6. Policy Gradient Methods
Policy gradient methods directly optimize the policy $\pi_\theta$.
REINFORCE with Baseline
Policy Gradient with Baseline
Core Formula1:
\[\frac{\partial V_\pi(s_t)}{\partial \theta} = \mathbb{E}_{A_t \sim \pi}\left[\frac{\partial \ln \pi(A_t \vert s_t; \theta)}{\partial \theta} \cdot (Q_\pi(s_t, A_t) - V_\pi(s_t))\right] = \mathbb{E}_{A_t \sim \pi}\left[\frac{\partial \ln \pi(A_t \vert s_t; \theta)}{\partial \theta} \cdot Q_\pi(s_t, A_t)\right]\]Key Insight: $g(A_t) = \frac{\partial \ln \pi(a_t \vert s_t;\theta)}{\partial \theta} \cdot (Q_\pi(s_t, a_t) - V_\pi(s_t))$
Three Approximations
- Monte Carlo Approximation: Approximate expectation using one sample $a_t$
- Return Approximation: Approximate $Q_\pi(s_t, a_t)$ by $u_t = \sum_{i=t}^n \gamma^{i-t} r_i$
- Value Network Approximation: Approximate $V_\pi(s)$ by value network $v(s; w)$
Algorithm
- Observe trajectory: $s_t, a_t, r_t, s_{t+1}, a_{t+1}, r_{t+1}, …, s_n, a_n, r_n$
- Compute return: $u_t = \sum_{i=t}^n \gamma^{i-t} r_i$
- Compute error: $\delta_t = v(s_t; w) - u_t$
- Update policy network: $\theta \leftarrow \theta - \beta \cdot \delta_t \cdot \frac{\partial \ln \pi(a_t \vert s_t;\theta)}{\partial \theta}$
- Update value network: $w \leftarrow w - \alpha \cdot \delta_t \cdot \frac{\partial v(s_t;w)}{\partial w}$
Network Architecture
- Policy Network: State $s$ → Conv → Dense → Softmax → Action probabilities
- Value Network: State $s$ → Conv → Dense → Scalar value $v(s; w)$
- Networks can share parameters in early layers
Advantage Actor-Critic (A2C)
Architecture
Two Neural Networks:
- Policy Network (Actor): $\pi(a \vert s; \theta)$ - outputs action probabilities
- Value Network (Critic): $v(s; w)$ - outputs state value estimate
Temporal Difference Learning
TD Target: $y_t = r_t + \gamma \cdot v(s_{t+1}; w)$
TD Error: $\delta_t = v(s_t; w) - y_t$
Key Difference from REINFORCE: Uses bootstrapping with $v(s_{t+1}; w)$ instead of waiting for complete returns
Advantage Function
\[\text{Advantage} = Q_\pi(s_t, a_t) - V_\pi(s_t)\]Approximated as: $Q_\pi(s_t, a_t) \approx r_t + \gamma \cdot V_\pi(s_{t+1})$
A2C Algorithm
- Observe transition: $(s_t, a_t, r_t, s_{t+1})$
- Compute TD target: $y_t = r_t + \gamma \cdot v(s_{t+1}; w)$
- Compute TD error: $\delta_t = v(s_t; w) - y_t$
- Update critic: $w \leftarrow w - \alpha \cdot \delta_t \cdot \frac{\partial v(s_t;w)}{\partial w}$
- Update actor: $\theta \leftarrow \theta + \beta \cdot \frac{\partial \ln \pi(a_t \vert s_t;\theta)}{\partial \theta} \cdot (y_t - v(s_t; w))$
Mathematical Foundation
Theorem 1: \(Q_\pi(s_t, a_t) = \mathbb{E}_{S_{t+1}}[R_t + \gamma \cdot V_\pi(S_{t+1})]\)
Theorem 2: \(V_\pi(s_t) = \mathbb{E}_{A_t,S_{t+1}}[R_t + \gamma \cdot V_\pi(S_{t+1})]\)
REINFORCE versus A2C
Key Differences
REINFORCE with Baseline:
- Uses Monte Carlo return: $u_t = \sum_{i=t}^n \gamma^{i-t} r_i$
- Error: $\delta_t = v(s_t; w) - u_t$
- Updates after complete episodes
- No bootstrapping - uses actual returns
- Unbiased but high variance
A2C:
- Uses TD target: $y_t = r_t + \gamma \cdot v(s_{t+1}; w)$
- TD error: $\delta_t = v(s_t; w) - y_t$
- Can update after each step
- Uses bootstrapping from value function
- Lower variance but introduces bias
Multi-step A2C
Multi-step TD target: $y_t = \sum_{i=0}^{m-1} \gamma^i r_{t+i} + \gamma^m v(s_{t+m}; w)$
This provides a spectrum between one-step TD (m=1) and Monte Carlo (m=∞).
Trade-offs
- Variance: A2C has lower variance due to bootstrapping
- Bias: REINFORCE is unbiased, A2C has bias from value approximation
- Sample Efficiency: A2C is more sample efficient
- Update Frequency: REINFORCE requires complete episodes, A2C can update online
- Complexity: REINFORCE is simpler, A2C requires managing two networks
Trust Region Policy Optimization (TRPO)
TRPO is a policy optimization algorithm that is more robust and sample efficient than standard policy gradient algorithms.
Trust Region Concept
Trust Region: A neighborhood $\mathcal{N}(\theta_{old})$ of $\theta_{old}$ where we trust our approximation of the objective function.
Example definition: \(\mathcal{N}(\theta_{old}) = \{\theta : \vert\vert\theta - \theta_{old}\vert\vert_2 \leq \Delta\}\)
Objective Function
The objective function for policy optimization: \(J(\theta) = \mathbb{E}_{S,A}\left[\frac{\pi(A\vert S;\theta)}{\pi(A\vert S;\theta_{old})} \cdot Q_\pi(S,A)\right]\)
where S is sampled from state transitions and A is sampled from policy $\pi(A\vert S;\theta_{old})$.
TRPO Algorithm
TRPO repeats two steps:
-
Approximation: Given $\theta_{old}$, construct $L(\theta\vert\theta_{old})$ that approximates $J(\theta)$ in the neighborhood of $\theta_{old}$
-
Maximization: In the trust region $\mathcal{N}(\theta_{old})$, find $\theta_{new}$ by: \(\theta_{new} \leftarrow \arg\max_{\theta \in \mathcal{N}(\theta_{old})} L(\theta\vert\theta_{old})\)
Implementation Steps
-
Collect Trajectory: Controlled by policy $\pi(\cdot\vert s;\theta_{old})$, the agent collects a trajectory: $s_1, a_1, r_1, s_2, a_2, r_2, \ldots, s_n, a_n, r_n$
-
Compute Returns: For $i = 1, \ldots, n$, compute discounted returns: $u_i = \sum_{k=i}^{n} \gamma^{k-i} \cdot r_k$
-
Approximation: \(\tilde{L}(\theta\vert\theta_{old}) = \frac{1}{n}\sum_{i=1}^{n} \frac{\pi(a_i\vert s_i;\theta)}{\pi(a_i\vert s_i;\theta_{old})} \cdot u_i\)
-
Maximization with Constraint: \(\theta_{new} \leftarrow \arg\max_\theta \tilde{L}(\theta\vert\theta_{old})\)
Subject to one of these constraints:
- Option 1: $\vert\vert\theta - \theta_{old}\vert\vert < \Delta$
- Option 2: $\frac{1}{n}\sum_{i=1}^{n} \text{KL}[\pi(\cdot\vert s_i;\theta_{old}) \vert\vert \pi(\cdot\vert s_i;\theta)] < \Delta$
TRPO vs Policy Gradient
- Both are policy-based reinforcement learning methods with the same objective function: $J(\theta) = \mathbb{E}S[V\pi(S)]$
- Policy Gradient: Maximizes $J(\theta)$ by stochastic gradient ascent
- TRPO: Maximizes $J(\theta)$ by trust-region algorithm
Why TRPO?
- More robust than policy gradient algorithms
- More sample efficient than policy gradient algorithms
7. AlphaGo and Monte Carlo Tree Search
Overview
AlphaGo is DeepMind’s groundbreaking Go-playing AI that defeated world champion players. It combines deep neural networks with Monte Carlo Tree Search (MCTS) for sophisticated game planning.
Monte Carlo Tree Search (MCTS)
Key Idea: Use look-ahead search to evaluate actions by simulating many possible future game trajectories.
Main Components:
- Select actions by exploring the game tree
- Look ahead to evaluate whether actions lead to wins or losses
- Repeat simulations many times
- Choose the action with the highest success rate
MCTS Algorithm Steps
MCTS consists of four iterative steps:
Step 1: Selection
- Start from the root node (current state $s_t$)
- Traverse the tree using a selection policy
-
Balance exploration and exploitation using UCB (Upper Confidence Bound):
\[UCB = Q(s,a) + c \sqrt{\frac{\ln N(s)}{N(s,a)}}\]where:
- $Q(s,a)$: Average value of action $a$ in state $s$
- $N(s)$: Number of times state $s$ was visited
- $N(s,a)$: Number of times action $a$ was selected in state $s$
- $c$: Exploration constant
Step 2: Expansion
Question: What will be the opponent’s action?
- Given player action $a_t$, opponent’s action $a_t’$ leads to new state $s_{t+1}$
-
Opponent’s action is randomly sampled from policy:
\[a_t' \sim \pi(\cdot \vert s_t'; \theta)\]where $s_t’$ is the state observed by the opponent
Step 3: Evaluation
- Run a rollout to the end of the game (step $T$)
- Player’s action: $a_k \sim \pi(\cdot \vert s_k; \theta)$
- Opponent’s action: $a_k’ \sim \pi(\cdot \vert s_k’; \theta)$
- Receive terminal reward $r_T$:
- Win: $r_T = +1$
- Lose: $r_T = -1$
-
Evaluate state $s_{t+1}$ using value network:
\[v(s_{t+1}; \mathbf{w})\]
Step 4: Backup
- MCTS repeats simulations many times
- Each child of $a_t$ has multiple recorded $V(s_{t+1})$ values
-
Update action-value:
\[Q(a_t) = \text{mean}(\text{recorded } V\text{ values})\] - The Q values will be used in Step 1 (selection)
Decision Making After MCTS
After running MCTS simulations:
- $N(a)$: Number of times action $a$ has been selected
-
Final decision:
\[a_t = \arg\max_a N(a)\] - Choose the action that was explored most frequently
Key Properties:
- For each new action, AlphaGo performs MCTS from scratch
- Initialize $Q(a)$ and $N(a)$ to zeros for each new position
AlphaGo’s Neural Networks
AlphaGo uses two deep neural networks:
1. Policy Network $\pi(a \vert s; \theta)$
- Predicts probability distribution over legal moves
- Trained in three steps:
- Behavior Cloning: Train on expert human games
- Policy Gradient: Improve through self-play
- Value Network Training: Train value function
2. Value Network $v(s; \mathbf{w})$
- Evaluates board positions
- Predicts probability of winning from state $s$
- Used in MCTS evaluation step
Training Process
Training in 3 steps:
- Train a policy network using behavior cloning
- Learn from expert human games
- Supervised learning on professional game datasets
- Train the policy network using policy gradient algorithm
- Self-play reinforcement learning
- Improve beyond human-level play
- Train a value network
- Predict game outcomes from board positions
- Used to evaluate leaf nodes in MCTS
MCTS Summary
- MCTS has 4 steps: selection, expansion, evaluation, and backup
- To perform one action, AlphaGo repeats the 4 steps many times to calculate $Q(a)$ and $N(a)$ for every action $a$
- AlphaGo executes the action $a$ with the highest $N(a)$ value
- To perform the next action, AlphaGo performs MCTS all over again (initializing $Q(a)$ and $N(a)$ to zeros)
Key Innovations
AlphaGo’s Success Factors:
- Combination of Deep Learning and Tree Search: Neural networks guide MCTS
- Self-Play: Continuous improvement through playing against itself
- Value and Policy Networks: Dual network architecture for position evaluation and move selection
- Efficient Search: MCTS focuses computational resources on promising moves
8. Continuous Control
Discrete vs Continuous Control
Discrete Control:
-
Finite action space (e.g., left, right, up, down)
-
Can enumerate all actions
-
Use softmax for action probabilities
Continuous Control:
-
Infinite action space (e.g., steering angle, throttle)
-
Cannot enumerate actions
-
Need different policy representations
Policy Network for Continuous Actions
Architecture: State → Conv → Dense → Action output
Deterministic Policy Gradient (DPG)
Goal: Increasing $q(s, a; \mathbf{w})$ where $a = \pi(s; \boldsymbol{\theta})$
Dual Networks:
- Policy Network (parameters: $\boldsymbol{\theta}$)
- Value Network (parameters: $\mathbf{w}$)
9. Multi-Agent Reinforcement Learning
Self-Interested Multi-Agent Settings
Self-Interested Setting:
- Agents are self-interested; rewards may or may not conflict
- Example: Financial trading scenarios where agents compete
Centralized Actor-Critic Method
Centralized Training:
- Controller knows all observations, actions, and rewards
- Train $\pi(a^i \vert o ; \boldsymbol{\theta}^i)$ using policy gradient
- Train $q(\mathbf{o}, \mathbf{a} ; \mathbf{w}^i)$ using TD algorithm
Centralized Execution:
- Controller makes decisions for all agents
- Each agent sends observation $o^i$ to controller
- Controller knows $\mathbf{o} = [o^1, o^2, \ldots, o^n]$
- Controller samples actions and sends $a^i$ to each agent
Footnotes
These are my footnotes
-
I find baselines really interesting.
This is how I understand why we should use baselines - because what we could sample is finite.
Assume there are 100 actions, 10 of which (preferred) has a true action-value of 1000, and all the other 90 of which (dispreferred) have an action-value of 999. If we do not use baseline and we sample 10 actions, gradient accent will make the dispreferred actions more likely as well - just that the preferred action is made more likely 1000/999 times more. With a baseline, gradient accent will make the dispreferred actions less likely, and the preferred actions more likely.
I was also thinking - why don’t use the average of sampled action values as the baseline, instead of training a value function? I understand that we want to approximate the value function at a certain state with just one trajectory. Approximating the value function with trajectories might be much more expensive than training a value function. ↩