Module 3.4 — Flagship Module

Agent-Based Control and Reinforcement Learning for Dynamical Systems

Bridge differential equations and reinforcement learning: control the damped pendulum with PID, Q-learning, and policy gradient methods.

Difficulty:Advanced
Estimated Time:5–6 hours

Prerequisites

  • Modules 1.1–1.4: ODE fundamentals, numerical solvers
  • Module 3.1: Neural Differential Equations
  • Basic probability (expectations, sampling)

Why This Matters

Every controlled dynamical system is an ODE with a control input: $\dot{x} = f(x, u)$, where $x$ is the state and $u$ is the control. Classical control theory (PID, LQR, MPC) designs $u$ using analytical models. Reinforcement learning (RL) learns $u$ from interaction with the system, requiring no explicit model.

This module makes the connection concrete using a single physical system—the damped pendulum—controlled by three paradigms: PID (hand-crafted), Q-learning (tabular value-based), and REINFORCE (policy gradient). By solving the same task three ways, you see exactly where each approach shines and where it fails. The pendulum swing-up problem is a canonical benchmark in control and RL that captures the essential challenges: nonlinearity, underactuation, and the exploration-exploitation trade-off.

Learning Objectives

  1. Derive the controlled pendulum ODE from Newton’s second law.
  2. Implement the pendulum environment with RK4 integration.
  3. Design and implement a PID controller with gains $K_p = 10$, $K_i = 0.1$, $K_d = 2$.
  4. Define the Markov Decision Process (MDP) formulation of the pendulum control problem.
  5. State the Bellman optimality equation and explain its role in Q-learning.
  6. Implement Q-learning with a $20 \times 20 \times 5$ table and $\varepsilon$-greedy exploration.
  7. Implement the REINFORCE policy gradient algorithm with a softmax policy.
  8. Compare PID, Q-learning, and REINFORCE on cumulative reward, stability, and sample efficiency.
  9. Design reward functions that encode physical objectives (swing-up, stabilization, energy minimization).
  10. Explain when RL outperforms classical control and when classical control is sufficient.

Core Concepts

The Damped Pendulum Environment

The governing ODE is:

$$\ddot{\theta} + b\dot{\theta} + \frac{g}{L}\sin(\theta) = \frac{u}{mL^2}$$

with parameters: $g = 9.81$ m/s$^2$, $L = 1.0$ m, $m = 1.0$ kg, $b = 0.1$ (damping coefficient), $\Delta t = 0.05$ s.

State: $(\theta, \omega)$ where $\omega = \dot{\theta}$. Written as a first-order system:

$$\dot{\theta} = \omega, \quad \dot{\omega} = -b\omega - \frac{g}{L}\sin(\theta) + \frac{u}{mL^2}$$

Goal: swing up from $\theta = \pi$ (hanging down) to $\theta = 0$ (upright) and stabilize.

Reward: $r = -(\theta^2 + 0.1\omega^2 + 0.001u^2)$ at each step.

Definition 3.4.1 (Markov Decision Process). An MDP is a tuple $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$ where $\mathcal{S}$ is the state space, $\mathcal{A}$ is the action space, $P(s'|s,a)$ is the transition probability, $R(s,a)$ is the reward function, and $\gamma \in [0,1)$ is the discount factor. The pendulum MDP has continuous states $(\theta, \omega)$, discrete actions $u \in \{-2, -1, 0, +1, +2\}$ N·m, deterministic transitions (ODE integration), and $\gamma = 0.99$.
Definition 3.4.2 (State-Action Value Function). The Q-function $Q^\pi(s,a)$ is the expected cumulative discounted reward starting from state $s$, taking action $a$, and following policy $\pi$ thereafter: $$Q^\pi(s,a) = \mathbb{E}_\pi\!\left[\sum_{k=0}^{\infty}\gamma^k R(s_k, a_k) \;\middle|\; s_0 = s, a_0 = a\right]$$
Definition 3.4.3 (Policy). A policy $\pi(a|s)$ maps states to a probability distribution over actions. A deterministic policy outputs a single action: $a = \pi(s)$. A stochastic policy samples: $a \sim \pi(\cdot|s)$. PID is a deterministic policy; softmax Q-learning uses a stochastic policy during training.
Definition 3.4.4 (Bellman Optimality Equation). The optimal Q-function satisfies: $$Q^*(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) \max_{a'} Q^*(s', a')$$ For the deterministic pendulum, $P(s'|s,a) = 1$ for the unique next state from ODE integration, so: $$Q^*(s,a) = R(s,a) + \gamma \max_{a'} Q^*(s', a')$$
Definition 3.4.5 (Temporal Difference Learning). TD learning updates the value estimate using the bootstrapped target: $$Q(s,a) \leftarrow Q(s,a) + \alpha\!\left[R(s,a) + \gamma \max_{a'} Q(s',a') - Q(s,a)\right]$$ where $\alpha$ is the learning rate. This is the Q-learning update rule.
Definition 3.4.6 (PID Controller). A PID controller computes the control input as: $$u(t) = K_p e(t) + K_i \int_0^t e(\tau)\,d\tau + K_d \dot{e}(t)$$ where $e(t) = \theta_{\text{target}} - \theta(t)$ is the tracking error. Discretized: $u_n = K_p e_n + K_i \sum_{k=0}^{n} e_k \Delta t + K_d \frac{e_n - e_{n-1}}{\Delta t}$.
Definition 3.4.7 (Policy Gradient Theorem). For a parameterized policy $\pi_\theta(a|s)$, the gradient of the expected return $J(\theta) = \mathbb{E}_{\pi_\theta}[\sum_t \gamma^t R_t]$ is: $$\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\!\left[\sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t\right]$$ where $G_t = \sum_{k=t}^{T} \gamma^{k-t} R_k$ is the return-to-go from step $t$.
Definition 3.4.8 (Reward Shaping). Reward shaping adds an auxiliary reward term $F(s, s')$ to the original reward to guide learning without changing the optimal policy. A common form is potential-based shaping: $F(s, s') = \gamma \Phi(s') - \Phi(s)$ for some potential function $\Phi$.
Theorem 3.4.1 (Bellman Optimality). The optimal value function $Q^*$ is the unique fixed point of the Bellman optimality operator $\mathcal{T}^*$: $$(\mathcal{T}^* Q)(s,a) = R(s,a) + \gamma \mathbb{E}_{s'}\!\left[\max_{a'} Q(s',a')\right]$$ Furthermore, $Q^*$ defines an optimal policy: $\pi^*(s) = \arg\max_a Q^*(s,a)$.
Theorem 3.4.2 (Policy Gradient Theorem, Sutton et al. 2000). For any differentiable policy $\pi_\theta$, the gradient of the expected return is: $$\nabla_\theta J(\theta) = \mathbb{E}_{s \sim d^\pi, a \sim \pi_\theta}\!\left[Q^{\pi_\theta}(s,a) \nabla_\theta \log \pi_\theta(a|s)\right]$$ where $d^\pi$ is the stationary state distribution under $\pi_\theta$. REINFORCE uses the sample return $G_t$ as an unbiased estimate of $Q^{\pi_\theta}$.
Proof sketch. Start with $J(\theta) = \mathbb{E}_{s_0}[V^{\pi_\theta}(s_0)]$. Expand $V$ using the policy: $V(s) = \sum_a \pi_\theta(a|s)Q(s,a)$. Differentiate w.r.t. $\theta$ using the product rule (the $\nabla Q$ terms telescope via recursion), yielding the result.
Theorem 3.4.3 (Convergence of Q-Learning). Q-learning converges to $Q^*$ with probability 1 provided: (1) every state-action pair is visited infinitely often, (2) the learning rate $\alpha_n$ satisfies $\sum \alpha_n = \infty$ and $\sum \alpha_n^2 < \infty$ (Robbins-Monro conditions), and (3) the MDP has bounded rewards.

Common Misconceptions

Misconception 1: “RL always outperforms classical control.” For known linear systems with quadratic cost, LQR is provably optimal with a single matrix computation. PID controllers with well-tuned gains perform well for many engineering problems. RL excels when the dynamics are unknown, highly nonlinear, or when the cost function is complex.
Misconception 2: “Q-learning works well with continuous states directly.” Tabular Q-learning requires a finite state space. With continuous states, the table must be discretized, which suffers from the curse of dimensionality. For high-dimensional continuous states, function approximation (Deep Q-Networks) is needed.
Misconception 3: “More exploration is always better.” Too much exploration ($\varepsilon$ too high) wastes time on random actions and slows convergence. Too little exploration ($\varepsilon$ too low) causes the agent to get stuck in suboptimal policies. The $\varepsilon$-decay schedule balances this trade-off.
Misconception 4: “The reward function doesn’t matter much.” Reward design is critical. A reward that only penalizes angle $\theta$ (ignoring velocity $\omega$) can produce bang-bang oscillations. Including $\omega^2$ and $u^2$ terms produces smooth, energy-efficient control.
Misconception 5: “PID controllers can’t handle nonlinear systems at all.” PID controllers work well near a linearization point. For the pendulum near $\theta = 0$, PID stabilizes effectively. PID fails for the full swing-up because $\sin(\theta)$ is far from linear when $\theta \approx \pi$.
Misconception 6: “Policy gradient methods are always better than value-based methods.” Policy gradient has high variance and often requires millions of samples. Q-learning is more sample-efficient for discrete, low-dimensional action spaces. The best choice depends on the problem structure.

Worked Examples

Example 1: PID Controller Design

Linearize the pendulum around $\theta = 0$: $\sin(\theta) \approx \theta$, so $\ddot{\theta} + 0.1\dot{\theta} + 9.81\theta = u$.

Choose PID gains: $K_p = 10$, $K_i = 0.1$, $K_d = 2$. Target: $\theta_{\text{target}} = 0$.

Step-by-step from $\theta_0 = 0.5$, $\omega_0 = 0$, $\Delta t = 0.05$:

Step 0: $e_0 = 0 - 0.5 = -0.5$, integral $= -0.5 \times 0.05 = -0.025$, derivative $= -0.5/0.05 = -10$.
$u_0 = 10(-0.5) + 0.1(-0.025) + 2(-10) = -5.0 - 0.0025 - 20.0 = -25.003$. Clamp to $u_0 = -2$ (discrete).

Step 1: Integrate ODE with $u = -2$: $\omega_1 = 0 + 0.05(-0.1 \times 0 - 9.81 \times 0.479 - 2) = -0.335$, $\theta_1 = 0.5 + 0.05(-0.335) = 0.483$.
$e_1 = -0.483$, integral $= -0.025 + (-0.483)(0.05) = -0.049$, deriv $= (-0.483 + 0.5)/0.05 = 0.34$.
$u_1 = 10(-0.483) + 0.1(-0.049) + 2(0.34) = -4.83 - 0.005 + 0.68 = -4.155$. Clamp: $u_1 = -2$.

The PID controller generates large torques near the setpoint. After about 50 steps (2.5 seconds), the pendulum stabilizes near $\theta = 0$. PID works well for stabilization but cannot perform the full swing-up from $\theta = \pi$.

Example 2: Q-Learning Update Step

Discretize: $\theta \in [-\pi, \pi]$ into 20 bins (bin width $= \pi/10$), $\omega \in [-8, 8]$ into 20 bins (bin width $= 0.8$). Actions: $\{-2, -1, 0, +1, +2\}$.

One Q-learning step:

  1. Observe state: $\theta = 2.8$, $\omega = -1.5$. Bins: $i_\theta = \lfloor(2.8 + \pi)/(2\pi/20)\rfloor = 18$, $i_\omega = \lfloor(-1.5 + 8)/0.8\rfloor = 8$.
  2. Choose action ($\varepsilon$-greedy): With $\varepsilon = 0.1$, probability 0.9 pick $a^* = \arg\max_a Q[18, 8, a]$. Suppose $Q[18, 8, :] = [-3.2, -2.1, -2.8, -1.5, -2.0]$. Then $a^* = 3$ (action $+1$ N·m).
  3. Execute action: Integrate pendulum with $u = +1$ for $\Delta t = 0.05$. Get $\theta' = 2.73$, $\omega' = -1.82$. Reward: $r = -(2.8^2 + 0.1(1.5)^2 + 0.001(1)^2) = -(7.84 + 0.225 + 0.001) = -8.066$.
  4. Compute bins for next state: $i'_\theta = 18$, $i'_\omega = 7$.
  5. Update Q: With $\alpha = 0.1$, $\gamma = 0.99$: $Q[18, 8, 3] \leftarrow -1.5 + 0.1[-8.066 + 0.99 \times \max_a Q[18, 7, a] - (-1.5)]$.

Example 3: REINFORCE Gradient Computation

Softmax policy over 5 actions with linear features: $\pi_\theta(a|s) = \frac{\exp(\theta_a^T \phi(s))}{\sum_{a'}\exp(\theta_{a'}^T \phi(s))}$ where $\phi(s) = [\theta, \omega, 1]^T$.

One trajectory of 3 steps:

  1. $s_0 = (2.5, 0)$, $a_0 = +2$, $r_0 = -6.25$
  2. $s_1 = (2.3, -0.8)$, $a_1 = +1$, $r_1 = -5.35$
  3. $s_2 = (2.0, -1.2)$, $a_2 = +2$, $r_2 = -4.14$

Returns (no discounting for simplicity): $G_0 = -6.25 - 5.35 - 4.14 = -15.74$, $G_1 = -5.35 - 4.14 = -9.49$, $G_2 = -4.14$.

Gradient: $\nabla_\theta J \approx \sum_{t=0}^{2} G_t \nabla_\theta \log \pi_\theta(a_t|s_t)$.

For softmax: $\nabla_\theta \log \pi_\theta(a|s) = \phi(s)(e_a - \pi_\theta(\cdot|s))$ where $e_a$ is the one-hot vector for action $a$. The large negative returns push probability away from actions taken in high-cost states.

Interactive Code Labs

Code Lab 1: PID Controller for the Pendulum


import numpy as np
import matplotlib.pyplot as plt

# ── Pendulum Environment ─────────────────────────────────────
class Pendulum:
    """Damped pendulum: theta'' + b*theta' + (g/L)*sin(theta) = u/(m*L^2)"""
    def __init__(self):
        self.g = 9.81; self.L = 1.0; self.m = 1.0; self.b = 0.1
        self.dt = 0.05
        self.reset()

    def reset(self):
        self.theta = np.pi  # hanging down
        self.omega = 0.0
        return np.array([self.theta, self.omega])

    def _dynamics(self, theta, omega, u):
        dtheta = omega
        domega = -self.b * omega - (self.g / self.L) * np.sin(theta) + u / (self.m * self.L**2)
        return dtheta, domega

    def step(self, u):
        u = np.clip(u, -2.0, 2.0)
        # RK4 integration
        th, om = self.theta, self.omega
        k1_th, k1_om = self._dynamics(th, om, u)
        k2_th, k2_om = self._dynamics(th + 0.5*self.dt*k1_th, om + 0.5*self.dt*k1_om, u)
        k3_th, k3_om = self._dynamics(th + 0.5*self.dt*k2_th, om + 0.5*self.dt*k2_om, u)
        k4_th, k4_om = self._dynamics(th + self.dt*k3_th, om + self.dt*k3_om, u)
        self.theta += (self.dt / 6) * (k1_th + 2*k2_th + 2*k3_th + k4_th)
        self.omega += (self.dt / 6) * (k1_om + 2*k2_om + 2*k3_om + k4_om)
        # Wrap theta to [-pi, pi]
        self.theta = ((self.theta + np.pi) % (2*np.pi)) - np.pi
        reward = -(self.theta**2 + 0.1*self.omega**2 + 0.001*u**2)
        return np.array([self.theta, self.omega]), reward

# ── PID Controller ───────────────────────────────────────────
class PIDController:
    def __init__(self, Kp=10.0, Ki=0.1, Kd=2.0, dt=0.05):
        self.Kp, self.Ki, self.Kd, self.dt = Kp, Ki, Kd, dt
        self.integral = 0.0
        self.prev_error = 0.0

    def reset(self):
        self.integral = 0.0
        self.prev_error = 0.0

    def compute(self, theta, target=0.0):
        error = target - theta
        # Wrap error to [-pi, pi]
        error = ((error + np.pi) % (2*np.pi)) - np.pi
        self.integral += error * self.dt
        derivative = (error - self.prev_error) / self.dt
        self.prev_error = error
        u = self.Kp * error + self.Ki * self.integral + self.Kd * derivative
        return np.clip(u, -2.0, 2.0)

# ── Run PID simulation ──────────────────────────────────────
env = Pendulum()
pid = PIDController(Kp=10.0, Ki=0.1, Kd=2.0)
state = env.reset()
pid.reset()

n_steps = 500
thetas, omegas, controls, rewards = [], [], [], []

for t in range(n_steps):
    u = pid.compute(state[0])
    state, r = env.step(u)
    thetas.append(state[0]); omegas.append(state[1])
    controls.append(u); rewards.append(r)

time = np.arange(n_steps) * env.dt
fig, axes = plt.subplots(3, 1, figsize=(10, 8), sharex=True)
axes[0].plot(time, thetas, 'b-', lw=1.5)
axes[0].axhline(0, color='r', ls='--', alpha=0.5, label='target')
axes[0].set_ylabel('theta (rad)'); axes[0].legend(); axes[0].set_title('PID Control of Damped Pendulum')
axes[0].grid(True, alpha=0.3)

axes[1].plot(time, omegas, 'g-', lw=1.5)
axes[1].set_ylabel('omega (rad/s)'); axes[1].grid(True, alpha=0.3)

axes[2].plot(time, controls, 'r-', lw=1.5)
axes[2].set_ylabel('u (N*m)'); axes[2].set_xlabel('Time (s)')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
print(f"Total reward: {sum(rewards):.1f}")
print(f"Final theta: {thetas[-1]:.4f} rad, Final omega: {omegas[-1]:.4f} rad/s")

Exploration ideas:

  • Change starting position from $\pi$ (bottom) to $0.5$ (near top). Does PID succeed now?
  • Increase $K_p$ to 50. What happens?
  • Remove the integral term ($K_i = 0$). Is there steady-state error?

Code Lab 2: Q-Learning Agent


import numpy as np
import matplotlib.pyplot as plt

# ── Pendulum Environment (same as Code Lab 1) ───────────────
class Pendulum:
    def __init__(self):
        self.g = 9.81; self.L = 1.0; self.m = 1.0; self.b = 0.1
        self.dt = 0.05
        self.reset()
    def reset(self):
        self.theta = np.pi; self.omega = 0.0
        return np.array([self.theta, self.omega])
    def _dynamics(self, theta, omega, u):
        return omega, -self.b*omega - (self.g/self.L)*np.sin(theta) + u/(self.m*self.L**2)
    def step(self, u):
        u = np.clip(u, -2.0, 2.0)
        th, om = self.theta, self.omega
        k1t, k1o = self._dynamics(th, om, u)
        k2t, k2o = self._dynamics(th+0.5*self.dt*k1t, om+0.5*self.dt*k1o, u)
        k3t, k3o = self._dynamics(th+0.5*self.dt*k2t, om+0.5*self.dt*k2o, u)
        k4t, k4o = self._dynamics(th+self.dt*k3t, om+self.dt*k3o, u)
        self.theta += (self.dt/6)*(k1t+2*k2t+2*k3t+k4t)
        self.omega += (self.dt/6)*(k1o+2*k2o+2*k3o+k4o)
        self.theta = ((self.theta+np.pi)%(2*np.pi))-np.pi
        self.omega = np.clip(self.omega, -8.0, 8.0)
        reward = -(self.theta**2 + 0.1*self.omega**2 + 0.001*u**2)
        return np.array([self.theta, self.omega]), reward

# ── Q-Learning Setup ─────────────────────────────────────────
n_theta_bins = 20
n_omega_bins = 20
actions = np.array([-2.0, -1.0, 0.0, 1.0, 2.0])
n_actions = len(actions)

Q = np.zeros((n_theta_bins, n_omega_bins, n_actions))

def discretize(theta, omega):
    i_th = int(np.clip((theta + np.pi) / (2*np.pi) * n_theta_bins, 0, n_theta_bins - 1))
    i_om = int(np.clip((omega + 8.0) / 16.0 * n_omega_bins, 0, n_omega_bins - 1))
    return i_th, i_om

# ── Training ─────────────────────────────────────────────────
alpha = 0.1      # learning rate
gamma = 0.99     # discount factor
n_episodes = 500
max_steps = 200

eps_start = 1.0
eps_end = 0.01
episode_rewards = []

env = Pendulum()

for ep in range(n_episodes):
    state = env.reset()
    eps = eps_start - (eps_start - eps_end) * ep / n_episodes
    total_reward = 0

    for step in range(max_steps):
        i_th, i_om = discretize(state[0], state[1])

        # Epsilon-greedy action selection
        if np.random.rand() < eps:
            a_idx = np.random.randint(n_actions)
        else:
            a_idx = np.argmax(Q[i_th, i_om, :])

        u = actions[a_idx]
        next_state, reward = env.step(u)
        total_reward += reward

        # Q-learning update
        ni_th, ni_om = discretize(next_state[0], next_state[1])
        td_target = reward + gamma * np.max(Q[ni_th, ni_om, :])
        Q[i_th, i_om, a_idx] += alpha * (td_target - Q[i_th, i_om, a_idx])

        state = next_state

    episode_rewards.append(total_reward)
    if (ep + 1) % 100 == 0:
        avg = np.mean(episode_rewards[-50:])
        print(f"Episode {ep+1}: avg reward (last 50) = {avg:.1f}, eps = {eps:.3f}")

# ── Evaluate learned policy ──────────────────────────────────
state = env.reset()
q_thetas, q_omegas, q_controls, q_rewards = [], [], [], []
for step in range(500):
    i_th, i_om = discretize(state[0], state[1])
    a_idx = np.argmax(Q[i_th, i_om, :])
    u = actions[a_idx]
    state, r = env.step(u)
    q_thetas.append(state[0]); q_omegas.append(state[1])
    q_controls.append(u); q_rewards.append(r)

# ── PID baseline for comparison ──────────────────────────────
class PIDController:
    def __init__(self, Kp=10.0, Ki=0.1, Kd=2.0, dt=0.05):
        self.Kp, self.Ki, self.Kd, self.dt = Kp, Ki, Kd, dt
        self.integral = 0.0; self.prev_error = 0.0
    def reset(self): self.integral = 0.0; self.prev_error = 0.0
    def compute(self, theta):
        error = ((0 - theta + np.pi) % (2*np.pi)) - np.pi
        self.integral += error * self.dt
        deriv = (error - self.prev_error) / self.dt
        self.prev_error = error
        return np.clip(self.Kp*error + self.Ki*self.integral + self.Kd*deriv, -2, 2)

env2 = Pendulum(); pid = PIDController(); state2 = env2.reset(); pid.reset()
pid_thetas = []
for _ in range(500):
    u = pid.compute(state2[0])
    state2, _ = env2.step(u)
    pid_thetas.append(state2[0])

# ── Plots ────────────────────────────────────────────────────
time = np.arange(500) * 0.05
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

# Training curve
axes[0,0].plot(episode_rewards, alpha=0.3, color='blue')
# Smoothed
window = 20
smoothed = np.convolve(episode_rewards, np.ones(window)/window, mode='valid')
axes[0,0].plot(range(window-1, n_episodes), smoothed, 'b-', lw=2)
axes[0,0].set_xlabel('Episode'); axes[0,0].set_ylabel('Total Reward')
axes[0,0].set_title('Q-Learning Training'); axes[0,0].grid(True, alpha=0.3)

# Q-learning vs PID trajectories
axes[0,1].plot(time, q_thetas, 'b-', lw=1.5, label='Q-Learning')
axes[0,1].plot(time, pid_thetas, 'r--', lw=1.5, label='PID')
axes[0,1].axhline(0, color='k', ls=':', alpha=0.3)
axes[0,1].set_xlabel('Time (s)'); axes[0,1].set_ylabel('theta (rad)')
axes[0,1].legend(); axes[0,1].set_title('Evaluation: theta(t)')
axes[0,1].grid(True, alpha=0.3)

# Q-learning controls
axes[1,0].plot(time, q_controls, 'b-', lw=1)
axes[1,0].set_xlabel('Time (s)'); axes[1,0].set_ylabel('u (N*m)')
axes[1,0].set_title('Q-Learning Control Input'); axes[1,0].grid(True, alpha=0.3)

# Q-table heatmap (best action per state)
best_actions = np.argmax(Q, axis=2)
im = axes[1,1].imshow(best_actions.T, origin='lower', aspect='auto',
                       extent=[-np.pi, np.pi, -8, 8], cmap='RdYlBu')
plt.colorbar(im, ax=axes[1,1], label='Best action index')
axes[1,1].set_xlabel('theta'); axes[1,1].set_ylabel('omega')
axes[1,1].set_title('Learned Policy (Q-table)')

plt.tight_layout()
plt.show()

print(f"Q-learning total reward: {sum(q_rewards):.1f}")
print(f"Q-learning final theta: {q_thetas[-1]:.4f}")

Exploration ideas:

  • Increase the Q-table resolution to $40 \times 40 \times 5$. Does performance improve? How does training time change?
  • Try different $\varepsilon$ decay schedules (exponential vs linear). Which converges faster?
  • Modify the reward to $r = \cos(\theta) - 1$ (maximum at $\theta = 0$). How does the learned policy change?
  • Add a continuous action baseline using the same Q-learning but with actions $\{-2, -1.5, -1, \ldots, 1.5, 2\}$ (9 actions).

Agent Lens: From ODE to MDP — The Complete Picture

This module is the agent module. The Agent Lens here synthesizes the ODE-to-MDP mapping that has appeared as a sidebar throughout the curriculum.

The Explicit Mapping

ODE ConceptMDP ConceptPendulum Instance
State $x(t)$State $s_t$$(\theta_t, \omega_t)$
Control input $u(t)$Action $a_t$Torque $\in \{-2, -1, 0, +1, +2\}$
ODE integration $x(t+\Delta t)$Transition $P(s_{t+1}|s_t,a_t)$RK4 step of pendulum ODE
Cost functional $J = \int_0^T c(x,u)\,dt$Cumulative reward $\sum \gamma^t r_t$$\sum -(\theta^2 + 0.1\omega^2 + 0.001u^2)$
Optimal control law $u^*(x)$Optimal policy $\pi^*(s)$$\arg\max_a Q^*(s,a)$

Controller Comparison

PropertyPIDQ-LearningPolicy Gradient
Requires model?Yes (linearized)NoNo
Sample efficiencyN/A (no training)ModerateLow
Handles nonlinearityNear setpoint onlyYes (via discretization)Yes
Stability guaranteesYes (Lyapunov)Asymptotic onlyNone in general
Continuous actionsNaturalRequires discretizationNatural
Swing-up capable?No (from $\pi$)Yes (with enough training)Yes (with enough training)

When RL outperforms classical control: unknown dynamics, high-dimensional state spaces, complex nonlinear constraints, when the cost function is hard to express analytically, and when the system changes over time (online adaptation).

When classical control is sufficient: known linear (or mildly nonlinear) dynamics, need for formal stability guarantees, real-time hard constraints, and when interpretability is critical.

Exercises

Analytical

Exercise 1. Derive the Bellman equation for the pendulum MDP with $\gamma = 0.99$ and the reward $r = -(\theta^2 + 0.1\omega^2 + 0.001u^2)$. Write it explicitly for state $(\theta, \omega)$ and action $u$.

Solution

$Q^*(\theta, \omega, u) = -(\theta^2 + 0.1\omega^2 + 0.001u^2) + 0.99 \max_{u'} Q^*(\theta', \omega', u')$ where $(\theta', \omega')$ is obtained by one RK4 step of the pendulum ODE with torque $u$.

Exercise 2. Linearize the pendulum around $\theta = 0$ and compute the transfer function $G(s) = \Theta(s)/U(s)$ in the Laplace domain.

Solution

Linearized: $\ddot{\theta} + 0.1\dot{\theta} + 9.81\theta = u$. Taking Laplace transforms: $(s^2 + 0.1s + 9.81)\Theta(s) = U(s)$. So $G(s) = 1/(s^2 + 0.1s + 9.81)$.

Exercise 3. Show that the reward $r = -(\theta^2 + 0.1\omega^2 + 0.001u^2)$ is equivalent to a quadratic cost in LQR form $c = x^T Q x + u^T R u$ and identify $Q$ and $R$.

Solution

With $x = [\theta, \omega]^T$: $Q = \begin{pmatrix}1 & 0\\0 & 0.1\end{pmatrix}$, $R = [0.001]$. The reward is $r = -(x^T Q x + u^T R u)$.

Exercise 4. For the softmax policy $\pi_\theta(a|s) \propto \exp(\theta_a^T \phi(s))$, compute $\nabla_{\theta_a} \log \pi_\theta(a|s)$ and verify it equals $\phi(s)(1 - \pi_\theta(a|s))$.

Solution

$\log \pi_\theta(a|s) = \theta_a^T \phi(s) - \log \sum_{a'} \exp(\theta_{a'}^T \phi(s))$. $\nabla_{\theta_a} \log \pi_\theta(a|s) = \phi(s) - \pi_\theta(a|s)\phi(s) = \phi(s)(1 - \pi_\theta(a|s))$.

Computational

Exercise 5. Implement SARSA (on-policy TD) for the pendulum and compare its learning curve with Q-learning (off-policy). Run 500 episodes with the same hyperparameters.

Exercise 6. Implement a simple actor-critic: use the Q-table as the critic and a softmax policy (parameterized by a weight matrix) as the actor. Train for 500 episodes.

Exercise 7. Tune the PID gains by grid search over $K_p \in \{5, 10, 20, 50\}$, $K_d \in \{1, 2, 5\}$ (fix $K_i = 0.1$). Starting from $\theta = 0.3$ (near upright), report the gains that minimize settling time.

Exercise 8. Run Q-learning with $\gamma = 0.9, 0.95, 0.99, 0.999$ and compare total reward after 500 episodes. Plot learning curves for all four values.

Agentic

Exercise 9. Design a multi-objective reward function that balances swing-up speed, energy efficiency, and stabilization accuracy. Parameterize it as $r = -w_1 \theta^2 - w_2 \omega^2 - w_3 u^2$ and search over $(w_1, w_2, w_3)$ to find weights that achieve swing-up within 200 steps while keeping $\sum u^2$ below a threshold.

Exercise 10. Implement curriculum learning for the pendulum: first train the agent starting from $\theta = 0.5$ (easy), then gradually increase the starting angle to $\pi$ (hard) over the course of training. Compare with training from $\pi$ throughout.

Assessment

Quiz

Q1. What are the five components of an MDP?

Answer

State space $\mathcal{S}$, action space $\mathcal{A}$, transition function $P(s'|s,a)$, reward function $R(s,a)$, discount factor $\gamma$.

Q2. What does the Q-function $Q^\pi(s,a)$ represent?

Answer

The expected cumulative discounted reward when taking action $a$ in state $s$ and following policy $\pi$ thereafter.

Q3. Why does tabular Q-learning require discretization of the pendulum state?

Answer

A Q-table has a finite number of entries. Continuous states $(\theta, \omega)$ must be mapped to discrete bins to index into the table.

Q4. What is $\varepsilon$-greedy exploration?

Answer

With probability $\varepsilon$, choose a random action; with probability $1-\varepsilon$, choose the greedy action $\arg\max_a Q(s,a)$. $\varepsilon$ is decayed during training.

Q5. Why can’t PID swing up the pendulum from $\theta = \pi$?

Answer

PID is designed for linearized dynamics near $\theta = 0$. At $\theta = \pi$, $\sin(\theta) \approx \theta$ is invalid; the linearized controller does not produce the correct sequence of pushes needed for swing-up.

Q6. What is the role of the discount factor $\gamma$?

Answer

$\gamma$ controls how much the agent values future rewards versus immediate rewards. $\gamma = 0$: purely myopic. $\gamma \to 1$: values long-term outcomes equally. For the pendulum, $\gamma = 0.99$ encourages the agent to plan ahead for swing-up.

Q7. What is the REINFORCE gradient estimate?

Answer

$\nabla_\theta J \approx \sum_t G_t \nabla_\theta \log \pi_\theta(a_t|s_t)$ where $G_t = \sum_{k=t}^T \gamma^{k-t} r_k$ is the return-to-go.

Q8. What are the Robbins-Monro conditions for Q-learning convergence?

Answer

The learning rate sequence must satisfy $\sum_n \alpha_n = \infty$ (visit each state-action infinitely) and $\sum_n \alpha_n^2 < \infty$ (variance decreases). A constant $\alpha = 0.1$ does not formally satisfy these, but works well in practice for finite problems.

Q9. Why is reward shaping important for the pendulum?

Answer

Without velocity and control penalties ($\omega^2$ and $u^2$ terms), the agent may learn bang-bang control that rapidly oscillates the torque, producing physically undesirable behaviour. Reward shaping encodes physical preferences (smooth control, energy efficiency).

Q10. Name one advantage and one disadvantage of policy gradient vs Q-learning.

Answer

Advantage: policy gradient handles continuous actions naturally. Disadvantage: high variance in gradient estimates requires many samples for convergence.

Mini-Project: Control Tournament

Implement PID, Q-learning, and REINFORCE for the damped pendulum. Run a controlled comparison.

  1. Train Q-learning and REINFORCE for 500 episodes each
  2. Evaluate each controller for 100 episodes from random initial states $\theta \sim \text{Uniform}[-\pi, \pi]$
  3. Record: mean cumulative reward, success rate (reaches $|\theta| < 0.1$ within 200 steps), computation time
  4. Produce a comparison table and learning curves
CriterionWeightDescription
Correctness30%All three controllers are correctly implemented and produce reasonable behaviour
Fair comparison25%Evaluation uses identical initial conditions and environment settings
Analysis20%Discusses strengths/weaknesses of each approach with evidence
Visualization15%Clear learning curves, trajectory comparisons, and summary table
Code quality10%Modular code with shared environment class

References & Next Steps

References

  1. Sutton, R. S. & Barto, A. G. (2018). Reinforcement Learning: An Introduction, 2nd ed. MIT Press.
  2. Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2016). “Continuous control with deep reinforcement learning.” ICLR 2016.
  3. Doya, K. (2000). “Reinforcement learning in continuous time and space.” Neural Computation, 12(1), 219–245.
  4. Åström, K. J. & Murray, R. M. (2008). Feedback Systems: An Introduction for Scientists and Engineers. Princeton University Press.
  5. Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (2000). “Policy gradient methods for reinforcement learning with function approximation.” NeurIPS 2000.
  6. Williams, R. J. (1992). “Simple statistical gradient-following algorithms for connectionist reinforcement learning.” Machine Learning, 8, 229–256.

Next Steps