Module 3.4 — Flagship Module
Bridge differential equations and reinforcement learning: control the damped pendulum with PID, Q-learning, and policy gradient methods.
| Difficulty: | Advanced |
| Estimated Time: | 5–6 hours |
Every controlled dynamical system is an ODE with a control input: $\dot{x} = f(x, u)$, where $x$ is the state and $u$ is the control. Classical control theory (PID, LQR, MPC) designs $u$ using analytical models. Reinforcement learning (RL) learns $u$ from interaction with the system, requiring no explicit model.
This module makes the connection concrete using a single physical system—the damped pendulum—controlled by three paradigms: PID (hand-crafted), Q-learning (tabular value-based), and REINFORCE (policy gradient). By solving the same task three ways, you see exactly where each approach shines and where it fails. The pendulum swing-up problem is a canonical benchmark in control and RL that captures the essential challenges: nonlinearity, underactuation, and the exploration-exploitation trade-off.
The governing ODE is:
$$\ddot{\theta} + b\dot{\theta} + \frac{g}{L}\sin(\theta) = \frac{u}{mL^2}$$with parameters: $g = 9.81$ m/s$^2$, $L = 1.0$ m, $m = 1.0$ kg, $b = 0.1$ (damping coefficient), $\Delta t = 0.05$ s.
State: $(\theta, \omega)$ where $\omega = \dot{\theta}$. Written as a first-order system:
$$\dot{\theta} = \omega, \quad \dot{\omega} = -b\omega - \frac{g}{L}\sin(\theta) + \frac{u}{mL^2}$$Goal: swing up from $\theta = \pi$ (hanging down) to $\theta = 0$ (upright) and stabilize.
Reward: $r = -(\theta^2 + 0.1\omega^2 + 0.001u^2)$ at each step.
Linearize the pendulum around $\theta = 0$: $\sin(\theta) \approx \theta$, so $\ddot{\theta} + 0.1\dot{\theta} + 9.81\theta = u$.
Choose PID gains: $K_p = 10$, $K_i = 0.1$, $K_d = 2$. Target: $\theta_{\text{target}} = 0$.
Step-by-step from $\theta_0 = 0.5$, $\omega_0 = 0$, $\Delta t = 0.05$:
Step 0: $e_0 = 0 - 0.5 = -0.5$, integral $= -0.5 \times 0.05 = -0.025$, derivative $= -0.5/0.05 = -10$.
$u_0 = 10(-0.5) + 0.1(-0.025) + 2(-10) = -5.0 - 0.0025 - 20.0 = -25.003$. Clamp to $u_0 = -2$ (discrete).
Step 1: Integrate ODE with $u = -2$: $\omega_1 = 0 + 0.05(-0.1 \times 0 - 9.81 \times 0.479 - 2) = -0.335$, $\theta_1 = 0.5 + 0.05(-0.335) = 0.483$.
$e_1 = -0.483$, integral $= -0.025 + (-0.483)(0.05) = -0.049$, deriv $= (-0.483 + 0.5)/0.05 = 0.34$.
$u_1 = 10(-0.483) + 0.1(-0.049) + 2(0.34) = -4.83 - 0.005 + 0.68 = -4.155$. Clamp: $u_1 = -2$.
The PID controller generates large torques near the setpoint. After about 50 steps (2.5 seconds), the pendulum stabilizes near $\theta = 0$. PID works well for stabilization but cannot perform the full swing-up from $\theta = \pi$.
Discretize: $\theta \in [-\pi, \pi]$ into 20 bins (bin width $= \pi/10$), $\omega \in [-8, 8]$ into 20 bins (bin width $= 0.8$). Actions: $\{-2, -1, 0, +1, +2\}$.
One Q-learning step:
Softmax policy over 5 actions with linear features: $\pi_\theta(a|s) = \frac{\exp(\theta_a^T \phi(s))}{\sum_{a'}\exp(\theta_{a'}^T \phi(s))}$ where $\phi(s) = [\theta, \omega, 1]^T$.
One trajectory of 3 steps:
Returns (no discounting for simplicity): $G_0 = -6.25 - 5.35 - 4.14 = -15.74$, $G_1 = -5.35 - 4.14 = -9.49$, $G_2 = -4.14$.
Gradient: $\nabla_\theta J \approx \sum_{t=0}^{2} G_t \nabla_\theta \log \pi_\theta(a_t|s_t)$.
For softmax: $\nabla_\theta \log \pi_\theta(a|s) = \phi(s)(e_a - \pi_\theta(\cdot|s))$ where $e_a$ is the one-hot vector for action $a$. The large negative returns push probability away from actions taken in high-cost states.
import numpy as np
import matplotlib.pyplot as plt
# ── Pendulum Environment ─────────────────────────────────────
class Pendulum:
"""Damped pendulum: theta'' + b*theta' + (g/L)*sin(theta) = u/(m*L^2)"""
def __init__(self):
self.g = 9.81; self.L = 1.0; self.m = 1.0; self.b = 0.1
self.dt = 0.05
self.reset()
def reset(self):
self.theta = np.pi # hanging down
self.omega = 0.0
return np.array([self.theta, self.omega])
def _dynamics(self, theta, omega, u):
dtheta = omega
domega = -self.b * omega - (self.g / self.L) * np.sin(theta) + u / (self.m * self.L**2)
return dtheta, domega
def step(self, u):
u = np.clip(u, -2.0, 2.0)
# RK4 integration
th, om = self.theta, self.omega
k1_th, k1_om = self._dynamics(th, om, u)
k2_th, k2_om = self._dynamics(th + 0.5*self.dt*k1_th, om + 0.5*self.dt*k1_om, u)
k3_th, k3_om = self._dynamics(th + 0.5*self.dt*k2_th, om + 0.5*self.dt*k2_om, u)
k4_th, k4_om = self._dynamics(th + self.dt*k3_th, om + self.dt*k3_om, u)
self.theta += (self.dt / 6) * (k1_th + 2*k2_th + 2*k3_th + k4_th)
self.omega += (self.dt / 6) * (k1_om + 2*k2_om + 2*k3_om + k4_om)
# Wrap theta to [-pi, pi]
self.theta = ((self.theta + np.pi) % (2*np.pi)) - np.pi
reward = -(self.theta**2 + 0.1*self.omega**2 + 0.001*u**2)
return np.array([self.theta, self.omega]), reward
# ── PID Controller ───────────────────────────────────────────
class PIDController:
def __init__(self, Kp=10.0, Ki=0.1, Kd=2.0, dt=0.05):
self.Kp, self.Ki, self.Kd, self.dt = Kp, Ki, Kd, dt
self.integral = 0.0
self.prev_error = 0.0
def reset(self):
self.integral = 0.0
self.prev_error = 0.0
def compute(self, theta, target=0.0):
error = target - theta
# Wrap error to [-pi, pi]
error = ((error + np.pi) % (2*np.pi)) - np.pi
self.integral += error * self.dt
derivative = (error - self.prev_error) / self.dt
self.prev_error = error
u = self.Kp * error + self.Ki * self.integral + self.Kd * derivative
return np.clip(u, -2.0, 2.0)
# ── Run PID simulation ──────────────────────────────────────
env = Pendulum()
pid = PIDController(Kp=10.0, Ki=0.1, Kd=2.0)
state = env.reset()
pid.reset()
n_steps = 500
thetas, omegas, controls, rewards = [], [], [], []
for t in range(n_steps):
u = pid.compute(state[0])
state, r = env.step(u)
thetas.append(state[0]); omegas.append(state[1])
controls.append(u); rewards.append(r)
time = np.arange(n_steps) * env.dt
fig, axes = plt.subplots(3, 1, figsize=(10, 8), sharex=True)
axes[0].plot(time, thetas, 'b-', lw=1.5)
axes[0].axhline(0, color='r', ls='--', alpha=0.5, label='target')
axes[0].set_ylabel('theta (rad)'); axes[0].legend(); axes[0].set_title('PID Control of Damped Pendulum')
axes[0].grid(True, alpha=0.3)
axes[1].plot(time, omegas, 'g-', lw=1.5)
axes[1].set_ylabel('omega (rad/s)'); axes[1].grid(True, alpha=0.3)
axes[2].plot(time, controls, 'r-', lw=1.5)
axes[2].set_ylabel('u (N*m)'); axes[2].set_xlabel('Time (s)')
axes[2].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print(f"Total reward: {sum(rewards):.1f}")
print(f"Final theta: {thetas[-1]:.4f} rad, Final omega: {omegas[-1]:.4f} rad/s")
Exploration ideas:
import numpy as np
import matplotlib.pyplot as plt
# ── Pendulum Environment (same as Code Lab 1) ───────────────
class Pendulum:
def __init__(self):
self.g = 9.81; self.L = 1.0; self.m = 1.0; self.b = 0.1
self.dt = 0.05
self.reset()
def reset(self):
self.theta = np.pi; self.omega = 0.0
return np.array([self.theta, self.omega])
def _dynamics(self, theta, omega, u):
return omega, -self.b*omega - (self.g/self.L)*np.sin(theta) + u/(self.m*self.L**2)
def step(self, u):
u = np.clip(u, -2.0, 2.0)
th, om = self.theta, self.omega
k1t, k1o = self._dynamics(th, om, u)
k2t, k2o = self._dynamics(th+0.5*self.dt*k1t, om+0.5*self.dt*k1o, u)
k3t, k3o = self._dynamics(th+0.5*self.dt*k2t, om+0.5*self.dt*k2o, u)
k4t, k4o = self._dynamics(th+self.dt*k3t, om+self.dt*k3o, u)
self.theta += (self.dt/6)*(k1t+2*k2t+2*k3t+k4t)
self.omega += (self.dt/6)*(k1o+2*k2o+2*k3o+k4o)
self.theta = ((self.theta+np.pi)%(2*np.pi))-np.pi
self.omega = np.clip(self.omega, -8.0, 8.0)
reward = -(self.theta**2 + 0.1*self.omega**2 + 0.001*u**2)
return np.array([self.theta, self.omega]), reward
# ── Q-Learning Setup ─────────────────────────────────────────
n_theta_bins = 20
n_omega_bins = 20
actions = np.array([-2.0, -1.0, 0.0, 1.0, 2.0])
n_actions = len(actions)
Q = np.zeros((n_theta_bins, n_omega_bins, n_actions))
def discretize(theta, omega):
i_th = int(np.clip((theta + np.pi) / (2*np.pi) * n_theta_bins, 0, n_theta_bins - 1))
i_om = int(np.clip((omega + 8.0) / 16.0 * n_omega_bins, 0, n_omega_bins - 1))
return i_th, i_om
# ── Training ─────────────────────────────────────────────────
alpha = 0.1 # learning rate
gamma = 0.99 # discount factor
n_episodes = 500
max_steps = 200
eps_start = 1.0
eps_end = 0.01
episode_rewards = []
env = Pendulum()
for ep in range(n_episodes):
state = env.reset()
eps = eps_start - (eps_start - eps_end) * ep / n_episodes
total_reward = 0
for step in range(max_steps):
i_th, i_om = discretize(state[0], state[1])
# Epsilon-greedy action selection
if np.random.rand() < eps:
a_idx = np.random.randint(n_actions)
else:
a_idx = np.argmax(Q[i_th, i_om, :])
u = actions[a_idx]
next_state, reward = env.step(u)
total_reward += reward
# Q-learning update
ni_th, ni_om = discretize(next_state[0], next_state[1])
td_target = reward + gamma * np.max(Q[ni_th, ni_om, :])
Q[i_th, i_om, a_idx] += alpha * (td_target - Q[i_th, i_om, a_idx])
state = next_state
episode_rewards.append(total_reward)
if (ep + 1) % 100 == 0:
avg = np.mean(episode_rewards[-50:])
print(f"Episode {ep+1}: avg reward (last 50) = {avg:.1f}, eps = {eps:.3f}")
# ── Evaluate learned policy ──────────────────────────────────
state = env.reset()
q_thetas, q_omegas, q_controls, q_rewards = [], [], [], []
for step in range(500):
i_th, i_om = discretize(state[0], state[1])
a_idx = np.argmax(Q[i_th, i_om, :])
u = actions[a_idx]
state, r = env.step(u)
q_thetas.append(state[0]); q_omegas.append(state[1])
q_controls.append(u); q_rewards.append(r)
# ── PID baseline for comparison ──────────────────────────────
class PIDController:
def __init__(self, Kp=10.0, Ki=0.1, Kd=2.0, dt=0.05):
self.Kp, self.Ki, self.Kd, self.dt = Kp, Ki, Kd, dt
self.integral = 0.0; self.prev_error = 0.0
def reset(self): self.integral = 0.0; self.prev_error = 0.0
def compute(self, theta):
error = ((0 - theta + np.pi) % (2*np.pi)) - np.pi
self.integral += error * self.dt
deriv = (error - self.prev_error) / self.dt
self.prev_error = error
return np.clip(self.Kp*error + self.Ki*self.integral + self.Kd*deriv, -2, 2)
env2 = Pendulum(); pid = PIDController(); state2 = env2.reset(); pid.reset()
pid_thetas = []
for _ in range(500):
u = pid.compute(state2[0])
state2, _ = env2.step(u)
pid_thetas.append(state2[0])
# ── Plots ────────────────────────────────────────────────────
time = np.arange(500) * 0.05
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
# Training curve
axes[0,0].plot(episode_rewards, alpha=0.3, color='blue')
# Smoothed
window = 20
smoothed = np.convolve(episode_rewards, np.ones(window)/window, mode='valid')
axes[0,0].plot(range(window-1, n_episodes), smoothed, 'b-', lw=2)
axes[0,0].set_xlabel('Episode'); axes[0,0].set_ylabel('Total Reward')
axes[0,0].set_title('Q-Learning Training'); axes[0,0].grid(True, alpha=0.3)
# Q-learning vs PID trajectories
axes[0,1].plot(time, q_thetas, 'b-', lw=1.5, label='Q-Learning')
axes[0,1].plot(time, pid_thetas, 'r--', lw=1.5, label='PID')
axes[0,1].axhline(0, color='k', ls=':', alpha=0.3)
axes[0,1].set_xlabel('Time (s)'); axes[0,1].set_ylabel('theta (rad)')
axes[0,1].legend(); axes[0,1].set_title('Evaluation: theta(t)')
axes[0,1].grid(True, alpha=0.3)
# Q-learning controls
axes[1,0].plot(time, q_controls, 'b-', lw=1)
axes[1,0].set_xlabel('Time (s)'); axes[1,0].set_ylabel('u (N*m)')
axes[1,0].set_title('Q-Learning Control Input'); axes[1,0].grid(True, alpha=0.3)
# Q-table heatmap (best action per state)
best_actions = np.argmax(Q, axis=2)
im = axes[1,1].imshow(best_actions.T, origin='lower', aspect='auto',
extent=[-np.pi, np.pi, -8, 8], cmap='RdYlBu')
plt.colorbar(im, ax=axes[1,1], label='Best action index')
axes[1,1].set_xlabel('theta'); axes[1,1].set_ylabel('omega')
axes[1,1].set_title('Learned Policy (Q-table)')
plt.tight_layout()
plt.show()
print(f"Q-learning total reward: {sum(q_rewards):.1f}")
print(f"Q-learning final theta: {q_thetas[-1]:.4f}")
Exploration ideas:
This module is the agent module. The Agent Lens here synthesizes the ODE-to-MDP mapping that has appeared as a sidebar throughout the curriculum.
| ODE Concept | MDP Concept | Pendulum Instance |
|---|---|---|
| State $x(t)$ | State $s_t$ | $(\theta_t, \omega_t)$ |
| Control input $u(t)$ | Action $a_t$ | Torque $\in \{-2, -1, 0, +1, +2\}$ |
| ODE integration $x(t+\Delta t)$ | Transition $P(s_{t+1}|s_t,a_t)$ | RK4 step of pendulum ODE |
| Cost functional $J = \int_0^T c(x,u)\,dt$ | Cumulative reward $\sum \gamma^t r_t$ | $\sum -(\theta^2 + 0.1\omega^2 + 0.001u^2)$ |
| Optimal control law $u^*(x)$ | Optimal policy $\pi^*(s)$ | $\arg\max_a Q^*(s,a)$ |
| Property | PID | Q-Learning | Policy Gradient |
|---|---|---|---|
| Requires model? | Yes (linearized) | No | No |
| Sample efficiency | N/A (no training) | Moderate | Low |
| Handles nonlinearity | Near setpoint only | Yes (via discretization) | Yes |
| Stability guarantees | Yes (Lyapunov) | Asymptotic only | None in general |
| Continuous actions | Natural | Requires discretization | Natural |
| Swing-up capable? | No (from $\pi$) | Yes (with enough training) | Yes (with enough training) |
When RL outperforms classical control: unknown dynamics, high-dimensional state spaces, complex nonlinear constraints, when the cost function is hard to express analytically, and when the system changes over time (online adaptation).
When classical control is sufficient: known linear (or mildly nonlinear) dynamics, need for formal stability guarantees, real-time hard constraints, and when interpretability is critical.
Exercise 1. Derive the Bellman equation for the pendulum MDP with $\gamma = 0.99$ and the reward $r = -(\theta^2 + 0.1\omega^2 + 0.001u^2)$. Write it explicitly for state $(\theta, \omega)$ and action $u$.
$Q^*(\theta, \omega, u) = -(\theta^2 + 0.1\omega^2 + 0.001u^2) + 0.99 \max_{u'} Q^*(\theta', \omega', u')$ where $(\theta', \omega')$ is obtained by one RK4 step of the pendulum ODE with torque $u$.
Exercise 2. Linearize the pendulum around $\theta = 0$ and compute the transfer function $G(s) = \Theta(s)/U(s)$ in the Laplace domain.
Linearized: $\ddot{\theta} + 0.1\dot{\theta} + 9.81\theta = u$. Taking Laplace transforms: $(s^2 + 0.1s + 9.81)\Theta(s) = U(s)$. So $G(s) = 1/(s^2 + 0.1s + 9.81)$.
Exercise 3. Show that the reward $r = -(\theta^2 + 0.1\omega^2 + 0.001u^2)$ is equivalent to a quadratic cost in LQR form $c = x^T Q x + u^T R u$ and identify $Q$ and $R$.
With $x = [\theta, \omega]^T$: $Q = \begin{pmatrix}1 & 0\\0 & 0.1\end{pmatrix}$, $R = [0.001]$. The reward is $r = -(x^T Q x + u^T R u)$.
Exercise 4. For the softmax policy $\pi_\theta(a|s) \propto \exp(\theta_a^T \phi(s))$, compute $\nabla_{\theta_a} \log \pi_\theta(a|s)$ and verify it equals $\phi(s)(1 - \pi_\theta(a|s))$.
$\log \pi_\theta(a|s) = \theta_a^T \phi(s) - \log \sum_{a'} \exp(\theta_{a'}^T \phi(s))$. $\nabla_{\theta_a} \log \pi_\theta(a|s) = \phi(s) - \pi_\theta(a|s)\phi(s) = \phi(s)(1 - \pi_\theta(a|s))$.
Exercise 5. Implement SARSA (on-policy TD) for the pendulum and compare its learning curve with Q-learning (off-policy). Run 500 episodes with the same hyperparameters.
Exercise 6. Implement a simple actor-critic: use the Q-table as the critic and a softmax policy (parameterized by a weight matrix) as the actor. Train for 500 episodes.
Exercise 7. Tune the PID gains by grid search over $K_p \in \{5, 10, 20, 50\}$, $K_d \in \{1, 2, 5\}$ (fix $K_i = 0.1$). Starting from $\theta = 0.3$ (near upright), report the gains that minimize settling time.
Exercise 8. Run Q-learning with $\gamma = 0.9, 0.95, 0.99, 0.999$ and compare total reward after 500 episodes. Plot learning curves for all four values.
Exercise 9. Design a multi-objective reward function that balances swing-up speed, energy efficiency, and stabilization accuracy. Parameterize it as $r = -w_1 \theta^2 - w_2 \omega^2 - w_3 u^2$ and search over $(w_1, w_2, w_3)$ to find weights that achieve swing-up within 200 steps while keeping $\sum u^2$ below a threshold.
Exercise 10. Implement curriculum learning for the pendulum: first train the agent starting from $\theta = 0.5$ (easy), then gradually increase the starting angle to $\pi$ (hard) over the course of training. Compare with training from $\pi$ throughout.
Q1. What are the five components of an MDP?
State space $\mathcal{S}$, action space $\mathcal{A}$, transition function $P(s'|s,a)$, reward function $R(s,a)$, discount factor $\gamma$.
Q2. What does the Q-function $Q^\pi(s,a)$ represent?
The expected cumulative discounted reward when taking action $a$ in state $s$ and following policy $\pi$ thereafter.
Q3. Why does tabular Q-learning require discretization of the pendulum state?
A Q-table has a finite number of entries. Continuous states $(\theta, \omega)$ must be mapped to discrete bins to index into the table.
Q4. What is $\varepsilon$-greedy exploration?
With probability $\varepsilon$, choose a random action; with probability $1-\varepsilon$, choose the greedy action $\arg\max_a Q(s,a)$. $\varepsilon$ is decayed during training.
Q5. Why can’t PID swing up the pendulum from $\theta = \pi$?
PID is designed for linearized dynamics near $\theta = 0$. At $\theta = \pi$, $\sin(\theta) \approx \theta$ is invalid; the linearized controller does not produce the correct sequence of pushes needed for swing-up.
Q6. What is the role of the discount factor $\gamma$?
$\gamma$ controls how much the agent values future rewards versus immediate rewards. $\gamma = 0$: purely myopic. $\gamma \to 1$: values long-term outcomes equally. For the pendulum, $\gamma = 0.99$ encourages the agent to plan ahead for swing-up.
Q7. What is the REINFORCE gradient estimate?
$\nabla_\theta J \approx \sum_t G_t \nabla_\theta \log \pi_\theta(a_t|s_t)$ where $G_t = \sum_{k=t}^T \gamma^{k-t} r_k$ is the return-to-go.
Q8. What are the Robbins-Monro conditions for Q-learning convergence?
The learning rate sequence must satisfy $\sum_n \alpha_n = \infty$ (visit each state-action infinitely) and $\sum_n \alpha_n^2 < \infty$ (variance decreases). A constant $\alpha = 0.1$ does not formally satisfy these, but works well in practice for finite problems.
Q9. Why is reward shaping important for the pendulum?
Without velocity and control penalties ($\omega^2$ and $u^2$ terms), the agent may learn bang-bang control that rapidly oscillates the torque, producing physically undesirable behaviour. Reward shaping encodes physical preferences (smooth control, energy efficiency).
Q10. Name one advantage and one disadvantage of policy gradient vs Q-learning.
Advantage: policy gradient handles continuous actions naturally. Disadvantage: high variance in gradient estimates requires many samples for convergence.
Implement PID, Q-learning, and REINFORCE for the damped pendulum. Run a controlled comparison.
| Criterion | Weight | Description |
|---|---|---|
| Correctness | 30% | All three controllers are correctly implemented and produce reasonable behaviour |
| Fair comparison | 25% | Evaluation uses identical initial conditions and environment settings |
| Analysis | 20% | Discusses strengths/weaknesses of each approach with evidence |
| Visualization | 15% | Clear learning curves, trajectory comparisons, and summary table |
| Code quality | 10% | Modular code with shared environment class |