PROJECT 4.3
| Difficulty | Advanced |
| Estimated Time | 5 – 7 hours |
| Prerequisites | Module 3.1 (Neural ODEs), Module 3.5 (SINDy & Equation Discovery) |
A central challenge in scientific machine learning is recovering interpretable governing equations directly from observed data. The Sparse Identification of Nonlinear Dynamics (SINDy) framework addresses this by expressing the time derivatives of state variables as sparse linear combinations of candidate nonlinear functions drawn from a library. When the true dynamics are sparse in the chosen library, SINDy can recover the exact governing equations from data — even in the presence of measurement noise.
In this project you will apply SINDy to discover the governing equations of the SIR (Susceptible-Infected-Recovered) epidemiological model, one of the foundational compartmental models in mathematical epidemiology. The SIR system is:
with parameters $\beta = 0.3$ (transmission rate) and $\gamma = 0.1$ (recovery rate), and the conservation law $S(t) + I(t) + R(t) = N$ (total population). The basic reproduction number is $\mathcal{R}_0 = \beta / \gamma = 3.0$, indicating a strongly transmissible disease.
Your task is to: (1) generate synthetic SIR trajectory data with controlled noise levels, (2) implement the SINDy algorithm to discover the governing equations from this noisy data, (3) compare the discovered model against simpler baselines, and (4) build an iterative model refinement agent that automatically tunes SINDy hyperparameters (sparsity threshold, library composition, noise filtering) to maximize equation recovery accuracy. This project connects differential equations, data-driven modeling, and automated scientific discovery.
The following code generates clean SIR trajectories using a high-accuracy ODE solver, then adds controlled noise to simulate realistic observational data. Multiple noise levels and sampling rates are produced for systematic evaluation.
import numpy as np
from scipy.integrate import solve_ivp
import matplotlib.pyplot as plt
# --- SIR model parameters ---
N_POP = 1000 # total population
BETA = 0.3 / N_POP # transmission rate (per capita)
GAMMA = 0.1 # recovery rate
T_FINAL = 160.0 # simulation duration (days)
# Initial conditions: small outbreak
S0 = 999.0
I0 = 1.0
R0_INIT = 0.0
# Reproduction number
R0 = (BETA * N_POP) / GAMMA
print(f"Basic reproduction number R0 = {R0:.1f}")
def sir_rhs(t, y, beta=BETA, gamma=GAMMA):
"""Right-hand side of the SIR system.
Args:
t: time (unused, autonomous system)
y: state vector [S, I, R]
beta: transmission rate
gamma: recovery rate
Returns:
dydt: time derivatives [dS/dt, dI/dt, dR/dt]
"""
S, I, R = y
dSdt = -beta * S * I
dIdt = beta * S * I - gamma * I
dRdt = gamma * I
return [dSdt, dIdt, dRdt]
def generate_sir_data(dt=0.5, noise_levels=None, seed=42):
"""Generate SIR trajectory data with multiple noise levels.
Args:
dt: sampling interval in days
noise_levels: list of relative noise levels (fraction of max value)
seed: random seed for reproducibility
Returns:
t: time points, shape (M,)
y_clean: clean trajectory, shape (M, 3)
noisy_data: dict mapping noise_level -> noisy trajectory, shape (M, 3)
"""
if noise_levels is None:
noise_levels = [0.0, 0.01, 0.02, 0.05, 0.10]
rng = np.random.default_rng(seed)
# Solve ODE with high accuracy
t_eval = np.arange(0, T_FINAL + dt, dt)
sol = solve_ivp(
sir_rhs, [0, T_FINAL], [S0, I0, R0_INIT],
method='RK45', t_eval=t_eval,
rtol=1e-10, atol=1e-12, max_step=0.1
)
t = sol.t
y_clean = sol.y.T # shape (M, 3)
# Verify conservation law
pop_error = np.max(np.abs(y_clean.sum(axis=1) - N_POP))
print(f"Max population conservation error: {pop_error:.2e}")
# Add noise at each level
noisy_data = {}
for noise_level in noise_levels:
if noise_level == 0.0:
noisy_data[noise_level] = y_clean.copy()
else:
sigma = noise_level * np.max(y_clean, axis=0) # per-variable scaling
noise = rng.normal(0, 1, size=y_clean.shape) * sigma[np.newaxis, :]
y_noisy = y_clean + noise
# Clip to non-negative values (physical constraint)
y_noisy = np.clip(y_noisy, 0, None)
noisy_data[noise_level] = y_noisy
return t, y_clean, noisy_data
def estimate_derivatives(t, y, method='central4'):
"""Estimate time derivatives from discrete data.
Args:
t: time points, shape (M,)
y: trajectory data, shape (M, 3)
method: 'central4' for 4th-order central differences,
'savgol' for Savitzky-Golay filter
Returns:
t_d: interior time points, shape (M',)
y_d: interpolated state at interior points, shape (M', 3)
dydt: estimated derivatives, shape (M', 3)
"""
dt = t[1] - t[0]
M = len(t)
if method == 'central4':
# 4th-order central finite differences (interior points only)
dydt = np.zeros_like(y)
# Interior: (-y[i+2] + 8*y[i+1] - 8*y[i-1] + y[i-2]) / (12*dt)
for i in range(2, M - 2):
dydt[i] = (-y[i+2] + 8*y[i+1] - 8*y[i-1] + y[i-2]) / (12 * dt)
return t[2:-2], y[2:-2], dydt[2:-2]
elif method == 'savgol':
from scipy.signal import savgol_filter
# Window length must be odd, polynomial order 3
window = min(21, M if M % 2 == 1 else M - 1)
y_smooth = np.zeros_like(y)
dydt = np.zeros_like(y)
for col in range(3):
y_smooth[:, col] = savgol_filter(y[:, col], window, 3)
dydt[:, col] = savgol_filter(y[:, col], window, 3, deriv=1, delta=dt)
# Trim edges where filter is unreliable
trim = window // 4
return t[trim:-trim], y_smooth[trim:-trim], dydt[trim:-trim]
else:
raise ValueError(f"Unknown method: {method}")
# --- Generate datasets for all noise levels and sampling rates ---
NOISE_LEVELS = [0.0, 0.01, 0.02, 0.05, 0.10]
SAMPLING_INTERVALS = [0.1, 0.5, 1.0, 2.0]
datasets = {}
for dt in SAMPLING_INTERVALS:
t, y_clean, noisy = generate_sir_data(dt=dt, noise_levels=NOISE_LEVELS)
datasets[dt] = {'t': t, 'clean': y_clean, 'noisy': noisy}
print(f"dt={dt}: {len(t)} samples, t in [{t[0]:.1f}, {t[-1]:.1f}]")
# Save primary dataset
primary = datasets[0.5]
np.savez('sir_data.npz',
t=primary['t'],
y_clean=primary['clean'],
**{f"y_noise_{nl}": primary['noisy'][nl] for nl in NOISE_LEVELS})
print(f"Saved {len(NOISE_LEVELS)} noise variants to sir_data.npz")
The primary dataset uses $\Delta t = 0.5$ days and produces 321 observations over 160 days. The epidemic peaks at approximately $t \approx 47$ days with $I_{\max} \approx 300$ individuals. After the peak, infection decays exponentially as the susceptible pool is depleted, and the system reaches a final size of approximately $R(\infty) \approx 940$.
The first baseline assumes the SIR structure is known and estimates $\beta$ and $\gamma$ by minimizing the sum of squared residuals between the observed trajectory and the ODE solution. This represents the traditional approach where the model form is specified a priori and only the parameters are learned.
from scipy.optimize import minimize
def sir_residual(params, t_obs, y_obs):
"""Compute residual between SIR model and observations.
Args:
params: [beta, gamma] to optimize
t_obs: observation times, shape (M,)
y_obs: observed trajectory, shape (M, 3)
Returns:
total squared residual (scalar)
"""
beta_opt, gamma_opt = params
if beta_opt <= 0 or gamma_opt <= 0:
return 1e12
sol = solve_ivp(
sir_rhs, [t_obs[0], t_obs[-1]], y_obs[0],
args=(beta_opt, gamma_opt),
method='RK45', t_eval=t_obs,
rtol=1e-8, atol=1e-10
)
if sol.status != 0 or sol.y.shape[1] != len(t_obs):
return 1e12
residual = np.sum((sol.y.T - y_obs)**2)
return residual
def fit_sir_parameters(t_obs, y_obs, beta_init=1e-4, gamma_init=0.05):
"""Estimate SIR parameters via least-squares fitting.
Args:
t_obs: observation times
y_obs: observed trajectory (possibly noisy)
beta_init, gamma_init: initial guesses
Returns:
beta_hat, gamma_hat: estimated parameters
result: full optimization result
"""
result = minimize(
sir_residual, [beta_init, gamma_init],
args=(t_obs, y_obs),
method='Nelder-Mead',
options={'maxiter': 5000, 'xatol': 1e-10, 'fatol': 1e-10}
)
beta_hat, gamma_hat = result.x
print(f"Fitted: beta={beta_hat:.6e}, gamma={gamma_hat:.6f}")
print(f"True: beta={BETA:.6e}, gamma={GAMMA:.6f}")
print(f"Relative error: beta={abs(beta_hat-BETA)/BETA:.4%}, "
f"gamma={abs(gamma_hat-GAMMA)/GAMMA:.4%}")
return beta_hat, gamma_hat, result
The second baseline exploits the early-phase approximation. When $S \approx N$ and $I$ is small, the infected compartment grows approximately exponentially: $I(t) \approx I_0 e^{(\beta N - \gamma)t}$. By fitting an exponential to the early growth phase, we can estimate $\mathcal{R}_0$ and $\gamma$ independently.
def estimate_from_initial_growth(t, y_obs, growth_window=30):
"""Estimate R0 and gamma from the initial exponential growth phase.
During early growth (S ~ N), I(t) ~ I0 * exp((beta*N - gamma)*t).
The growth rate r = beta*N - gamma gives R0 = 1 + r/gamma.
Args:
t: time points
y_obs: observed trajectory, shape (M, 3)
growth_window: number of initial points to use for fit
Returns:
r_hat: estimated growth rate
R0_hat: estimated reproduction number
beta_hat, gamma_hat: estimated parameters
"""
# Extract early-phase I(t) data
I_early = y_obs[:growth_window, 1]
t_early = t[:growth_window]
# Fit log(I) = log(I0) + r*t via linear regression
# Only use points where I > 0
mask = I_early > 0
log_I = np.log(I_early[mask])
t_fit = t_early[mask]
# Linear regression: log(I) = a + r*t
coeffs = np.polyfit(t_fit, log_I, 1)
r_hat = coeffs[0] # growth rate
# Estimate gamma from recovery phase (after peak)
peak_idx = np.argmax(y_obs[:, 1])
# After peak, dI/dt < 0, and if S is small: dI/dt ~ -gamma * I
I_decay = y_obs[peak_idx:peak_idx + 50, 1]
t_decay = t[peak_idx:peak_idx + 50]
mask_decay = I_decay > 10 # only use points with sufficient counts
if mask_decay.sum() > 5:
log_I_decay = np.log(I_decay[mask_decay])
t_d = t_decay[mask_decay]
decay_coeffs = np.polyfit(t_d, log_I_decay, 1)
gamma_hat = -decay_coeffs[0]
else:
gamma_hat = 0.1 # fallback
R0_hat = 1.0 + r_hat / gamma_hat
beta_hat = (r_hat + gamma_hat) / N_POP
print(f"Growth rate r = {r_hat:.4f}")
print(f"Estimated gamma = {gamma_hat:.4f} (true: {GAMMA})")
print(f"Estimated R0 = {R0_hat:.2f} (true: {R0:.1f})")
print(f"Estimated beta = {beta_hat:.6e} (true: {BETA:.6e})")
return r_hat, R0_hat, beta_hat, gamma_hat
Before describing the agent, we provide the core SINDy algorithm. The method constructs a library matrix $\Theta(\mathbf{X}) \in \mathbb{R}^{M \times p}$ from the state data, then solves $\dot{\mathbf{X}} = \Theta(\mathbf{X})\, \Xi$ for the sparse coefficient matrix $\Xi \in \mathbb{R}^{p \times n}$ using sequential thresholded least squares (STLS).
def build_library(X, poly_order=2):
"""Build the SINDy candidate function library.
Args:
X: state data, shape (M, n) where n = number of state variables
poly_order: maximum polynomial order
Returns:
Theta: library matrix, shape (M, p)
labels: list of string labels for each library function
"""
M, n = X.shape
labels = ['1']
Theta = [np.ones(M)]
# Linear terms
var_names = ['S', 'I', 'R']
for i in range(n):
labels.append(var_names[i])
Theta.append(X[:, i])
# Quadratic terms
if poly_order >= 2:
for i in range(n):
for j in range(i, n):
labels.append(f"{var_names[i]}*{var_names[j]}")
Theta.append(X[:, i] * X[:, j])
Theta = np.column_stack(Theta)
return Theta, labels
def sindy_stls(Theta, dXdt, threshold=0.1, max_iter=20, alpha=0.0):
"""Sparse regression via Sequential Thresholded Least Squares (STLS).
Args:
Theta: library matrix, shape (M, p)
dXdt: derivative data, shape (M, n)
threshold: sparsity threshold lambda
max_iter: maximum number of thresholding iterations
alpha: L2 regularization strength (ridge parameter)
Returns:
Xi: sparse coefficient matrix, shape (p, n)
"""
M, p = Theta.shape
n = dXdt.shape[1]
# Initial least-squares solution (with optional ridge regularization)
if alpha > 0:
Xi = np.linalg.solve(
Theta.T @ Theta + alpha * np.eye(p),
Theta.T @ dXdt
)
else:
Xi, _, _, _ = np.linalg.lstsq(Theta, dXdt, rcond=None)
for iteration in range(max_iter):
Xi_prev = Xi.copy()
# Threshold: zero out small coefficients
small_mask = np.abs(Xi) < threshold
Xi[small_mask] = 0.0
# Re-solve least squares for remaining nonzero coefficients
for j in range(n):
nonzero = np.where(Xi[:, j] != 0)[0]
if len(nonzero) == 0:
continue
if alpha > 0:
Xi[nonzero, j] = np.linalg.solve(
Theta[:, nonzero].T @ Theta[:, nonzero]
+ alpha * np.eye(len(nonzero)),
Theta[:, nonzero].T @ dXdt[:, j]
)
else:
Xi[nonzero, j], _, _, _ = np.linalg.lstsq(
Theta[:, nonzero], dXdt[:, j], rcond=None
)
# Check convergence
if np.allclose(Xi, Xi_prev, atol=1e-12):
print(f"STLS converged at iteration {iteration + 1}")
break
return Xi
def print_equations(Xi, labels, var_names=['dS/dt', 'dI/dt', 'dR/dt']):
"""Pretty-print the discovered equations."""
n = Xi.shape[1]
for j in range(n):
terms = []
for i, label in enumerate(labels):
if abs(Xi[i, j]) > 1e-15:
terms.append(f"{Xi[i, j]:+.6e} * {label}")
equation = ' '.join(terms) if terms else '0'
print(f" {var_names[j]} = {equation}")
The iterative model refinement agent operates in a hyperparameter search loop. At each iteration $k$, the agent observes:
where:
The agent selects one of the following actions at each refinement step:
where $n_{\text{true}} = 4$ is the number of nonzero terms in the true SIR dynamics (the terms $SI$ in $dS/dt$, $SI$ and $I$ in $dI/dt$, and $I$ in $dR/dt$), and $\mu = 50$ is a penalty for including spurious terms. The AICc component encourages parsimonious models that fit the data well, while the penalty discourages overfitting.
The discovered model is evaluated against the true SIR equations using:
| # | Item | Description |
|---|---|---|
| D1 | sir_data_gen.py |
SIR model simulation, noise injection, and derivative estimation. Generates all 20 dataset variants (5 noise levels $\times$ 4 sampling rates). |
| D2 | sindy_core.py |
SINDy library construction, STLS regression, equation printing, AICc computation. Supports configurable library order and sparsity threshold. |
| D3 | baselines.py |
Least-squares parameter fitting and initial growth rate estimation baselines. Outputs fitted parameters and trajectory predictions. |
| D4 | refinement_agent.py |
Q-learning agent for iterative SINDy hyperparameter tuning. Includes state discretization, $\epsilon$-greedy policy, and evaluation loop. |
| D5 | evaluate_all.py |
Runs SINDy, baselines, and agent on all 20 problem instances. Outputs CSV with coefficient errors, support F1 scores, and trajectory errors. |
| D6 | Report (PDF, 5–7 pages) | Sections: Introduction, SIR Model & Data Generation, SINDy Method, Baselines, Agent-Based Refinement, Results, Discussion, Conclusion. |
| D7 | Plots |
(a) SIR trajectories at each noise level (S, I, R vs time). (b) Discovered coefficient matrix heatmap vs ground truth for each noise level. (c) Support recovery F1 score vs noise level (SINDy, agent, baselines). (d) Trajectory prediction error vs noise level and sampling rate (heatmap). (e) Agent's $\lambda$ selection trajectory over refinement iterations. |
| D8 | Metrics table | For each method and each of the 20 problem instances: coefficient recovery error, support F1 score, trajectory $L^2$ error, number of active terms, and AICc. |
| Criterion | Weight | Description |
|---|---|---|
| Correctness | 30 | SIR simulation matches analytical properties ($\mathcal{R}_0 = 3.0$, conservation law $S + I + R = N$). SINDy correctly recovers the exact equations from clean data ($\sigma = 0$): $dS/dt = -\beta SI$, $dI/dt = \beta SI - \gamma I$, $dR/dt = \gamma I$, with coefficient errors below 1%. STLS converges and produces sparse solutions. Derivative estimation is validated against analytical derivatives. |
| Efficiency | 15 | SINDy pipeline runs in under 5 seconds per problem instance. Library construction and STLS use vectorized NumPy operations. The agent converges within 500 episodes of Q-learning (less than 2 minutes total training time). |
| Robustness | 25 | The agent-tuned SINDy achieves support recovery F1 $\geq 0.8$ for noise levels up to 5% and all sampling rates. At 10% noise, the method degrades gracefully (F1 $\geq 0.5$). The discovered model produces bounded, physically plausible trajectories (non-negative populations, conservation law approximately satisfied) when integrated forward. Performance is consistent across 5 random seeds. |
| Reproducibility | 15 |
All random seeds are fixed (NumPy). A requirements.txt specifies
exact package versions. Running python sir_data_gen.py &&
python evaluate_all.py reproduces all results. Output CSV matches
reported tables exactly.
|
| Documentation | 15 | Report clearly explains the SINDy algorithm and its connection to sparse regression. Discussion section addresses: (a) when SINDy succeeds vs fails, (b) the role of derivative estimation quality, (c) how the agent's strategy adapts to noise level, and (d) limitations of the approach for more complex epidemic models (SEIR, stochastic). Code has module-level and function-level docstrings. Plots are publication-quality. |
| Total | 100 |
solve_ivp documentation: https://docs.scipy.org/doc/scipy/reference/generated/scipy.integrate.solve_ivp.html