When governing equations are unknown, discover them from data — use SINDy, sparse regression, and symbolic regression to extract interpretable dynamical systems from time-series measurements.
Classical differential equations assume you already know the governing law: $\dot{y} = -ky$, $\ddot{x} + \omega^2 x = 0$, and so on. But in many real-world settings—cell biology, climate science, epidemiology, materials engineering—the true equations are unknown. All you have is time-series data measured from the system.
Sparse Identification of Nonlinear Dynamics (SINDy), introduced by Brunton, Proctor, and Kutz (2016), attacks this problem head-on. The idea is strikingly simple: construct a library of candidate functions (polynomials, trigonometric terms, etc.), then use sparse regression to select the few terms that actually appear in the governing equation. The result is an interpretable, parsimonious differential equation discovered entirely from data.
Symbolic regression takes a complementary approach: search over the space of all mathematical expressions using genetic programming or reinforcement learning. Systems like Eureqa (Schmidt & Lipson, 2009) and AI Feynman (Udrescu & Tegmark, 2020) have rediscovered known physical laws from data alone.
This is the frontier of scientific machine learning: automated scientific discovery. Instead of a human scientist hypothesising equations and testing them, the algorithm discovers the equations directly. This module develops the mathematical foundations, derives the key results, and provides a complete working implementation.
After completing this module you will be able to:
Given state measurements $\mathbf{X} \in \mathbb{R}^{m \times n}$ (where $m$ is the number of time samples and $n$ is the state dimension), the candidate library is the matrix
where each column is a candidate function evaluated at all $m$ time points. $\mathbf{X}^{P_k}$ denotes all $k$th-order polynomial combinations of the state variables. For a 2D system with states $(x, y)$ and polynomials up to degree 2, the library columns are $[1,\; x,\; y,\; x^2,\; xy,\; y^2]$.
The equation discovery problem is cast as: find a sparse coefficient matrix $\Xi \in \mathbb{R}^{p \times n}$ such that
Each column $\boldsymbol{\xi}_k$ of $\Xi$ contains the coefficients for the $k$th state variable. Sparsity means most entries of $\Xi$ are exactly zero, so only a few candidate functions appear in each discovered equation.
STLS is an iterative algorithm for finding sparse solutions to $\dot{\mathbf{X}} = \Theta\,\Xi$:
The threshold $\tau > 0$ is a hyperparameter controlling the sparsity level.
The Pareto front is the set of models for which no other model is simultaneously more accurate (lower $\|\dot{\mathbf{X}} - \Theta\Xi\|$) and less complex (fewer non-zero terms $\|\Xi\|_0$). Selecting a model on this front balances parsimony against fidelity.
Searches the space of all closed-form expressions to find $\dot{x} = f(x_1, x_2, \ldots)$ fitting observed data. Unlike standard regression (fixed form), symbolic regression discovers both the structure and parameters simultaneously, using operators ($+, -, \times, \div$), elementary functions ($\sin, \cos, \exp, \log$), constants, and state variables.
Represents candidate equations as expression trees and evolves a population over generations. Selection favours low-error, small trees. Crossover swaps subtrees between parents. Mutation randomly modifies nodes. This powers tools like Eureqa and PySR.
(Donoho & Huo, 2001; Tropp, 2004)
Let $\Theta \in \mathbb{R}^{m \times p}$ be the candidate library with columns normalised to unit $\ell_2$-norm. Define the mutual coherence
If the true coefficient vector $\boldsymbol{\xi}^*$ has at most $s$ non-zero entries and
then $\boldsymbol{\xi}^*$ is the unique sparsest solution to $\dot{\mathbf{x}} = \Theta\boldsymbol{\xi}$, and it can be recovered exactly by $\ell_1$-minimisation (basis pursuit) or greedy algorithms (OMP). In practice, SINDy succeeds when the library columns corresponding to the true terms are sufficiently distinct from those that should not appear.
(Zhang & Schaeffer, 2019)
Consider the STLS iteration applied to $\dot{\mathbf{x}} = \Theta\boldsymbol{\xi}$ with threshold $\tau$. If the candidate library $\Theta$ satisfies a restricted isometry property (RIP) with constant $\delta_{2s} < \sqrt{2} - 1$ for sparsity level $s$, and the minimum magnitude of the true non-zero coefficients satisfies $|\xi^*_j| > 2\tau$ for all active terms, then STLS converges in at most $s$ iterations to the correct sparsity pattern, and the recovered coefficients satisfy
where $\Theta_S$ is the submatrix of $\Theta$ restricted to the true support $S$ and $C$ is a constant depending on the RIP constant.
For a library of $p$ candidate functions and true sparsity $s$, the minimum number of data samples $m$ required to uniquely identify the governing equation scales as
This is a fundamental lower bound from compressed sensing theory (Candès & Tao, 2005). With $p = 6$ candidate functions and $s = 2$ active terms (as in Lotka–Volterra), the theoretical minimum is on the order of $m \approx 2 \log 3 \approx 2.2$ samples—but in practice, noise, numerical differentiation errors, and coherence between library columns require $m$ to be substantially larger (typically $m \geq 10s$ to $100s$).
The Lotka–Volterra equations: $\dot{x} = \alpha x - \beta xy$, $\dot{y} = \delta xy - \gamma y$ with $\alpha=1.0$, $\beta=0.1$, $\delta=0.075$, $\gamma=1.5$.
Step 1 — Generate data. Integrate from $(x_0,y_0)=(10,5)$ over $t\in[0,20]$ with $m=2000$ points via RK45, producing $\mathbf{X}\in\mathbb{R}^{2000\times 2}$.
Step 2 — Compute derivatives. Fourth-order centred finite differences: $\dot{x}_i \approx (-x_{i+2}+8x_{i+1}-8x_{i-1}+x_{i-2})/(12\Delta t)$.
Step 3 — Build library. $\Theta = [1,\;x,\;y,\;x^2,\;xy,\;y^2] \in \mathbb{R}^{1996\times 6}$ (4 boundary rows lost).
Step 4 — STLS ($\tau=0.05$). Initial least squares for $\dot{x}$: $\boldsymbol{\xi}_1^{(0)} = [0.0002,\;1.0001,\;{-0.0003},\;{-0.0000},\;{-0.1000},\;0.0001]$. After thresholding, surviving terms: $x$ and $xy$. Re-solve: $\hat{\boldsymbol{\xi}}_1 = [1.0000,\;{-0.1000}]$. Converged in 1 iteration.
Result: $\dot{x} = 1.000\,x - 0.100\,xy$, $\dot{y} = 0.075\,xy - 1.500\,y$. Exact recovery of both equations.
| $\tau$ | Discovered $\dot{x}$ | Terms | Residual |
|---|---|---|---|
| 0.001 | $0.0002 + 1.0001x - 0.0003y - 0.1000xy + 0.0001y^2$ | 5 | $2.1\times10^{-6}$ |
| 0.05 | $1.0000x - 0.1000xy$ | 2 | $3.4\times10^{-6}$ |
| 0.5 | $1.0000x$ | 1 | $0.47$ |
| 2.0 | $0$ (empty) | 0 | $1.82$ |
At $\tau=0.001$: overfitting (spurious terms). At $\tau=0.05$: exact recovery. At $\tau=0.5$: underfitting ($xy$ interaction removed). The threshold must separate true coefficient magnitudes ($\geq 0.075$) from the noise floor ($\leq 0.001$).
1% noise: Derivative SNR $\approx 50$. STLS with $\tau=0.05$ recovers the correct pattern; coefficients slightly perturbed: $\dot{x}\approx 0.998x - 0.0999xy$.
5% noise: Derivative SNR $\approx 10$. Direct finite differences yield a spurious $x^2$ term. After Savitzky–Golay smoothing (window 21, order 3): $\dot{x}\approx 0.996x - 0.100xy$. Correct equation recovered.
Lesson: Always smooth data before SINDy. Alternatives include integral formulations (weak SINDy) that avoid differentiation entirely.
Complete NumPy/SciPy implementation: generate data, build library, run STLS, compare discovered vs. true equations.
Click "Run Code" to execute. Expected: STLS converges in 1-2 iterations, discovered equations match dx/dt = 1.0*x - 0.1*xy and dy/dt = 0.075*xy - 1.5*y, relative coefficient error on the order of 1e-5 or smaller.
notebooks/equation_discovery_full.ipynb.
| RL Concept | Equation Discovery Counterpart |
|---|---|
| State $s_t$ | Current candidate equations, library of functions, available data, noise estimate, residual errors |
| Action $a_t$ | Choose candidate functions for library, select threshold $\tau$, decide whether to collect more data, apply preprocessing (smoothing, subsampling) |
| Reward $r_t$ | Parsimony (fewer terms) + accuracy (low residual) + predictive power (forward simulation matches held-out data) |
| Policy $\pi_\theta$ | Bayesian model selection (BIC/AIC), cross-validation for $\tau$, active experimentation (perturb system for informative data) |
| Episode | One complete SINDy run: library construction, STLS, validation |
| Environment | The unknown dynamical system from which data is measured |
Active experimentation. A passive agent uses available data. An active agent designs experiments—choosing initial conditions and perturbations that maximise distinguishability between candidate models. If two candidate equations produce identical predictions in the current regime, the agent perturbs the system into a regime where they differ.
Model selection as exploration–exploitation. Exploitation: commit to the current best model. Exploration: test alternatives (different terms, thresholds). AIC/BIC and cross-validation balance this trade-off.
Hierarchical discovery. Low level: run SINDy with fixed library and $\tau$. Mid level: select library composition and $\tau$. High level: decide whether to collect more data, switch to symbolic regression, or declare the equation discovered. This mirrors how human scientists approach equation discovery.
Show that a polynomial library up to degree $d$ in $n$ variables has $\binom{n+d}{d}$ columns. Compute for $n=3$, $d=3$.
The total number of monomials up to degree $d$ in $n$ variables is $\sum_{k=0}^{d}\binom{n+k-1}{k} = \binom{n+d}{d}$. For $n=3$, $d=3$: $\binom{6}{3}=20$ columns (1 constant + 3 linear + 6 quadratic + 10 cubic).
Compute $\mu(\Theta)$ for $\Theta=[1,\,x,\,x^2]$ with $x_i=i/m$, $m=100$. Which column pair has highest coherence?
After normalising columns: $\langle\hat{\mathbf{1}},\hat{x}\rangle\approx 0.866$, $\langle\hat{\mathbf{1}},\hat{x}^2\rangle\approx 0.745$, $\langle\hat{x},\hat{x}^2\rangle\approx 0.968$. So $\mu\approx 0.968$, from the $(x,x^2)$ pair. This high coherence illustrates why polynomial libraries make sparse recovery harder. Centring and normalising the data reduces coherence.
With measured $\tilde{x}_i = x_i + \epsilon_i$ ($\epsilon_i$ i.i.d., variance $\sigma^2$) and first-order differences $\dot{\tilde{x}}_i=(\tilde{x}_{i+1}-\tilde{x}_i)/\Delta t$, show the derivative noise variance is $2\sigma^2/(\Delta t)^2$.
$\dot{\tilde{x}}_i = \dot{x}_i + (\epsilon_{i+1}-\epsilon_i)/\Delta t$. Since $\epsilon_{i+1},\epsilon_i$ are independent: $\text{Var}((\epsilon_{i+1}-\epsilon_i)/\Delta t) = 2\sigma^2/(\Delta t)^2$. For $\Delta t=0.01$, $\sigma=0.01$: derivative noise std $=\sqrt{2}\approx 1.41$, which can dominate the signal.
With $m=1996$ points, compute BIC $= m\ln(\text{RSS}/m)+k\ln(m)$ for: (a) 2-term model ($k=2$, RSS $=6.8\times10^{-6}$) and (b) 5-term model ($k=5$, RSS $=4.2\times10^{-6}$). Which does BIC prefer?
BIC$_2 = 1996\ln(3.41\times10^{-9})+2(7.60)=-38{,}907$. BIC$_5 = 1996\ln(2.10\times10^{-9})+5(7.60)=-39{,}842$. BIC$_5$ is slightly lower, but the 3 extra terms have coefficients $<0.001$ (noise artefacts). Cross-validation alongside BIC robustly selects the 2-term model.
Apply SINDy to $\dot{x}=v$, $\dot{v}=-\omega^2 x$ with $\omega=2$, $(x_0,v_0)=(1,0)$, $t\in[0,10]$, library $[1,x,v,x^2,xv,v^2]$, $\tau=0.1$. Report discovered equations and coefficient errors.
Sweep $\tau\in[0.001, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1.0]$ on Lotka–Volterra data. Record non-zero terms and simulation error for each. Plot the Pareto front and identify the optimal $\tau$.
Add noise $\sigma\in[0, 0.001, 0.005, 0.01, 0.02, 0.05, 0.1]$ (fraction of signal std). Run SINDy with and without Savitzky–Golay smoothing (window 21, order 3). Report coefficient error and whether the correct sparsity pattern is recovered.
Discover the Lorenz system ($\sigma=10$, $\rho=28$, $\beta=8/3$) using degree-2 polynomials (20 columns for 3 variables). Apply SINDy with $\tau=0.1$ and verify correct recovery.
Design an agent that discovers Lotka–Volterra using minimum data. The agent chooses initial conditions for each experiment: (1) Run SINDy. (2) Identify ambiguous terms (coefficients near $\tau$). (3) Choose $(x_0,y_0)$ maximising Fisher information for ambiguous terms. (4) Collect data and append. (5) Repeat until stable for 3 iterations. Compare total data vs. a single trajectory.
Implement an agent that selects $\tau$ via 5-fold temporal cross-validation: for each $\tau\in[0.001,0.005,0.01,0.02,0.05,0.1,0.2,0.5]$, train SINDy on 4 folds, validate by forward simulation on the held-out fold. Select minimum-error $\tau$, re-train on all data. Test on 2% noisy Lotka–Volterra data.
Q1. What does each column of the candidate library $\Theta$ represent?
A candidate nonlinear function (e.g., $1$, $x$, $xy$, $y^2$) evaluated at all $m$ measurement times. Sparse regression selects which functions actually appear in the governing equation.
Q2. Why does STLS iterate rather than threshold once?
Removing terms changes the least-squares fit, potentially revealing additional small coefficients. Iteration continues until the sparsity pattern stabilises, ensuring self-consistency.
Q3. What happens if the true dynamics involve $\sin(x)$ but the library contains only polynomials?
SINDy returns a polynomial approximation ($x - x^3/6 + \cdots$), inaccurate outside the training data range. SINDy can only discover terms present in its library.
Q4. How does mutual coherence $\mu(\Theta)$ affect sparse recovery?
Higher $\mu$ means columns are more similar, making it harder to distinguish active from inactive terms. Theorem 3.5.1 requires $s < \frac{1}{2}(1+1/\mu)$; near $\mu=1$, only $s=1$ can be recovered.
Q5. Why is numerical differentiation problematic for noisy data?
Derivative noise variance scales as $O(\sigma^2/(\Delta t)^2)$. Small $\Delta t$ (needed for accuracy) amplifies noise severely. Smoothing or integral formulations (weak SINDy) mitigate this.
Q6. What is the key difference between SINDy and symbolic regression?
SINDy searches a fixed, pre-specified library via sparse regression. Symbolic regression searches the entire space of mathematical expressions via combinatorial optimisation. SINDy is faster with a good library; symbolic regression is more flexible but computationally expensive.
Q7. How many data points are needed for a 3-term equation from 20 candidates (Theorem 3.5.3)?
Theoretical lower bound: $m=\Omega(3\log(20/3))\approx 6$. In practice, noise requires $m\geq 30$–$300$ for reliable recovery.
Q8. What role does the Pareto front play in equation discovery?
It displays the accuracy–complexity trade-off. Models on the front are non-dominated. A sharp “knee” typically indicates the true equation complexity.
Q9. How does cross-validation select $\tau$?
Split data into temporal folds. For each $\tau$, train SINDy on training folds, forward-simulate on validation fold, measure prediction error. Select $\tau$ with lowest average error, avoiding overfitting (small $\tau$) and underfitting (large $\tau$).
Q10. Why is forward simulation a stronger test than residual error?
Residual error measures instantaneous fit. Forward simulation integrates over time, compounding errors. A model with low residual but wrong dynamics diverges during simulation. It tests actual long-term behaviour.
Scenario: Mystery time-series from a damped oscillator $y''+by'+ky=0$ with unknown $b,k$. Position $y(t)$ at 500 points over $t\in[0,15]$.
Tasks: (1) Convert to first-order system: $\dot{y}_1=y_2$, $\dot{y}_2=-ky_1-by_2$. (2) Estimate velocity via finite differences. (3) Build library $[1,y_1,y_2,y_1^2,y_1 y_2,y_2^2]$, apply SINDy. (4) Report $b$ and $k$. (5) Validate by forward simulation. (6) Compare SINDy with direct least squares on $\ddot{y}=-b\dot{y}-ky$.
Test data: $b=0.3$, $k=4.0$, $(y_0,\dot{y}_0)=(2.0,0)$.
| Criterion | Excellent | Good | Needs Work |
|---|---|---|---|
| Equation form | Both correct, no spurious terms | Correct with 1–2 small spurious terms | Incorrect or missing terms |
| Coefficient accuracy | Relative error $<1\%$ | $<5\%$ | $>5\%$ |
| Forward simulation | Matches full $[0,15]$ | Matches $[0,10]$, drifts later | Diverges within $[0,5]$ |
| Method comparison | Quantitative with error analysis | Qualitative comparison | No comparison |
| Code quality | Clean, commented, modular | Functional with comments | Unclear or incomplete |
notebooks/equation_discovery_full.ipynb — Lorenz system, noise robustness, Savitzky–Golay preprocessing, symbolic regression comparison.pysindy.readthedocs.io).astroautomata.com/PySR).