Module 3.5: Data-Driven Discovery of Unknown Equations

When governing equations are unknown, discover them from data — use SINDy, sparse regression, and symbolic regression to extract interpretable dynamical systems from time-series measurements.

Module 1.1–1.3 (First- and second-order ODEs, systems of ODEs), Module 0.1 (Linear algebra), and familiarity with least squares ($\min_\xi \|\Theta\xi - \dot{X}\|_2^2$).
Difficulty: Advanced  |  Estimated time: 4–5 hours

Why This Matters


Classical differential equations assume you already know the governing law: $\dot{y} = -ky$, $\ddot{x} + \omega^2 x = 0$, and so on. But in many real-world settings—cell biology, climate science, epidemiology, materials engineering—the true equations are unknown. All you have is time-series data measured from the system.

Sparse Identification of Nonlinear Dynamics (SINDy), introduced by Brunton, Proctor, and Kutz (2016), attacks this problem head-on. The idea is strikingly simple: construct a library of candidate functions (polynomials, trigonometric terms, etc.), then use sparse regression to select the few terms that actually appear in the governing equation. The result is an interpretable, parsimonious differential equation discovered entirely from data.

Symbolic regression takes a complementary approach: search over the space of all mathematical expressions using genetic programming or reinforcement learning. Systems like Eureqa (Schmidt & Lipson, 2009) and AI Feynman (Udrescu & Tegmark, 2020) have rediscovered known physical laws from data alone.

This is the frontier of scientific machine learning: automated scientific discovery. Instead of a human scientist hypothesising equations and testing them, the algorithm discovers the equations directly. This module develops the mathematical foundations, derives the key results, and provides a complete working implementation.

Learning Objectives


After completing this module you will be able to:

  1. Construct a candidate library $\Theta(\mathbf{X})$ of nonlinear functions from state measurements.
  2. Formulate the equation discovery problem as sparse regression: $\dot{\mathbf{X}} = \Theta(\mathbf{X})\,\Xi$.
  3. Implement Sequential Thresholded Least Squares (STLS) and explain how it promotes sparsity.
  4. Apply SINDy to the Lotka–Volterra predator–prey system and recover the true coefficients.
  5. Quantify the effect of measurement noise on equation discovery and apply smoothing as preprocessing.
  6. Select the sparsity threshold $\tau$ using cross-validation and information criteria (AIC/BIC).
  7. Describe how symbolic regression searches expression space via genetic programming.
  8. Compare data-driven equation discovery with physics-based modelling in interpretability, generalisability, and data requirements.
  9. Assess discovered equations by forward simulation and residual analysis.
  10. Design an active-experimentation strategy to gather maximally informative data.

Core Concepts


Definitions

Definition 3.5.1 — Candidate Library $\Theta(\mathbf{X})$

Given state measurements $\mathbf{X} \in \mathbb{R}^{m \times n}$ (where $m$ is the number of time samples and $n$ is the state dimension), the candidate library is the matrix

$$\Theta(\mathbf{X}) = \begin{bmatrix} \mathbf{1} & \mathbf{X} & \mathbf{X}^{P_2} & \mathbf{X}^{P_3} & \cdots & \sin(\mathbf{X}) & \cdots \end{bmatrix} \in \mathbb{R}^{m \times p},$$

where each column is a candidate function evaluated at all $m$ time points. $\mathbf{X}^{P_k}$ denotes all $k$th-order polynomial combinations of the state variables. For a 2D system with states $(x, y)$ and polynomials up to degree 2, the library columns are $[1,\; x,\; y,\; x^2,\; xy,\; y^2]$.

Definition 3.5.2 — Sparse Regression for Dynamics

The equation discovery problem is cast as: find a sparse coefficient matrix $\Xi \in \mathbb{R}^{p \times n}$ such that

$$\dot{\mathbf{X}} \approx \Theta(\mathbf{X})\,\Xi.$$

Each column $\boldsymbol{\xi}_k$ of $\Xi$ contains the coefficients for the $k$th state variable. Sparsity means most entries of $\Xi$ are exactly zero, so only a few candidate functions appear in each discovered equation.

Definition 3.5.3 — Sequential Thresholded Least Squares (STLS)

STLS is an iterative algorithm for finding sparse solutions to $\dot{\mathbf{X}} = \Theta\,\Xi$:

  1. Initialise: Solve $\Xi^{(0)} = \arg\min_\Xi \|\dot{\mathbf{X}} - \Theta\Xi\|_F^2$ by ordinary least squares.
  2. Threshold: Set all entries $|\Xi^{(k)}_{ij}| < \tau$ to zero.
  3. Re-solve: For each column, re-fit the least-squares problem using only the surviving (non-zero) candidate terms.
  4. Repeat steps 2–3 until convergence (the sparsity pattern stabilises).

The threshold $\tau > 0$ is a hyperparameter controlling the sparsity level.

Definition 3.5.4 — Pareto Front (Accuracy vs. Complexity)

The Pareto front is the set of models for which no other model is simultaneously more accurate (lower $\|\dot{\mathbf{X}} - \Theta\Xi\|$) and less complex (fewer non-zero terms $\|\Xi\|_0$). Selecting a model on this front balances parsimony against fidelity.

Definition 3.5.5 — Symbolic Regression

Searches the space of all closed-form expressions to find $\dot{x} = f(x_1, x_2, \ldots)$ fitting observed data. Unlike standard regression (fixed form), symbolic regression discovers both the structure and parameters simultaneously, using operators ($+, -, \times, \div$), elementary functions ($\sin, \cos, \exp, \log$), constants, and state variables.

Definition 3.5.6 — Genetic Programming for Equation Search

Represents candidate equations as expression trees and evolves a population over generations. Selection favours low-error, small trees. Crossover swaps subtrees between parents. Mutation randomly modifies nodes. This powers tools like Eureqa and PySR.

Theorems and Key Results

Theorem 3.5.1 — Sparse Recovery via Mutual Incoherence

(Donoho & Huo, 2001; Tropp, 2004)

Let $\Theta \in \mathbb{R}^{m \times p}$ be the candidate library with columns normalised to unit $\ell_2$-norm. Define the mutual coherence

$$\mu(\Theta) = \max_{i \neq j} \frac{|\langle \boldsymbol{\theta}_i, \boldsymbol{\theta}_j \rangle|}{\|\boldsymbol{\theta}_i\| \|\boldsymbol{\theta}_j\|}.$$

If the true coefficient vector $\boldsymbol{\xi}^*$ has at most $s$ non-zero entries and

$$s < \tfrac{1}{2}\left(1 + \frac{1}{\mu(\Theta)}\right),$$

then $\boldsymbol{\xi}^*$ is the unique sparsest solution to $\dot{\mathbf{x}} = \Theta\boldsymbol{\xi}$, and it can be recovered exactly by $\ell_1$-minimisation (basis pursuit) or greedy algorithms (OMP). In practice, SINDy succeeds when the library columns corresponding to the true terms are sufficiently distinct from those that should not appear.

Theorem 3.5.2 — Convergence of STLS

(Zhang & Schaeffer, 2019)

Consider the STLS iteration applied to $\dot{\mathbf{x}} = \Theta\boldsymbol{\xi}$ with threshold $\tau$. If the candidate library $\Theta$ satisfies a restricted isometry property (RIP) with constant $\delta_{2s} < \sqrt{2} - 1$ for sparsity level $s$, and the minimum magnitude of the true non-zero coefficients satisfies $|\xi^*_j| > 2\tau$ for all active terms, then STLS converges in at most $s$ iterations to the correct sparsity pattern, and the recovered coefficients satisfy

$$\|\hat{\boldsymbol{\xi}} - \boldsymbol{\xi}^*\|_2 \leq C \cdot \frac{\|\text{noise}\|_2}{\sigma_{\min}(\Theta_S)},$$

where $\Theta_S$ is the submatrix of $\Theta$ restricted to the true support $S$ and $C$ is a constant depending on the RIP constant.

Theorem 3.5.3 — Information-Theoretic Limits of Equation Discovery

For a library of $p$ candidate functions and true sparsity $s$, the minimum number of data samples $m$ required to uniquely identify the governing equation scales as

$$m = \Omega\!\left(s \log \frac{p}{s}\right).$$

This is a fundamental lower bound from compressed sensing theory (Candès & Tao, 2005). With $p = 6$ candidate functions and $s = 2$ active terms (as in Lotka–Volterra), the theoretical minimum is on the order of $m \approx 2 \log 3 \approx 2.2$ samples—but in practice, noise, numerical differentiation errors, and coherence between library columns require $m$ to be substantially larger (typically $m \geq 10s$ to $100s$).

Common Misconceptions

Misconception 1: “SINDy always finds the correct equations.”
Reality: SINDy can only discover terms present in the library. If the true dynamics involve $\sin(x)$ but the library contains only polynomials, SINDy returns a polynomial approximation. Noise and poor threshold selection also cause errors.
Misconception 2: “More candidate functions are always better.”
Reality: Expanding the library increases mutual coherence $\mu(\Theta)$, making sparse recovery harder. A library of 100 terms when only 3 are active is much harder than a library of 10. Domain knowledge to constrain the library is extremely valuable.
Misconception 3: “Noise doesn't affect SINDy much.”
Reality: SINDy requires computing derivatives, which amplifies noise: if measurement noise is $O(\epsilon)$, derivative noise is $O(\epsilon/\Delta t)$. Smoothing (Savitzky–Golay) or integral formulations (weak SINDy) are essential for noisy data.
Misconception 4: “Symbolic regression is just curve fitting.”
Reality: Curve fitting assumes a fixed form and optimises parameters. Symbolic regression discovers both the functional form and parameters, searching over all mathematical expressions—a combinatorially vast space.
Misconception 5: “You need millions of data points.”
Reality: Compressed sensing shows data scales as $O(s \log(p/s))$—tens to hundreds of points. The key is richness: sampling diverse states matters more than sheer volume.

Worked Examples


Example 3.5.1 — SINDy for the Lotka–Volterra System

The Lotka–Volterra equations: $\dot{x} = \alpha x - \beta xy$, $\dot{y} = \delta xy - \gamma y$ with $\alpha=1.0$, $\beta=0.1$, $\delta=0.075$, $\gamma=1.5$.

Step 1 — Generate data. Integrate from $(x_0,y_0)=(10,5)$ over $t\in[0,20]$ with $m=2000$ points via RK45, producing $\mathbf{X}\in\mathbb{R}^{2000\times 2}$.

Step 2 — Compute derivatives. Fourth-order centred finite differences: $\dot{x}_i \approx (-x_{i+2}+8x_{i+1}-8x_{i-1}+x_{i-2})/(12\Delta t)$.

Step 3 — Build library. $\Theta = [1,\;x,\;y,\;x^2,\;xy,\;y^2] \in \mathbb{R}^{1996\times 6}$ (4 boundary rows lost).

Step 4 — STLS ($\tau=0.05$). Initial least squares for $\dot{x}$: $\boldsymbol{\xi}_1^{(0)} = [0.0002,\;1.0001,\;{-0.0003},\;{-0.0000},\;{-0.1000},\;0.0001]$. After thresholding, surviving terms: $x$ and $xy$. Re-solve: $\hat{\boldsymbol{\xi}}_1 = [1.0000,\;{-0.1000}]$. Converged in 1 iteration.

Result: $\dot{x} = 1.000\,x - 0.100\,xy$, $\dot{y} = 0.075\,xy - 1.500\,y$. Exact recovery of both equations.

Example 3.5.2 — Effect of the Threshold $\tau$

$\tau$Discovered $\dot{x}$TermsResidual
0.001$0.0002 + 1.0001x - 0.0003y - 0.1000xy + 0.0001y^2$5$2.1\times10^{-6}$
0.05$1.0000x - 0.1000xy$2$3.4\times10^{-6}$
0.5$1.0000x$1$0.47$
2.0$0$ (empty)0$1.82$

At $\tau=0.001$: overfitting (spurious terms). At $\tau=0.05$: exact recovery. At $\tau=0.5$: underfitting ($xy$ interaction removed). The threshold must separate true coefficient magnitudes ($\geq 0.075$) from the noise floor ($\leq 0.001$).

Example 3.5.3 — SINDy with Measurement Noise

1% noise: Derivative SNR $\approx 50$. STLS with $\tau=0.05$ recovers the correct pattern; coefficients slightly perturbed: $\dot{x}\approx 0.998x - 0.0999xy$.

5% noise: Derivative SNR $\approx 10$. Direct finite differences yield a spurious $x^2$ term. After Savitzky–Golay smoothing (window 21, order 3): $\dot{x}\approx 0.996x - 0.100xy$. Correct equation recovered.

Lesson: Always smooth data before SINDy. Alternatives include integral formulations (weak SINDy) that avoid differentiation entirely.

Interactive Code Lab: SINDy for Lotka–Volterra Discovery


Complete NumPy/SciPy implementation: generate data, build library, run STLS, compare discovered vs. true equations.

Click "Run Code" to execute. Expected: STLS converges in 1-2 iterations,
discovered equations match dx/dt = 1.0*x - 0.1*xy and dy/dt = 0.075*xy - 1.5*y,
relative coefficient error on the order of 1e-5 or smaller.
For the full Lorenz system example and noise analysis, see notebooks/equation_discovery_full.ipynb.

Agent Lens: The Equation Discovery Agent


Reinforcement Learning Analogy
RL ConceptEquation Discovery Counterpart
State $s_t$Current candidate equations, library of functions, available data, noise estimate, residual errors
Action $a_t$Choose candidate functions for library, select threshold $\tau$, decide whether to collect more data, apply preprocessing (smoothing, subsampling)
Reward $r_t$Parsimony (fewer terms) + accuracy (low residual) + predictive power (forward simulation matches held-out data)
Policy $\pi_\theta$Bayesian model selection (BIC/AIC), cross-validation for $\tau$, active experimentation (perturb system for informative data)
EpisodeOne complete SINDy run: library construction, STLS, validation
EnvironmentThe unknown dynamical system from which data is measured

Why this perspective is powerful

Active experimentation. A passive agent uses available data. An active agent designs experiments—choosing initial conditions and perturbations that maximise distinguishability between candidate models. If two candidate equations produce identical predictions in the current regime, the agent perturbs the system into a regime where they differ.

Model selection as exploration–exploitation. Exploitation: commit to the current best model. Exploration: test alternatives (different terms, thresholds). AIC/BIC and cross-validation balance this trade-off.

Hierarchical discovery. Low level: run SINDy with fixed library and $\tau$. Mid level: select library composition and $\tau$. High level: decide whether to collect more data, switch to symbolic regression, or declare the equation discovered. This mirrors how human scientists approach equation discovery.

Automated scientific discovery dates to BACON (Langley et al., 1987), which rediscovered Kepler's third law from numerical data. Modern SINDy and symbolic regression continue this tradition with vastly more powerful algorithms.

Exercises


Analytical Exercises

Exercise 3.5.1 (Library Size)

Show that a polynomial library up to degree $d$ in $n$ variables has $\binom{n+d}{d}$ columns. Compute for $n=3$, $d=3$.

The total number of monomials up to degree $d$ in $n$ variables is $\sum_{k=0}^{d}\binom{n+k-1}{k} = \binom{n+d}{d}$. For $n=3$, $d=3$: $\binom{6}{3}=20$ columns (1 constant + 3 linear + 6 quadratic + 10 cubic).

Exercise 3.5.2 (Mutual Coherence)

Compute $\mu(\Theta)$ for $\Theta=[1,\,x,\,x^2]$ with $x_i=i/m$, $m=100$. Which column pair has highest coherence?

After normalising columns: $\langle\hat{\mathbf{1}},\hat{x}\rangle\approx 0.866$, $\langle\hat{\mathbf{1}},\hat{x}^2\rangle\approx 0.745$, $\langle\hat{x},\hat{x}^2\rangle\approx 0.968$. So $\mu\approx 0.968$, from the $(x,x^2)$ pair. This high coherence illustrates why polynomial libraries make sparse recovery harder. Centring and normalising the data reduces coherence.

Exercise 3.5.3 (Noise Amplification)

With measured $\tilde{x}_i = x_i + \epsilon_i$ ($\epsilon_i$ i.i.d., variance $\sigma^2$) and first-order differences $\dot{\tilde{x}}_i=(\tilde{x}_{i+1}-\tilde{x}_i)/\Delta t$, show the derivative noise variance is $2\sigma^2/(\Delta t)^2$.

$\dot{\tilde{x}}_i = \dot{x}_i + (\epsilon_{i+1}-\epsilon_i)/\Delta t$. Since $\epsilon_{i+1},\epsilon_i$ are independent: $\text{Var}((\epsilon_{i+1}-\epsilon_i)/\Delta t) = 2\sigma^2/(\Delta t)^2$. For $\Delta t=0.01$, $\sigma=0.01$: derivative noise std $=\sqrt{2}\approx 1.41$, which can dominate the signal.

Exercise 3.5.4 (BIC Model Selection)

With $m=1996$ points, compute BIC $= m\ln(\text{RSS}/m)+k\ln(m)$ for: (a) 2-term model ($k=2$, RSS $=6.8\times10^{-6}$) and (b) 5-term model ($k=5$, RSS $=4.2\times10^{-6}$). Which does BIC prefer?

BIC$_2 = 1996\ln(3.41\times10^{-9})+2(7.60)=-38{,}907$. BIC$_5 = 1996\ln(2.10\times10^{-9})+5(7.60)=-39{,}842$. BIC$_5$ is slightly lower, but the 3 extra terms have coefficients $<0.001$ (noise artefacts). Cross-validation alongside BIC robustly selects the 2-term model.

Computational Exercises

Exercise 3.5.5 (SINDy for Harmonic Oscillator)

Apply SINDy to $\dot{x}=v$, $\dot{v}=-\omega^2 x$ with $\omega=2$, $(x_0,v_0)=(1,0)$, $t\in[0,10]$, library $[1,x,v,x^2,xv,v^2]$, $\tau=0.1$. Report discovered equations and coefficient errors.

Exercise 3.5.6 (Threshold Sweep)

Sweep $\tau\in[0.001, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1.0]$ on Lotka–Volterra data. Record non-zero terms and simulation error for each. Plot the Pareto front and identify the optimal $\tau$.

Exercise 3.5.7 (Noise Robustness)

Add noise $\sigma\in[0, 0.001, 0.005, 0.01, 0.02, 0.05, 0.1]$ (fraction of signal std). Run SINDy with and without Savitzky–Golay smoothing (window 21, order 3). Report coefficient error and whether the correct sparsity pattern is recovered.

Exercise 3.5.8 (Lorenz System)

Discover the Lorenz system ($\sigma=10$, $\rho=28$, $\beta=8/3$) using degree-2 polynomials (20 columns for 3 variables). Apply SINDy with $\tau=0.1$ and verify correct recovery.

Agentic Exercises

Exercise 3.5.9 (Active Experiment Agent)

Design an agent that discovers Lotka–Volterra using minimum data. The agent chooses initial conditions for each experiment: (1) Run SINDy. (2) Identify ambiguous terms (coefficients near $\tau$). (3) Choose $(x_0,y_0)$ maximising Fisher information for ambiguous terms. (4) Collect data and append. (5) Repeat until stable for 3 iterations. Compare total data vs. a single trajectory.

Exercise 3.5.10 (Cross-Validation Model Selection)

Implement an agent that selects $\tau$ via 5-fold temporal cross-validation: for each $\tau\in[0.001,0.005,0.01,0.02,0.05,0.1,0.2,0.5]$, train SINDy on 4 folds, validate by forward simulation on the held-out fold. Select minimum-error $\tau$, re-train on all data. Test on 2% noisy Lotka–Volterra data.

Assessment


Concept Quiz

Q1. What does each column of the candidate library $\Theta$ represent?

A candidate nonlinear function (e.g., $1$, $x$, $xy$, $y^2$) evaluated at all $m$ measurement times. Sparse regression selects which functions actually appear in the governing equation.

Q2. Why does STLS iterate rather than threshold once?

Removing terms changes the least-squares fit, potentially revealing additional small coefficients. Iteration continues until the sparsity pattern stabilises, ensuring self-consistency.

Q3. What happens if the true dynamics involve $\sin(x)$ but the library contains only polynomials?

SINDy returns a polynomial approximation ($x - x^3/6 + \cdots$), inaccurate outside the training data range. SINDy can only discover terms present in its library.

Q4. How does mutual coherence $\mu(\Theta)$ affect sparse recovery?

Higher $\mu$ means columns are more similar, making it harder to distinguish active from inactive terms. Theorem 3.5.1 requires $s < \frac{1}{2}(1+1/\mu)$; near $\mu=1$, only $s=1$ can be recovered.

Q5. Why is numerical differentiation problematic for noisy data?

Derivative noise variance scales as $O(\sigma^2/(\Delta t)^2)$. Small $\Delta t$ (needed for accuracy) amplifies noise severely. Smoothing or integral formulations (weak SINDy) mitigate this.

Q6. What is the key difference between SINDy and symbolic regression?

SINDy searches a fixed, pre-specified library via sparse regression. Symbolic regression searches the entire space of mathematical expressions via combinatorial optimisation. SINDy is faster with a good library; symbolic regression is more flexible but computationally expensive.

Q7. How many data points are needed for a 3-term equation from 20 candidates (Theorem 3.5.3)?

Theoretical lower bound: $m=\Omega(3\log(20/3))\approx 6$. In practice, noise requires $m\geq 30$–$300$ for reliable recovery.

Q8. What role does the Pareto front play in equation discovery?

It displays the accuracy–complexity trade-off. Models on the front are non-dominated. A sharp “knee” typically indicates the true equation complexity.

Q9. How does cross-validation select $\tau$?

Split data into temporal folds. For each $\tau$, train SINDy on training folds, forward-simulate on validation fold, measure prediction error. Select $\tau$ with lowest average error, avoiding overfitting (small $\tau$) and underfitting (large $\tau$).

Q10. Why is forward simulation a stronger test than residual error?

Residual error measures instantaneous fit. Forward simulation integrates over time, compounding errors. A model with low residual but wrong dynamics diverges during simulation. It tests actual long-term behaviour.

Mini-Project: Discover the Governing Equations

Scenario: Mystery time-series from a damped oscillator $y''+by'+ky=0$ with unknown $b,k$. Position $y(t)$ at 500 points over $t\in[0,15]$.

Tasks: (1) Convert to first-order system: $\dot{y}_1=y_2$, $\dot{y}_2=-ky_1-by_2$. (2) Estimate velocity via finite differences. (3) Build library $[1,y_1,y_2,y_1^2,y_1 y_2,y_2^2]$, apply SINDy. (4) Report $b$ and $k$. (5) Validate by forward simulation. (6) Compare SINDy with direct least squares on $\ddot{y}=-b\dot{y}-ky$.

Test data: $b=0.3$, $k=4.0$, $(y_0,\dot{y}_0)=(2.0,0)$.

CriterionExcellentGoodNeeds Work
Equation formBoth correct, no spurious termsCorrect with 1–2 small spurious termsIncorrect or missing terms
Coefficient accuracyRelative error $<1\%$$<5\%$$>5\%$
Forward simulationMatches full $[0,15]$Matches $[0,10]$, drifts laterDiverges within $[0,5]$
Method comparisonQuantitative with error analysisQualitative comparisonNo comparison
Code qualityClean, commented, modularFunctional with commentsUnclear or incomplete

References & Next Steps


Key References
  • Brunton, Proctor & Kutz (2016). “Discovering governing equations from data by sparse identification of nonlinear dynamical systems.” PNAS 113(15), 3932–3937.
  • Champion, Lusch, Kutz & Brunton (2019). “Data-driven discovery of coordinates and governing equations.” PNAS 116(45), 22445–22451.
  • Udrescu & Tegmark (2020). “AI Feynman: A physics-inspired method for symbolic regression.” Science Advances 6(16), eaay2631.
  • Schmidt & Lipson (2009). “Distilling free-form natural laws from experimental data.” Science 324(5923), 81–85.
  • Candès & Tao (2005). “Decoding by linear programming.” IEEE Trans. Info. Theory 51(12), 4203–4215.
Further Resources
  • Notebook: notebooks/equation_discovery_full.ipynb — Lorenz system, noise robustness, Savitzky–Golay preprocessing, symbolic regression comparison.
  • PySINDy: Open-source SINDy implementation with constrained regression, ensemble methods, and weak formulation (pysindy.readthedocs.io).
  • PySR: Symbolic regression via multi-population evolutionary search (astroautomata.com/PySR).