Replace hand-crafted dynamics with learned vector fields — parameterize $\frac{d\mathbf{y}}{dt} = f_\theta(t, \mathbf{y})$ with a neural network and solve with standard ODE integrators.
Classical differential equations require an explicit formula for the right-hand side: $\dot{y} = -ky$, $\dot{y} = \sin(y) - y^3$, and so on. But what if the governing law is unknown—buried in experimental data from a biological system, a financial market, or a robotic controller? Neural Ordinary Differential Equations (Neural ODEs) answer this question by replacing the hand-crafted right-hand side with a neural network $f_\theta$, then solving the resulting initial-value problem with an off-the-shelf ODE solver.
Introduced by Chen et al. (NeurIPS 2018), Neural ODEs have opened entirely new avenues:
This module develops the mathematical framework, derives the adjoint method from first principles, and provides a complete NumPy implementation of a toy Neural ODE so you can see every gradient flow by hand.
After completing this module you will be able to:
A Neural Ordinary Differential Equation is an initial-value problem
where $f_\theta : \mathbb{R} \times \mathbb{R}^d \to \mathbb{R}^d$ is a neural network with parameters $\theta \in \mathbb{R}^p$, and $\mathbf{y}(t) \in \mathbb{R}^d$ is the state vector. The solution at any target time $t_1$ is obtained by numerical integration:
Given a scalar loss $L(\mathbf{y}(t_1))$, the adjoint state is defined as
The adjoint evolves backward in time according to the adjoint ODE (see Theorem 3.1.1 below).
A continuous-depth network replaces a sequence of discrete residual layers
with the limit as step size $\Delta t \to 0$, recovering the ODE $\dot{\mathbf{h}} = f_\theta(\mathbf{h})$. The “depth” is now a continuous variable (integration time) rather than a layer index.
A Latent ODE model consists of:
Training uses the evidence lower bound (ELBO), enabling principled handling of irregularly sampled and partially observed data.
An Augmented Neural ODE extends the state space by appending extra dimensions:
This overcomes a limitation of standard Neural ODEs: because ODE flows are homeomorphisms (continuous bijections), a Neural ODE in $\mathbb{R}^d$ cannot represent mappings that change topology (e.g., mapping a single connected component to two). Augmentation lifts the dynamics into a higher-dimensional space where such mappings become possible.
An alternative to the adjoint method: treat each step of the numerical solver as a layer in a computational graph and apply standard backpropagation. Memory cost is $O(N)$ where $N$ is the number of solver steps, but it produces exact gradients of the discretised objective (as opposed to the continuous adjoint, which solves a different ODE backward).
(Pontryagin, 1962; Chen et al., 2018)
Let $\mathbf{y}(t)$ solve the Neural ODE $\dot{\mathbf{y}} = f_\theta(t, \mathbf{y})$ on $[t_0, t_1]$, and let $L = L(\mathbf{y}(t_1))$ be a scalar loss. Define the adjoint $\mathbf{a}(t) = \frac{\partial L}{\partial \mathbf{y}(t)}$. Then:
(a) The adjoint satisfies the ODE solved backward from $t_1$ to $t_0$:
(b) The gradient with respect to parameters is:
Memory advantage: The adjoint method requires $O(1)$ memory in the number of solver steps (only the current state and adjoint need be stored), compared to $O(N)$ for backpropagation through all $N$ solver steps.
Let $g : [0,T] \times \mathbb{R}^d \to \mathbb{R}^d$ be a continuous vector field. For any $\varepsilon > 0$ and compact set $K \subset \mathbb{R}^d$, there exists a feedforward neural network $f_\theta$ with a single hidden layer and finitely many neurons such that
Consequently, the Neural ODE $\dot{\mathbf{y}} = f_\theta(t,\mathbf{y})$ can approximate the flow of any ODE $\dot{\mathbf{y}} = g(t,\mathbf{y})$ to arbitrary accuracy on compact time-state domains.
This follows from the classical Universal Approximation Theorem (Cybenko 1989, Hornik 1991) applied to vector-valued functions on $[0,T] \times K$.
A Residual Network with $N$ layers of the form $\mathbf{h}_{k+1} = \mathbf{h}_k + \frac{T}{N} f_\theta(\mathbf{h}_k)$ is exactly the forward Euler discretisation of the ODE $\dot{\mathbf{h}} = f_\theta(\mathbf{h})$ on $[0, T]$ with step size $\Delta t = T/N$. In the limit $N \to \infty$ (infinite depth), the ResNet converges to the Neural ODE solution.
Consider a 1D Neural ODE with a single-neuron hidden layer:
with parameters $\theta = (w_1, b_1, w_2, b_2) = (1.5, 0, -2, 0)$ and initial condition $y(0)=1$. We solve $\dot{y} = f_\theta(y)$ on $[0, 0.5]$ using Euler's method with $\Delta t = 0.25$.
$f_\theta(y_0) = -2\,\tanh(1.5 \cdot 1) = -2\,\tanh(1.5) \approx -2 \times 0.9051 = -1.8102$
$y_1 = y_0 + \Delta t \cdot f_\theta(y_0) = 1 + 0.25 \times (-1.8102) = 1 - 0.4526 = 0.5474$
$f_\theta(y_1) = -2\,\tanh(1.5 \times 0.5474) = -2\,\tanh(0.8211) \approx -2 \times 0.6756 = -1.3513$
$y_2 = 0.5474 + 0.25 \times (-1.3513) = 0.5474 - 0.3378 = 0.2096$
With these parameters, the network has learned a vector field that drives $y$ toward zero—qualitatively similar to $\dot{y} = -2y$ for small $y$ (since $\tanh(w_1 y) \approx w_1 y$ near the origin). The true analytical solution of $\dot{y} = -2y$ gives $y(0.5) = e^{-1} \approx 0.3679$. The discrepancy arises because $\tanh$ saturates for large arguments, making the effective decay rate slower than $-2y$ when $|y|$ is large.
Continuing Example 3.1.1, suppose we observe $y^*(0.5) = 0.37$ and use $L = \frac{1}{2}(y(0.5) - y^*)^2$. We compute $\frac{dL}{dw_2}$ via the adjoint method.
$\mathbf{a}(t_1) = \frac{\partial L}{\partial y(0.5)} = y(0.5) - y^* = 0.2096 - 0.37 = -0.1604$
The adjoint ODE in 1D is $\frac{da}{dt} = -a \cdot \frac{\partial f_\theta}{\partial y}$. We need $\frac{\partial f_\theta}{\partial y}\big|_{y=y_2}$:
Backward Euler step (going from $t=0.5$ to $t=0.25$, so $\Delta t = -0.25$):
$a_1 = a_2 - \Delta t \cdot \left(-a_2 \cdot \frac{\partial f}{\partial y}\big|_{y_2}\right) = -0.1604 - 0.25 \times (-(-0.1604)(-1.6307)) = -0.1604 - 0.25 \times (-0.2616) = -0.0950$
$\frac{\partial f_\theta}{\partial w_2}\big|_{y_2} = \tanh(w_1 y_2) = \tanh(0.8211) \approx 0.6756$
$\frac{\partial f_\theta}{\partial w_2}\big|_{y_1} = \tanh(w_1 y_1) = \tanh(0.8211) \approx 0.6756$
Using the trapezoidal rule to approximate the integral:
This positive gradient tells us that increasing $w_2$ would increase the loss—which makes sense because $w_2$ controls the magnitude of the vector field, and a more negative $w_2$ would push $y(0.5)$ closer to the target $0.37$.
Below is a complete, runnable implementation of a Neural ODE that learns the dynamics $\dot{y} = -2y$ from noisy trajectory data. The “network” is a one-hidden-layer MLP with $H=8$ hidden units and $\tanh$ activation. Gradients are estimated via finite differences.
Constraint note: This runs in-browser using only NumPy.
For a full PyTorch implementation with torchdiffeq,
see the companion notebook.
Click "Run Code" to execute. Expected output: loss decreasing from ~0.1 to ~0.001, and learned vector field values close to -2y at each test point.
A Neural ODE can be viewed through the lens of a continuous-time agent operating in a dynamical environment. The correspondence is:
| RL Concept | Neural ODE Counterpart |
|---|---|
| State $s_t$ | Hidden representation $\mathbf{y}(t)$ |
| Action $a_t$ | Instantaneous velocity $f_\theta(t, \mathbf{y}(t))$, applied continuously |
| Policy $\pi_\theta(a|s)$ | The neural network $f_\theta$ mapping state to action (deterministic) |
| Environment dynamics | $\dot{\mathbf{y}} = f_\theta(t, \mathbf{y})$ (the agent is the dynamics) |
| Reward / Return | $-L(\mathbf{y}(t_1))$: negative of reconstruction or prediction loss |
| Policy gradient $\nabla_\theta J$ | Adjoint gradient $\frac{dL}{d\theta}$, computed by solving the adjoint ODE backward |
| Episode | One forward integration from $t_0$ to $t_1$ |
Continuous-time policy gradient. The adjoint method is equivalent to the continuous-time limit of the REINFORCE policy gradient. In discrete RL, the policy gradient theorem gives
where $G_t$ is the return-to-go. In the Neural ODE setting, the sum becomes an integral, the stochastic policy becomes deterministic, and the return-to-go becomes the adjoint state $\mathbf{a}(t)$. The adjoint ODE propagates “credit” backward through continuous time, exactly as $G_t$ propagates reward backward through discrete time steps.
Adaptive computation as adaptive planning. An adaptive ODE solver (e.g., Dormand–Prince) takes smaller steps where the dynamics are complex and larger steps where they are simple. This is analogous to an agent spending more “thinking time” on difficult decisions—a form of adaptive computation depth that emerges naturally from the numerical method.
Latent ODE as a world model. In model-based RL, an agent learns a “world model” to predict future states. A Latent ODE serves exactly this purpose: given partial observations, it infers the latent state and predicts forward trajectories. The encoder is the perception module, the Neural ODE is the transition model, and the decoder maps predicted states back to observations.
Show that the ResNet update $\mathbf{h}_{k+1} = \mathbf{h}_k + \frac{1}{N} f_\theta(\mathbf{h}_k)$ with $N$ layers is the forward Euler discretisation of $\dot{\mathbf{h}} = f_\theta(\mathbf{h})$ on $[0, 1]$ with step size $\Delta t = 1/N$. What is the local truncation error?
Forward Euler for $\dot{\mathbf{h}} = f_\theta(\mathbf{h})$ with step size $\Delta t$:
Setting $\Delta t = 1/N$ and $t_k = k/N$, we get $\mathbf{h}_{k+1} = \mathbf{h}_k + \frac{1}{N} f_\theta(\mathbf{h}_k)$, which is exactly the ResNet update.
The local truncation error of forward Euler is $O(\Delta t^2) = O(1/N^2)$. The global error after $N$ steps is $O(\Delta t) = O(1/N)$. As $N \to \infty$, the ResNet converges to the ODE solution (assuming $f_\theta$ is Lipschitz).
For the scalar ODE $\dot{y} = -\alpha y$ with $y(0) = y_0$ and loss $L = \frac{1}{2}(y(T) - y^*)^2$:
(a) Here $f(y) = -\alpha y$, so $\frac{\partial f}{\partial y} = -\alpha$. The adjoint ODE is:
(b) This is a linear ODE with solution $a(t) = a(T) e^{\alpha(t - T)}$, i.e., $a(t) = (y_0 e^{-\alpha T} - y^*) e^{\alpha(t-T)}$.
(c) We have $\frac{\partial f}{\partial \alpha} = -y$, so:
Substituting $a(t) = (y_0 e^{-\alpha T} - y^*) e^{\alpha(t-T)}$ and $y(t) = y_0 e^{-\alpha t}$:
Verification: $y(T) = y_0 e^{-\alpha T}$, so $L = \frac{1}{2}(y_0 e^{-\alpha T} - y^*)^2$. Direct differentiation: $\frac{dL}{d\alpha} = (y_0 e^{-\alpha T} - y^*) \cdot (-T y_0 e^{-\alpha T})$, which matches (up to the sign convention in the integration bounds).
Prove that a Neural ODE $\dot{\mathbf{y}} = f_\theta(t,\mathbf{y})$ in $\mathbb{R}^2$ with Lipschitz $f_\theta$ cannot map a disk to an annulus. Hint: Use the fact that ODE flows are homeomorphisms and homeomorphisms preserve simple connectedness.
A Neural ODE is solved with an adaptive RK45 solver that takes $N$ steps. The network $f_\theta$ has $p$ parameters and the state dimension is $d$.
(a) Backprop through solver: $O(N \cdot d)$ for storing all intermediate states, plus $O(p)$ for the graph of $f_\theta$ at each step. Total: $O(N(d + p))$.
(b) Adjoint method: $O(d + p)$. Only the current state, adjoint, and accumulated parameter gradient need to be stored. The forward trajectory is recomputed backward.
(c) The adjoint method is preferable when $N$ is large (many solver steps), $d$ is large (high-dimensional state), or memory is the bottleneck. However, if $N$ is small or gradient accuracy is critical (stiff systems), backprop through the solver may be better despite higher memory cost.
Modify the code lab to use a 4th-order Runge-Kutta solver instead of Euler. Compare the trajectory accuracy for the same step size $\Delta t = 0.05$. How many Euler steps would you need to match RK4 accuracy?
Extend the code lab to learn a 2D dynamical system: the damped harmonic oscillator $\dot{x} = v$, $\dot{v} = -kx - cv$ with $k=1, c=0.5$. Use a 2-input, 2-output MLP. Generate training trajectories from multiple initial conditions and train the Neural ODE.
import numpy as np
np.random.seed(42)
# True system: dx/dt = v, dv/dt = -x - 0.5v
def true_rhs(state):
x, v = state
return np.array([v, -x - 0.5*v])
# Euler solver for 2D
def euler_2d(f, y0, t_eval, params, dt=0.01):
traj = [y0.copy()]
y = y0.copy()
t = t_eval[0]
idx = 1
while idx < len(t_eval):
y = y + dt * f(y, params)
t += dt
if t >= t_eval[idx] - 1e-10:
traj.append(y.copy())
idx += 1
return np.array(traj)
# 2D MLP: R^2 -> R^H -> R^2
H = 16
W1 = np.random.randn(H, 2) * 0.5
b1 = np.zeros(H)
W2 = np.random.randn(2, H) * 0.5
b2 = np.zeros(2)
params = [W1, b1, W2, b2]
def f_net_2d(y, params):
W1, b1, W2, b2 = params
h = np.tanh(W1 @ y + b1)
return W2 @ h + b2
# Generate data
t_span = np.linspace(0, 4, 50)
ics = [np.array([1.0, 0.0]), np.array([0.0, 1.0]),
np.array([-0.5, 0.5])]
data = []
for ic in ics:
traj = euler_2d(lambda y, p: true_rhs(y), ic, t_span, None)
data.append((ic, traj))
# Training (finite diff)
lr = 0.005
for epoch in range(80):
total_loss = 0
for ic, traj_true in data:
traj_pred = euler_2d(f_net_2d, ic, t_span, params)
total_loss += np.mean((traj_pred - traj_true)**2)
if epoch % 20 == 0:
print(f"Epoch {epoch}: loss = {total_loss:.6f}")
# finite-diff gradient update
eps = 1e-5
for i, p in enumerate(params):
g = np.zeros_like(p)
it = np.nditer(p, flags=['multi_index'])
while not it.finished:
ix = it.multi_index
old = p[ix]
p[ix] = old + eps
lp = sum(np.mean((euler_2d(f_net_2d, ic, t_span, params) - tr)**2) for ic, tr in data)
p[ix] = old - eps
lm = sum(np.mean((euler_2d(f_net_2d, ic, t_span, params) - tr)**2) for ic, tr in data)
g[ix] = (lp - lm) / (2 * eps)
p[ix] = old
it.iternext()
params[i] -= lr * g
print("Done!")
Run the toy Neural ODE code lab with learning rates $\eta \in \{0.001, 0.005, 0.01, 0.05, 0.1\}$. For each, record the final loss after 120 epochs. Plot (or tabulate) loss vs. learning rate. What happens for large $\eta$? Explain in terms of the loss landscape.
For the scalar Neural ODE with $f_\theta(y) = w_2 \tanh(w_1 y + b_1) + b_2$:
Design an “adaptive depth” strategy for the Neural ODE: at each Euler step, the agent decides whether to take another step or to stop early, based on the magnitude of $f_\theta(y)$. Specifically:
Implement a curriculum learning strategy: start by training the Neural ODE on short time horizons ($T=0.5$) and progressively increase to $T=2.0$. Compare against training on the full interval from the start.
Q1. What is the fundamental equation defining a Neural ODE?
Answer: (b)
Q2. The adjoint method computes gradients with what memory complexity in the number of solver steps $N$?
Answer: (d). The adjoint method requires only constant memory in $N$, since it recomputes the forward trajectory during the backward pass.
Q3. A standard (non-augmented) Neural ODE in $\mathbb{R}^d$ defines a flow map that is:
Answer: (b). By the existence-uniqueness theorem, trajectories cannot cross, so the flow is a homeomorphism (in fact a diffeomorphism if $f_\theta$ is smooth).
Q4. The ResNet update $\mathbf{h}_{k+1} = \mathbf{h}_k + \frac{1}{N}f_\theta(\mathbf{h}_k)$ corresponds to which numerical method applied to the Neural ODE?
Answer: (b)
Q5. In the adjoint method, the adjoint state $\mathbf{a}(t) = \frac{\partial L}{\partial \mathbf{y}(t)}$ satisfies which equation?
Answer: (b)
Q6. Why might the adjoint method produce less accurate gradients than backpropagation through the solver?
Answer: (b). The adjoint ODE is the continuous limit; numerical errors in both the forward and backward solves can compound, especially for stiff or chaotic systems.
Q7. What advantage do Latent ODE models have over standard RNNs for time-series modelling?
Answer: (b). Since the ODE can be evaluated at any time $t$, it handles variable time gaps naturally, unlike RNNs which assume fixed time steps.
Q8. Augmented Neural ODEs address which limitation of standard Neural ODEs?
Answer: (b)
Q9. In the agent-lens view of a Neural ODE, what corresponds to the RL “policy”?
Answer: (c). The network $f_\theta$ determines what “action” (velocity) to apply at each state, analogous to a deterministic policy.
Q10. For the Universal Approximation Theorem applied to Neural ODEs, which condition is required?
Answer: (b). The classical universal approximation theorem requires only continuity on a compact set; a single hidden layer with sufficiently many neurons suffices.
You are given noisy trajectory data from an unknown 2D dynamical system (which is secretly the Van der Pol oscillator $\dot{x} = y$, $\dot{y} = \mu(1-x^2)y - x$ with $\mu = 1$, but you should treat the dynamics as unknown).
| Criterion | Excellent (A) | Satisfactory (B) | Needs Work (C) |
|---|---|---|---|
| Implementation | Complete, correct NumPy code; RK4 or adaptive solver | Working Euler solver; minor issues | Incomplete or non-functional code |
| Training | Loss converges below 0.01; proper hyperparameter tuning documented | Loss converges below 0.05; default hyperparameters | Loss does not converge or training crashes |
| Evaluation | Quantitative MSE comparison; multiple held-out ICs tested | Single held-out IC; qualitative comparison | No held-out evaluation |
| Phase Portrait | Clear vector field comparison; limit cycle visible | Vector field plotted/tabulated but incomplete comparison | Missing or incorrect |
| Analysis | Insightful discussion of limit cycle capture, extrapolation, and failure modes | Basic discussion | No analysis provided |
torchdiffeq for adaptive solvers and automatic adjoint differentiation.