Module 3.1: Neural Differential Equations

Replace hand-crafted dynamics with learned vector fields — parameterize $\frac{d\mathbf{y}}{dt} = f_\theta(t, \mathbf{y})$ with a neural network and solve with standard ODE integrators.

Module 1.4 (Numerical Methods for ODEs), Module 1.1 (First-Order ODEs), and a basic understanding of neural networks (forward pass, backpropagation, gradient descent).
Difficulty: Advanced | Estimated time: 4–5 hours

Why This Matters

Classical differential equations require an explicit formula for the right-hand side: $\dot{y} = -ky$, $\dot{y} = \sin(y) - y^3$, and so on. But what if the governing law is unknown—buried in experimental data from a biological system, a financial market, or a robotic controller? Neural Ordinary Differential Equations (Neural ODEs) answer this question by replacing the hand-crafted right-hand side with a neural network $f_\theta$, then solving the resulting initial-value problem with an off-the-shelf ODE solver.

Introduced by Chen et al. (NeurIPS 2018), Neural ODEs have opened entirely new avenues:

Continuous-depth networks: Instead of stacking discrete residual layers, the hidden state evolves continuously, enabling adaptive computation.
Memory-efficient training: The adjoint sensitivity method computes gradients without storing intermediate activations, making it possible to train arbitrarily deep models.
Irregularly sampled time series: Latent ODE models handle missing data and variable time gaps naturally, unlike RNNs that assume fixed sampling.
Scientific discovery: Fitting a Neural ODE to physical measurements yields a differentiable, continuous-time surrogate model that can be interrogated, extrapolated, or composed with known physics.

This module develops the mathematical framework, derives the adjoint method from first principles, and provides a complete NumPy implementation of a toy Neural ODE so you can see every gradient flow by hand.

Learning Objectives

After completing this module you will be able to:

State the Neural ODE initial-value problem $\frac{d\mathbf{y}}{dt} = f_\theta(t,\mathbf{y})$ and explain how it generalises discrete residual networks.
Implement a forward Euler / RK4 solver to propagate a Neural ODE forward in time.
Derive the adjoint ODE for computing $\frac{dL}{d\theta}$ without storing intermediate states.
Compare memory complexity of backpropagation-through-the-solver vs. the adjoint method.
Train a simple Neural ODE on synthetic trajectory data using NumPy and finite-difference gradients.
Explain the connection between ResNets ($\mathbf{h}_{t+1} = \mathbf{h}_t + f_\theta(\mathbf{h}_t)$) and the Euler discretisation of a Neural ODE.
Describe the Latent ODE framework for modelling irregularly sampled time series.
Identify failure modes: stiff dynamics, non-unique solutions, and numerical instability.
Articulate the universal approximation result for neural-network right-hand sides.
Frame the Neural ODE as a continuous-time agent and connect adjoint gradients to policy optimisation in reinforcement learning.

Core Concepts

Definitions

Definition 3.1.1 — Neural ODE

A Neural Ordinary Differential Equation is an initial-value problem

$$\frac{d\mathbf{y}}{dt} = f_\theta(t, \mathbf{y}(t)), \qquad \mathbf{y}(t_0) = \mathbf{y}_0,$$

where $f_\theta : \mathbb{R} \times \mathbb{R}^d \to \mathbb{R}^d$ is a neural network with parameters $\theta \in \mathbb{R}^p$, and $\mathbf{y}(t) \in \mathbb{R}^d$ is the state vector. The solution at any target time $t_1$ is obtained by numerical integration:

$$\mathbf{y}(t_1) = \mathbf{y}(t_0) + \int_{t_0}^{t_1} f_\theta(t, \mathbf{y}(t))\,dt.$$

Definition 3.1.2 — Adjoint State

Given a scalar loss $L(\mathbf{y}(t_1))$, the adjoint state is defined as

$$\mathbf{a}(t) = \frac{\partial L}{\partial \mathbf{y}(t)} \in \mathbb{R}^d.$$

The adjoint evolves backward in time according to the adjoint ODE (see Theorem 3.1.1 below).

Definition 3.1.3 — Continuous-Depth Network

A continuous-depth network replaces a sequence of discrete residual layers

$$\mathbf{h}_{t+1} = \mathbf{h}_t + f_\theta(\mathbf{h}_t)$$

with the limit as step size $\Delta t \to 0$, recovering the ODE $\dot{\mathbf{h}} = f_\theta(\mathbf{h})$. The “depth” is now a continuous variable (integration time) rather than a layer index.

Definition 3.1.4 — Latent ODE

A Latent ODE model consists of:

An encoder (often an RNN or attention mechanism) that maps an observed time series $\{(\mathbf{x}_i, t_i)\}$ to an approximate posterior $q(\mathbf{z}_0)$ over initial latent states.
A Neural ODE that propagates $\mathbf{z}_0$ forward to any desired time: $\mathbf{z}(t) = \mathbf{z}_0 + \int_0^t f_\theta(s, \mathbf{z}(s))\,ds$.
A decoder that maps $\mathbf{z}(t_i)$ back to observation space to reconstruct $\mathbf{x}_i$.

Training uses the evidence lower bound (ELBO), enabling principled handling of irregularly sampled and partially observed data.

Definition 3.1.5 — Augmented Neural ODE

An Augmented Neural ODE extends the state space by appending extra dimensions:

$$\frac{d}{dt}\begin{pmatrix}\mathbf{y} \\ \mathbf{a}\end{pmatrix} = \begin{pmatrix}f_\theta(t, \mathbf{y}, \mathbf{a}) \\ g_\theta(t, \mathbf{y}, \mathbf{a})\end{pmatrix}, \qquad \mathbf{a}(t_0) = \mathbf{0}.$$

This overcomes a limitation of standard Neural ODEs: because ODE flows are homeomorphisms (continuous bijections), a Neural ODE in $\mathbb{R}^d$ cannot represent mappings that change topology (e.g., mapping a single connected component to two). Augmentation lifts the dynamics into a higher-dimensional space where such mappings become possible.

Definition 3.1.6 — Backpropagation Through the Solver (Discretise-then-Optimise)

An alternative to the adjoint method: treat each step of the numerical solver as a layer in a computational graph and apply standard backpropagation. Memory cost is $O(N)$ where $N$ is the number of solver steps, but it produces exact gradients of the discretised objective (as opposed to the continuous adjoint, which solves a different ODE backward).

Theorems and Key Results

Theorem 3.1.1 — Adjoint ODE for Gradient Computation

(Pontryagin, 1962; Chen et al., 2018)

Let $\mathbf{y}(t)$ solve the Neural ODE $\dot{\mathbf{y}} = f_\theta(t, \mathbf{y})$ on $[t_0, t_1]$, and let $L = L(\mathbf{y}(t_1))$ be a scalar loss. Define the adjoint $\mathbf{a}(t) = \frac{\partial L}{\partial \mathbf{y}(t)}$. Then:

(a) The adjoint satisfies the ODE solved backward from $t_1$ to $t_0$:

$$\frac{d\mathbf{a}}{dt} = -\mathbf{a}(t)^\top \frac{\partial f_\theta}{\partial \mathbf{y}}(t, \mathbf{y}(t)), \qquad \mathbf{a}(t_1) = \frac{\partial L}{\partial \mathbf{y}(t_1)}.$$

(b) The gradient with respect to parameters is:

$$\frac{dL}{d\theta} = -\int_{t_1}^{t_0} \mathbf{a}(t)^\top \frac{\partial f_\theta}{\partial \theta}(t, \mathbf{y}(t))\,dt.$$

Memory advantage: The adjoint method requires $O(1)$ memory in the number of solver steps (only the current state and adjoint need be stored), compared to $O(N)$ for backpropagation through all $N$ solver steps.

Proof sketch. Consider a small perturbation $\delta \mathbf{y}(t)$ to the trajectory. By the chain rule, the change in loss is $\delta L = \mathbf{a}(t_1)^\top \delta \mathbf{y}(t_1)$. The perturbation evolves as $\frac{d}{dt}\delta \mathbf{y} = \frac{\partial f_\theta}{\partial \mathbf{y}} \delta \mathbf{y}$ (linearisation). We want $\frac{d}{dt}[\mathbf{a}^\top \delta \mathbf{y}] = 0$ so that the inner product is preserved backward in time. Differentiating: $\dot{\mathbf{a}}^\top \delta \mathbf{y} + \mathbf{a}^\top \dot{\delta \mathbf{y}} = 0$, giving $\dot{\mathbf{a}}^\top \delta \mathbf{y} + \mathbf{a}^\top \frac{\partial f}{\partial \mathbf{y}} \delta \mathbf{y} = 0$. Since this must hold for all $\delta \mathbf{y}$, we obtain $\dot{\mathbf{a}} = -\left(\frac{\partial f}{\partial \mathbf{y}}\right)^\top \mathbf{a}$. Part (b) follows by applying the same argument to perturbations in $\theta$.

Theorem 3.1.2 — Universal Approximation for ODE Right-Hand Sides

Let $g : [0,T] \times \mathbb{R}^d \to \mathbb{R}^d$ be a continuous vector field. For any $\varepsilon > 0$ and compact set $K \subset \mathbb{R}^d$, there exists a feedforward neural network $f_\theta$ with a single hidden layer and finitely many neurons such that

$$\sup_{t \in [0,T]} \sup_{\mathbf{y} \in K} \|f_\theta(t, \mathbf{y}) - g(t, \mathbf{y})\| < \varepsilon.$$

Consequently, the Neural ODE $\dot{\mathbf{y}} = f_\theta(t,\mathbf{y})$ can approximate the flow of any ODE $\dot{\mathbf{y}} = g(t,\mathbf{y})$ to arbitrary accuracy on compact time-state domains.

This follows from the classical Universal Approximation Theorem (Cybenko 1989, Hornik 1991) applied to vector-valued functions on $[0,T] \times K$.

Result 3.1.3 — ResNet-to-Neural-ODE Correspondence

A Residual Network with $N$ layers of the form $\mathbf{h}_{k+1} = \mathbf{h}_k + \frac{T}{N} f_\theta(\mathbf{h}_k)$ is exactly the forward Euler discretisation of the ODE $\dot{\mathbf{h}} = f_\theta(\mathbf{h})$ on $[0, T]$ with step size $\Delta t = T/N$. In the limit $N \to \infty$ (infinite depth), the ResNet converges to the Neural ODE solution.

Common Misconceptions

Misconception 1: “Neural ODEs can represent any mapping.”
Reality: The flow of an ODE is a diffeomorphism (smooth, invertible map). This means a standard Neural ODE in $\mathbb{R}^d$ cannot map one connected component to two disjoint components, or cross trajectories. Augmented Neural ODEs partially overcome this limitation by lifting the dynamics to a higher-dimensional space.

Misconception 2: “The adjoint method always gives more accurate gradients.”
Reality: The adjoint method solves a different ODE backward. Numerical error in the backward solve can make adjoint gradients less accurate than backpropagation through the solver (discretise-then-optimise), especially for stiff or chaotic dynamics. The adjoint's advantage is memory, not necessarily gradient accuracy.

Misconception 3: “Neural ODEs are always more efficient than ResNets.”
Reality: Adaptive solvers can take many small steps in regions of complex dynamics, making a single forward pass potentially more expensive than a fixed-depth ResNet. The computational cost is data-dependent and can vary significantly across inputs.

Misconception 4: “You need automatic differentiation to train Neural ODEs.”
Reality: While autodiff frameworks make training convenient, gradients can be computed via finite differences, the adjoint method implemented manually, or even evolutionary strategies. Our code lab below demonstrates training with finite-difference gradients in pure NumPy.

Misconception 5: “Deeper (longer integration time) always means more expressive.”
Reality: Longer integration can cause gradient explosion/vanishing and numerical instability. The expressivity of a Neural ODE depends on the network architecture of $f_\theta$ and the state dimension, not just the integration interval. Furthermore, trajectories cannot cross in the state space (uniqueness theorem), which fundamentally limits expressivity regardless of depth.

Misconception 6: “Neural ODEs are only for time-series data.”
Reality: Neural ODEs have been applied to image classification (continuous-depth CNNs), generative modelling (continuous normalising flows), point cloud processing, and molecular dynamics—anywhere the “depth” of a network can be reinterpreted as a continuous variable.

Worked Examples

Example 3.1.1 — Forward Pass of a Neural ODE (by hand)

Consider a 1D Neural ODE with a single-neuron hidden layer:

$$f_\theta(y) = w_2 \tanh(w_1 y + b_1) + b_2$$

with parameters $\theta = (w_1, b_1, w_2, b_2) = (1.5, 0, -2, 0)$ and initial condition $y(0)=1$. We solve $\dot{y} = f_\theta(y)$ on $[0, 0.5]$ using Euler's method with $\Delta t = 0.25$.

Step 1: $t=0 \to t=0.25$

$f_\theta(y_0) = -2\,\tanh(1.5 \cdot 1) = -2\,\tanh(1.5) \approx -2 \times 0.9051 = -1.8102$

$y_1 = y_0 + \Delta t \cdot f_\theta(y_0) = 1 + 0.25 \times (-1.8102) = 1 - 0.4526 = 0.5474$

Step 2: $t=0.25 \to t=0.5$

$f_\theta(y_1) = -2\,\tanh(1.5 \times 0.5474) = -2\,\tanh(0.8211) \approx -2 \times 0.6756 = -1.3513$

$y_2 = 0.5474 + 0.25 \times (-1.3513) = 0.5474 - 0.3378 = 0.2096$

Interpretation

With these parameters, the network has learned a vector field that drives $y$ toward zero—qualitatively similar to $\dot{y} = -2y$ for small $y$ (since $\tanh(w_1 y) \approx w_1 y$ near the origin). The true analytical solution of $\dot{y} = -2y$ gives $y(0.5) = e^{-1} \approx 0.3679$. The discrepancy arises because $\tanh$ saturates for large arguments, making the effective decay rate slower than $-2y$ when $|y|$ is large.

Example 3.1.2 — Adjoint Gradient Computation

Continuing Example 3.1.1, suppose we observe $y^*(0.5) = 0.37$ and use $L = \frac{1}{2}(y(0.5) - y^*)^2$. We compute $\frac{dL}{dw_2}$ via the adjoint method.

Step 1: Terminal condition

$\mathbf{a}(t_1) = \frac{\partial L}{\partial y(0.5)} = y(0.5) - y^* = 0.2096 - 0.37 = -0.1604$

Step 2: Backward Euler for adjoint, $t=0.5 \to t=0.25$

The adjoint ODE in 1D is $\frac{da}{dt} = -a \cdot \frac{\partial f_\theta}{\partial y}$. We need $\frac{\partial f_\theta}{\partial y}\big|_{y=y_2}$:

$$\frac{\partial f_\theta}{\partial y} = w_2 \cdot w_1 \cdot \text{sech}^2(w_1 y + b_1) = (-2)(1.5)(1 - \tanh^2(0.8211)) \approx -3 \times 0.5436 = -1.6307$$

Backward Euler step (going from $t=0.5$ to $t=0.25$, so $\Delta t = -0.25$):

$a_1 = a_2 - \Delta t \cdot \left(-a_2 \cdot \frac{\partial f}{\partial y}\big|_{y_2}\right) = -0.1604 - 0.25 \times (-(-0.1604)(-1.6307)) = -0.1604 - 0.25 \times (-0.2616) = -0.0950$

Step 3: Accumulate parameter gradient

$\frac{\partial f_\theta}{\partial w_2}\big|_{y_2} = \tanh(w_1 y_2) = \tanh(0.8211) \approx 0.6756$

$\frac{\partial f_\theta}{\partial w_2}\big|_{y_1} = \tanh(w_1 y_1) = \tanh(0.8211) \approx 0.6756$

Using the trapezoidal rule to approximate the integral:

$$\frac{dL}{dw_2} \approx -\frac{\Delta t}{2} \left[a_2 \cdot \frac{\partial f}{\partial w_2}\big|_{y_2} + a_1 \cdot \frac{\partial f}{\partial w_2}\big|_{y_1}\right] = -\frac{0.25}{2}[(-0.1604)(0.6756) + (-0.0950)(0.6756)] \approx 0.0862$$

This positive gradient tells us that increasing $w_2$ would increase the loss—which makes sense because $w_2$ controls the magnitude of the vector field, and a more negative $w_2$ would push $y(0.5)$ closer to the target $0.37$.

Interactive Code Lab: Toy Neural ODE in NumPy

Below is a complete, runnable implementation of a Neural ODE that learns the dynamics $\dot{y} = -2y$ from noisy trajectory data. The “network” is a one-hidden-layer MLP with $H=8$ hidden units and $\tanh$ activation. Gradients are estimated via finite differences.

Constraint note: This runs in-browser using only NumPy. For a full PyTorch implementation with torchdiffeq, see the companion notebook.

import numpy as np

# ──────────────────────────────────────────────
# 1. Generate training data from dy/dt = -2y
# ──────────────────────────────────────────────
np.random.seed(42)
t_span = np.linspace(0, 2, 40)          # 40 time points in [0,2]
y0_true = 1.0
y_true = y0_true * np.exp(-2 * t_span)  # exact solution
y_data = y_true + 0.03 * np.random.randn(len(t_span))  # noisy obs

# ──────────────────────────────────────────────
# 2. Define a 1-hidden-layer MLP: f_theta(y)
#    Architecture: y -> Linear(1,H) -> tanh -> Linear(H,1)
# ──────────────────────────────────────────────
H = 8  # hidden units

def init_params(H):
    """Xavier-style initialisation."""
    W1 = np.random.randn(H, 1) * np.sqrt(2.0 / (1 + H))
    b1 = np.zeros(H)
    W2 = np.random.randn(1, H) * np.sqrt(2.0 / (H + 1))
    b2 = np.zeros(1)
    return [W1, b1, W2, b2]

def f_net(y, params):
    """Forward pass: scalar y -> scalar dy/dt."""
    W1, b1, W2, b2 = params
    h = np.tanh(W1[:, 0] * y + b1)   # shape (H,)
    out = W2[0, :] @ h + b2[0]       # scalar
    return out

# ──────────────────────────────────────────────
# 3. Euler ODE solver
# ──────────────────────────────────────────────
def euler_solve(f, y0, t_eval, params, dt=0.01):
    """Solve dy/dt = f(y, params) with forward Euler."""
    trajectory = [y0]
    y = y0
    t_curr = t_eval[0]
    idx = 1
    while idx < len(t_eval):
        y = y + dt * f(y, params)
        t_curr += dt
        if t_curr >= t_eval[idx] - 1e-10:
            trajectory.append(y)
            idx += 1
    return np.array(trajectory)

# ──────────────────────────────────────────────
# 4. Loss: mean squared error on trajectory
# ──────────────────────────────────────────────
def loss_fn(params, t_eval, y_data, y0):
    y_pred = euler_solve(f_net, y0, t_eval, params)
    return np.mean((y_pred - y_data) ** 2)

# ──────────────────────────────────────────────
# 5. Finite-difference gradient
# ──────────────────────────────────────────────
def fd_gradient(params, t_eval, y_data, y0, eps=1e-5):
    """Compute gradient of loss w.r.t. each scalar parameter."""
    grads = []
    for i, p in enumerate(params):
        g = np.zeros_like(p)
        it = np.nditer(p, flags=['multi_index'])
        while not it.finished:
            ix = it.multi_index
            old_val = p[ix]
            p[ix] = old_val + eps
            loss_plus = loss_fn(params, t_eval, y_data, y0)
            p[ix] = old_val - eps
            loss_minus = loss_fn(params, t_eval, y_data, y0)
            g[ix] = (loss_plus - loss_minus) / (2 * eps)
            p[ix] = old_val
            it.iternext()
        grads.append(g)
    return grads

# ──────────────────────────────────────────────
# 6. Training loop
# ──────────────────────────────────────────────
params = init_params(H)
lr = 0.01
n_epochs = 120
y0 = y_data[0]

print("Training Neural ODE (NumPy, finite-diff gradients)...")
print(f"{'Epoch':>6}  {'Loss':>10}")
print("-" * 20)

for epoch in range(n_epochs):
    L = loss_fn(params, t_span, y_data, y0)
    if epoch % 20 == 0:
        print(f"{epoch:6d}  {L:10.6f}")

    grads = fd_gradient(params, t_span, y_data, y0)
    for i in range(len(params)):
        params[i] -= lr * grads[i]

L_final = loss_fn(params, t_span, y_data, y0)
print(f"{'Final':>6}  {L_final:10.6f}")

# ──────────────────────────────────────────────
# 7. Compare learned vs true dynamics
# ──────────────────────────────────────────────
y_pred = euler_solve(f_net, y0, t_span, params)

print("\n--- Learned vs True Dynamics ---")
print(f"{'t':>6}  {'y_true':>8}  {'y_pred':>8}  {'y_data':>8}")
for i in range(0, len(t_span), 5):
    print(f"{t_span[i]:6.2f}  {y_true[i]:8.4f}  {y_pred[i]:8.4f}  {y_data[i]:8.4f}")

# Evaluate learned vector field vs true (-2y) at sample points
print("\n--- Learned Vector Field f_theta(y) vs True f(y) = -2y ---")
y_test = np.array([1.0, 0.5, 0.25, 0.1, 0.0, -0.1])
print(f"{'y':>6}  {'f_true':>8}  {'f_learned':>10}")
for yv in y_test:
    print(f"{yv:6.2f}  {-2*yv:8.4f}  {f_net(yv, params):10.4f}")

Click "Run Code" to execute. Expected output: loss decreasing from ~0.1 to ~0.001,
and learned vector field values close to -2y at each test point.

Agent Lens: The Neural ODE as a Continuous-Time Agent

Reinforcement Learning Analogy

A Neural ODE can be viewed through the lens of a continuous-time agent operating in a dynamical environment. The correspondence is:

RL Concept	Neural ODE Counterpart
State $s_t$	Hidden representation $\mathbf{y}(t)$
Action $a_t$	Instantaneous velocity $f_\theta(t, \mathbf{y}(t))$, applied continuously
Policy $\pi_\theta(a\|s)$	The neural network $f_\theta$ mapping state to action (deterministic)
Environment dynamics	$\dot{\mathbf{y}} = f_\theta(t, \mathbf{y})$ (the agent is the dynamics)
Reward / Return	$-L(\mathbf{y}(t_1))$: negative of reconstruction or prediction loss
Policy gradient $\nabla_\theta J$	Adjoint gradient $\frac{dL}{d\theta}$, computed by solving the adjoint ODE backward
Episode	One forward integration from $t_0$ to $t_1$

Why this perspective is powerful

Continuous-time policy gradient. The adjoint method is equivalent to the continuous-time limit of the REINFORCE policy gradient. In discrete RL, the policy gradient theorem gives

$$\nabla_\theta J = \mathbb{E}\left[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t\right],$$

where $G_t$ is the return-to-go. In the Neural ODE setting, the sum becomes an integral, the stochastic policy becomes deterministic, and the return-to-go becomes the adjoint state $\mathbf{a}(t)$. The adjoint ODE propagates “credit” backward through continuous time, exactly as $G_t$ propagates reward backward through discrete time steps.

Adaptive computation as adaptive planning. An adaptive ODE solver (e.g., Dormand–Prince) takes smaller steps where the dynamics are complex and larger steps where they are simple. This is analogous to an agent spending more “thinking time” on difficult decisions—a form of adaptive computation depth that emerges naturally from the numerical method.

Latent ODE as a world model. In model-based RL, an agent learns a “world model” to predict future states. A Latent ODE serves exactly this purpose: given partial observations, it infers the latent state and predicts forward trajectories. The encoder is the perception module, the Neural ODE is the transition model, and the decoder maps predicted states back to observations.

The connection between optimal control (Pontryagin's Maximum Principle, 1962) and neural network training was recognised early by LeCun (1988) and later formalised by E (2017). The Neural ODE paper (Chen et al., 2018) made this connection practical by providing an efficient implementation of the adjoint method within modern automatic differentiation frameworks.

Exercises

Analytical Exercises

Exercise 3.1.1 (Euler Discretisation)

Show that the ResNet update $\mathbf{h}_{k+1} = \mathbf{h}_k + \frac{1}{N} f_\theta(\mathbf{h}_k)$ with $N$ layers is the forward Euler discretisation of $\dot{\mathbf{h}} = f_\theta(\mathbf{h})$ on $[0, 1]$ with step size $\Delta t = 1/N$. What is the local truncation error?

Forward Euler for $\dot{\mathbf{h}} = f_\theta(\mathbf{h})$ with step size $\Delta t$:

$$\mathbf{h}(t + \Delta t) \approx \mathbf{h}(t) + \Delta t \cdot f_\theta(\mathbf{h}(t))$$

Setting $\Delta t = 1/N$ and $t_k = k/N$, we get $\mathbf{h}_{k+1} = \mathbf{h}_k + \frac{1}{N} f_\theta(\mathbf{h}_k)$, which is exactly the ResNet update.

The local truncation error of forward Euler is $O(\Delta t^2) = O(1/N^2)$. The global error after $N$ steps is $O(\Delta t) = O(1/N)$. As $N \to \infty$, the ResNet converges to the ODE solution (assuming $f_\theta$ is Lipschitz).

Exercise 3.1.2 (Adjoint ODE Derivation)

For the scalar ODE $\dot{y} = -\alpha y$ with $y(0) = y_0$ and loss $L = \frac{1}{2}(y(T) - y^*)^2$:

Write the adjoint ODE and its terminal condition.
Solve the adjoint ODE analytically.
Compute $\frac{dL}{d\alpha}$ and verify by direct differentiation of $y(T) = y_0 e^{-\alpha T}$.

(a) Here $f(y) = -\alpha y$, so $\frac{\partial f}{\partial y} = -\alpha$. The adjoint ODE is:

$$\frac{da}{dt} = -a(t) \cdot (-\alpha) = \alpha\, a(t), \qquad a(T) = y(T) - y^*$$

(b) This is a linear ODE with solution $a(t) = a(T) e^{\alpha(t - T)}$, i.e., $a(t) = (y_0 e^{-\alpha T} - y^*) e^{\alpha(t-T)}$.

(c) We have $\frac{\partial f}{\partial \alpha} = -y$, so:

$$\frac{dL}{d\alpha} = -\int_T^0 a(t) \cdot (-y(t))\,dt = \int_0^T a(t)\, y(t)\,dt$$

Substituting $a(t) = (y_0 e^{-\alpha T} - y^*) e^{\alpha(t-T)}$ and $y(t) = y_0 e^{-\alpha t}$:

$$\frac{dL}{d\alpha} = (y_0 e^{-\alpha T} - y^*) y_0 \int_0^T e^{\alpha(t-T)} e^{-\alpha t}\,dt = (y_0 e^{-\alpha T} - y^*) y_0 \int_0^T e^{-\alpha T}\,dt$$ $$= (y_0 e^{-\alpha T} - y^*) y_0 T e^{-\alpha T}$$

Verification: $y(T) = y_0 e^{-\alpha T}$, so $L = \frac{1}{2}(y_0 e^{-\alpha T} - y^*)^2$. Direct differentiation: $\frac{dL}{d\alpha} = (y_0 e^{-\alpha T} - y^*) \cdot (-T y_0 e^{-\alpha T})$, which matches (up to the sign convention in the integration bounds).

Exercise 3.1.3 (Topology Constraint)

Prove that a Neural ODE $\dot{\mathbf{y}} = f_\theta(t,\mathbf{y})$ in $\mathbb{R}^2$ with Lipschitz $f_\theta$ cannot map a disk to an annulus. Hint: Use the fact that ODE flows are homeomorphisms and homeomorphisms preserve simple connectedness.

Exercise 3.1.4 (Memory Analysis)

A Neural ODE is solved with an adaptive RK45 solver that takes $N$ steps. The network $f_\theta$ has $p$ parameters and the state dimension is $d$.

What is the memory cost of backpropagation through the solver?
What is the memory cost of the adjoint method?
Under what conditions is the adjoint method preferable?

(a) Backprop through solver: $O(N \cdot d)$ for storing all intermediate states, plus $O(p)$ for the graph of $f_\theta$ at each step. Total: $O(N(d + p))$.

(b) Adjoint method: $O(d + p)$. Only the current state, adjoint, and accumulated parameter gradient need to be stored. The forward trajectory is recomputed backward.

(c) The adjoint method is preferable when $N$ is large (many solver steps), $d$ is large (high-dimensional state), or memory is the bottleneck. However, if $N$ is small or gradient accuracy is critical (stiff systems), backprop through the solver may be better despite higher memory cost.

Computational Exercises

Exercise 3.1.5 (Implement RK4 Solver)

Modify the code lab to use a 4th-order Runge-Kutta solver instead of Euler. Compare the trajectory accuracy for the same step size $\Delta t = 0.05$. How many Euler steps would you need to match RK4 accuracy?

Exercise 3.1.6 (2D System)

Extend the code lab to learn a 2D dynamical system: the damped harmonic oscillator $\dot{x} = v$, $\dot{v} = -kx - cv$ with $k=1, c=0.5$. Use a 2-input, 2-output MLP. Generate training trajectories from multiple initial conditions and train the Neural ODE.

import numpy as np
np.random.seed(42)

# True system: dx/dt = v, dv/dt = -x - 0.5v
def true_rhs(state):
    x, v = state
    return np.array([v, -x - 0.5*v])

# Euler solver for 2D
def euler_2d(f, y0, t_eval, params, dt=0.01):
    traj = [y0.copy()]
    y = y0.copy()
    t = t_eval[0]
    idx = 1
    while idx < len(t_eval):
        y = y + dt * f(y, params)
        t += dt
        if t >= t_eval[idx] - 1e-10:
            traj.append(y.copy())
            idx += 1
    return np.array(traj)

# 2D MLP: R^2 -> R^H -> R^2
H = 16
W1 = np.random.randn(H, 2) * 0.5
b1 = np.zeros(H)
W2 = np.random.randn(2, H) * 0.5
b2 = np.zeros(2)
params = [W1, b1, W2, b2]

def f_net_2d(y, params):
    W1, b1, W2, b2 = params
    h = np.tanh(W1 @ y + b1)
    return W2 @ h + b2

# Generate data
t_span = np.linspace(0, 4, 50)
ics = [np.array([1.0, 0.0]), np.array([0.0, 1.0]),
       np.array([-0.5, 0.5])]
data = []
for ic in ics:
    traj = euler_2d(lambda y, p: true_rhs(y), ic, t_span, None)
    data.append((ic, traj))

# Training (finite diff)
lr = 0.005
for epoch in range(80):
    total_loss = 0
    for ic, traj_true in data:
        traj_pred = euler_2d(f_net_2d, ic, t_span, params)
        total_loss += np.mean((traj_pred - traj_true)**2)
    if epoch % 20 == 0:
        print(f"Epoch {epoch}: loss = {total_loss:.6f}")
    # finite-diff gradient update
    eps = 1e-5
    for i, p in enumerate(params):
        g = np.zeros_like(p)
        it = np.nditer(p, flags=['multi_index'])
        while not it.finished:
            ix = it.multi_index
            old = p[ix]
            p[ix] = old + eps
            lp = sum(np.mean((euler_2d(f_net_2d, ic, t_span, params) - tr)**2) for ic, tr in data)
            p[ix] = old - eps
            lm = sum(np.mean((euler_2d(f_net_2d, ic, t_span, params) - tr)**2) for ic, tr in data)
            g[ix] = (lp - lm) / (2 * eps)
            p[ix] = old
            it.iternext()
        params[i] -= lr * g
print("Done!")

Exercise 3.1.7 (Learning Rate Sensitivity)

Run the toy Neural ODE code lab with learning rates $\eta \in \{0.001, 0.005, 0.01, 0.05, 0.1\}$. For each, record the final loss after 120 epochs. Plot (or tabulate) loss vs. learning rate. What happens for large $\eta$? Explain in terms of the loss landscape.

Exercise 3.1.8 (Analytic vs. Finite-Difference Gradients)

For the scalar Neural ODE with $f_\theta(y) = w_2 \tanh(w_1 y + b_1) + b_2$:

Derive $\frac{\partial f_\theta}{\partial w_1}$, $\frac{\partial f_\theta}{\partial b_1}$, $\frac{\partial f_\theta}{\partial w_2}$, $\frac{\partial f_\theta}{\partial b_2}$ analytically.
Implement a version of the training loop that uses these analytic local gradients combined with backpropagation through the Euler solver (chain rule across steps) instead of finite differences.
Compare convergence speed (epochs to reach loss $< 0.005$) of the two approaches.

Agentic Exercises

Exercise 3.1.9 (Adaptive Depth Agent)

Design an “adaptive depth” strategy for the Neural ODE: at each Euler step, the agent decides whether to take another step or to stop early, based on the magnitude of $f_\theta(y)$. Specifically:

Define a stopping criterion: stop if $\|f_\theta(\mathbf{y})\| < \varepsilon$ for some threshold $\varepsilon$.
Implement this in the Euler solver.
Train the Neural ODE and compare the average number of steps taken vs. the fixed-step solver. How does the accuracy change?
Discuss: in an RL context, this is analogous to what kind of agent behaviour?

Exercise 3.1.10 (Curriculum Learning for Neural ODEs)

Implement a curriculum learning strategy: start by training the Neural ODE on short time horizons ($T=0.5$) and progressively increase to $T=2.0$. Compare against training on the full interval from the start.

Implement a schedule: $T_{\text{epoch}} = 0.5 + 1.5 \cdot \min(1, \text{epoch}/60)$.
Record the loss trajectory for both approaches.
Explain the analogy to curriculum learning in RL, where the agent first learns easy tasks before progressing to harder ones.

Assessment

Quiz (10 Questions)

Q1. What is the fundamental equation defining a Neural ODE?

(a) $\mathbf{y}_{t+1} = \sigma(W\mathbf{y}_t + b)$
(b) $\frac{d\mathbf{y}}{dt} = f_\theta(t, \mathbf{y}(t))$
(c) $\nabla^2 u = f$
(d) $\mathbf{y}(t_1) = e^{At}\mathbf{y}(t_0)$

Answer: (b)

Q2. The adjoint method computes gradients with what memory complexity in the number of solver steps $N$?

(a) $O(N^2)$
(b) $O(N \log N)$
(c) $O(N)$
(d) $O(1)$

Answer: (d). The adjoint method requires only constant memory in $N$, since it recomputes the forward trajectory during the backward pass.

Q3. A standard (non-augmented) Neural ODE in $\mathbb{R}^d$ defines a flow map that is:

(a) A surjection but not an injection
(b) A homeomorphism (continuous bijection with continuous inverse)
(c) An arbitrary nonlinear map
(d) A linear transformation

Answer: (b). By the existence-uniqueness theorem, trajectories cannot cross, so the flow is a homeomorphism (in fact a diffeomorphism if $f_\theta$ is smooth).

Q4. The ResNet update $\mathbf{h}_{k+1} = \mathbf{h}_k + \frac{1}{N}f_\theta(\mathbf{h}_k)$ corresponds to which numerical method applied to the Neural ODE?

(a) Backward Euler
(b) Forward Euler
(c) Runge-Kutta 4
(d) Adams-Bashforth

Answer: (b)

Q5. In the adjoint method, the adjoint state $\mathbf{a}(t) = \frac{\partial L}{\partial \mathbf{y}(t)}$ satisfies which equation?

(a) $\dot{\mathbf{a}} = f_\theta(t, \mathbf{a})$
(b) $\dot{\mathbf{a}} = -\mathbf{a}^\top \frac{\partial f_\theta}{\partial \mathbf{y}}$
(c) $\dot{\mathbf{a}} = \mathbf{a}^\top \frac{\partial f_\theta}{\partial \theta}$
(d) $\dot{\mathbf{a}} = -\nabla^2 L$

Answer: (b)

Q6. Why might the adjoint method produce less accurate gradients than backpropagation through the solver?

(a) It uses a lower-order ODE solver
(b) It solves a different (continuous) ODE backward, accumulating numerical error that doesn't match the discrete forward pass
(c) It doesn't use the chain rule
(d) It ignores the initial condition

Answer: (b). The adjoint ODE is the continuous limit; numerical errors in both the forward and backward solves can compound, especially for stiff or chaotic systems.

Q7. What advantage do Latent ODE models have over standard RNNs for time-series modelling?

(a) They are always faster to train
(b) They naturally handle irregularly sampled time points and missing data
(c) They require less data
(d) They always have lower loss

Answer: (b). Since the ODE can be evaluated at any time $t$, it handles variable time gaps naturally, unlike RNNs which assume fixed time steps.

Q8. Augmented Neural ODEs address which limitation of standard Neural ODEs?

(a) Slow training speed
(b) The inability to represent maps that change topology (e.g., separating connected components)
(c) The need for a numerical solver
(d) High memory usage

Answer: (b)

Q9. In the agent-lens view of a Neural ODE, what corresponds to the RL “policy”?

(a) The ODE solver
(b) The loss function $L$
(c) The neural network $f_\theta$ mapping state to instantaneous velocity
(d) The initial condition $\mathbf{y}_0$

Answer: (c). The network $f_\theta$ determines what “action” (velocity) to apply at each state, analogous to a deterministic policy.

Q10. For the Universal Approximation Theorem applied to Neural ODEs, which condition is required?

(a) The target vector field must be analytic
(b) The target vector field must be continuous on a compact domain
(c) The state dimension must be at least 3
(d) The network must have at least 10 layers

Answer: (b). The classical universal approximation theorem requires only continuity on a compact set; a single hidden layer with sufficiently many neurons suffices.

Mini-Project: Learning an Unknown 2D Oscillator

Project Description

You are given noisy trajectory data from an unknown 2D dynamical system (which is secretly the Van der Pol oscillator $\dot{x} = y$, $\dot{y} = \mu(1-x^2)y - x$ with $\mu = 1$, but you should treat the dynamics as unknown).

Tasks

Data generation (provided): Generate 5 trajectories from different initial conditions, each with 100 time points over $[0, 10]$, with Gaussian noise ($\sigma = 0.05$).
Model: Implement a 2D Neural ODE with a 2-hidden-layer MLP ($2 \to 32 \to 32 \to 2$) using NumPy. Use $\tanh$ activations.
Training: Train for at least 200 epochs using finite-difference gradients. Report the loss curve.
Evaluation: From a held-out initial condition, propagate the learned ODE and compare to the true trajectory. Report MSE over the trajectory.
Phase portrait: Evaluate the learned vector field on a grid in $[-3,3]^2$ and produce a quiver plot (or a table of vector field values). Compare qualitatively to the true Van der Pol vector field.
Analysis: Discuss whether the learned system captures the limit cycle. What happens if you extrapolate beyond the training time horizon?

Rubric

Criterion	Excellent (A)	Satisfactory (B)	Needs Work (C)
Implementation	Complete, correct NumPy code; RK4 or adaptive solver	Working Euler solver; minor issues	Incomplete or non-functional code
Training	Loss converges below 0.01; proper hyperparameter tuning documented	Loss converges below 0.05; default hyperparameters	Loss does not converge or training crashes
Evaluation	Quantitative MSE comparison; multiple held-out ICs tested	Single held-out IC; qualitative comparison	No held-out evaluation
Phase Portrait	Clear vector field comparison; limit cycle visible	Vector field plotted/tabulated but incomplete comparison	Missing or incorrect
Analysis	Insightful discussion of limit cycle capture, extrapolation, and failure modes	Basic discussion	No analysis provided

References & Next Steps

References

R. T. Q. Chen, Y. Rubanova, J. Bettencourt, D. Duvenaud, “Neural Ordinary Differential Equations,” NeurIPS, 2018. arXiv:1806.07366
L. S. Pontryagin, V. G. Boltyanskii, R. V. Gamkrelidze, E. F. Mishchenko, The Mathematical Theory of Optimal Processes, Wiley, 1962.
E. Dupont, A. Doucet, Y. W. Teh, “Augmented Neural ODEs,” NeurIPS, 2019. arXiv:1904.01681
Y. Rubanova, R. T. Q. Chen, D. Duvenaud, “Latent ODEs for Irregularly-Sampled Time Series,” NeurIPS, 2019. arXiv:1907.03907
W. Grathwohl, R. T. Q. Chen, J. Bettencourt, I. Sutskever, D. Duvenaud, “FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models,” ICLR, 2019. arXiv:1810.01367
K. Hornik, “Approximation capabilities of multilayer feedforward networks,” Neural Networks, 4(2):251–257, 1991.
E. Weinan, “A Proposal on Machine Learning via Dynamical Systems,” Communications in Mathematics and Statistics, 5(1):1–11, 2017.
P. Kidger, “On Neural Differential Equations,” DPhil Thesis, University of Oxford, 2022. arXiv:2202.02435

Next Steps

Where to go from here

Full implementation: Work through the PyTorch Neural ODE notebook which uses torchdiffeq for adaptive solvers and automatic adjoint differentiation.
Next module: Module 3.2 — Physics-Informed Neural Networks (PINNs) builds on Neural ODEs by embedding PDE residuals directly into the loss function.
Advanced topics: Module 3.3 covers Neural Operators (DeepONet, FNO) for learning mappings between function spaces.
Agent connection: Module 3.4 applies reinforcement learning to ODE/PDE control, using the continuous-time policy gradient ideas introduced here.

← Back to Level 3 Overview Next: Module 3.2 — PINNs →