Module 3.3
Learn the entire solution operator $\mathcal{G}: f \mapsto u$—once trained, get instant predictions for any new input function.
| Difficulty: | Advanced |
| Estimated Time: | 4–5 hours |
A classical PDE solver computes one solution at a time: change the forcing function, initial condition, or geometry and you must re-solve from scratch. Neural operators learn the entire solution operator—a mapping from input functions to output functions. Once trained, they predict solutions for new inputs in milliseconds, enabling real-time simulation, rapid design optimization, and uncertainty quantification that would be infeasible with traditional solvers.
Two architectures dominate the field. DeepONet (Lu et al. 2021) decomposes the operator into a branch network (encodes the input function) and a trunk network (encodes the query location), combining them via a learned basis expansion. The Fourier Neural Operator (FNO, Li et al. 2021) performs convolution in the spectral domain, achieving resolution invariance and computational efficiency. Both architectures are backed by universal approximation theorems for nonlinear operators.
Learn the operator $\mathcal{G}[f](x) = \int_0^x f(s)\,ds$.
Setup: Sample $f$ as random sine series: $f(s) = \sum_{k=1}^{3} a_k \sin(k\pi s)$ with $a_k \sim \mathcal{N}(0,1)$. Compute $F(x) = \int_0^x f(s)\,ds$ analytically.
Branch network: Input: $[f(s_1), \ldots, f(s_{50})]$ at 50 equally-spaced sensor points $s_j \in [0,1]$. Architecture: 50 → 40 → $p$.
Trunk network: Input: query location $x$. Architecture: 1 → 40 → $p$.
Forward pass for one input: Let $f(s) = 2\sin(\pi s) - \sin(2\pi s)$. Evaluate at sensors to get a 50-dimensional vector. Pass through branch to get $\mathbf{b} = [b_1, \ldots, b_p]$. For query $x = 0.5$, pass through trunk to get $\boldsymbol{\tau} = [\tau_1, \ldots, \tau_p]$. Output: $\hat{F}(0.5) = \sum_{k} b_k \tau_k + b_0$.
Training: Generate 1000 random $f$’s. MSE loss between predicted and true antiderivatives at random query points. After training, test on unseen $f$’s: relative $L^2$ error $\sim 10^{-3}$.
Illustrate one spectral convolution on a function discretized at $n = 8$ points.
Input: $v = [1, 2, 3, 4, 3, 2, 1, 0]$ (single channel).
Step 1: FFT. $\hat{v} = \text{FFT}(v) = [16, -2-5.83i, 0+2i, -2-0.17i, 0, -2+0.17i, 0-2i, -2+5.83i]$ (5 unique modes due to symmetry).
Step 2: Truncate. Keep $k_{\max} = 3$ modes: $\hat{v}_{\text{trunc}} = [16, -2-5.83i, 0+2i, 0, 0, 0, 0, 0]$.
Step 3: Multiply. For weight $R = [1.5, 0.8+0.1i, -0.3+0.5i]$ (one per kept mode): $\hat{w}_0 = 1.5 \times 16 = 24$, $\hat{w}_1 = (0.8+0.1i)(-2-5.83i) = -1.02-5.86i$, $\hat{w}_2 = (-0.3+0.5i)(0+2i) = -1.0-0.6i$.
Step 4: IFFT. Zero-pad back to 8 modes and compute $w = \text{IFFT}(\hat{w})$ to get the output in physical space. The result captures the low-frequency structure of $v$ transformed by the learned weights.
Learn $\mathcal{G}: f \mapsto u$ where $-u''(x) = f(x)$, $u(0) = u(1) = 0$.
Data generation: For each of 1000 training samples, draw $f$ as a random Fourier series with 1–5 modes. Solve using a tridiagonal finite-difference solver (see Code Lab below).
Training: Branch input: $f$ at 64 sensor points. Trunk input: query $x$. Loss: $\frac{1}{N}\sum_{j}\|u_{\text{pred}}(x_j) - u_{\text{true}}(x_j)\|^2$ averaged over samples and query points.
After 5000 epochs: Mean relative test error $\approx 5 \times 10^{-3}$. The trained model predicts the solution for a new $f$ in a single forward pass ($\sim$1 ms) versus $\sim$10 ms for the tridiagonal solve—a 10x speedup that grows with problem dimension.
Build and train a minimal DeepONet to learn the Poisson operator $-u'' = f$, $u(0) = u(1) = 0$.
For full FNO and DeepONet implementations in PyTorch, see notebooks/neural_operator_full.ipynb.
import numpy as np
import matplotlib.pyplot as plt
# ── 1. Data generation: solve -u'' = f with u(0)=u(1)=0 ─────
def solve_poisson(f_vals, x_grid):
"""Solve -u''=f via tridiagonal finite differences."""
n = len(x_grid)
dx = x_grid[1] - x_grid[0]
# Interior points only (exclude boundaries)
A_diag = 2.0 * np.ones(n - 2)
A_off = -1.0 * np.ones(n - 3)
# Solve tridiagonal system: (1/dx^2) * A * u_int = f_int
rhs = dx**2 * f_vals[1:-1]
# Thomas algorithm
n_int = n - 2
c = np.zeros(n_int); d = np.zeros(n_int)
c[0] = A_off[0] / A_diag[0] if n_int > 1 else 0
d[0] = rhs[0] / A_diag[0]
for i in range(1, n_int):
m = A_diag[i] - A_off[min(i-1, len(A_off)-1)] * c[i-1]
c[i] = A_off[min(i, len(A_off)-1)] / m if i < n_int - 1 else 0
d[i] = (rhs[i] - A_off[min(i-1, len(A_off)-1)] * d[i-1]) / m
u_int = np.zeros(n_int)
u_int[-1] = d[-1]
for i in range(n_int - 2, -1, -1):
u_int[i] = d[i] - c[i] * u_int[i+1]
u = np.zeros(n)
u[1:-1] = u_int
return u
n_points = 32
x_grid = np.linspace(0, 1, n_points)
def random_forcing(n_samples):
"""Random Fourier forcing functions."""
F = np.zeros((n_samples, n_points))
for i in range(n_samples):
n_modes = np.random.randint(1, 5)
for k in range(1, n_modes + 1):
a = np.random.randn() * 2.0
F[i] += a * np.sin(k * np.pi * x_grid)
return F
N_train, N_test = 500, 100
f_train = random_forcing(N_train)
u_train = np.array([solve_poisson(f, x_grid) for f in f_train])
f_test = random_forcing(N_test)
u_test = np.array([solve_poisson(f, x_grid) for f in f_test])
# ── 2. Tiny DeepONet in NumPy ────────────────────────────────
class TinyDeepONet:
def __init__(self, n_sensors, trunk_in, hidden, p, seed=42):
rng = np.random.RandomState(seed)
sc = lambda m, n: rng.randn(m, n) * np.sqrt(2.0/(m+n))
# Branch: n_sensors -> hidden -> p
self.Wb1 = sc(n_sensors, hidden)
self.bb1 = np.zeros(hidden)
self.Wb2 = sc(hidden, p)
self.bb2 = np.zeros(p)
# Trunk: trunk_in -> hidden -> p
self.Wt1 = sc(trunk_in, hidden)
self.bt1 = np.zeros(hidden)
self.Wt2 = sc(hidden, p)
self.bt2 = np.zeros(p)
# Bias
self.b0 = np.zeros(1)
self.params = [self.Wb1, self.bb1, self.Wb2, self.bb2,
self.Wt1, self.bt1, self.Wt2, self.bt2, self.b0]
def branch(self, f_sensors):
"""f_sensors: (batch, n_sensors) -> (batch, p)"""
h = np.tanh(f_sensors @ self.Wb1 + self.bb1)
return h @ self.Wb2 + self.bb2
def trunk(self, x_query):
"""x_query: (n_query, 1) -> (n_query, p)"""
h = np.tanh(x_query @ self.Wt1 + self.bt1)
return h @ self.Wt2 + self.bt2
def forward(self, f_sensors, x_query):
"""Output: (batch, n_query)"""
b = self.branch(f_sensors) # (batch, p)
t = self.trunk(x_query) # (n_query, p)
return b @ t.T + self.b0 # (batch, n_query)
# ── 3. Training ──────────────────────────────────────────────
p = 20 # basis dimension
net = TinyDeepONet(n_points, 1, 32, p)
x_query = x_grid.reshape(-1, 1) # (n_points, 1)
lr = 0.001
batch_size = 32
n_epochs = 2000
losses = []
for epoch in range(n_epochs):
idx = np.random.choice(N_train, batch_size, replace=False)
f_batch = f_train[idx] # (batch, n_points)
u_batch = u_train[idx] # (batch, n_points)
# Forward
u_pred = net.forward(f_batch, x_query) # (batch, n_points)
loss = np.mean((u_pred - u_batch)**2)
losses.append(loss)
# Numerical gradient update (parameter perturbation)
eps = 1e-5
for pi in range(len(net.params)):
grad = np.zeros_like(net.params[pi])
flat = net.params[pi].ravel()
grad_flat = grad.ravel()
# Sample subset of parameters to update (for speed)
n_param = len(flat)
sample_idx = np.random.choice(n_param, min(50, n_param), replace=False)
for j in sample_idx:
old_val = flat[j]
flat[j] = old_val + eps
net.params[pi] = flat.reshape(net.params[pi].shape)
u_p = net.forward(f_batch, x_query)
loss_p = np.mean((u_p - u_batch)**2)
flat[j] = old_val - eps
net.params[pi] = flat.reshape(net.params[pi].shape)
u_m = net.forward(f_batch, x_query)
loss_m = np.mean((u_m - u_batch)**2)
grad_flat[j] = (loss_p - loss_m) / (2 * eps)
flat[j] = old_val
net.params[pi] = flat.reshape(net.params[pi].shape)
net.params[pi] -= lr * grad.reshape(net.params[pi].shape)
# Keep references in sync
(net.Wb1, net.bb1, net.Wb2, net.bb2,
net.Wt1, net.bt1, net.Wt2, net.bt2, net.b0) = net.params
if (epoch + 1) % 500 == 0:
print(f"Epoch {epoch+1}: MSE = {loss:.6f}")
# ── 4. Evaluation ────────────────────────────────────────────
u_pred_test = net.forward(f_test, x_query)
errors = np.linalg.norm(u_pred_test - u_test, axis=1) / (np.linalg.norm(u_test, axis=1) + 1e-10)
print(f"\nMean relative L2 error: {errors.mean():.4f}")
print(f"Max relative L2 error: {errors.max():.4f}")
fig, axes = plt.subplots(2, 3, figsize=(14, 8))
for col in range(3):
i = np.random.randint(N_test)
axes[0, col].plot(x_grid, f_test[i], 'k-', lw=1.5)
axes[0, col].set_title(f'Forcing f(x) #{i}')
axes[0, col].set_xlabel('x'); axes[0, col].grid(True, alpha=0.3)
axes[1, col].plot(x_grid, u_test[i], 'b-', lw=2, label='True')
axes[1, col].plot(x_grid, u_pred_test[i], 'r--', lw=2, label='DeepONet')
axes[1, col].set_title(f'Solution, err={errors[i]:.3e}')
axes[1, col].set_xlabel('x'); axes[1, col].legend()
axes[1, col].grid(True, alpha=0.3)
plt.suptitle('DeepONet: Poisson Operator Learning', fontsize=14)
plt.tight_layout()
plt.show()
Note: This toy implementation uses numerical gradients and trains slowly. Production implementations use PyTorch autograd for exact, efficient gradients. See notebooks/neural_operator_full.ipynb for full DeepONet and FNO implementations.
Exploration ideas:
| MDP Component | Operator-Learning Interpretation |
|---|---|
| State | Current model parameters, training/validation loss, library of PDE families already learned |
| Action | Choose architecture (DeepONet vs FNO vs hybrid), select number of Fourier modes $k_{\max}$, choose sensor locations, decide training distribution of input functions |
| Reward | Low relative prediction error on unseen inputs + fast inference time + generalization to new resolutions |
| Policy | Meta-learning across PDE families; active data acquisition (run expensive solver only for most informative inputs) |
Active data acquisition. Generating training data requires running a classical PDE solver, which is expensive. A smart agent chooses which input functions to solve next based on where the current model is most uncertain. This reduces the total number of expensive simulations needed. Techniques from Bayesian optimization and active learning apply directly.
Architecture search as agent design. Choosing between DeepONet and FNO, selecting $k_{\max}$, choosing network width/depth, and setting the basis dimension $p$ are all design decisions. A meta-learning agent could try different configurations on small-scale problems and transfer the best architecture to the full-scale problem.
Multi-fidelity learning. An agent can combine cheap low-resolution data with expensive high-resolution data, learning to correct the low-fidelity predictions rather than learning from scratch. This is analogous to curriculum learning: start with easy (low-res) tasks, then refine with hard (high-res) tasks.
Exercise 1. Write out the DeepONet prediction formula for the antiderivative operator $\mathcal{G}[f](x) = \int_0^x f(s)\,ds$ with $p = 3$ basis functions, 4 sensor points, and query point $x = 0.7$.
$\hat{F}(0.7) = b_1(f(s_1), f(s_2), f(s_3), f(s_4))\tau_1(0.7) + b_2(\cdots)\tau_2(0.7) + b_3(\cdots)\tau_3(0.7) + b_0$, where $b_k$ are branch network outputs and $\tau_k$ are trunk network outputs.
Exercise 2. Explain why the FNO spectral convolution is more efficient than a pointwise convolution in physical space for long-range interactions.
In physical space, capturing interactions between points separated by distance $d$ requires a convolution kernel of size at least $d$, costing $O(n \cdot d)$ per output point. In frequency space, all frequencies interact through a simple matrix multiply at cost $O(k_{\max} \cdot d_{\text{in}} \cdot d_{\text{out}})$. The FFT costs $O(n \log n)$, making the total $O(n \log n + k_{\max} d_{\text{in}} d_{\text{out}})$, which is cheaper when $k_{\max} \ll n$.
Exercise 3. State the conditions under which Chen & Chen’s universal approximation theorem applies. Does it require the operator to be linear?
The theorem requires: (1) the operator $\mathcal{G}$ is continuous, (2) the input set $K$ is compact in $C([0,1])$, (3) the output is in $C(\Omega)$. Linearity is not required—the theorem applies to nonlinear operators.
Exercise 4. For a 1D FNO with $k_{\max} = 5$ Fourier modes on a grid of $n = 64$ points, how many learnable parameters does one spectral convolution layer have (for single input/output channel)?
Each of the $k_{\max} = 5$ modes has one complex weight, so $5 \times 2 = 10$ real parameters. Plus the pointwise linear transform $W$ has 1 parameter. Total: 11 parameters per layer. For $d_{\text{in}} = d_{\text{out}} = d$ channels: $5 \times d \times d \times 2 + d \times d + d$ parameters.
Exercise 5. Modify the code lab to learn the antiderivative operator $\mathcal{G}[f](x) = \int_0^x f(s)\,ds$ instead of the Poisson operator. Generate data analytically and compare training convergence.
Exercise 6. Implement a 1D spectral convolution layer in NumPy: take a vector of length 32, apply FFT, multiply the first 8 modes by random weights, apply IFFT, and plot the result.
Exercise 7. Train the toy DeepONet on forcing functions with amplitudes scaled to $[-1, 1]$, then test on forcing functions with amplitudes in $[-3, 3]$. How well does the operator extrapolate?
Exercise 8. Implement a simple resolution-transfer experiment: train the DeepONet on $n = 32$ point grids, then evaluate at $n = 64$ by querying the trunk network at 64 locations. Report the error compared to the 64-point FD solution.
Exercise 9. Design an active data acquisition agent: maintain a pool of candidate forcing functions, train the DeepONet on a small initial set, then iteratively select the candidate with highest predicted uncertainty (approximated as ensemble disagreement) and add its solution to the training set. Compare with random acquisition.
Exercise 10. Implement a model selection agent that trains both a DeepONet and a simple fully-connected network on the same data and automatically selects the architecture with lower validation error. Run on 3 different PDE families and report which architecture wins each time.
Q1. What is the difference between a function and an operator?
A function maps numbers (or vectors) to numbers. An operator maps functions to functions. The Poisson solution operator maps a forcing function $f$ to the solution function $u$.
Q2. What do the branch and trunk networks in DeepONet represent?
The branch network encodes the input function (evaluated at sensor points) into a coefficient vector. The trunk network encodes the query location into a basis vector. Their inner product gives the operator output at the query point.
Q3. How does the FNO achieve global spatial interaction efficiently?
By performing convolution in the Fourier domain. The FFT transforms the input to frequency space at $O(n \log n)$ cost, a pointwise multiply captures global interactions, and the inverse FFT returns to physical space.
Q4. What does resolution invariance mean for a neural operator?
A model trained on one grid resolution can be evaluated at a different resolution without retraining. FNOs achieve this because spectral weights are defined in mode space (independent of grid size), and DeepONets because the trunk network takes continuous coordinates.
Q5. What is the role of $k_{\max}$ in the FNO?
$k_{\max}$ is the number of Fourier modes retained in the spectral convolution. It controls the frequency resolution: higher $k_{\max}$ captures finer spatial features but increases model size and overfitting risk.
Q6. Why does the universal approximation theorem for operators (Chen & Chen 1995) not guarantee practical accuracy?
The theorem is existential: it guarantees approximation is possible with enough basis functions but says nothing about how many are needed, how to find the right parameters, or how much training data is required. In practice, optimization difficulties and finite data limit accuracy.
Q7. What is the main advantage of neural operators over PINNs?
Neural operators learn the entire solution map and amortize the cost: training is expensive but inference for any new input is nearly instant. PINNs must retrain (or re-optimize) for each new set of conditions.
Q8. How is training data generated for operator learning?
By sampling many input functions from a distribution and solving the PDE for each using a classical solver. The input-output pairs $(f_i, u_i)$ form the training set.
Q9. What is the computational cost of one FNO forward pass on a 1D grid of $n$ points?
$O(L \cdot (n \log n + k_{\max} d^2))$ where $L$ is the number of layers, $d$ is the channel width, and $k_{\max}$ is the number of Fourier modes. The $n \log n$ term comes from FFT/IFFT per layer.
Q10. In the agent lens, what does “active data acquisition” mean?
Choosing which input functions to simulate next based on where the current model is most uncertain, rather than sampling randomly. This reduces the total number of expensive simulations needed to achieve a target accuracy.
Train a DeepONet in NumPy to learn $\mathcal{G}[f](x) = \int_0^x f(s)\,ds$.
| Criterion | Weight | Description |
|---|---|---|
| Correctness | 30% | In-distribution relative error below 5% |
| Generalization | 25% | Analysis of out-of-distribution performance with discussion |
| Visualization | 20% | Clear plots of predictions, errors, and training curves |
| Analysis | 15% | Discussion of basis dimension, sensor placement, and training distribution effects |
| Code quality | 10% | Modular, well-commented code |
For complete DeepONet and FNO implementations: notebooks/neural_operator_full.ipynb