PROJECT 4.1

Agent-Based Simulation of Physical Systems

Difficulty	Advanced
Estimated Time	5 – 7 hours
Prerequisites	Module 1.3 (Numerical Methods), Module 1.4 (Higher-Order ODEs), Module 3.4 (RL for Dynamical Systems)

1. Problem Statement

The gravitational three-body problem is one of the oldest unsolved problems in classical mechanics. Given three point masses interacting through Newtonian gravity, the system of coupled second-order ODEs is:

$$m_i \ddot{\mathbf{r}}_i = \sum_{j \neq i} \frac{G\, m_i\, m_j\, (\mathbf{r}_j - \mathbf{r}_i)}{|\mathbf{r}_j - \mathbf{r}_i|^3}, \quad i = 1, 2, 3$$

Naively integrating this system with a fixed time step and a single integration method leads to either excessive computation (small fixed step) or numerical blow-up near close encounters (large fixed step). Your task is to build a reinforcement learning agent that adaptively selects both the time step size and the integration method at each simulation step, aiming to maintain the relative energy error below a prescribed threshold while minimizing the total number of force evaluations (a proxy for computational cost).

The agent must handle the full range of dynamical regimes that arise in the three-body problem: wide hierarchical orbits where large steps suffice, close binary interactions requiring small steps and symplectic methods, and chaotic scattering events where the trajectory is sensitive to integration accuracy. A successful solution demonstrates that learned policies can outperform hand-tuned adaptive stepping heuristics on this challenging benchmark.

2. Modeling Assumptions

Point-mass approximation. Each body is a point mass with no spatial extent. Collisions occur only when $|\mathbf{r}_i - \mathbf{r}_j| < \epsilon_{\text{soft}}$, where the softening length $\epsilon_{\text{soft}} = 10^{-4}$ AU prevents singularities.
Newtonian gravity only. No relativistic corrections, radiation pressure, or tidal effects are included. The gravitational constant is $G = 4\pi^2$ in units where distances are in AU, masses in solar masses, and time in years.
Two-dimensional dynamics. All motion is restricted to the $xy$-plane. This simplifies the state space from 18 to 12 phase-space dimensions while preserving the essential chaotic dynamics of the three-body problem.
Bounded simulation domain. If any body moves beyond $R_{\max} = 100$ AU from the center of mass, the simulation is terminated as an ejection event.
Energy conservation as the quality metric. The relative energy error is $\Delta E / |E_0| = |E(t) - E_0| / |E_0|$, where $E_0$ is the initial total energy. The threshold is $\delta = 10^{-6}$.
Discrete method choices. The agent selects from exactly three integration methods: Forward Euler (1 evaluation per step), classical RK4 (4 evaluations per step), and velocity Verlet (2 evaluations per step, symplectic).
Time-step range. The time step is constrained to $dt \in [10^{-6}, 10^{-1}]$ years. The agent chooses a multiplier $m \in \{0.25, 0.5, 1.0, 2.0, 4.0\}$ applied to the current $dt$.

3. Synthetic Data Generation

There is no external dataset for this project. Instead, you must generate initial conditions for the three-body system and use numerical integration to produce reference trajectories. The following Python code generates a single random initial condition and computes a high-accuracy reference solution using scipy.integrate.solve_ivp with the DOP853 method (8th-order Runge-Kutta) at very tight tolerances.

import numpy as np
from scipy.integrate import solve_ivp

# --- Constants ---
G = 4.0 * np.pi**2        # AU^3 / (M_sun * yr^2)
SOFTENING = 1e-4           # AU
T_FINAL = 5.0              # years
N_SCENARIOS = 50           # number of initial conditions to generate

def generate_initial_conditions(seed=None):
    """Generate random initial conditions for a 3-body system in 2D.

    Returns:
        masses: array of shape (3,), in solar masses [1, 10]
        state0: array of shape (12,), [x1,y1,x2,y2,x3,y3,vx1,vy1,vx2,vy2,vx3,vy3]
    """
    rng = np.random.default_rng(seed)

    # Random masses between 1 and 10 solar masses
    masses = rng.uniform(1.0, 10.0, size=3)

    # Random positions in [-5, 5] AU
    positions = rng.uniform(-5.0, 5.0, size=6)

    # Shift to center-of-mass frame
    total_mass = masses.sum()
    com_x = (masses[0]*positions[0] + masses[1]*positions[2] + masses[2]*positions[4]) / total_mass
    com_y = (masses[0]*positions[1] + masses[1]*positions[3] + masses[2]*positions[5]) / total_mass
    positions[0] -= com_x; positions[1] -= com_y
    positions[2] -= com_x; positions[3] -= com_y
    positions[4] -= com_x; positions[5] -= com_y

    # Random velocities in [-2, 2] AU/yr
    velocities = rng.uniform(-2.0, 2.0, size=6)

    # Shift to zero total momentum
    px = masses[0]*velocities[0] + masses[1]*velocities[2] + masses[2]*velocities[4]
    py = masses[0]*velocities[1] + masses[1]*velocities[3] + masses[2]*velocities[5]
    velocities[0] -= px / (3*masses[0]); velocities[1] -= py / (3*masses[0])
    velocities[2] -= px / (3*masses[1]); velocities[3] -= py / (3*masses[1])
    velocities[4] -= px / (3*masses[2]); velocities[5] -= py / (3*masses[2])

    state0 = np.concatenate([positions, velocities])
    return masses, state0


def gravitational_rhs(t, state, masses):
    """Right-hand side of the 3-body equations of motion in 2D."""
    pos = state[:6].reshape(3, 2)
    vel = state[6:].reshape(3, 2)
    acc = np.zeros_like(pos)
    for i in range(3):
        for j in range(3):
            if i == j:
                continue
            rij = pos[j] - pos[i]
            dist = np.sqrt(np.dot(rij, rij) + SOFTENING**2)
            acc[i] += G * masses[j] * rij / dist**3
    return np.concatenate([vel.flatten(), acc.flatten()])


def total_energy(state, masses):
    """Compute kinetic + potential energy of the 3-body system."""
    pos = state[:6].reshape(3, 2)
    vel = state[6:].reshape(3, 2)
    KE = 0.5 * sum(masses[i] * np.dot(vel[i], vel[i]) for i in range(3))
    PE = 0.0
    for i in range(3):
        for j in range(i+1, 3):
            rij = pos[j] - pos[i]
            dist = np.sqrt(np.dot(rij, rij) + SOFTENING**2)
            PE -= G * masses[i] * masses[j] / dist
    return KE + PE


def generate_reference_solution(masses, state0, t_final=T_FINAL, n_eval=10000):
    """High-accuracy reference solution using DOP853."""
    t_eval = np.linspace(0, t_final, n_eval)
    sol = solve_ivp(
        gravitational_rhs, [0, t_final], state0,
        args=(masses,), method='DOP853',
        rtol=1e-13, atol=1e-15, t_eval=t_eval, max_step=1e-3
    )
    return sol.t, sol.y.T  # shape (n_eval, 12)


# --- Generate all scenarios ---
scenarios = []
for i in range(N_SCENARIOS):
    masses, state0 = generate_initial_conditions(seed=42 + i)
    E0 = total_energy(state0, masses)
    t_ref, y_ref = generate_reference_solution(masses, state0)
    scenarios.append({
        'masses': masses, 'state0': state0,
        'E0': E0, 't_ref': t_ref, 'y_ref': y_ref
    })
    print(f"Scenario {i}: masses={masses.round(2)}, E0={E0:.6f}")

Generate at least 50 scenarios. Use 40 for training the RL agent and hold out 10 for evaluation. Store each scenario as a dictionary containing the masses, initial state, initial energy, and reference trajectory.

4. Baselines

4.1 Classical Baseline: Fixed-Step RK4

The classical baseline uses the standard fourth-order Runge-Kutta method with a fixed time step of $dt = 0.001$ years. This is the simplest "safe" choice: it provides reasonable accuracy for moderately separated configurations but wastes computation during quiescent phases and may still violate the energy threshold during close encounters.

def rk4_step(f, t, state, dt, masses):
    """Single step of the classical RK4 method."""
    k1 = f(t, state, masses)
    k2 = f(t + 0.5*dt, state + 0.5*dt*k1, masses)
    k3 = f(t + 0.5*dt, state + 0.5*dt*k2, masses)
    k4 = f(t + dt, state + dt*k3, masses)
    return state + (dt / 6.0) * (k1 + 2*k2 + 2*k3 + k4)


def run_fixed_rk4(masses, state0, t_final, dt=0.001):
    """Run fixed-step RK4 simulation. Returns trajectory and evaluation count."""
    t = 0.0
    state = state0.copy()
    trajectory = [state.copy()]
    times = [0.0]
    n_evals = 0

    while t < t_final:
        dt_actual = min(dt, t_final - t)
        state = rk4_step(gravitational_rhs, t, state, dt_actual, masses)
        t += dt_actual
        n_evals += 4  # RK4 uses 4 function evaluations per step
        trajectory.append(state.copy())
        times.append(t)

    return np.array(times), np.array(trajectory), n_evals

Record the total number of force evaluations, final energy error, and maximum energy error over the trajectory. These serve as the primary comparison metrics.

4.2 Heuristic Baseline: Velocity-Adaptive Stepping

A simple adaptive heuristic adjusts the time step inversely proportional to the maximum velocity magnitude in the system. When bodies are moving fast (close encounter), the step shrinks; when they are moving slowly (wide separation), the step grows.

def run_velocity_adaptive(masses, state0, t_final, dt_base=0.001):
    """Heuristic adaptive stepping based on max velocity magnitude."""
    t = 0.0
    state = state0.copy()
    trajectory = [state.copy()]
    times = [0.0]
    n_evals = 0
    dt = dt_base

    while t < t_final:
        # Compute max velocity magnitude
        vel = state[6:].reshape(3, 2)
        v_max = max(np.linalg.norm(vel[i]) for i in range(3))

        # Adaptive step: dt ~ 1 / v_max, clamped to [1e-6, 0.1]
        dt = np.clip(dt_base / max(v_max, 0.1), 1e-6, 0.1)
        dt_actual = min(dt, t_final - t)

        state = rk4_step(gravitational_rhs, t, state, dt_actual, masses)
        t += dt_actual
        n_evals += 4
        trajectory.append(state.copy())
        times.append(t)

    return np.array(times), np.array(trajectory), n_evals

This heuristic is intuitive but does not account for the structure of the integration error (it does not switch methods) and cannot learn from past experience across different initial conditions.

5. Agent Approach Specification

5.1 State Space

At each simulation step $k$, the agent observes:

$$\mathbf{s}_k = \bigl(\mathbf{r}_1, \mathbf{r}_2, \mathbf{r}_3,\; \mathbf{v}_1, \mathbf{v}_2, \mathbf{v}_3,\; \Delta E_k,\; dt_k,\; m_{\text{method}}\bigr) \in \mathbb{R}^{15}$$

where $\mathbf{r}_i, \mathbf{v}_i \in \mathbb{R}^2$ are positions and velocities of the three bodies (12 components), $\Delta E_k = |E(t_k) - E_0| / |E_0|$ is the current relative energy error (1 component), $dt_k$ is the current time step in log-scale (1 component), and $m_{\text{method}} \in \{0, 1, 2\}$ is the index of the currently selected integration method (1 component, one-hot encoded as 3 components in practice, giving 16 total inputs).

5.2 Action Space

The agent outputs a discrete action $a_k \in \{0, 1, \ldots, 14\}$, which encodes a joint decision:

Time-step multiplier $\mu \in \{0.25,\, 0.5,\, 1.0,\, 2.0,\, 4.0\}$ — applied as $dt_{k+1} = \text{clip}(\mu \cdot dt_k,\; 10^{-6},\; 10^{-1})$.
Method index $m \in \{0, 1, 2\}$ corresponding to Euler, RK4, Verlet respectively.

The total action space has $5 \times 3 = 15$ discrete actions.

5.3 Reward Function

$$r_k = \begin{cases} -\,c_{\text{method}} & \text{if } \Delta E_k < \delta = 10^{-6} \\[4pt] -1000 & \text{if } \Delta E_k \geq \delta \end{cases}$$

where $c_{\text{method}}$ is the computational cost of the chosen method: $c_{\text{Euler}} = 1$, $c_{\text{Verlet}} = 2$, $c_{\text{RK4}} = 4$. This reward structure incentivizes the agent to advance the simulation as cheaply as possible (fewest force evaluations) while maintaining energy conservation. The large penalty for violating the threshold ensures constraint satisfaction.

5.4 Training Procedure

Environment. Wrap the three-body integrator as a Gymnasium-compatible environment. Each episode corresponds to simulating one scenario from $t = 0$ to $t = T_{\text{final}}$. The episode terminates early if the energy error exceeds $\delta$ or a body is ejected beyond $R_{\max}$.
Algorithm. Use Proximal Policy Optimization (PPO) with a shared encoder for the actor and critic. Architecture: 2-layer MLP with 128 hidden units each, ReLU activations.
Hyperparameters. Learning rate: $3 \times 10^{-4}$. Discount factor $\gamma = 0.99$. GAE parameter $\lambda = 0.95$. Clip ratio $\epsilon = 0.2$. Mini-batch size: 64. Number of epochs per update: 10. Total training steps: $5 \times 10^5$.
Curriculum. Begin training with "easy" scenarios (well-separated bodies, small velocities), then gradually introduce harder configurations (close initial separations, high relative velocities) by controlling the random seed distribution.
Evaluation. Every 10,000 training steps, evaluate on the held-out 10 scenarios. Report mean and max energy error, total force evaluations, and episode completion rate.

6. Deliverables

#	Item	Description
D1	`threebody_env.py`	Gymnasium environment wrapping the 3-body integrator with Euler, RK4, and Verlet methods. Includes `reset()`, `step()`, and `render()`.
D2	`generate_scenarios.py`	Script to generate 50+ initial conditions and reference trajectories. Saves to `scenarios.npz`.
D3	`train_agent.py`	PPO training script with logging. Saves checkpoints every 50k steps.
D4	`evaluate.py`	Evaluation script: runs agent, fixed-RK4, and heuristic on held-out scenarios. Outputs CSV with metrics.
D5	Report (PDF, 4–6 pages)	Sections: Introduction, Methods (environment + agent), Results (plots + tables), Discussion, Conclusion.
D6	Plots	(a) Orbital trajectories for 3 representative scenarios. (b) Energy error vs. time for agent, fixed-RK4, and heuristic. (c) Cumulative force evaluations vs. time for all three methods. (d) Training reward curve (mean ± std over 5 seeds). (e) Heatmap of agent's method selection over the trajectory.
D7	Metrics table	For each method and each held-out scenario: max energy error, mean energy error, total force evaluations, wall-clock time, and completion rate.

7. Grading Rubric

Criterion	Weight	Description
Correctness	30	Euler, RK4, and Verlet are implemented correctly and pass unit tests against known two-body analytical solutions. Energy conservation for fixed-step RK4 matches reference. Agent actions are applied correctly (dt clipping, method switching).
Efficiency	25	The trained agent uses fewer total force evaluations than fixed-step RK4 on at least 7 out of 10 held-out scenarios while maintaining the energy threshold. Improvement over the heuristic baseline on at least 5 out of 10 scenarios.
Robustness	20	Agent completes at least 8 out of 10 held-out scenarios without triggering the energy violation penalty or ejection termination. Performance is consistent across 3 random training seeds.
Reproducibility	15	A single `requirements.txt` and `README.md` allow the grader to reproduce all results by running `python generate_scenarios.py && python train_agent.py && python evaluate.py`. Random seeds are fixed. All hyperparameters are in a config file or argparse.
Documentation	10	Report is clearly written with proper figures. Code has docstrings and inline comments for non-obvious steps. Plots are labeled and legible.
Total	100

8. References & Links

Hairer, E., Lubich, C., & Wanner, G. (2006). Geometric Numerical Integration: Structure-Preserving Algorithms for Ordinary Differential Equations. Springer.
Aarseth, S. J. (2003). Gravitational N-Body Simulations: Tools and Algorithms. Cambridge University Press.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). "Proximal Policy Optimization Algorithms." arXiv:1707.06347.
Breen, P. G., Foley, C. N., Boekholt, T., & Portegies Zwart, S. (2020). "Newton vs the machine: solving the chaotic three-body problem using deep neural networks." Monthly Notices of the Royal Astronomical Society, 494(2), 2465–2470.
Gymnasium documentation: https://gymnasium.farama.org/
Stable-Baselines3 PPO: https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html

← Level 4 Overview Project 4.2: Neural vs Classical →