An ablation study on DiffusionDriveV2 that replaces its DDIM diffusion sampler with a rectified flow matching model and attempts RL fine-tuning via GRPO/PPO. Built on the NAVSIM benchmark.
DiffusionDriveV2 achieves 91.2 PDMS using DDIM + GRPO-style RL fine-tuning. This project replaces the DDIM denoiser with a rectified flow matching ODE and attempts the same RL fine-tuning. After extensive experimentation (20+ configurations across multi-step chains, single-step REINFORCE, PPO, DPPO, and Flow-GRPO approaches), we conclude that truncated flow matching (t_max=0.05) with single-step inference is structurally incompatible with RL fine-tuning.
The pretrained IL flow model achieves 0.94 PDM score but resists all RL improvement attempts due to an unresolvable tension between exploration noise scale and proposal quality.
4 cameras + LiDAR
|
TransFuser Backbone (ResNet-34 + BEV) [frozen during RL]
|
BEV Features [B, 256, H, W]
|
Flow Trajectory Head [trainable]
|
20 K-Means Anchors x G groups = 80 proposals
|
Rectified Flow: single pass at t=0.02
|
PDM Scorer -> best trajectory
The trajectory head operates in normalized [-1, 1] space and uses 20 pre-computed anchor trajectories. At inference, a single decoder pass at t=0.02 with input x_t = 0.98 * anchor produces 20 trajectory predictions. The decoder is a 2-layer CustomTransformerDecoder with cross-BEV attention, time-embedding modulation, and per-mode classification + regression heads.
navsim/agents/flow_drive_agent/
flow_model.py - IL flow model (pretraining)
flow_agent.py - Lightning wrapper for IL
flow_rl_model.py - RL flow model (fine-tuning attempts)
flow_rl_agent.py - Lightning wrapper for RL
navsim/planning/script/config/
training/flow_rl_mini.yaml - navmini RL config (fast iteration)
training/flow_rl_navtrain.yaml - navtrain RL config (full training)
common/agent/flow_rl_agent.yaml - agent hyperparameters
navsim/planning/training/
agent_lightning_module.py - RLAgentLightningModule (multi-epoch PPO)
scripts/training/
run_flow_rl_navtrain.sh - Launch script
CHANGES.md - Detailed experiment log with all attempts
File: flow_model.py
- Train:
t ~ U(0, 0.05),x_t = (1-t)*anchor + t*noise, decoder predicts clean x_0 - Inference: single pass at t=0.02 with zero noise input
- Achieves 0.94 PDM on navtrain evaluation
File: flow_rl_model.py
| # | Approach | Diversity (adv_frac_zero) | Stability | Outcome |
|---|---|---|---|---|
| 1 | Multi-step chain (t=0.15->0), z-inversion PPO | 0.60 | Unstable | Reward 0.88->0.60 |
| 2 | Single-step REINFORCE, output noise | 0.89 | Stable | Zero gradient |
| 3 | Input noise diversity | 0.87 | Stable | No policy movement |
| 4 | Two-step chain | 0.85 | Stable | Zero gradient |
| 5 | DPPO multi-step (5 steps, per-step log-prob) | 0.60 | Unstable | Reward 0.86->0.26 |
| 6 | Flow-GRPO symmetric adv, sigma=0.15 | 0.54 | Unstable | Reward 0.21 at step 0 |
| 7 | Multiplicative noise, sigma_logprob=1.0 | 0.89 | Stable | kl_ref grows, no improvement |
| 8 | Symmetric adv, sigma=0.04, z-inversion PPO | 0.50 | Unstable | Reward 0.44->0.01 |
There is no exploration noise level that simultaneously:
- Creates enough trajectory diversity for PDM to differentiate proposals (needs >3m displacement)
- Keeps proposals within the safe driving corridor (needs <1.5m displacement)
This is because:
- PDM has binary cliff metrics (collision, drivable_area) that create discontinuous reward
- The pretrained model at 0.94 is already near-optimal with minimal room to improve
- Single-step noise produces position offsets, not shape diversity (PDM responds to curvature/timing changes)
| Property | DDV2 (DDIM) | Flow Matching (this project) |
|---|---|---|
| Diversity | 10-step chain compounds noise -> shape diversity | Single-step offset -> no shape diversity |
| Train/test gap | None (alpha-schedule consistency) | Fatal (chain damages t=0.02 inference) |
| Decoder training range | Full schedule (t=0 to 1000) | Truncated (t in [0, 0.05] only) |
| Exploration mechanism | Per-step noise within DDIM formula | Must be injected externally |
DDV2's key structural advantage: the DDIM scheduler's alpha-schedule makes the decoder's job (predict x_0) identical regardless of which timestep-pair is used. This means training at timesteps [18,16,...,0] and inferring at [10,0] have zero distribution gap. Flow matching with truncated t cannot replicate this property.
Single-step flow matching with truncated noise schedule is structurally incompatible with RL fine-tuning for trajectory planning with PDM rewards. The architecture provides no mechanism for generating reward-differentiable trajectory diversity without either:
- Damaging inference behavior (multi-step chains at untrained timesteps)
- Producing degenerate proposals (large exploration noise -> collisions)
- Providing no gradient signal (small exploration noise -> identical PDM scores)
- Retrain flow model on t ~ U(0, 1) with multi-step ODE inference -> eliminates train/test gap, enables DPPO
- Change inference to multi-step (5 deterministic Euler steps) -> matches RL training distribution
- Use DDV2's DDIM-based RL (known to work, the reference implementation)
- Accept 0.94 as ceiling for this single-step architecture
# Environment
export NAVSIM_DEVKIT_ROOT=/path/to/navsim
export NAVSIM_EXP_ROOT=/path/to/navsim/exp
export PYTHONPATH=/path/to/flow_drive_rl:$PYTHONPATH
# IL pretraining (flow matching)
python navsim/planning/script/run_training.py --config-name flow_il_mini
# RL fine-tuning (navmini, fast iteration)
python navsim/planning/script/run_training.py --config-name flow_rl_mini
# RL fine-tuning (navtrain, full)
python navsim/planning/script/run_training.py --config-name flow_rl_navtrain- DiffusionDriveV2 - DDIM + GRPO baseline
- NAVSIM - Benchmark
- Flow-GRPO - Flow matching RL theory
- DPPO - Diffusion Policy Policy Optimization