Flow Matching RL for Trajectory Planning (Negative Result)

An ablation study on DiffusionDriveV2 that replaces its DDIM diffusion sampler with a rectified flow matching model and attempts RL fine-tuning via GRPO/PPO. Built on the NAVSIM benchmark.

Summary

DiffusionDriveV2 achieves 91.2 PDMS using DDIM + GRPO-style RL fine-tuning. This project replaces the DDIM denoiser with a rectified flow matching ODE and attempts the same RL fine-tuning. After extensive experimentation (20+ configurations across multi-step chains, single-step REINFORCE, PPO, DPPO, and Flow-GRPO approaches), we conclude that truncated flow matching (t_max=0.05) with single-step inference is structurally incompatible with RL fine-tuning.

The pretrained IL flow model achieves 0.94 PDM score but resists all RL improvement attempts due to an unresolvable tension between exploration noise scale and proposal quality.

Architecture

4 cameras + LiDAR
      |
  TransFuser Backbone (ResNet-34 + BEV) [frozen during RL]
      |
  BEV Features [B, 256, H, W]
      |
  Flow Trajectory Head [trainable]
      |
  20 K-Means Anchors x G groups = 80 proposals
      |
  Rectified Flow: single pass at t=0.02
      |
  PDM Scorer -> best trajectory

The trajectory head operates in normalized [-1, 1] space and uses 20 pre-computed anchor trajectories. At inference, a single decoder pass at t=0.02 with input x_t = 0.98 * anchor produces 20 trajectory predictions. The decoder is a 2-layer CustomTransformerDecoder with cross-BEV attention, time-embedding modulation, and per-mode classification + regression heads.

Codebase Structure

navsim/agents/flow_drive_agent/
  flow_model.py       - IL flow model (pretraining)
  flow_agent.py       - Lightning wrapper for IL
  flow_rl_model.py    - RL flow model (fine-tuning attempts)
  flow_rl_agent.py    - Lightning wrapper for RL

navsim/planning/script/config/
  training/flow_rl_mini.yaml     - navmini RL config (fast iteration)
  training/flow_rl_navtrain.yaml - navtrain RL config (full training)
  common/agent/flow_rl_agent.yaml - agent hyperparameters

navsim/planning/training/
  agent_lightning_module.py - RLAgentLightningModule (multi-epoch PPO)

scripts/training/
  run_flow_rl_navtrain.sh  - Launch script

CHANGES.md              - Detailed experiment log with all attempts

IL Pretraining (Stage 1)

File: flow_model.py

Train: t ~ U(0, 0.05), x_t = (1-t)*anchor + t*noise, decoder predicts clean x_0
Inference: single pass at t=0.02 with zero noise input
Achieves 0.94 PDM on navtrain evaluation

RL Fine-Tuning Attempts (Stage 2) - Negative Result

File: flow_rl_model.py

Approaches Tried

#	Approach	Diversity (adv_frac_zero)	Stability	Outcome
1	Multi-step chain (t=0.15->0), z-inversion PPO	0.60	Unstable	Reward 0.88->0.60
2	Single-step REINFORCE, output noise	0.89	Stable	Zero gradient
3	Input noise diversity	0.87	Stable	No policy movement
4	Two-step chain	0.85	Stable	Zero gradient
5	DPPO multi-step (5 steps, per-step log-prob)	0.60	Unstable	Reward 0.86->0.26
6	Flow-GRPO symmetric adv, sigma=0.15	0.54	Unstable	Reward 0.21 at step 0
7	Multiplicative noise, sigma_logprob=1.0	0.89	Stable	kl_ref grows, no improvement
8	Symmetric adv, sigma=0.04, z-inversion PPO	0.50	Unstable	Reward 0.44->0.01

The Fundamental Problem

There is no exploration noise level that simultaneously:

Creates enough trajectory diversity for PDM to differentiate proposals (needs >3m displacement)
Keeps proposals within the safe driving corridor (needs <1.5m displacement)

This is because:

PDM has binary cliff metrics (collision, drivable_area) that create discontinuous reward
The pretrained model at 0.94 is already near-optimal with minimal room to improve
Single-step noise produces position offsets, not shape diversity (PDM responds to curvature/timing changes)

Why DDV2's DDIM Approach Works

Property	DDV2 (DDIM)	Flow Matching (this project)
Diversity	10-step chain compounds noise -> shape diversity	Single-step offset -> no shape diversity
Train/test gap	None (alpha-schedule consistency)	Fatal (chain damages t=0.02 inference)
Decoder training range	Full schedule (t=0 to 1000)	Truncated (t in [0, 0.05] only)
Exploration mechanism	Per-step noise within DDIM formula	Must be injected externally

DDV2's key structural advantage: the DDIM scheduler's alpha-schedule makes the decoder's job (predict x_0) identical regardless of which timestep-pair is used. This means training at timesteps [18,16,...,0] and inferring at [10,0] have zero distribution gap. Flow matching with truncated t cannot replicate this property.

Conclusion

Single-step flow matching with truncated noise schedule is structurally incompatible with RL fine-tuning for trajectory planning with PDM rewards. The architecture provides no mechanism for generating reward-differentiable trajectory diversity without either:

Damaging inference behavior (multi-step chains at untrained timesteps)
Producing degenerate proposals (large exploration noise -> collisions)
Providing no gradient signal (small exploration noise -> identical PDM scores)

Viable Paths Forward

Retrain flow model on t ~ U(0, 1) with multi-step ODE inference -> eliminates train/test gap, enables DPPO
Change inference to multi-step (5 deterministic Euler steps) -> matches RL training distribution
Use DDV2's DDIM-based RL (known to work, the reference implementation)
Accept 0.94 as ceiling for this single-step architecture

Running

# Environment
export NAVSIM_DEVKIT_ROOT=/path/to/navsim
export NAVSIM_EXP_ROOT=/path/to/navsim/exp
export PYTHONPATH=/path/to/flow_drive_rl:$PYTHONPATH

# IL pretraining (flow matching)
python navsim/planning/script/run_training.py --config-name flow_il_mini

# RL fine-tuning (navmini, fast iteration)
python navsim/planning/script/run_training.py --config-name flow_rl_mini

# RL fine-tuning (navtrain, full)
python navsim/planning/script/run_training.py --config-name flow_rl_navtrain

References

DiffusionDriveV2 - DDIM + GRPO baseline
NAVSIM - Benchmark
Flow-GRPO - Flow matching RL theory
DPPO - Diffusion Policy Policy Optimization

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github/workflows		.github/workflows
docs		docs
navsim		navsim
scripts		scripts
.flake8		.flake8
.gitignore		.gitignore
.isort.cfg		.isort.cfg
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
experiments.md		experiments.md
mean.py		mean.py
requirements.txt		requirements.txt
setup.py		setup.py
setup_env.sh		setup_env.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Flow Matching RL for Trajectory Planning (Negative Result)

Summary

Architecture

Codebase Structure

IL Pretraining (Stage 1)

RL Fine-Tuning Attempts (Stage 2) - Negative Result

Approaches Tried

The Fundamental Problem

Why DDV2's DDIM Approach Works

Conclusion

Viable Paths Forward

Running

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Flow Matching RL for Trajectory Planning (Negative Result)

Summary

Architecture

Codebase Structure

IL Pretraining (Stage 1)

RL Fine-Tuning Attempts (Stage 2) - Negative Result

Approaches Tried

The Fundamental Problem

Why DDV2's DDIM Approach Works

Conclusion

Viable Paths Forward

Running

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages