Skip to content

akashprakas/flow_drive_rl

Repository files navigation

Flow Matching RL for Trajectory Planning (Negative Result)

An ablation study on DiffusionDriveV2 that replaces its DDIM diffusion sampler with a rectified flow matching model and attempts RL fine-tuning via GRPO/PPO. Built on the NAVSIM benchmark.


Summary

DiffusionDriveV2 achieves 91.2 PDMS using DDIM + GRPO-style RL fine-tuning. This project replaces the DDIM denoiser with a rectified flow matching ODE and attempts the same RL fine-tuning. After extensive experimentation (20+ configurations across multi-step chains, single-step REINFORCE, PPO, DPPO, and Flow-GRPO approaches), we conclude that truncated flow matching (t_max=0.05) with single-step inference is structurally incompatible with RL fine-tuning.

The pretrained IL flow model achieves 0.94 PDM score but resists all RL improvement attempts due to an unresolvable tension between exploration noise scale and proposal quality.


Architecture

4 cameras + LiDAR
      |
  TransFuser Backbone (ResNet-34 + BEV) [frozen during RL]
      |
  BEV Features [B, 256, H, W]
      |
  Flow Trajectory Head [trainable]
      |
  20 K-Means Anchors x G groups = 80 proposals
      |
  Rectified Flow: single pass at t=0.02
      |
  PDM Scorer -> best trajectory

The trajectory head operates in normalized [-1, 1] space and uses 20 pre-computed anchor trajectories. At inference, a single decoder pass at t=0.02 with input x_t = 0.98 * anchor produces 20 trajectory predictions. The decoder is a 2-layer CustomTransformerDecoder with cross-BEV attention, time-embedding modulation, and per-mode classification + regression heads.


Codebase Structure

navsim/agents/flow_drive_agent/
  flow_model.py       - IL flow model (pretraining)
  flow_agent.py       - Lightning wrapper for IL
  flow_rl_model.py    - RL flow model (fine-tuning attempts)
  flow_rl_agent.py    - Lightning wrapper for RL

navsim/planning/script/config/
  training/flow_rl_mini.yaml     - navmini RL config (fast iteration)
  training/flow_rl_navtrain.yaml - navtrain RL config (full training)
  common/agent/flow_rl_agent.yaml - agent hyperparameters

navsim/planning/training/
  agent_lightning_module.py - RLAgentLightningModule (multi-epoch PPO)

scripts/training/
  run_flow_rl_navtrain.sh  - Launch script

CHANGES.md              - Detailed experiment log with all attempts

IL Pretraining (Stage 1)

File: flow_model.py

  • Train: t ~ U(0, 0.05), x_t = (1-t)*anchor + t*noise, decoder predicts clean x_0
  • Inference: single pass at t=0.02 with zero noise input
  • Achieves 0.94 PDM on navtrain evaluation

RL Fine-Tuning Attempts (Stage 2) - Negative Result

File: flow_rl_model.py

Approaches Tried

# Approach Diversity (adv_frac_zero) Stability Outcome
1 Multi-step chain (t=0.15->0), z-inversion PPO 0.60 Unstable Reward 0.88->0.60
2 Single-step REINFORCE, output noise 0.89 Stable Zero gradient
3 Input noise diversity 0.87 Stable No policy movement
4 Two-step chain 0.85 Stable Zero gradient
5 DPPO multi-step (5 steps, per-step log-prob) 0.60 Unstable Reward 0.86->0.26
6 Flow-GRPO symmetric adv, sigma=0.15 0.54 Unstable Reward 0.21 at step 0
7 Multiplicative noise, sigma_logprob=1.0 0.89 Stable kl_ref grows, no improvement
8 Symmetric adv, sigma=0.04, z-inversion PPO 0.50 Unstable Reward 0.44->0.01

The Fundamental Problem

There is no exploration noise level that simultaneously:

  1. Creates enough trajectory diversity for PDM to differentiate proposals (needs >3m displacement)
  2. Keeps proposals within the safe driving corridor (needs <1.5m displacement)

This is because:

  • PDM has binary cliff metrics (collision, drivable_area) that create discontinuous reward
  • The pretrained model at 0.94 is already near-optimal with minimal room to improve
  • Single-step noise produces position offsets, not shape diversity (PDM responds to curvature/timing changes)

Why DDV2's DDIM Approach Works

Property DDV2 (DDIM) Flow Matching (this project)
Diversity 10-step chain compounds noise -> shape diversity Single-step offset -> no shape diversity
Train/test gap None (alpha-schedule consistency) Fatal (chain damages t=0.02 inference)
Decoder training range Full schedule (t=0 to 1000) Truncated (t in [0, 0.05] only)
Exploration mechanism Per-step noise within DDIM formula Must be injected externally

DDV2's key structural advantage: the DDIM scheduler's alpha-schedule makes the decoder's job (predict x_0) identical regardless of which timestep-pair is used. This means training at timesteps [18,16,...,0] and inferring at [10,0] have zero distribution gap. Flow matching with truncated t cannot replicate this property.


Conclusion

Single-step flow matching with truncated noise schedule is structurally incompatible with RL fine-tuning for trajectory planning with PDM rewards. The architecture provides no mechanism for generating reward-differentiable trajectory diversity without either:

  • Damaging inference behavior (multi-step chains at untrained timesteps)
  • Producing degenerate proposals (large exploration noise -> collisions)
  • Providing no gradient signal (small exploration noise -> identical PDM scores)

Viable Paths Forward

  1. Retrain flow model on t ~ U(0, 1) with multi-step ODE inference -> eliminates train/test gap, enables DPPO
  2. Change inference to multi-step (5 deterministic Euler steps) -> matches RL training distribution
  3. Use DDV2's DDIM-based RL (known to work, the reference implementation)
  4. Accept 0.94 as ceiling for this single-step architecture

Running

# Environment
export NAVSIM_DEVKIT_ROOT=/path/to/navsim
export NAVSIM_EXP_ROOT=/path/to/navsim/exp
export PYTHONPATH=/path/to/flow_drive_rl:$PYTHONPATH

# IL pretraining (flow matching)
python navsim/planning/script/run_training.py --config-name flow_il_mini

# RL fine-tuning (navmini, fast iteration)
python navsim/planning/script/run_training.py --config-name flow_rl_mini

# RL fine-tuning (navtrain, full)
python navsim/planning/script/run_training.py --config-name flow_rl_navtrain

References

About

Some RL/flow matching ablation's on diffusion drive v2

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors