A BEVFormer reimplementation in pure PyTorch for camera-only 3D object detection on nuScenes, with no MMDetection or MMCV dependencies. Written primarily as an educational reference.(Even though its named tiny and written primarily as an educational reference if you add the custom cuda kernel for deformable attention this repo if good enough probably even a bit faster than the original implemnetion, but if you want to load the pretrained weights from the orginal repo then that might be easier with the original repo )
This repo includes the core model architecture, a nuScenes data pipeline, temporal self attention and spatial cross attention have been written cleanly using einops focusing more on readabilty, PyTorch Lightning training, nuScenes metric integration, and unit tests for the main components. So if someone want to dig into a component they can use the unit test to specifically check that module.
Key differences from the original implementation:
- Reference point pre-computation — The 2D and 3D reference point calculations, which were originally computed inside the model's forward pass, are moved outside since they are constant across steps. Only the
lidar2imgprojection is kept in the forward path. - Single decoder computation graph — The original implementation built a redundant computation graph for the regression head (once for iterative reference point updates and once for final outputs). This has been reduced to a single pass.
- Yaw convention — The original saves yaw in the SECOND coordinate system. Here, yaw is stored directly in the nuScenes coordinate system, which slightly changes the ego-motion shift calculation in the temporal self-attention.
- Regresion head - In the original implemenation they have cx,cy,cz normalized to metric in the regression head out, here i have kept it in the [0,1] normalized space and did the normalization for the groundtruth before the loss calcualtions.
- Readability — Most complex tensor operations have been rewritten using einops for clarity.
Known limitation: Original implemenation have good image augmentation pipe line including very useful ones like GridMask, currently its not there also image resizing is not currently supported , the corresponding intrinsic matrix scaling transform has not been implemented yet.(This is not much a simple scale tranform would suffice.)
pip install -r requirements.txt
pip install -e .Use the helper script:
bash scripts/prepare_data.shThis runs:
python tools/data_converter/create_data.py nuscenes \
--root-path data/nuscenes \
--canbus data/can_bus \
--version v1.0-mini \
--out-dir dataThe default tiny config expects temporal nuScenes info files under data/nuscenes/.
Use the provided script:
bash scripts/train_tiny.shOr run the trainer directly:
python tools/train.py --config configs/bevformer_tiny.yamlThe training script also supports checkpoint resume and dot-notation config overrides:
python tools/train.py --config configs/bevformer_tiny.yaml \
--resume work_dirs/nuscenes_mini/checkpoints/last.ckpt
python tools/train.py --config configs/bevformer_tiny.yaml \
train.batch_size=2 optimizer.lr=1e-4 data.load_interval=5Outputs are written under work_dirs/nuscenes_mini/, including checkpoints, TensorBoard logs, and nuScenes eval artifacts. (basically depends on your work_dir in configs, the above if for the current demo yaml)
Step 1 — Generate PKL files (one-time):
python tools/data_converter/create_data.py nuscenes \
--root-path /path/to/nuscenes \
--canbus /path/to/nuscenes_full \
--version v1.0 \
--out-dir /path/to/nuscenes_fullStep 2 — Train on full dataset (no file edits, overrides passed on CLI):
python tools/train.py \
"data.root=/path/to/nuscenes" \
"data.train_ann=/path/to/nuscenes_full/nuscenes_infos_temporal_train.pkl" \
"data.val_ann=/path/to/nuscenes_full/nuscenes_infos_temporal_val.pkl" \
"data.version=v1.0-trainval" \
"data.eval_set=val"All thanks to the very neatly written BEVFormer paper and its implementation.