This repository contains the official PyTorch implementation of our paper "Unity in Diversity: Video Editing via Gradient-Latent Purification". Our method enables precise video editing by leveraging gradient-based latent purification techniques to achieve consistent and high-quality results across diverse video content.
- Gradient-Latent Purification: Novel approach for consistent video editing
- High-Quality Results: Maintains temporal coherence across frames
- Flexible Configuration: Easy-to-use YAML configuration system
- Multiple Input Formats: Support for various video and image formats
- GPU Accelerated: Optimized for NVIDIA GPUs with CUDA support
- Python 3.9.19
- CUDA-compatible NVIDIA GPU
- CUDA 12.1 or compatible version
-
Create and activate a conda environment:
conda create -n ulg python=3.9.19 conda activate ulg
-
Install PyTorch with CUDA support:
pip install torch==2.2.2+cu121 torchvision --extra-index-url https://download.pytorch.org/whl/cu121
-
Install other dependencies:
pip install -r requirements.txt
If you're in mainland China, use the Hugging Face mirror:
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple/
Run video editing with default settings:
python inference.py
For users in China (using HF mirror):
HF_ENDPOINT=https://hf-mirror.com python inference.py
Use a specific configuration file:
python inference.py --config configs/your_config.yaml
- Video Input: Place your input video in the
data/
directory - Configuration: Modify the configuration file in
configs/
directory - DDIM Inversion: Run preprocessing if needed:
bash inversion.sh
configs/ddim_inversion_png.yaml
: Configuration for DDIM inversionconfigs/dog_robotic.yaml
: Example configuration for dog-to-robot transformation
n_frames
: Number of frames to processimage_size
: Target image resolution [height, width]n_steps
: Number of diffusion stepscfg_txt
: Text guidance scalecfg_img
: Image guidance scale
python inference.py --config configs/dog_robotic.yaml
python inference.py \
--video_path "data/your_video.mp4" \
--prompt "your editing prompt" \
--output_dir "results/your_result"
βββ configs/ # Configuration files
β βββ ddim_inversion_png.yaml
β βββ dog_robotic.yaml
βββ data/ # Input data directory
βββ ddim_inversion/ # DDIM inversion results
βββ dds/ # DDS related modules
βββ results/ # Output results
βββ unity_pipeline/ # Core pipeline implementation
β βββ pipelines/
β βββ utils/
βββ inference.py # Main inference script
βββ preprocess.py # Data preprocessing
βββ utils.py # Utility functions
βββ requirements.txt # Dependencies
The configuration system uses YAML files. Key settings include:
# General settings
seed: 8888
device: "cuda:0"
# Data settings
image_size: [512, 512]
n_frames: 8
# DDIM inversion settings
inverse_config:
n_steps: 100
cfg_txt: 1.0
cfg_img: 1.0
- CUDA out of memory: Reduce
image_size
orn_frames
- Missing dependencies: Ensure all packages in
requirements.txt
are installed - Slow download: Use mirror sources if you're in China
- Use smaller image sizes for faster processing
- Adjust
n_steps
based on quality requirements - Enable mixed precision training if supported
If you find this work useful in your research, please consider citing:
@inproceedings{gao2025unity,
title={Unity in Diversity: Video Editing via Gradient-Latent Purification},
author={Gao, Junyu and Yang, Kunlin and Yao, Xuan and Hu, Yufan},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={23401--23411},
year={2025}
}
This project is licensed under the MIT License - see the LICENSE file for details.
We welcome contributions! Please feel free to submit a Pull Request.
For questions or issues, please:
- Open an issue on GitHub
- Contact us at [[email protected]]
Note: This code is released for research purposes. Please ensure proper attribution when using this work.