Yiqing Shi, Yiren Song, Mike Zheng Shou
Abstract: We present Edit2Perceive, a unified framework for diverse dense prediction tasks. We demonstrate that image editing diffusion models (specifically FLUX.1 Kontext), rather than text-to-image generators, provide a better inductive bias for deterministic dense perception. Our model achieves state-of-the-art performance across Zero-shot Monocular Depth Estimation, Surface Normal Estimation, and Interactive Matting, supporting efficient single-step deterministic inference.
Dec 19, 2025 Inference Code Release, with model weights
Dec 23, 2025 Training Code Release
git clone [https://github.com/showlab/Edit2Perceive.git](https://github.com/showlab/Edit2Perceive.git)
cd Edit2Perceive
# Create environment (Python 3.12 recommended)
conda create -n e2p python=3.12
conda activate e2p
# Install dependencies
pip install -r requirements.txtStep 1: Download Base Model (FLUX.1-Kontext)
# If huggingface is not available, use mirror
export HF_ENDPOINT=https://hf-mirror.com
hf download black-forest-labs/FLUX.1-Kontext-dev --exclude "transformer/" --local-dir ./FLUX.1-Kontext-devStep 2: Download Edit2Perceive Weights
Place the models in the ckpts/ directory.
- Option A: LoRA Weights (Small size, fast validation)
hf download Seq2Tri/Edit2Perceive --local-dir ckpts/ --include "*lora.safetensors"- Option B: Full Model Weights (Best quality)
hf download Seq2Tri/Edit2Perceive --local-dir ckpts/ --exclude "*lora.safetensors"Required Directory Structure:
Edit2Perceive/
├── ckpts/
│ ├── depth.safetensors
│ ├── depth_lora.safetensors
│ ├── normal.safetensors
│ ├── normal_lora.safetensors
│ ├── matting.safetensors
│ └── matting_lora.safetensors
├── FLUX.1-Kontext-dev/
└── ...
python app.py # Visit http://localhost:7860Run inference on images without the UI:
python inference.pyPlease download the evaluation datasets from the links below:
- Depth: Evaluation Dataset
- Normal: Evaluation Dataset
- Matting: P3M-10k, AM-2k, AIM-500
Before run evaluation, chagne the gt_path in utils/eval_multiple_datasets.py to your dataset path. And then
# Depth
python utils/eval_multiple_datasets.py --task depth --state_dict ckpts/depth.safetensors
# Normal
python utils/eval_multiple_datasets.py --task normal --state_dict ckpts/normal.safetensors
# Matting
python utils/eval_multiple_datasets.py --task matting --state_dict ckpts/matting.safetensorsSet up the datasets for your target task.
Datasets for Depth Estimation (Hypersim & Virtual KITTI 2)
- Hypersim Dataset
- Download:
python preprocess/depth/download_hypersim.py --contains color.hdf5 depth_meters.hdf5 position.hdf5 render_entity_id.hdf5 normal_cam.hdf5 normal_world.hdf5 --silent- Preprocess:
python preprocess/depth/preprocess_hypersim.py --dataset_dir /path/to/hypersim --output_dir /path/to/output- Virtual KITTI 2 Dataset
- Download from here.
Datasets for Surface Normal (Hypersim, InteriorVerse & Sintel)
- Hypersim Dataset
- Download:
python preprocess/depth/download_hypersim.py --contains color.hdf5 position.hdf5 normal_cam.hdf5 normal_world.hdf5 render_entity_id.hdf5 --silent- Preprocess:
python preprocess/normal/preprocess_hypersim_normals.py --dataset_dir /path/to/hypersim --output_dir /path/to/output- InteriorVerse Dataset
- Refer to download instructions.
- Preprocess:
python preprocess/normal/preprocess_interiorverse_normals.py --dataset_dir /path/to/interiorverse --output_dir /path/to/output- Sintel Dataset
Datasets for Interactive Matting (Comp-1k, Distinctions, AM-2k, COCO)
- Sources: Composition-1k, Distinctions-646, AM-2k, COCO-Matting.
- Special Instructions:
- Distinctions-646: You must generate the merged dataset yourself using backgrounds sampled from VOC2012. Refer to
preprocess/matting/preprocess_distinctions_646.py. - COCO-Matting: Download trimap_alpha and train2017.zip. Mannually split the
trimap_alpha(concatenated in width) into single alpha channels.
Update the --dataset_base_path in the scripts located in scripts/*.sh. Note: The dataset order is strict and must not be changed.
# Example for Depth (Hypersim + VKITTI2)
--dataset_base_path "/path/to/Hypersim/processed_depth,/path/to/vkitti2"
# Example for Normal (Hypersim + InteriorVerse + Sintel)
--dataset_base_path "/path/to/Hypersim/processed_normal,/path/to/InteriorVerse/processed_normal,/path/to/sintel"
# Example for Matting
--dataset_base_path "/path/to/composition-1k,/path/to/Distinctions-646,/path/to/AM-2k,/path/to/COCO-Matte"Execute the corresponding script for LoRA or Full Fine-tuning, more details of training refer to training_args_instructions.md
# Depth Estimation
bash scripts/Kontext_depth_lora.sh
bash scripts/Kontext_depth.sh
# Surface Normal Estimation
bash scripts/Kontext_normal_lora.sh
bash scripts/Kontext_normal.sh
# Interactive Matting
bash scripts/Kontext_matting_lora.sh
bash scripts/Kontext_matting.If you find our work useful in your research, please consider citing our paper:
@misc{shi2025edit2perceive,
title={Edit2Perceive: Image Editing Diffusion Models Are Strong Dense Perceivers},
author={Yiqing Shi and Yiren Song and Mike Zheng Shou},
year={2025},
eprint={2511.18673},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={[https://arxiv.org/abs/2511.18673](https://arxiv.org/abs/2511.18673)},
}If you have any questions, please feel free to contact Yiqing Shi at [email protected].
