This is the official PyTorch implementation for the ACMMM 24 paper: "LoMOE: Localized Multi-Object Editing via Multi-Diffusion". All the published data is available on our project page.
This code was tested with python=3.9, pytorch=2.0.1 and torchvision=0.15.2. Please follow the instructions here to install PyTorch and TorchVision dependencies. Installing both PyTorch and TorchVision with CUDA support is strongly recommended.
Create a conda environment with the following dependencies:
conda create -n lomoe python=3.9
conda activate lomoe
conda install pytorch==2.0.1 torchvision==0.15.2 pytorch-cuda=11.7 -c pytorch -c nvidia
pip install accelerate==0.20.3 diffusers==0.12.1 einops==0.7.0 ipython transformers==4.26.1 salesforce-lavis==1.0.2
Start by downloading the SOE and MOE datasets from our project page to ./benchmark/data.
To generate the prompt, inverted latent, and store intermediate latents for an image, first run the inversion script located at ./lomoe/invert/inversion.py. Then, to apply edits, use ./lomoe/edit/main.py. A sample image and corresponding masks for single and multi-object edit operations are provided in ./lomoe/sample/.
The invert/inversion.py script takes the following arguments
--input_image: Path to the image.--results_folder: Path to store the prompt, inverted and intermediate latents.
CUDA_VISIBLE_DEVICES=0 python invert/inversion.py \
--input_image "sample/single/init_image.jpg" \
--results_folder "invert/output/single"
CUDA_VISIBLE_DEVICES=0 python invert/inversion.py \
--input_image "sample/multi/init_image.png" \
--results_folder "invert/output/multi"
The edit/main.py script takes the following arguments
--mask_paths: Path to the object mask.--num_fgmasks: Number of foreground masks (defaults to 1).--bg_prompt: Path to the background prompt (we use the prompt generated byinversion.py).--bg_negative: Path to the background negative prompt (we use the prompt generated byinversion.py).--fg_prompts: Edit prompt corresponding to the masks.--fg_negative: The foreground negative prompt. (We use "artifacts, blurry, smooth texture, bad quality, distortions, unrealistic, distorted image")--W: Output image width.--H: Output image height.--seed: The seed to initialize random number generators (defaults to 0).--sd_version: The stable diffusion version to be used (use the same as that ininversion.py).--steps: The number of diffusion timesteps (use the same as that ininversion.py).--ca_coef: Cross attention preservation loss coefficient (defaults to 1.0).--seg_coef: Background loss coefficient (defaults to 1.75).--bootstrapping: Value of the bootstrap parameter (defaults to 20).--latent: Path to the inverted latent produced byinversion.py.--latent_list: Path to the latent list produced byinversion.py.--rec_path: Path to save the reconstructed input image.--edit_path: Path to save the edited image.--save_path: Path to save the merged reconstructed and edited image.
CUDA_VISIBLE_DEVICES=0 python edit/main.py \
--mask_paths "sample/single/mask_1.jpg" \
--bg_prompt "invert/output/single/prompt/init_image.txt" \
--bg_negative "invert/output/single/prompt/init_image.txt" \
--fg_negative "artifacts, blurry, smooth texture, bad quality, distortions, unrealistic, distorted image" \
--H 512 \
--W 512 \
--bootstrapping 20 \
--latent 'invert/output/single/inversion/init_image.pt' \
--latent_list 'invert/output/single/latentlist/init_image.pt' \
--rec_path 'results/single/1_reconstruction.png' \
--edit_path 'results/single/2_edit.png' \
--fg_prompts "a red dog collar" \
--seed 1234 \
--save_path 'results/single/3_merged.png'
CUDA_VISIBLE_DEVICES=0 python edit/main.py \
--mask_paths "sample/multi/mask_1.png" "sample/multi/mask_2.png" \
--bg_prompt "invert/output/multi/prompt/init_image.txt" \
--bg_negative "invert/output/multi/prompt/init_image.txt" \
--fg_negative "artifacts, blurry, smooth texture, bad quality, distortions, unrealistic, distorted image" "artifacts, blurry, smooth texture, bad quality, distortions, unrealistic, distorted image" \
--H 512 \
--W 512 \
--bootstrapping 20 \
--latent 'invert/output/multi/inversion/init_image.pt' \
--latent_list 'invert/output/multi/latentlist/init_image.pt' \
--rec_path 'results/multi/1_reconstruction.png' \
--edit_path 'results/multi/2_edit.png' \
--fg_prompts "a crochet bird" "an origami bird" \
--num_fgmasks 2 \
--seed 1234 \
--save_path 'results/multi/3_merged.png'
To compute the classical and neural metrics, use compute_metrics.py in ./benchmark/metrics/{SOE/MOE}. This includes the SRC and TGT Clip Scores, BG LPIPS, BG PSNR, BG MSE, BG SSIM and the Structural Distance. The compute_aesthetic.py in ./benchmark/metrics/{SOE/MOE} computes the aesthetic metrics including HPS, IR and Aesthetic Score. This file also requires additional dependencies, namely HPSv2 and ImageReward.
NOTE: The compute_metrics.py and compute_aesthetic.py scripts expect a folder containing edits for all images in the dataset. Please modify the code to run them on a smaller subset or single images.
CUDA_VISIBLE_DEVICES=0 python compute_metrics.py --folder_name PATH_TO_SAVED_EDITS
CUDA_VISIBLE_DEVICES=0 python compute_aesthetic.py --folder_name PATH_TO_SAVED_EDITS
If you use LoMOE or find this work useful for your research, please use the following BibTeX entry.
@InProceedings{Chakrabarty_2024_ACMMM,
author = {Chakrabarty$^*$, Goirik and Chandrasekar$^*$, Aditya and Hebbalaguppe, Ramya and Prathosh, AP},
title = {LoMOE: Localized Multi-Object Editing via Multi-Diffusion},
booktitle = {ACM Multimedia 2024},
month = {October},
year = {2024}
}
