- Preparation: Be aware of the task type (single task or multi-task composition) of your creative idea, and prepare all the required materials (images, videos, prompt, etc.)
- Preprocessing: Select the appropriate preprocessing method based task name, then preprocess your materials to meet the model's input requirements.
- Inference: Based on the preprocessed materials, perform VACE inference to obtain results.
VACE, as a unified video generation solution, simultaneously supports Video Generation, Video Editing, and complex composition task. Specifically:
- Video Generation: No video input. Injecting concepts into the model through semantic understanding based on text and reference materials, including T2V (Text-to-Video Generation) and R2V (Reference-to-Video Generation) tasks.
- Video Editing: With video input. Modifying input video at the pixel level globally or locally,including V2V (Video-to-Video Editing) and MV2V (Masked Video-to-Video Editing).
- Composition Task: Compose two or more single task above into a complex composition task, such as Reference Anything (Face R2V + Object R2V), Move Anything(Frame R2V + Layout V2V), Animate Anything(R2V + Pose V2V), Swap Anything(R2V + Inpainting MV2V), and Expand Anything(Object R2V + Frame R2V + Outpainting MV2V), etc.
Single tasks and compositional tasks are illustrated in the diagram below:
- Super high resolution video will be resized to proper spatial size.
- Super long video will be trimmed or uniformly sampled into around 5 seconds.
- For users who are demanding of long video generation, we recommend to generate 5s video clips one by one, while using
firstclipvideo extension task to keep the temporal consistency.
User-collected materials needs to be preprocessed into VACE-recognizable inputs, including src_video, src_mask, src_ref_images, and prompt.
Specific descriptions are as follows:
src_video: The video to be edited for input into the model, such as condition videos (Depth, Pose, etc.) or in/outpainting input video. Gray areas(values equal to 127) represent missing video part. In first-frame R2V task, the first frame are reference frame while subsequent frames are left gray. The missing parts of in/outpaintingsrc_videoare also set gray.src_mask: A 3D mask in the same shape ofsrc_video. White areas represent the parts to be generated, while black areas represent the parts to be retained.src_ref_images: Reference images of R2V. Salient object segmentation can be performed to keep the background white.prompt: A text describing the content of the output video. Prompt expansion can be used to achieve better generation effects for LTX-Video and English user of Wan2.1. Use descriptive prompt instead of instructions.
Among them, prompt is required while src_video, src_mask, and src_ref_images are optional. For instance, MV2V task requires src_video, src_mask, and prompt; R2V task only requires src_ref_images and prompt.
Both command line and Gradio demo are supported.
- Command Line: You can refer to the
run_vace_preproccess.shscript and invoke it based on the different task types. An example command is as follows:
python vace/vace_preproccess.py --task depth --video assets/videos/test.mp4- Gradio Interactive: Launch the graphical interface for data preprocessing and perform preprocessing on the interface. The specific command is as follows:
python vace/gradios/preprocess_demo.pyVACE is an all-in-one model supporting various task types. However, different preprocessing is required for these task types. The specific task types and descriptions are as follows:
| Task | Subtask | Annotator | Input modal | Params | Note |
|---|---|---|---|---|---|
| txt2vid | txt2vid | / | / | / | |
| control | depth | DepthVideoAnnotator | video | / | |
| control | flow | FlowVisAnnotator | video | / | |
| control | gray | GrayVideoAnnotator | video | / | |
| control | pose | PoseBodyFaceVideoAnnotator | video | / | |
| control | scribble | ScribbleVideoAnnotator | video | / | |
| control | layout_bbox | LayoutBboxAnnotator | two bboxes 'x1,y1,x2,y2 x1,y1,x2,y2' |
/ | Move linearly from the first box to the second box |
| control | layout_track | LayoutTrackAnnotator | video | mode='masktrack/bboxtrack/label/caption' maskaug_mode(optional)='original/original_expand/hull/hull_expand/bbox/bbox_expand' maskaug_ratio(optional)=0~1.0 |
Mode represents different methods of subject tracking. |
| extension | frameref | FrameRefExpandAnnotator | image | mode='firstframe' expand_num=80 (default) |
|
| extension | frameref | FrameRefExpandAnnotator | image | mode='lastframe' expand_num=80 (default) |
|
| extension | frameref | FrameRefExpandAnnotator | two images a.jpg,b.jpg |
mode='firstlastframe' expand_num=80 (default) |
Images are separated by commas. |
| extension | clipref | FrameRefExpandAnnotator | video | mode='firstclip' expand_num=80 (default) |
|
| extension | clipref | FrameRefExpandAnnotator | video | mode='lastclip' expand_num=80 (default) |
|
| extension | clipref | FrameRefExpandAnnotator | two videos a.mp4,b.mp4 |
mode='firstlastclip' expand_num=80 (default) |
Videos are separated by commas. |
| repainting | inpainting_mask | InpaintingAnnotator | video | mode='salient' | Use salient as a fixed mask. |
| repainting | inpainting_mask | InpaintingAnnotator | video + mask | mode='mask' | Use mask as a fixed mask. |
| repainting | inpainting_bbox | InpaintingAnnotator | video + bbox 'x1, y1, x2, y2' |
mode='bbox' | Use bbox as a fixed mask. |
| repainting | inpainting_masktrack | InpaintingAnnotator | video | mode='salientmasktrack' | Use salient mask for dynamic tracking. |
| repainting | inpainting_masktrack | InpaintingAnnotator | video | mode='salientbboxtrack' | Use salient bbox for dynamic tracking. |
| repainting | inpainting_masktrack | InpaintingAnnotator | video + mask | mode='masktrack' | Use mask for dynamic tracking. |
| repainting | inpainting_bboxtrack | InpaintingAnnotator | video + bbox 'x1, y1, x2, y2' |
mode='bboxtrack' | Use bbox for dynamic tracking. |
| repainting | inpainting_label | InpaintingAnnotator | video + label | mode='label' | Use label for dynamic tracking. |
| repainting | inpainting_caption | InpaintingAnnotator | video + caption | mode='caption' | Use caption for dynamic tracking. |
| repainting | outpainting | OutpaintingVideoAnnotator | video | direction=left/right/up/down expand_ratio=0~1.0 |
Combine outpainting directions arbitrarily. |
| reference | image_reference | SubjectAnnotator | image | mode='salient/mask/bbox/salientmasktrack/salientbboxtrack/masktrack/bboxtrack/label/caption' maskaug_mode(optional)='original/original_expand/hull/hull_expand/bbox/bbox_expand' maskaug_ratio(optional)=0~1.0 |
Use different methods to obtain the subject region. |
Moreover, VACE supports combining tasks to accomplish more complex objectives. The following examples illustrate how tasks can be combined, but these combinations are not limited to the examples provided:
| Task | Subtask | Annotator | Input modal | Params | Note |
|---|---|---|---|---|---|
| composition | reference_anything | ReferenceAnythingAnnotator | image_list | mode='salientmasktrack/salientbboxtrack/masktrack/bboxtrack/label/caption' | Input no more than three images. |
| composition | animate_anything | AnimateAnythingAnnotator | image + video | mode='salientmasktrack/salientbboxtrack/masktrack/bboxtrack/label/caption' | Video for conditional redrawing; images for reference generation. |
| composition | swap_anything | SwapAnythingAnnotator | image + video | mode='masktrack/bboxtrack/label/caption' maskaug_mode(optional)='original/original_expand/hull/hull_expand/bbox/bbox_expand' maskaug_ratio(optional)=0~1.0 |
Video for conditional redrawing; images for reference generation. Comma-separated mode: first for video, second for images. |
| composition | expand_anything | ExpandAnythingAnnotator | image + image_list | mode='masktrack/bboxtrack/label/caption' direction=left/right/up/down expand_ratio=0~1.0 expand_num=80 (default) |
First image for extension edit; others for reference. Comma-separated mode: first for video, second for images. |
| composition | move_anything | MoveAnythingAnnotator | image + two bboxes | expand_num=80 (default) | First image for initial frame reference; others represented by linear bbox changes. |
| composition | more_anything | ... | ... | ... | ... |
Both command line and Gradio demo are supported.
- Command Line: Refer to the
run_vace_ltx.shandrun_vace_wan.shscripts and invoke them based on the different task types. The input data needs to be preprocessed to obtain parameters such assrc_video,src_mask,src_ref_imagesandprompt. An example command is as follows:
python vace/vace_wan_inference.py --src_video <path-to-src-video> --src_mask <path-to-src-mask> --src_ref_images <paths-to-src-ref-images> --prompt <prompt> # wan
python vace/vace_ltx_inference.py --src_video <path-to-src-video> --src_mask <path-to-src-mask> --src_ref_images <paths-to-src-ref-images> --prompt <prompt> # ltx- Gradio Interactive: Launch the graphical interface for model inference and perform inference through interactions on the interface. The specific command is as follows:
python vace/gradios/vace_wan_demo.py # wan
python vace/gradios/vace_ltx_demo.py # ltx- End-to-End Inference: Refer to the
run_vace_pipeline.shscript and invoke it based on different task types and input data. This pipeline includes both preprocessing and model inference, thereby requiring only user-provided materials. However, it offers relatively less flexibility. An example command is as follows:
python vace/vace_pipeline.py --base wan --task depth --video <path-to-video> --prompt <prompt> # wan
python vace/vace_pipeline.py --base lxt --task depth --video <path-to-video> --prompt <prompt> # ltxWe provide test examples under different tasks, enabling users to validate according to their needs. These include task, sub-tasks, original inputs (ori_videos and ori_images), model inputs (src_video, src_mask, src_ref_images, prompt), and model outputs.
- VACE-LTX-Video-0.9
- The prompt significantly impacts video generation quality on LTX-Video. It must be extended in accordance with the methods described in this system prompt. We also provide input parameters for using prompt extension (--use_prompt_extend).
- This model is intended for experimental research validation within the VACE paper and may not guarantee performance in real-world scenarios. However, its inference speed is very fast, capable of creating a video in 25 seconds with 40 steps on an A100 GPU, making it suitable for preliminary data and creative validation.
- VACE-Wan2.1-1.3B-Preview
- This model mainly keeps the original Wan2.1-T2V-1.3B's video quality while supporting various tasks.
- When you encounter failure cases with specific tasks, we recommend trying again with a different seed and adjusting the prompt.
















