Skip to content

Add LingBot-World support#1083

Draft
radicalyyyahaha wants to merge 3 commits intohao-ai-lab:mainfrom
radicalyyyahaha:lingbot-world-adapt
Draft

Add LingBot-World support#1083
radicalyyyahaha wants to merge 3 commits intohao-ai-lab:mainfrom
radicalyyyahaha:lingbot-world-adapt

Conversation

@radicalyyyahaha
Copy link

No description provided.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @radicalyyyahaha, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the FastVideo framework's capabilities by integrating full support for LingBot-World models. It introduces a new image-to-video pipeline tailored for these models, refines the model configuration to accommodate their specific architectures, and makes the weight loading process more adaptable to various checkpoint formats. Additionally, it includes a dedicated conversion script to simplify the preparation of LingBot-World checkpoints and implements a performance enhancement for multi-process operations.

Highlights

  • LingBot-World Model Support: Introduced comprehensive support for LingBot-World models, including specific parameter name mappings for official checkpoints and a dedicated image-to-video pipeline.
  • Flexible Weight Loading: Enhanced the weight loading mechanism to be more robust, allowing for fallback to different file types (.pth, .bin) and intelligent extraction of tensor state dictionaries from various checkpoint wrappers, including stripping common prefixes for VAE weights.
  • New WanCamImageToVideoPipeline: Added a new WanCamImageToVideoPipeline designed for LingBot/Wan camera-control image-to-video checkpoints, which utilizes two transformer experts and a FlowUniPCMultistepScheduler.
  • Dynamic Model Configuration: Updated WanVideoArchConfig to include new optional parameters (dim, num_heads, in_dim, out_dim, model_type) and logic to dynamically adjust model dimensions and attention heads based on these inputs.
  • Performance Optimization: Optimized multi-process execution by avoiding unnecessary CPU transfers for CUDA IPC when sending result tensors, improving overall performance.
  • Checkpoint Conversion Script: Provided a new utility script to convert LingBot-World checkpoints into a FastVideo-compatible repository layout, streamlining the integration process for these models.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • fastvideo/configs/models/dits/wanvideo.py
    • Added extensive parameter name mappings for official Wan/LingBot checkpoint naming conventions.
    • Introduced new optional configuration parameters (dim, num_heads, in_dim, out_dim, model_type) to WanVideoArchConfig.
    • Modified the __post_init__ method to dynamically set num_attention_heads, in_channels, out_channels, attention_head_dim, and hidden_size based on the newly added optional parameters.
  • fastvideo/models/loader/component_loader.py
    • Imported the new extract_tensor_state_dict utility function.
    • Expanded the allowed patterns for weight loading to include .pth files.
    • Added _class_name and architectures to the list of keys that are popped from the model_config during loading.
    • Refactored VAE weight loading to prioritize .safetensors and then fallback to .pt, .pth, or .bin files, including logic to strip common prefixes from VAE weights.
    • Implemented a warning mechanism to filter and log unsupported WanModel configuration fields.
    • Included WanModel in the list of models that permit non-strict weight loading.
  • fastvideo/models/loader/weight_utils.py
    • Modified pt_weights_iterator to handle torch.load gracefully, attempting weights_only=True first and falling back to weights_only=False for legacy checkpoints, followed by tensor state dictionary extraction.
    • Added a new function extract_tensor_state_dict to recursively find and extract a plain tensor state dictionary from various common checkpoint wrappers.
  • fastvideo/models/registry.py
    • Added 'WanModel' as an alias for 'WanTransformer3DModel' in the _IMAGE_TO_VIDEO_DIT_MODELS registry.
  • fastvideo/pipelines/basic/wan/wan_cam_i2v_pipeline.py
    • Added a new file defining WanCamImageToVideoPipeline, a camera-control image-to-video pipeline for LingBot/Wan models.
    • The pipeline is configured to use FlowUniPCMultistepScheduler and orchestrates various stages including input validation, prompt encoding, conditioning, timestep preparation, latent preparation, image VAE encoding, denoising (with two transformers), and decoding.
  • fastvideo/registry.py
    • Registered new configurations for Wan2_2_I2V_A14B_SamplingParam and Wan2_2_I2V_A14B_Config to support robbyant/lingbot-world-base-cam and pipelines detected by 'lingbot-world' or 'wancamimagetovideopipeline'.
  • fastvideo/worker/multiproc_executor.py
    • Modified worker_busy_loop to send output_batch.output directly via pipe, avoiding an unnecessary .cpu() transfer when using CUDA IPC for result tensors, which improves performance.
  • scripts/checkpoint_conversion/lingbot_world_to_fastvideo.py
    • Added a new script to convert LingBot-World checkpoints into a FastVideo-compatible repository layout.
    • This script handles copying and normalizing transformer configurations, managing tokenizer, text encoder, and VAE components, and generating a model_index.json file for proper integration.
Activity
  • The pull request was initiated by radicalyyyahaha with the goal of adding LingBot-World support.
  • The changes involve modifications to model configurations, weight loading utilities, and the introduction of a new pipeline and conversion script.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for LingBot-World models. The changes include new parameter mappings, more flexible model configuration and weight loading, a new pipeline, and a conversion script. The changes are generally good, improving compatibility and performance. However, I found a potential issue in the __post_init__ method of WanVideoArchConfig where hidden_size is being incorrectly overwritten, which could lead to model configuration errors. I've provided a suggestion to fix this.

Comment on lines 166 to 178
def __post_init__(self):
if self.num_heads is not None:
self.num_attention_heads = self.num_heads
if self.in_dim is not None:
self.in_channels = self.in_dim
if self.out_dim is not None:
self.out_channels = self.out_dim
if self.dim is not None and self.num_attention_heads > 0:
self.attention_head_dim = self.dim // self.num_attention_heads
self.hidden_size = self.dim
super().__post_init__()
self.out_channels = self.out_channels or self.in_channels
self.hidden_size = self.num_attention_heads * self.attention_head_dim
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In the __post_init__ method, self.hidden_size is set on line 175 based on self.dim, but it's then unconditionally overwritten on line 178. This can lead to an incorrect hidden_size if self.dim is not a multiple of self.num_attention_heads due to integer division. For example, if dim was 5001 and num_attention_heads was 40, attention_head_dim would be 125, and hidden_size would be incorrectly set to 40 * 125 = 5000 instead of the intended 5001. The logic should be restructured to correctly prioritize self.dim for hidden_size when it's available.

Suggested change
def __post_init__(self):
if self.num_heads is not None:
self.num_attention_heads = self.num_heads
if self.in_dim is not None:
self.in_channels = self.in_dim
if self.out_dim is not None:
self.out_channels = self.out_dim
if self.dim is not None and self.num_attention_heads > 0:
self.attention_head_dim = self.dim // self.num_attention_heads
self.hidden_size = self.dim
super().__post_init__()
self.out_channels = self.out_channels or self.in_channels
self.hidden_size = self.num_attention_heads * self.attention_head_dim
def __post_init__(self):
if self.num_heads is not None:
self.num_attention_heads = self.num_heads
if self.in_dim is not None:
self.in_channels = self.in_dim
if self.out_dim is not None:
self.out_channels = self.out_dim
super().__post_init__()
if self.dim is not None and self.num_attention_heads > 0:
self.attention_head_dim = self.dim // self.num_attention_heads
self.hidden_size = self.dim
else:
self.hidden_size = self.num_attention_heads * self.attention_head_dim
self.out_channels = self.out_channels or self.in_channels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant