Skip to content

[feat] Add stable_audio T2A Generation#1080

Open
happy-harvey wants to merge 1 commit intohao-ai-lab:mainfrom
happy-harvey:harvey/audio_dev
Open

[feat] Add stable_audio T2A Generation#1080
happy-harvey wants to merge 1 commit intohao-ai-lab:mainfrom
happy-harvey:harvey/audio_dev

Conversation

@happy-harvey
Copy link

No description provided.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @happy-harvey, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly extends the FastVideo library by introducing robust Text-to-Audio generation capabilities. It integrates the Stable Audio Open model, providing a complete pipeline from text prompt to audio waveform. The changes encompass new model configurations, specialized pipeline stages for audio processing, and an updated API, making it straightforward for users to leverage this new functionality. The addition of comprehensive documentation and examples ensures a smooth onboarding experience for generating high-quality audio from textual descriptions.

Highlights

  • New Feature: Text-to-Audio (T2A) Generation: Introduced support for Text-to-Audio generation using Stability AI's Stable Audio Open 1.0 model, expanding the library's capabilities beyond video.
  • Comprehensive Stable Audio Integration: Integrated Stable Audio's Diffusion Transformer (DiT), Oobleck VAE pretransform, and MultiConditioner components, along with k-diffusion v-prediction sampling, into the FastVideo framework.
  • Updated Documentation and Examples: Added a dedicated documentation page for Stable Audio, detailing installation (including Python 3.12 dependency workarounds), usage, supported models, and limitations. A new example script demonstrates basic and advanced T2A generation.
  • Flexible Model Loading: Enhanced model loading mechanisms to support unified checkpoints (e.g., model.safetensors or model.ckpt) for Stable Audio models, allowing for easier integration of pre-trained weights.
  • New generate_audio Method: The VideoGenerator class now includes a generate_audio method, providing a streamlined API for text-to-audio inference with configurable parameters.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • docs/inference/inference_quick_start.md
    • Added a link to the new Stable Audio documentation page.
  • docs/inference/stable_audio.md
    • New file detailing Stable Audio Open T2A generation, including supported models, installation, usage, and limitations.
  • docs/inference/support_matrix.md
    • Updated the supported models table to include Stable Audio Open 1.0 (T2A).
    • Added a new section for Stable Audio Open 1.0 features.
  • examples/inference/basic/stable_audio_basic.py
    • New example script for basic and custom parameter usage of Stable Audio T2A generation.
  • fastvideo/configs/models/dits/init.py
    • Imported and added StableAudioDiTConfig to the __all__ list.
  • fastvideo/configs/models/dits/stable_audio.py
    • New file defining StableAudioDiTArchConfig and StableAudioDiTConfig for the Stable Audio Diffusion Transformer.
  • fastvideo/configs/pipelines/stable_audio.py
    • New file defining StableAudioPipelineConfig for the Stable Audio T2A pipeline, including audio-specific parameters.
  • fastvideo/configs/sample/stable_audio.py
    • New file defining StableAudioSamplingParam with default sampling parameters for T2A generation.
  • fastvideo/entrypoints/video_generator.py
    • Added a new generate_audio method to the VideoGenerator class for text-to-audio generation.
  • fastvideo/models/dits/stable_audio.py
    • New file implementing StableAudioDiTModel, a wrapper for stable-audio-tools DiffusionTransformer.
  • fastvideo/models/loader/component_loader.py
    • Modified to support loading unified checkpoints (.ckpt files) and inferring paths based on transformer_key_prefix.
  • fastvideo/models/loader/fsdp_load.py
    • Updated maybe_load_fsdp_model to accept weight_key_prefix and handle .ckpt files for unified checkpoint loading.
  • fastvideo/models/loader/weight_utils.py
    • Added unified_checkpoint_weights_iterator for loading weights from .ckpt files.
    • Modified safetensors_weights_iterator to support key_prefix for filtering.
  • fastvideo/models/registry.py
    • Registered StableAudioDiTModel in the model registry.
  • fastvideo/models/stable_audio/init.py
    • New __init__.py file to export Stable Audio components.
  • fastvideo/models/stable_audio/conditioner.py
    • New file defining StableAudioConditioner for T5 and NumberEmbedder conditioning, loading weights from unified checkpoints.
  • fastvideo/models/stable_audio/pretransform.py
    • New file defining StableAudioPretransform for Oobleck VAE, handling encoding/decoding and loading weights.
  • fastvideo/models/stable_audio/sampling.py
    • New file defining sample_stable_audio for k-diffusion v-prediction sampling compatible with Stable Audio.
  • fastvideo/pipelines/basic/stable_audio/init.py
    • New __init__.py for the Stable Audio pipeline.
  • fastvideo/pipelines/basic/stable_audio/stable_audio_pipeline.py
    • New file defining StableAudioPipeline to orchestrate T2A generation, supporting unified and HuggingFace model loading.
  • fastvideo/pipelines/pipeline_batch_info.py
    • Added audio-specific fields (sample_rate, duration_seconds, seconds_start, seconds_total) to ForwardBatch.
  • fastvideo/pipelines/stages/init.py
    • Imported and added new Stable Audio pipeline stages to the __all__ list.
  • fastvideo/pipelines/stages/stable_audio_conditioning.py
    • New file defining StableAudioConditioningStage to process text prompts and seconds conditioning.
  • fastvideo/pipelines/stages/stable_audio_decoding.py
    • New file defining StableAudioDecodingStage to decode latents into audio waveforms.
  • fastvideo/pipelines/stages/stable_audio_denoising.py
    • New file defining StableAudioDenoisingStage for k-diffusion v-prediction sampling.
  • fastvideo/pipelines/stages/stable_audio_input_validation.py
    • New file defining StableAudioInputValidationStage for audio-specific input validation.
  • fastvideo/pipelines/stages/stable_audio_latent_preparation.py
    • New file defining StableAudioLatentPreparationStage to prepare initial noise latents based on audio duration.
  • fastvideo/registry.py
    • Registered StableAudioPipelineConfig and StableAudioSamplingParam for model detection and configuration.
  • mkdocs.yml
    • Updated documentation navigation to include the new 'Stable Audio' page.
  • pyproject.toml
    • Added a new [stable-audio] extra dependency group with specific installation notes for stable-audio-tools.
  • tests/local_tests/pipelines/test_stable_audio_pipeline_smoke.py
    • New smoke test for the Stable Audio pipeline, verifying basic functionality.
  • tests/local_tests/stable_audio/test_parity.py
    • New parity tests to ensure FastVideo's Stable Audio components match stable-audio-tools outputs.
Activity
  • The pull request introduces a major new feature: Text-to-Audio generation, indicating a significant expansion of the library's capabilities.
  • Extensive new code was added across various modules (configs, models, pipelines, stages) to support the integration of Stable Audio.
  • Documentation was created and updated to guide users through the new T2A functionality, including installation and usage examples.
  • New test files were added, including a smoke test and detailed parity tests, demonstrating a commitment to correctness and stability for the new feature.
  • Modifications to core loading utilities reflect an effort to support diverse model checkpoint formats, particularly for unified Stable Audio models.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces text-to-audio generation capabilities using the Stable Audio Open model. The changes are extensive and well-executed, including the core pipeline implementation, new configuration files, a command-line example, comprehensive documentation, and both smoke and parity tests to ensure correctness. The integration with the existing VideoGenerator and model loading infrastructure is thoughtfully handled, particularly the support for Stable Audio's unified checkpoint format. I've found one minor issue in a utility function within the example script that could be improved for robustness.

audio_np = np.clip(audio_np, -1.0, 1.0)
audio_int16 = (audio_np * 32767.0).astype(np.int16)
if audio_int16.ndim == 1:
audio_int16 = audio_int16[:, None]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for handling 1D (mono) audio arrays is incorrect. Reshaping a 1D array of shape (T,) to (T, 1) using [:, None] results in num_channels being incorrectly interpreted as T and num_frames as 1 in the subsequent lines. To correctly handle mono audio, the array should be reshaped to (1, T). While this may not affect the current Stable Audio model which produces stereo output, fixing this will make the save_audio_wav utility function more robust for general use.

Suggested change
audio_int16 = audio_int16[:, None]
audio_int16 = audio_int16[np.newaxis, :]

@happy-harvey happy-harvey force-pushed the harvey/audio_dev branch 3 times, most recently from d9ffee4 to 911861e Compare February 13, 2026 01:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant