[feat] Add stable_audio T2A Generation by happy-harvey · Pull Request #1080 · hao-ai-lab/FastVideo

happy-harvey · 2026-02-08T00:34:39Z

No description provided.

gemini-code-assist · 2026-02-08T00:35:13Z

Summary of Changes

Hello @happy-harvey, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly extends the FastVideo library by introducing robust Text-to-Audio generation capabilities. It integrates the Stable Audio Open model, providing a complete pipeline from text prompt to audio waveform. The changes encompass new model configurations, specialized pipeline stages for audio processing, and an updated API, making it straightforward for users to leverage this new functionality. The addition of comprehensive documentation and examples ensures a smooth onboarding experience for generating high-quality audio from textual descriptions.

Highlights

New Feature: Text-to-Audio (T2A) Generation: Introduced support for Text-to-Audio generation using Stability AI's Stable Audio Open 1.0 model, expanding the library's capabilities beyond video.
Comprehensive Stable Audio Integration: Integrated Stable Audio's Diffusion Transformer (DiT), Oobleck VAE pretransform, and MultiConditioner components, along with k-diffusion v-prediction sampling, into the FastVideo framework.
Updated Documentation and Examples: Added a dedicated documentation page for Stable Audio, detailing installation (including Python 3.12 dependency workarounds), usage, supported models, and limitations. A new example script demonstrates basic and advanced T2A generation.
Flexible Model Loading: Enhanced model loading mechanisms to support unified checkpoints (e.g., model.safetensors or model.ckpt) for Stable Audio models, allowing for easier integration of pre-trained weights.
New generate_audio Method: The VideoGenerator class now includes a generate_audio method, providing a streamlined API for text-to-audio inference with configurable parameters.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

docs/inference/inference_quick_start.md
- Added a link to the new Stable Audio documentation page.
docs/inference/stable_audio.md
- New file detailing Stable Audio Open T2A generation, including supported models, installation, usage, and limitations.
docs/inference/support_matrix.md
- Updated the supported models table to include Stable Audio Open 1.0 (T2A).
- Added a new section for Stable Audio Open 1.0 features.
examples/inference/basic/stable_audio_basic.py
- New example script for basic and custom parameter usage of Stable Audio T2A generation.
fastvideo/configs/models/dits/init.py
- Imported and added StableAudioDiTConfig to the __all__ list.
fastvideo/configs/models/dits/stable_audio.py
- New file defining StableAudioDiTArchConfig and StableAudioDiTConfig for the Stable Audio Diffusion Transformer.
fastvideo/configs/pipelines/stable_audio.py
- New file defining StableAudioPipelineConfig for the Stable Audio T2A pipeline, including audio-specific parameters.
fastvideo/configs/sample/stable_audio.py
- New file defining StableAudioSamplingParam with default sampling parameters for T2A generation.
fastvideo/entrypoints/video_generator.py
- Added a new generate_audio method to the VideoGenerator class for text-to-audio generation.
fastvideo/models/dits/stable_audio.py
- New file implementing StableAudioDiTModel, a wrapper for stable-audio-tools DiffusionTransformer.
fastvideo/models/loader/component_loader.py
- Modified to support loading unified checkpoints (.ckpt files) and inferring paths based on transformer_key_prefix.
fastvideo/models/loader/fsdp_load.py
- Updated maybe_load_fsdp_model to accept weight_key_prefix and handle .ckpt files for unified checkpoint loading.
fastvideo/models/loader/weight_utils.py
- Added unified_checkpoint_weights_iterator for loading weights from .ckpt files.
- Modified safetensors_weights_iterator to support key_prefix for filtering.
fastvideo/models/registry.py
- Registered StableAudioDiTModel in the model registry.
fastvideo/models/stable_audio/init.py
- New __init__.py file to export Stable Audio components.
fastvideo/models/stable_audio/conditioner.py
- New file defining StableAudioConditioner for T5 and NumberEmbedder conditioning, loading weights from unified checkpoints.
fastvideo/models/stable_audio/pretransform.py
- New file defining StableAudioPretransform for Oobleck VAE, handling encoding/decoding and loading weights.
fastvideo/models/stable_audio/sampling.py
- New file defining sample_stable_audio for k-diffusion v-prediction sampling compatible with Stable Audio.
fastvideo/pipelines/basic/stable_audio/init.py
- New __init__.py for the Stable Audio pipeline.
fastvideo/pipelines/basic/stable_audio/stable_audio_pipeline.py
- New file defining StableAudioPipeline to orchestrate T2A generation, supporting unified and HuggingFace model loading.
fastvideo/pipelines/pipeline_batch_info.py
- Added audio-specific fields (sample_rate, duration_seconds, seconds_start, seconds_total) to ForwardBatch.
fastvideo/pipelines/stages/init.py
- Imported and added new Stable Audio pipeline stages to the __all__ list.
fastvideo/pipelines/stages/stable_audio_conditioning.py
- New file defining StableAudioConditioningStage to process text prompts and seconds conditioning.
fastvideo/pipelines/stages/stable_audio_decoding.py
- New file defining StableAudioDecodingStage to decode latents into audio waveforms.
fastvideo/pipelines/stages/stable_audio_denoising.py
- New file defining StableAudioDenoisingStage for k-diffusion v-prediction sampling.
fastvideo/pipelines/stages/stable_audio_input_validation.py
- New file defining StableAudioInputValidationStage for audio-specific input validation.
fastvideo/pipelines/stages/stable_audio_latent_preparation.py
- New file defining StableAudioLatentPreparationStage to prepare initial noise latents based on audio duration.
fastvideo/registry.py
- Registered StableAudioPipelineConfig and StableAudioSamplingParam for model detection and configuration.
mkdocs.yml
- Updated documentation navigation to include the new 'Stable Audio' page.
pyproject.toml
- Added a new [stable-audio] extra dependency group with specific installation notes for stable-audio-tools.
tests/local_tests/pipelines/test_stable_audio_pipeline_smoke.py
- New smoke test for the Stable Audio pipeline, verifying basic functionality.
tests/local_tests/stable_audio/test_parity.py
- New parity tests to ensure FastVideo's Stable Audio components match stable-audio-tools outputs.

Activity

The pull request introduces a major new feature: Text-to-Audio generation, indicating a significant expansion of the library's capabilities.
Extensive new code was added across various modules (configs, models, pipelines, stages) to support the integration of Stable Audio.
Documentation was created and updated to guide users through the new T2A functionality, including installation and usage examples.
New test files were added, including a smoke test and detailed parity tests, demonstrating a commitment to correctness and stability for the new feature.
Modifications to core loading utilities reflect an effort to support diverse model checkpoint formats, particularly for unified Stable Audio models.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces text-to-audio generation capabilities using the Stable Audio Open model. The changes are extensive and well-executed, including the core pipeline implementation, new configuration files, a command-line example, comprehensive documentation, and both smoke and parity tests to ensure correctness. The integration with the existing VideoGenerator and model loading infrastructure is thoughtfully handled, particularly the support for Stable Audio's unified checkpoint format. I've found one minor issue in a utility function within the example script that could be improved for robustness.

gemini-code-assist · 2026-02-08T00:46:06Z

examples/inference/basic/stable_audio_basic.py

+    audio_np = np.clip(audio_np, -1.0, 1.0)
+    audio_int16 = (audio_np * 32767.0).astype(np.int16)
+    if audio_int16.ndim == 1:
+        audio_int16 = audio_int16[:, None]


The logic for handling 1D (mono) audio arrays is incorrect. Reshaping a 1D array of shape (T,) to (T, 1) using [:, None] results in num_channels being incorrectly interpreted as T and num_frames as 1 in the subsequent lines. To correctly handle mono audio, the array should be reshaped to (1, T). While this may not affect the current Stable Audio model which produces stereo output, fixing this will make the save_audio_wav utility function more robust for general use.

Suggested change

audio_int16 = audio_int16[:, None]

audio_int16 = audio_int16[np.newaxis, :]

gemini-code-assist bot reviewed Feb 8, 2026

View reviewed changes

happy-harvey force-pushed the harvey/audio_dev branch 3 times, most recently from d9ffee4 to 911861e Compare February 13, 2026 01:13

[feat] Add stable_audio T2A Generation

5b1e28c

happy-harvey force-pushed the harvey/audio_dev branch from 911861e to 5b1e28c Compare February 13, 2026 23:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Add stable_audio T2A Generation#1080

[feat] Add stable_audio T2A Generation#1080
happy-harvey wants to merge 1 commit intohao-ai-lab:mainfrom
happy-harvey:harvey/audio_dev

happy-harvey commented Feb 8, 2026

Uh oh!

gemini-code-assist bot commented Feb 8, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	audio_int16 = audio_int16[:, None]
	audio_int16 = audio_int16[np.newaxis, :]

Conversation

happy-harvey commented Feb 8, 2026

Uh oh!

gemini-code-assist bot commented Feb 8, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant