[CI] Add preprocessing pipeline tests by AjAnubolu · Pull Request #1152 · hao-ai-lab/FastVideo

AjAnubolu · 2026-03-03T01:19:10Z

Summary

Adds e2e CI test for T2V preprocessing pipeline (videos + captions → validated Parquet output)
Wires into Buildkite/Modal, triggers on changes to preprocessing code paths

gemini-code-assist · 2026-03-03T01:19:29Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the project's continuous integration by adding a comprehensive end-to-end test for the Text-to-Video (T2V) preprocessing pipeline. This new test ensures the pipeline correctly processes video and caption data into validated Parquet output, improving reliability and preventing regressions in this critical data preparation step. The integration into Buildkite and Modal means these tests will automatically run when changes are made to the preprocessing code paths, providing immediate feedback on the health of the system.

Highlights

New End-to-End CI Test: Introduced a comprehensive end-to-end CI test specifically for the Text-to-Video (T2V) preprocessing pipeline, ensuring the integrity of video and caption data processing.
Buildkite Integration: Integrated the newly added preprocessing test into the Buildkite continuous integration system, enabling automated execution.
Conditional Test Triggering: Configured the CI pipeline to automatically trigger the preprocessing tests when changes are detected in relevant preprocessing code paths, ensuring immediate feedback on system health.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

.buildkite/pipeline.yml
- Added a new conditional Buildkite step for "Preprocessing Tests" that triggers on changes to fastvideo/pipelines/preprocess/**, fastvideo/workflow/preprocess/**, fastvideo/dataset/preprocessing_datasets.py, fastvideo/dataset/dataloader/schema.py, fastvideo/tests/preprocessing/**, pyproject.toml, and docker/Dockerfile.python3.12.
.buildkite/scripts/pr_test.sh
- Extended the script to handle a new preprocessing test type, executing a Modal command to run the corresponding tests.
fastvideo/tests/modal/pr_test.py
- Implemented a new Modal function run_preprocessing_tests configured with GPU, timeout, secrets, and volumes, which executes pytest specifically for the fastvideo/tests/preprocessing/ directory.
fastvideo/tests/preprocessing/test_preprocessing_t2v.py
- Added a new end-to-end test that downloads a sample dataset, executes the T2V preprocessing pipeline using torchrun, and performs structural validation of the generated Parquet files, checking for expected columns, row counts, and content integrity for VAE latents, text embeddings, captions, media type, and dimensions.

Activity

No human activity (comments, reviews, or progress updates) has been recorded on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces an end-to-end CI test for the T2V preprocessing pipeline, which is a valuable addition for ensuring the stability of this critical data processing path. The changes correctly configure the new test in Buildkite and Modal. The test script itself is well-structured, covering data download, pipeline execution, and validation of the output. I have provided a few suggestions to improve the test script's robustness and align it with best practices.

gemini-code-assist · 2026-03-03T01:20:51Z

fastvideo/tests/preprocessing/test_preprocessing_t2v.py

+        local_dir=str(RAW_DATA_DIR),
+        repo_type="dataset",
+        resume_download=True,
+        token=os.environ.get("HF_TOKEN"),


The CI environment sets the Hugging Face token in the HF_API_KEY environment variable, but the code is attempting to access HF_TOKEN. This will likely cause authentication to fail when downloading the test dataset. Please use HF_API_KEY to ensure consistency with the CI configuration.

Suggested change

token=os.environ.get("HF_TOKEN"),

token=os.environ.get("HF_API_KEY"),

gemini-code-assist · 2026-03-03T01:20:51Z

fastvideo/tests/preprocessing/test_preprocessing_t2v.py

+        "--model_path",
+        MODEL_PATH,
+        "--data_merge_path",
+        os.path.join(RAW_DATA_DIR, "merge_1_sample.txt"),


The file primarily uses pathlib.Path for path manipulations, which is great for readability and cross-platform compatibility. For consistency, it would be better to use pathlib's / operator here as well, converting to a string only when passing it to the subprocess call.

Suggested change

os.path.join(RAW_DATA_DIR, "merge_1_sample.txt"),

str(RAW_DATA_DIR / "merge_1_sample.txt"),

gemini-code-assist · 2026-03-03T01:20:51Z

fastvideo/tests/preprocessing/test_preprocessing_t2v.py

+    for i in range(table.num_rows):
+        row = {
+            col: table.column(col)[i].as_py()
+            for col in EXPECTED_T2V_COLUMNS
+        }
+
+        # VAE latent
+        assert len(row["vae_latent_bytes"]) > 0, (
+            f"Row {i}: vae_latent_bytes is empty")
+        assert len(row["vae_latent_shape"]) == 4, (
+            f"Row {i}: vae_latent_shape should have 4 elements "
+            f"(C,T,H,W), got {row['vae_latent_shape']}")
+
+        # Text embedding
+        assert len(row["text_embedding_bytes"]) > 0, (
+            f"Row {i}: text_embedding_bytes is empty")
+
+        # Caption
+        assert isinstance(row["caption"], str) and row["caption"], (
+            f"Row {i}: caption is empty or not a string")
+
+        # Media type
+        assert row["media_type"] == "video", (
+            f"Row {i}: expected media_type='video', "
+            f"got '{row['media_type']}'")
+
+        # Dimensions
+        assert row["width"] > 0, (
+            f"Row {i}: width must be positive, got {row['width']}")
+        assert row["height"] > 0, (
+            f"Row {i}: height must be positive, got {row['height']}")


The current method of iterating through the table by building a row dictionary inside the loop is inefficient, as it accesses each column individually for every row. A more idiomatic and performant approach with pyarrow is to convert the entire table to a list of Python dictionaries using table.to_pylist() before the loop. This improves both readability and performance.

for i, row in enumerate(table.to_pylist()): # VAE latent assert len(row["vae_latent_bytes"]) > 0, ( f"Row {i}: vae_latent_bytes is empty") assert len(row["vae_latent_shape"]) == 4, ( f"Row {i}: vae_latent_shape should have 4 elements " f"(C,T,H,W), got {row['vae_latent_shape']}") # Text embedding assert len(row["text_embedding_bytes"]) > 0, ( f"Row {i}: text_embedding_bytes is empty") # Caption assert isinstance(row["caption"], str) and row["caption"], ( f"Row {i}: caption is empty or not a string") # Media type assert row["media_type"] == "video", ( f"Row {i}: expected media_type='video', " f"got '{row['media_type']}'") # Dimensions assert row["width"] > 0, ( f"Row {i}: width must be positive, got {row['width']}") assert row["height"] > 0, ( f"Row {i}: height must be positive, got {row['height']}")

Eigensystem

Looks Great! Could you also add preprocessing tests for new preprocessing pipeline? You can compare the result of old preprocessing pipeline with the new one. The entry for new preprocessing is fastvideo/pipelines/preprocess/v1_preprocess_new.py

[CI] add preprocessing pipeline tests

92a3145

gemini-code-assist bot reviewed Mar 3, 2026

View reviewed changes

AjAnubolu added 2 commits March 2, 2026 17:22

merge main

c7d8f55

fix yapf formatting

4d05c5e

Eigensystem reviewed Mar 3, 2026

View reviewed changes

AjAnubolu added the go Trigger Buildkite CI label Mar 3, 2026

AjAnubolu added 4 commits March 2, 2026 22:06

[CI] add new preprocessing pipeline tests and comparison test

422a857

fix pymarkdown lint in architecture.md

ec7cf3f

fix HF_TOKEN env var and use pathlib consistently

d72e41a

Merge remote-tracking branch 'origin/main' into ci/preprocessing-tests

93039fa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Add preprocessing pipeline tests#1152

[CI] Add preprocessing pipeline tests#1152
AjAnubolu wants to merge 7 commits intohao-ai-lab:mainfrom
AjAnubolu:ci/preprocessing-tests

AjAnubolu commented Mar 3, 2026

Uh oh!

gemini-code-assist bot commented Mar 3, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 3, 2026

Uh oh!

gemini-code-assist bot Mar 3, 2026

Uh oh!

gemini-code-assist bot Mar 3, 2026

Uh oh!

Eigensystem left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	token=os.environ.get("HF_TOKEN"),
	token=os.environ.get("HF_API_KEY"),

	os.path.join(RAW_DATA_DIR, "merge_1_sample.txt"),
	str(RAW_DATA_DIR / "merge_1_sample.txt"),

Conversation

AjAnubolu commented Mar 3, 2026

Summary

Uh oh!

gemini-code-assist bot commented Mar 3, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Eigensystem left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants