Skip to content

[CI] Add preprocessing pipeline tests#1152

Open
AjAnubolu wants to merge 7 commits intohao-ai-lab:mainfrom
AjAnubolu:ci/preprocessing-tests
Open

[CI] Add preprocessing pipeline tests#1152
AjAnubolu wants to merge 7 commits intohao-ai-lab:mainfrom
AjAnubolu:ci/preprocessing-tests

Conversation

@AjAnubolu
Copy link
Collaborator

Summary

  • Adds e2e CI test for T2V preprocessing pipeline (videos + captions → validated Parquet output)
  • Wires into Buildkite/Modal, triggers on changes to preprocessing code paths

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the project's continuous integration by adding a comprehensive end-to-end test for the Text-to-Video (T2V) preprocessing pipeline. This new test ensures the pipeline correctly processes video and caption data into validated Parquet output, improving reliability and preventing regressions in this critical data preparation step. The integration into Buildkite and Modal means these tests will automatically run when changes are made to the preprocessing code paths, providing immediate feedback on the health of the system.

Highlights

  • New End-to-End CI Test: Introduced a comprehensive end-to-end CI test specifically for the Text-to-Video (T2V) preprocessing pipeline, ensuring the integrity of video and caption data processing.
  • Buildkite Integration: Integrated the newly added preprocessing test into the Buildkite continuous integration system, enabling automated execution.
  • Conditional Test Triggering: Configured the CI pipeline to automatically trigger the preprocessing tests when changes are detected in relevant preprocessing code paths, ensuring immediate feedback on system health.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • .buildkite/pipeline.yml
    • Added a new conditional Buildkite step for "Preprocessing Tests" that triggers on changes to fastvideo/pipelines/preprocess/**, fastvideo/workflow/preprocess/**, fastvideo/dataset/preprocessing_datasets.py, fastvideo/dataset/dataloader/schema.py, fastvideo/tests/preprocessing/**, pyproject.toml, and docker/Dockerfile.python3.12.
  • .buildkite/scripts/pr_test.sh
    • Extended the script to handle a new preprocessing test type, executing a Modal command to run the corresponding tests.
  • fastvideo/tests/modal/pr_test.py
    • Implemented a new Modal function run_preprocessing_tests configured with GPU, timeout, secrets, and volumes, which executes pytest specifically for the fastvideo/tests/preprocessing/ directory.
  • fastvideo/tests/preprocessing/test_preprocessing_t2v.py
    • Added a new end-to-end test that downloads a sample dataset, executes the T2V preprocessing pipeline using torchrun, and performs structural validation of the generated Parquet files, checking for expected columns, row counts, and content integrity for VAE latents, text embeddings, captions, media type, and dimensions.
Activity
  • No human activity (comments, reviews, or progress updates) has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an end-to-end CI test for the T2V preprocessing pipeline, which is a valuable addition for ensuring the stability of this critical data processing path. The changes correctly configure the new test in Buildkite and Modal. The test script itself is well-structured, covering data download, pipeline execution, and validation of the output. I have provided a few suggestions to improve the test script's robustness and align it with best practices.

local_dir=str(RAW_DATA_DIR),
repo_type="dataset",
resume_download=True,
token=os.environ.get("HF_TOKEN"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The CI environment sets the Hugging Face token in the HF_API_KEY environment variable, but the code is attempting to access HF_TOKEN. This will likely cause authentication to fail when downloading the test dataset. Please use HF_API_KEY to ensure consistency with the CI configuration.

Suggested change
token=os.environ.get("HF_TOKEN"),
token=os.environ.get("HF_API_KEY"),

"--model_path",
MODEL_PATH,
"--data_merge_path",
os.path.join(RAW_DATA_DIR, "merge_1_sample.txt"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The file primarily uses pathlib.Path for path manipulations, which is great for readability and cross-platform compatibility. For consistency, it would be better to use pathlib's / operator here as well, converting to a string only when passing it to the subprocess call.

Suggested change
os.path.join(RAW_DATA_DIR, "merge_1_sample.txt"),
str(RAW_DATA_DIR / "merge_1_sample.txt"),

Comment on lines +122 to +152
for i in range(table.num_rows):
row = {
col: table.column(col)[i].as_py()
for col in EXPECTED_T2V_COLUMNS
}

# VAE latent
assert len(row["vae_latent_bytes"]) > 0, (
f"Row {i}: vae_latent_bytes is empty")
assert len(row["vae_latent_shape"]) == 4, (
f"Row {i}: vae_latent_shape should have 4 elements "
f"(C,T,H,W), got {row['vae_latent_shape']}")

# Text embedding
assert len(row["text_embedding_bytes"]) > 0, (
f"Row {i}: text_embedding_bytes is empty")

# Caption
assert isinstance(row["caption"], str) and row["caption"], (
f"Row {i}: caption is empty or not a string")

# Media type
assert row["media_type"] == "video", (
f"Row {i}: expected media_type='video', "
f"got '{row['media_type']}'")

# Dimensions
assert row["width"] > 0, (
f"Row {i}: width must be positive, got {row['width']}")
assert row["height"] > 0, (
f"Row {i}: height must be positive, got {row['height']}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current method of iterating through the table by building a row dictionary inside the loop is inefficient, as it accesses each column individually for every row. A more idiomatic and performant approach with pyarrow is to convert the entire table to a list of Python dictionaries using table.to_pylist() before the loop. This improves both readability and performance.

    for i, row in enumerate(table.to_pylist()):
        # VAE latent
        assert len(row["vae_latent_bytes"]) > 0, (
            f"Row {i}: vae_latent_bytes is empty")
        assert len(row["vae_latent_shape"]) == 4, (
            f"Row {i}: vae_latent_shape should have 4 elements "
            f"(C,T,H,W), got {row['vae_latent_shape']}")

        # Text embedding
        assert len(row["text_embedding_bytes"]) > 0, (
            f"Row {i}: text_embedding_bytes is empty")

        # Caption
        assert isinstance(row["caption"], str) and row["caption"], (
            f"Row {i}: caption is empty or not a string")

        # Media type
        assert row["media_type"] == "video", (
            f"Row {i}: expected media_type='video', "
            f"got '{row['media_type']}'")

        # Dimensions
        assert row["width"] > 0, (
            f"Row {i}: width must be positive, got {row['width']}")
        assert row["height"] > 0, (
            f"Row {i}: height must be positive, got {row['height']}")

Copy link
Collaborator

@Eigensystem Eigensystem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks Great! Could you also add preprocessing tests for new preprocessing pipeline? You can compare the result of old preprocessing pipeline with the new one. The entry for new preprocessing is fastvideo/pipelines/preprocess/v1_preprocess_new.py

@AjAnubolu AjAnubolu added the go Trigger Buildkite CI label Mar 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go Trigger Buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants