[CI] Add preprocessing pipeline tests#1152
[CI] Add preprocessing pipeline tests#1152AjAnubolu wants to merge 7 commits intohao-ai-lab:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the project's continuous integration by adding a comprehensive end-to-end test for the Text-to-Video (T2V) preprocessing pipeline. This new test ensures the pipeline correctly processes video and caption data into validated Parquet output, improving reliability and preventing regressions in this critical data preparation step. The integration into Buildkite and Modal means these tests will automatically run when changes are made to the preprocessing code paths, providing immediate feedback on the health of the system. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces an end-to-end CI test for the T2V preprocessing pipeline, which is a valuable addition for ensuring the stability of this critical data processing path. The changes correctly configure the new test in Buildkite and Modal. The test script itself is well-structured, covering data download, pipeline execution, and validation of the output. I have provided a few suggestions to improve the test script's robustness and align it with best practices.
| local_dir=str(RAW_DATA_DIR), | ||
| repo_type="dataset", | ||
| resume_download=True, | ||
| token=os.environ.get("HF_TOKEN"), |
There was a problem hiding this comment.
The CI environment sets the Hugging Face token in the HF_API_KEY environment variable, but the code is attempting to access HF_TOKEN. This will likely cause authentication to fail when downloading the test dataset. Please use HF_API_KEY to ensure consistency with the CI configuration.
| token=os.environ.get("HF_TOKEN"), | |
| token=os.environ.get("HF_API_KEY"), |
| "--model_path", | ||
| MODEL_PATH, | ||
| "--data_merge_path", | ||
| os.path.join(RAW_DATA_DIR, "merge_1_sample.txt"), |
There was a problem hiding this comment.
The file primarily uses pathlib.Path for path manipulations, which is great for readability and cross-platform compatibility. For consistency, it would be better to use pathlib's / operator here as well, converting to a string only when passing it to the subprocess call.
| os.path.join(RAW_DATA_DIR, "merge_1_sample.txt"), | |
| str(RAW_DATA_DIR / "merge_1_sample.txt"), |
| for i in range(table.num_rows): | ||
| row = { | ||
| col: table.column(col)[i].as_py() | ||
| for col in EXPECTED_T2V_COLUMNS | ||
| } | ||
|
|
||
| # VAE latent | ||
| assert len(row["vae_latent_bytes"]) > 0, ( | ||
| f"Row {i}: vae_latent_bytes is empty") | ||
| assert len(row["vae_latent_shape"]) == 4, ( | ||
| f"Row {i}: vae_latent_shape should have 4 elements " | ||
| f"(C,T,H,W), got {row['vae_latent_shape']}") | ||
|
|
||
| # Text embedding | ||
| assert len(row["text_embedding_bytes"]) > 0, ( | ||
| f"Row {i}: text_embedding_bytes is empty") | ||
|
|
||
| # Caption | ||
| assert isinstance(row["caption"], str) and row["caption"], ( | ||
| f"Row {i}: caption is empty or not a string") | ||
|
|
||
| # Media type | ||
| assert row["media_type"] == "video", ( | ||
| f"Row {i}: expected media_type='video', " | ||
| f"got '{row['media_type']}'") | ||
|
|
||
| # Dimensions | ||
| assert row["width"] > 0, ( | ||
| f"Row {i}: width must be positive, got {row['width']}") | ||
| assert row["height"] > 0, ( | ||
| f"Row {i}: height must be positive, got {row['height']}") |
There was a problem hiding this comment.
The current method of iterating through the table by building a row dictionary inside the loop is inefficient, as it accesses each column individually for every row. A more idiomatic and performant approach with pyarrow is to convert the entire table to a list of Python dictionaries using table.to_pylist() before the loop. This improves both readability and performance.
for i, row in enumerate(table.to_pylist()):
# VAE latent
assert len(row["vae_latent_bytes"]) > 0, (
f"Row {i}: vae_latent_bytes is empty")
assert len(row["vae_latent_shape"]) == 4, (
f"Row {i}: vae_latent_shape should have 4 elements "
f"(C,T,H,W), got {row['vae_latent_shape']}")
# Text embedding
assert len(row["text_embedding_bytes"]) > 0, (
f"Row {i}: text_embedding_bytes is empty")
# Caption
assert isinstance(row["caption"], str) and row["caption"], (
f"Row {i}: caption is empty or not a string")
# Media type
assert row["media_type"] == "video", (
f"Row {i}: expected media_type='video', "
f"got '{row['media_type']}'")
# Dimensions
assert row["width"] > 0, (
f"Row {i}: width must be positive, got {row['width']}")
assert row["height"] > 0, (
f"Row {i}: height must be positive, got {row['height']}")
Eigensystem
left a comment
There was a problem hiding this comment.
Looks Great! Could you also add preprocessing tests for new preprocessing pipeline? You can compare the result of old preprocessing pipeline with the new one. The entry for new preprocessing is fastvideo/pipelines/preprocess/v1_preprocess_new.py
Summary