Skip to content

Nvtx tracer#1087

Open
Ohm-Rishabh wants to merge 6 commits intohao-ai-lab:mainfrom
Ohm-Rishabh:nvtx_tracer
Open

Nvtx tracer#1087
Ohm-Rishabh wants to merge 6 commits intohao-ai-lab:mainfrom
Ohm-Rishabh:nvtx_tracer

Conversation

@Ohm-Rishabh
Copy link
Contributor

This is the pr for adding nvtx range markers in the codebase.

To get trace: python fastvideo/tests/training/Vanilla/mfu_calculation.py --profile

To add custom range: with nvtx_range("range name"): ....code block....

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Ohm-Rishabh, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the profiling capabilities of the codebase by integrating NVTX range markers into key computational graphs and the training loop. This allows developers to gain deeper insights into performance bottlenecks using NVIDIA's profiling tools. Additionally, the changes include updates to a training test script to leverage these new profiling features and minor cleanups within the training pipeline for improved stability and data logging.

Highlights

  • NVTX Profiling Integration: Introduced NVTX (NVIDIA Tools Extension) range markers across critical components of the model architecture and training pipeline, enabling detailed performance profiling with tools like Nsight Systems.
  • New NVTX Utilities: Added dedicated nvtx_range, nvtx_annotate, and nvtx_mark utilities in fastvideo/profiler.py to provide flexible and independent NVTX tracing capabilities.
  • MFU Calculation Test Enhancements: Updated the MFU calculation test script to support Nsight profiling, adjusted training parameters (e.g., batch size, max steps), and refined environment variable settings for more robust testing and profiling.
  • Training Pipeline Refinements: Streamlined the training pipeline by removing audio muxing logic and improving the robustness of metric logging, ensuring accurate capture of model architecture details.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • examples/training/finetune/wan_t2v_1.3B/crush_smol/preprocess_wan_data_t2v.sh
    • Updated GPU assignment and video batch size for preprocessing script.
  • fastvideo/attention/layer.py
    • Imported nvtx_range for profiling.
    • Wrapped QKV stacking, all-to-all scatter, rotary embedding application, QKV preprocessing, attention implementation, and all-to-all gather operations with NVTX ranges in DistributedAttention.forward.
    • Wrapped rotary embedding application and attention implementation with NVTX ranges in LocalAttention.forward.
  • fastvideo/layers/layernorm.py
    • Imported nvtx_range for profiling.
    • Wrapped RMSNorm.forward_native with an NVTX range.
    • Wrapped ScaleResidual.forward with an NVTX range.
    • Wrapped FP32LayerNorm.forward with an NVTX range.
    • Wrapped ScaleResidualLayerNormScaleShift.forward with an NVTX range.
    • Wrapped LayerNormScaleShift.forward with an NVTX range.
  • fastvideo/layers/linear.py
    • Imported nvtx_range for profiling.
    • Wrapped the quant_method.apply call in ReplicatedLinear.forward with an NVTX range.
  • fastvideo/layers/mlp.py
    • Imported nvtx_range for profiling.
    • Wrapped fc_in, act, and fc_out calls in MLP.forward with NVTX ranges.
  • fastvideo/models/dits/wanvideo.py
    • Imported nvtx_range for profiling.
    • Wrapped KV computation, attention, and output projection in WanT2VCrossAttention.forward with NVTX ranges.
    • Wrapped scale shift table processing, QKV computation, self-attention, cross-attention, feed-forward, and MLP residual operations in WanTransformerBlock.forward with NVTX ranges.
    • Wrapped rotary position embedding, patch embedding, sequence model parallel shard, attention mask creation, condition embedder, transformer block iterations, output normalization, all-gather with unpad, and projection out operations in WanTransformer3DModel.forward with NVTX ranges.
  • fastvideo/models/loader/component_loader.py
    • Updated the pipeline configuration's DIT config to ensure downstream code can access the actual model architecture.
  • fastvideo/profiler.py
    • Integrated torch.cuda.nvtx.range_push and range_pop into the ProfilerController.region context manager.
    • Added new standalone NVTX utilities: nvtx_range (context manager), nvtx_annotate (decorator), and nvtx_mark (instantaneous marker).
  • fastvideo/tests/training/Vanilla/mfu_calculation.py
    • Updated DATA_PATH to a combined parquet dataset.
    • Changed MASTER_PORT and added CUDA_VISIBLE_DEVICES and FASTVIDEO_ATTENTION_BACKEND environment variables.
    • Adjusted train_batch_size and max_train_steps for testing.
    • Added logging for training metrics including batch size, sequence length, context length, step time, hidden dimension, number of layers, and FFN dimension.
    • Modified test_distributed_training to accept a profile argument and integrated nsys profile command for Nsight tracing.
  • fastvideo/training/training_pipeline.py
    • Imported nvtx_range for profiling.
    • Removed shutil and tempfile imports.
    • Wrapped transformer_forward, loss computation (sharded and non-sharded), backward pass, reduce_loss, clip_grad_norm, optimizer_step, and the main train_step loop with NVTX ranges.
    • Removed audio muxing functionality (_mux_audio method and its calls).
    • Adjusted the logic for logging visualization and validation steps.
    • Removed the try-except block around metric logging, ensuring all metrics are consistently captured.
Activity
  • The pull request introduces NVTX range markers and profiling utilities to the codebase.
  • The author, Ohm-Rishabh, has integrated these markers into various core components and the training pipeline.
  • The MFU calculation test script has been updated to utilize the new profiling capabilities and adjusted for specific training parameters.
  • Minor refactorings and cleanups were performed in the training pipeline, including the removal of audio muxing.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces NVTX range markers throughout the codebase to improve profiling capabilities with NVIDIA Nsight Systems. The changes are well-implemented, adding a new nvtx_range utility and applying it to key performance-critical sections in various models and layers. This will provide valuable, fine-grained insights into GPU execution.

My review includes a few suggestions to improve the portability of example and test scripts by removing hardcoded GPU device IDs. I also noted some leftover debug code and a significant out-of-scope change (removal of audio muxing logic) that would be better handled in a separate pull request to maintain clarity and focus.

I am having trouble creating individual review comments. Click here to see my feedback.

examples/training/finetune/wan_t2v_1.3B/crush_smol/preprocess_wan_data_t2v.sh (4)

medium

Hardcoding CUDA_VISIBLE_DEVICES makes this example script difficult to run on different machines where GPU 2 might not be available or suitable. It would be more robust to either expect the user to set this environment variable before running the script or to parameterize it.

# Set the CUDA_VISIBLE_DEVICES environment variable to select a specific GPU, e.g.:
# export CUDA_VISIBLE_DEVICES=0

fastvideo/tests/training/Vanilla/mfu_calculation.py (36)

medium

Hardcoding CUDA_VISIBLE_DEVICES in a test script reduces its portability and can cause failures on systems where the specified GPUs (5, 6) are not available. It's better to let the user or the test environment configure this. Please remove this line and let the environment (e.g., a CI runner or the user's shell) control which GPUs are visible to the script.

fastvideo/training/training_pipeline.py (835)

medium

This commented-out return statement appears to be a leftover from a debugging session. It should be removed to keep the code clean.

fastvideo/training/training_pipeline.py (951-1042)

medium

The removal of the _mux_audio static method and its related logic is a significant change. While this might be a valid cleanup, it seems unrelated to the main purpose of this pull request, which is to add NVTX tracers. Including unrelated changes makes the PR harder to review and understand. It would be better to submit this change in a separate PR with a descriptive title and explanation.

@SolitaryThinker SolitaryThinker added the go Trigger Buildkite CI label Feb 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go Trigger Buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants