Skip to content

Add perf model coverage for TE _Linear, _LayerNormLinear, and LayerNormFn ops#548

Open
gphuang wants to merge 4 commits intomainfrom
538-te-linear-layernorm-perf-models
Open

Add perf model coverage for TE _Linear, _LayerNormLinear, and LayerNormFn ops#548
gphuang wants to merge 4 commits intomainfrom
538-te-linear-layernorm-perf-models

Conversation

@gphuang
Copy link
Contributor

@gphuang gphuang commented Mar 19, 2026

Summary

Closes #538 (sub-issue of #516)

Adds performance models (GFLOPS + TB/s) for TransformerEngine's fused linear ops that appear in Primus traces:

  • te_linear(GEMM): forward perf model for _Linear — extracts M, N, K from the weight and input shapes in the trace event
  • te_layer_norm_linear(GEMM): forward perf model for _LayerNormLinear — same GEMM modeling, ignores the LayerNorm contribution (negligible relative to GEMM)
  • te_layer_norm_fn(Normalization): TB/s model for standalone LayerNormFn
  • Backward ops (_LinearBackward, _LayerNormLinearBackward, LayerNormFnBackward) are categorized correctly but do not have standalone GFLOPS (backward traces lack weight shape; use --enable_pseudo_ops for decomposed GFLOPS)
  • Fixes categorize_torch_op to also check dict_cat2names["Normalization"] so dynamically mapped norm ops like LayerNormFn are categorized as NORM_fwd instead of other

Models benefiting: Qwen3-8B, Zebra-Llama-1B, Kimi-K2 (any TE-based model)

Test plan

  • 17 new unit tests in tests/test_te_linear_ops.py:
    • Mapping tests: _Linear, _LayerNormLinear, LayerNormFn registered in op_to_perf_model_class_map
    • Categorization tests: forward ops → GEMM/NORM_fwd, backward ops → GEMM/NORM_bwd
    • FLOPS tests: verify 2*M*N*K for symmetric and asymmetric GEMM shapes
    • Bytes tests: verify bf16 data movement calculation
    • Normalization model instantiation and bytes

@gphuang gphuang self-assigned this Mar 19, 2026
@gphuang gphuang changed the title Add perf model coverage for TE _Linear and _LayerNormLinear fused ops Add perf model coverage for TE _Linear, _LayerNormLinear, and LayerNormFn ops Mar 19, 2026
@gphuang gphuang added the perf_model Add performance model for calculating TFLOPS/s and TB/s label Mar 19, 2026
@gphuang gphuang force-pushed the 538-te-linear-layernorm-perf-models branch from 9abc098 to 7b99c81 Compare March 19, 2026 12:07
@gphuang gphuang marked this pull request as ready for review March 19, 2026 13:36
Copilot AI review requested due to automatic review settings March 19, 2026 13:36
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds TraceLens performance-model coverage and categorization for TransformerEngine fused linear and LayerNorm ops seen in Primus traces, improving GEMM/NORM attribution and enabling GFLOPS/TB/s reporting where shape metadata is available.

Changes:

  • Added new perf-model classes for TE _Linear, _LayerNormLinear (GEMM-based) and LayerNormFn (Normalization-based).
  • Registered TE ops in op_to_perf_model_class_map and updated categorize_torch_op so dynamically mapped normalization ops categorize as NORM_* (plus TE backward-name special-cases).
  • Added a dedicated unit test suite validating mapping, categorization, and basic FLOPs/bytes calculations for the new TE models.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
TraceLens/PerfModel/perf_model.py Introduces new TE perf-model classes (te_linear, te_layer_norm_linear, te_layer_norm_fn).
TraceLens/PerfModel/torch_op_mapping.py Maps TE op names to perf models and adjusts categorization logic for normalization + TE backward names.
tests/test_te_linear_ops.py Adds unit tests for TE mapping/categorization and FLOPs/bytes computations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

gphuang added a commit that referenced this pull request Mar 19, 2026
Addresses Copilot Round 1 review on PR #548:
- te_linear.bytes(): bpe_mat2 now uses weight dtype (dtype_A_B[1])
  instead of activation dtype, fixing bytes for mixed-precision configs
- te_layer_norm_linear.bytes(): same fix
- Added mixed-precision unit tests (FP8 weight + BF16 activation)

Made-with: Cursor
@gphuang gphuang force-pushed the 538-te-linear-layernorm-perf-models branch from 95ec22c to cc63818 Compare March 19, 2026 14:24
@gphuang gphuang requested a review from Copilot March 19, 2026 14:32
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds TransformerEngine (TE) fused linear and standalone LayerNorm performance-model coverage to TraceLens so TE-heavy Primus traces can report meaningful GFLOPS/TB/s instead of falling into “other”.

Changes:

  • Add new perf model classes for TE _Linear, _LayerNormLinear (GEMM-based) and LayerNormFn (Normalization-based).
  • Register TE ops in op_to_perf_model_class_map and adjust categorize_torch_op so LayerNormFn* is categorized as normalization and TE backward ops are categorized consistently.
  • Add a dedicated unit test suite covering mapping, categorization, FLOPs, and bytes for the new TE models.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
TraceLens/PerfModel/perf_model.py Introduces te_linear, te_layer_norm_linear, and te_layer_norm_fn perf model classes.
TraceLens/PerfModel/torch_op_mapping.py Maps TE op names to the new perf models and updates categorization logic for normalization and TE backward ops.
tests/test_te_linear_ops.py Adds unit tests validating mapping, categorization, and GEMM FLOPs/bytes calculations for TE ops.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

gphuang added 3 commits March 19, 2026 14:53
Closes #538

- te_linear(GEMM): full GFLOPS + TB/s for _Linear forward
- te_layer_norm_linear(GEMM): full GFLOPS + TB/s for _LayerNormLinear forward
- te_layer_norm_fn(Normalization): TB/s model for standalone LayerNormFn
- Backward ops categorized correctly via categorize_torch_op()
- Fixes categorize_torch_op to check dict_cat2names["Normalization"]

Made-with: Cursor
Addresses Copilot Round 1 review on PR #548:
- te_linear.bytes(): bpe_mat2 now uses weight dtype (dtype_A_B[1])
  instead of activation dtype, fixing bytes for mixed-precision configs
- te_layer_norm_linear.bytes(): same fix
- Added mixed-precision unit tests (FP8 weight + BF16 activation)

Made-with: Cursor
…ation base

Copilot Round 2: the docstring said "TB/s only — no significant FLOPS"
but the class inherits Normalization.flops() which returns non-zero FLOPS,
consistent with all other Normalization subclasses.  Updated to say
"memory-bound; reports both FLOPS and TB/s".

Made-with: Cursor
@gphuang gphuang force-pushed the 538-te-linear-layernorm-perf-models branch from 6acf8d1 to 7e11da8 Compare March 19, 2026 14:57
@gphuang gphuang requested a review from Copilot March 19, 2026 15:00
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds performance-model coverage for TransformerEngine fused linear and LayerNorm ops seen in Primus traces, improving TraceLens’ GFLOPS/TB/s reporting and reducing “other” categorization.

Changes:

  • Add new perf model classes for TE _Linear, _LayerNormLinear (GEMM-based) and LayerNormFn (Normalization-based).
  • Extend op_to_perf_model_class_map and categorize_torch_op to correctly map/categorize TE forward + backward op names (including dynamic Normalization categorization).
  • Add a new unit test suite covering TE mapping, categorization, and GEMM FLOPs/bytes (including mixed precision).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
TraceLens/PerfModel/perf_model.py Introduces te_linear, te_layer_norm_linear, and te_layer_norm_fn perf model implementations.
TraceLens/PerfModel/torch_op_mapping.py Registers new TE ops and adjusts categorization logic for Normalization + TE backward names.
tests/test_te_linear_ops.py Adds unit tests for TE op mapping/categorization and GEMM FLOPs/bytes calculations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- test_te_layer_norm_fn_bytes: assert exact expected bytes instead of > 0
- te_layer_norm_fn docstring: document optional Input[2] = beta
- torch_op_mapping.py comment: fix to say memory-bound with both FLOPS and TB/s

Made-with: Cursor
@gphuang gphuang requested review from ajassani and olehtika March 20, 2026 06:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

perf_model Add performance model for calculating TFLOPS/s and TB/s

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add linear category for fused Linear/LayerNorm ops

2 participants