Add perf model coverage for TE _Linear, _LayerNormLinear, and LayerNormFn ops by gphuang · Pull Request #548 · AMD-AGI/TraceLens

gphuang · 2026-03-19T11:43:24Z

Summary

Closes #538 (sub-issue of #516)

Adds performance models (GFLOPS + TB/s) for TransformerEngine's fused linear ops that appear in Primus traces:

te_linear(GEMM): forward perf model for _Linear — extracts M, N, K from the weight and input shapes in the trace event
te_layer_norm_linear(GEMM): forward perf model for _LayerNormLinear — same GEMM modeling, ignores the LayerNorm contribution (negligible relative to GEMM)
te_layer_norm_fn(Normalization): TB/s model for standalone LayerNormFn
Backward ops (_LinearBackward, _LayerNormLinearBackward, LayerNormFnBackward) are categorized correctly but do not have standalone GFLOPS (backward traces lack weight shape; use --enable_pseudo_ops for decomposed GFLOPS)
Fixes categorize_torch_op to also check dict_cat2names["Normalization"] so dynamically mapped norm ops like LayerNormFn are categorized as NORM_fwd instead of other

Models benefiting: Qwen3-8B, Zebra-Llama-1B, Kimi-K2 (any TE-based model)

Test plan

17 new unit tests in tests/test_te_linear_ops.py:
- Mapping tests: _Linear, _LayerNormLinear, LayerNormFn registered in op_to_perf_model_class_map
- Categorization tests: forward ops → GEMM/NORM_fwd, backward ops → GEMM/NORM_bwd
- FLOPS tests: verify 2*M*N*K for symmetric and asymmetric GEMM shapes
- Bytes tests: verify bf16 data movement calculation
- Normalization model instantiation and bytes

Copilot

Pull request overview

Adds TraceLens performance-model coverage and categorization for TransformerEngine fused linear and LayerNorm ops seen in Primus traces, improving GEMM/NORM attribution and enabling GFLOPS/TB/s reporting where shape metadata is available.

Changes:

Added new perf-model classes for TE _Linear, _LayerNormLinear (GEMM-based) and LayerNormFn (Normalization-based).
Registered TE ops in op_to_perf_model_class_map and updated categorize_torch_op so dynamically mapped normalization ops categorize as NORM_* (plus TE backward-name special-cases).
Added a dedicated unit test suite validating mapping, categorization, and basic FLOPs/bytes calculations for the new TE models.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
`TraceLens/PerfModel/perf_model.py`	Introduces new TE perf-model classes (`te_linear`, `te_layer_norm_linear`, `te_layer_norm_fn`).
`TraceLens/PerfModel/torch_op_mapping.py`	Maps TE op names to perf models and adjusts categorization logic for normalization + TE backward names.
`tests/test_te_linear_ops.py`	Adds unit tests for TE mapping/categorization and FLOPs/bytes computations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

TraceLens/PerfModel/perf_model.py

Addresses Copilot Round 1 review on PR #548: - te_linear.bytes(): bpe_mat2 now uses weight dtype (dtype_A_B[1]) instead of activation dtype, fixing bytes for mixed-precision configs - te_layer_norm_linear.bytes(): same fix - Added mixed-precision unit tests (FP8 weight + BF16 activation) Made-with: Cursor

Copilot

Pull request overview

Adds TransformerEngine (TE) fused linear and standalone LayerNorm performance-model coverage to TraceLens so TE-heavy Primus traces can report meaningful GFLOPS/TB/s instead of falling into “other”.

Changes:

Add new perf model classes for TE _Linear, _LayerNormLinear (GEMM-based) and LayerNormFn (Normalization-based).
Register TE ops in op_to_perf_model_class_map and adjust categorize_torch_op so LayerNormFn* is categorized as normalization and TE backward ops are categorized consistently.
Add a dedicated unit test suite covering mapping, categorization, FLOPs, and bytes for the new TE models.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
`TraceLens/PerfModel/perf_model.py`	Introduces `te_linear`, `te_layer_norm_linear`, and `te_layer_norm_fn` perf model classes.
`TraceLens/PerfModel/torch_op_mapping.py`	Maps TE op names to the new perf models and updates categorization logic for normalization and TE backward ops.
`tests/test_te_linear_ops.py`	Adds unit tests validating mapping, categorization, and GEMM FLOPs/bytes calculations for TE ops.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

TraceLens/PerfModel/perf_model.py

Closes #538 - te_linear(GEMM): full GFLOPS + TB/s for _Linear forward - te_layer_norm_linear(GEMM): full GFLOPS + TB/s for _LayerNormLinear forward - te_layer_norm_fn(Normalization): TB/s model for standalone LayerNormFn - Backward ops categorized correctly via categorize_torch_op() - Fixes categorize_torch_op to check dict_cat2names["Normalization"] Made-with: Cursor

Addresses Copilot Round 1 review on PR #548: - te_linear.bytes(): bpe_mat2 now uses weight dtype (dtype_A_B[1]) instead of activation dtype, fixing bytes for mixed-precision configs - te_layer_norm_linear.bytes(): same fix - Added mixed-precision unit tests (FP8 weight + BF16 activation) Made-with: Cursor

…ation base Copilot Round 2: the docstring said "TB/s only — no significant FLOPS" but the class inherits Normalization.flops() which returns non-zero FLOPS, consistent with all other Normalization subclasses. Updated to say "memory-bound; reports both FLOPS and TB/s". Made-with: Cursor

Copilot

Pull request overview

Adds performance-model coverage for TransformerEngine fused linear and LayerNorm ops seen in Primus traces, improving TraceLens’ GFLOPS/TB/s reporting and reducing “other” categorization.

Changes:

Add new perf model classes for TE _Linear, _LayerNormLinear (GEMM-based) and LayerNormFn (Normalization-based).
Extend op_to_perf_model_class_map and categorize_torch_op to correctly map/categorize TE forward + backward op names (including dynamic Normalization categorization).
Add a new unit test suite covering TE mapping, categorization, and GEMM FLOPs/bytes (including mixed precision).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
`TraceLens/PerfModel/perf_model.py`	Introduces `te_linear`, `te_layer_norm_linear`, and `te_layer_norm_fn` perf model implementations.
`TraceLens/PerfModel/torch_op_mapping.py`	Registers new TE ops and adjusts categorization logic for Normalization + TE backward names.
`tests/test_te_linear_ops.py`	Adds unit tests for TE op mapping/categorization and GEMM FLOPs/bytes calculations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/test_te_linear_ops.py

TraceLens/PerfModel/perf_model.py

TraceLens/PerfModel/torch_op_mapping.py

- test_te_layer_norm_fn_bytes: assert exact expected bytes instead of > 0 - te_layer_norm_fn docstring: document optional Input[2] = beta - torch_op_mapping.py comment: fix to say memory-bound with both FLOPS and TB/s Made-with: Cursor

gphuang self-assigned this Mar 19, 2026

gphuang changed the title ~~Add perf model coverage for TE _Linear and _LayerNormLinear fused ops~~ Add perf model coverage for TE _Linear, _LayerNormLinear, and LayerNormFn ops Mar 19, 2026

gphuang added the perf_model Add performance model for calculating TFLOPS/s and TB/s label Mar 19, 2026

gphuang force-pushed the 538-te-linear-layernorm-perf-models branch from 9abc098 to 7b99c81 Compare March 19, 2026 12:07

gphuang mentioned this pull request Mar 19, 2026

Add perf models for Primus op groups #516

Open

gphuang marked this pull request as ready for review March 19, 2026 13:36

Copilot AI review requested due to automatic review settings March 19, 2026 13:36

Copilot AI reviewed Mar 19, 2026

View reviewed changes

TraceLens/PerfModel/perf_model.py Outdated Show resolved Hide resolved

TraceLens/PerfModel/perf_model.py Outdated Show resolved Hide resolved

gphuang force-pushed the 538-te-linear-layernorm-perf-models branch from 95ec22c to cc63818 Compare March 19, 2026 14:24

gphuang requested a review from Copilot March 19, 2026 14:32

Copilot AI reviewed Mar 19, 2026

View reviewed changes

TraceLens/PerfModel/perf_model.py Show resolved Hide resolved

gphuang added 3 commits March 19, 2026 14:53

gphuang force-pushed the 538-te-linear-layernorm-perf-models branch from 6acf8d1 to 7e11da8 Compare March 19, 2026 14:57

gphuang requested a review from Copilot March 19, 2026 15:00

Copilot AI reviewed Mar 19, 2026

View reviewed changes

tests/test_te_linear_ops.py Outdated Show resolved Hide resolved

TraceLens/PerfModel/perf_model.py Show resolved Hide resolved

TraceLens/PerfModel/torch_op_mapping.py Outdated Show resolved Hide resolved

gphuang requested review from ajassani and olehtika March 20, 2026 06:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add perf model coverage for TE _Linear, _LayerNormLinear, and LayerNormFn ops#548

Add perf model coverage for TE _Linear, _LayerNormLinear, and LayerNormFn ops#548
gphuang wants to merge 4 commits intomainfrom
538-te-linear-layernorm-perf-models

gphuang commented Mar 19, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gphuang commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gphuang commented Mar 19, 2026 •

edited

Loading