Add perf model coverage for TE _Linear, _LayerNormLinear, and LayerNormFn ops#548
Add perf model coverage for TE _Linear, _LayerNormLinear, and LayerNormFn ops#548
Conversation
9abc098 to
7b99c81
Compare
There was a problem hiding this comment.
Pull request overview
Adds TraceLens performance-model coverage and categorization for TransformerEngine fused linear and LayerNorm ops seen in Primus traces, improving GEMM/NORM attribution and enabling GFLOPS/TB/s reporting where shape metadata is available.
Changes:
- Added new perf-model classes for TE
_Linear,_LayerNormLinear(GEMM-based) andLayerNormFn(Normalization-based). - Registered TE ops in
op_to_perf_model_class_mapand updatedcategorize_torch_opso dynamically mapped normalization ops categorize asNORM_*(plus TE backward-name special-cases). - Added a dedicated unit test suite validating mapping, categorization, and basic FLOPs/bytes calculations for the new TE models.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
TraceLens/PerfModel/perf_model.py |
Introduces new TE perf-model classes (te_linear, te_layer_norm_linear, te_layer_norm_fn). |
TraceLens/PerfModel/torch_op_mapping.py |
Maps TE op names to perf models and adjusts categorization logic for normalization + TE backward names. |
tests/test_te_linear_ops.py |
Adds unit tests for TE mapping/categorization and FLOPs/bytes computations. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Addresses Copilot Round 1 review on PR #548: - te_linear.bytes(): bpe_mat2 now uses weight dtype (dtype_A_B[1]) instead of activation dtype, fixing bytes for mixed-precision configs - te_layer_norm_linear.bytes(): same fix - Added mixed-precision unit tests (FP8 weight + BF16 activation) Made-with: Cursor
95ec22c to
cc63818
Compare
There was a problem hiding this comment.
Pull request overview
Adds TransformerEngine (TE) fused linear and standalone LayerNorm performance-model coverage to TraceLens so TE-heavy Primus traces can report meaningful GFLOPS/TB/s instead of falling into “other”.
Changes:
- Add new perf model classes for TE
_Linear,_LayerNormLinear(GEMM-based) andLayerNormFn(Normalization-based). - Register TE ops in
op_to_perf_model_class_mapand adjustcategorize_torch_opsoLayerNormFn*is categorized as normalization and TE backward ops are categorized consistently. - Add a dedicated unit test suite covering mapping, categorization, FLOPs, and bytes for the new TE models.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
TraceLens/PerfModel/perf_model.py |
Introduces te_linear, te_layer_norm_linear, and te_layer_norm_fn perf model classes. |
TraceLens/PerfModel/torch_op_mapping.py |
Maps TE op names to the new perf models and updates categorization logic for normalization and TE backward ops. |
tests/test_te_linear_ops.py |
Adds unit tests validating mapping, categorization, and GEMM FLOPs/bytes calculations for TE ops. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Closes #538 - te_linear(GEMM): full GFLOPS + TB/s for _Linear forward - te_layer_norm_linear(GEMM): full GFLOPS + TB/s for _LayerNormLinear forward - te_layer_norm_fn(Normalization): TB/s model for standalone LayerNormFn - Backward ops categorized correctly via categorize_torch_op() - Fixes categorize_torch_op to check dict_cat2names["Normalization"] Made-with: Cursor
Addresses Copilot Round 1 review on PR #548: - te_linear.bytes(): bpe_mat2 now uses weight dtype (dtype_A_B[1]) instead of activation dtype, fixing bytes for mixed-precision configs - te_layer_norm_linear.bytes(): same fix - Added mixed-precision unit tests (FP8 weight + BF16 activation) Made-with: Cursor
…ation base Copilot Round 2: the docstring said "TB/s only — no significant FLOPS" but the class inherits Normalization.flops() which returns non-zero FLOPS, consistent with all other Normalization subclasses. Updated to say "memory-bound; reports both FLOPS and TB/s". Made-with: Cursor
6acf8d1 to
7e11da8
Compare
There was a problem hiding this comment.
Pull request overview
Adds performance-model coverage for TransformerEngine fused linear and LayerNorm ops seen in Primus traces, improving TraceLens’ GFLOPS/TB/s reporting and reducing “other” categorization.
Changes:
- Add new perf model classes for TE
_Linear,_LayerNormLinear(GEMM-based) andLayerNormFn(Normalization-based). - Extend
op_to_perf_model_class_mapandcategorize_torch_opto correctly map/categorize TE forward + backward op names (including dynamic Normalization categorization). - Add a new unit test suite covering TE mapping, categorization, and GEMM FLOPs/bytes (including mixed precision).
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
TraceLens/PerfModel/perf_model.py |
Introduces te_linear, te_layer_norm_linear, and te_layer_norm_fn perf model implementations. |
TraceLens/PerfModel/torch_op_mapping.py |
Registers new TE ops and adjusts categorization logic for Normalization + TE backward names. |
tests/test_te_linear_ops.py |
Adds unit tests for TE op mapping/categorization and GEMM FLOPs/bytes calculations. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- test_te_layer_norm_fn_bytes: assert exact expected bytes instead of > 0 - te_layer_norm_fn docstring: document optional Input[2] = beta - torch_op_mapping.py comment: fix to say memory-bound with both FLOPS and TB/s Made-with: Cursor
Summary
Closes #538 (sub-issue of #516)
Adds performance models (GFLOPS + TB/s) for TransformerEngine's fused linear ops that appear in Primus traces:
te_linear(GEMM): forward perf model for_Linear— extracts M, N, K from the weight and input shapes in the trace eventte_layer_norm_linear(GEMM): forward perf model for_LayerNormLinear— same GEMM modeling, ignores the LayerNorm contribution (negligible relative to GEMM)te_layer_norm_fn(Normalization): TB/s model for standaloneLayerNormFn_LinearBackward,_LayerNormLinearBackward,LayerNormFnBackward) are categorized correctly but do not have standalone GFLOPS (backward traces lack weight shape; use--enable_pseudo_opsfor decomposed GFLOPS)categorize_torch_opto also checkdict_cat2names["Normalization"]so dynamically mapped norm ops likeLayerNormFnare categorized asNORM_fwdinstead ofotherModels benefiting: Qwen3-8B, Zebra-Llama-1B, Kimi-K2 (any TE-based model)
Test plan
tests/test_te_linear_ops.py:_Linear,_LayerNormLinear,LayerNormFnregistered inop_to_perf_model_class_mapGEMM/NORM_fwd, backward ops →GEMM/NORM_bwd2*M*N*Kfor symmetric and asymmetric GEMM shapes