Syncing private -> public (#104) by unagiboi · Pull Request #554 · AMD-AGI/TraceLens

unagiboi · 2026-03-20T21:28:01Z

Syncing private and public changes.

@tsrikris

* Fallback for Torch Compile, HipG raph * Prevent Triton analysis: TraceLens not supported * Specify limitations for speculative analysis * Flag communication analysis and route users to TraceLens * Rename BatchNorm analyzer to Norm Analyzer, Update MoE classificaiton to natural language * Flag Reduce analysis limitations * Update CPU bottleneck threshold * graph+cpature instructions * Updated README.md * graph+cpature instructions * Add docs. for Agentic Mode * Moved capture folder analysis to one script * Adding MI350X support for AgenticMode * Update Inference_analysis.md * Added roofline conceptual details * Added roofline conceptual details * Added roofline conceptual details * Update trace_capture_merge_experimental.py * Fixed a bug in capture-graph augmentation * Fixed a bug in capture-graph augmentation * Update docs/Inference_analysis.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update docs/Inference_analysis.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update docs/Inference_analysis.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update docs/Inference_analysis.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Added number of kernels for grouping * Remove perf summarization for bwd events * fixed timestamps for graph+capture and name_to_uid list * skip ops with 0 events * Extend the cpu_root_node list of the main tree * consider python function nodes if enabled * consider python function nodes if enabled * Revert "consider python function nodes if enabled" This reverts commit d2b4250d67ab1f3278a3d9f133ef69542b9edb8e. * Discard python function when findng leaf cpu node * Rename method to get_df_kernel_launchers_unique_args_module * docker image building script * Add tests created by cursor * run black on test * added sglang patches * add copyrights * very linty * hook up the standalone test analysis to github actions * Hard code plotting functionality to reduce context * Organize imports for plots * Added docker file and run script * Add venv support instead of only docker * remove pythonpath * Initial commit for evals * Unified docker creation script * apply annotation to aiter varlen attention * Moved inference perf models to extension files, restructured the extension files * perf model extenstions for other ops * Revert "added sglang patches" * Revert "Support TraceLens installed in a venv instead of a docker container" * remove local paths * solving merge conflict * add options * Updating evals, orchestrator for repeatability * Add REAME.md for evals * Remove intermediates * Routing output to eval results directory * Split up deterministic and LLM checks, add repeatability test * Add unit_tests for attention, conv, and moe eval cases - Created evals/unit_tests/ with 16 test cases (3 attention, 7 conv, 6 moe) - Each case contains trace JSON + analysis_output_ref (standalone_analysis.md + perf_report_csvs) - Updated unit_test_traces.csv with all 16 entries - Updated run_evals.sh with container detection for running inside Docker Made-with: Cursor * Update run_evals.sh * Reverting change to run_evals.sh * Remove container details * feat: categorize flash_attn::_flash_attn_backward as SDPA_bwd - Add flash_attn::_flash_attn_backward to op_to_perf_model_class_map (flash_attention_backward) so it is classified as SDPA_bwd instead of other. - Add sdpa_bwd to CATEGORY_SKILL_MAP in platform_specs (sdpa-analysis). - Add --category option to sdpa_analysis.py (sdpa_fwd | sdpa_bwd) so Standalone can run SDPA analysis for the backward category. Made-with: Cursor * reformat with black * cursor fixed the BWD atten perf model * store capture traces in a subdir * Enable deterministic plotting * tweaks to report to make it look better * plot generation tweaks * reformat * Remove improvement quantification unless estimated through roofline * Mitigating race conditions in file write * Repeatability Fixes: Cursor CLI Retry, Update tolerance, Graceful exit * Streamlining orchestrator + Remove redundancy * Evals update with non-deterministic estimate removal * Refactored utils for standalone analysis * Remove GEMM ref. pending addition * added changes for sglang graph capture * Update dsr1_fp8_mi355x_sglang_graph.sh * Update dsr1_fp8_mi355x_sglang_graph.sh * update skills files to recognize changes in feat/sdpa_bwd_category * Constraining impact estimate template, further validation * Update evals to reflect deterministic evals * Evals v1: Clean up * Evals v1: Reformatting * Update guidance on agent updates * adds estimated e2e column and assumes perf improvement reaches 75-100% of roofline * fix plot * restore tables * feat: mcp server * feat: code format * feat: Lint & Copyright Header * feat: lint * ask for ssh all permissions if full_network fails * asks for virtual env path * insight instead of issue * include flops/byte column * Feat/semantic breakdown/add semantic breakdown skill (#33) * Initial Commit * Added integration with standalone mode analysis * Added gzip loading to the semantic breakdown script * Added tracediff-like report generation script * Major refactor to single folder * Added output format tweaks + Documentation * Reverted changes made to the standalone orchestrator when merging the semantic skills (#40) * insights instead of issues * Improve Trace2Tree runtime (#522) Each invocation of `_get_graph_gpu_events` searches for the GPU events associated with a graph launch. This was done by searching through every event in the trace and finding the events with the same linking key. This was where a bulk of the runtime was coming from. `add_gpu_ops_to_tree` was traversing through all events as well, even if they weren't GPU events. In the initial tree traversal in `_preprocess_and_index_events`, the GPU event uids are cached in an array and the CPU op to GPU kernel links are cached in a dict. This improves runtime significantly. A trace that previously took 40 minutes to process now only takes around 1 minute. * Enhance trace annotation with GPU busy duration Refactor extract_iteration to include gpu_busy_duration and update related functions for consistency. * addressed comments. trimmed md files * Extract report template from orchestrator into standalone file Move the ~170-line report template and formatting rules from Step 10 of the orchestrator into a new standalone_analysis_template.md file. This delays context bloat until the agent actually needs the template (after analysis), reducing hallucination risk. - Create standalone_analysis_template.md with full report structure and formatting rules as HTML comments - Replace Step 10 body with concise read/copy/fill instructions - Reduce validation retries from 2 to 1 (template guarantees structure) - Orchestrator shrinks from 838 to 654 lines (-22%) Made-with: Cursor * readme update * Update Inference_analysis.md * venv support * added remote file path lookup * using prefix for every command * MoE Perf. Models for SGLang * SGLang roofline and graph+capture changes * Added support for inference ops * Inference analysis related feayures * roofline for unquantized GEMM * added unfused moe roofline * removed redundant code * Details on perf models * Added perf report sanity check, moved eager/graph+capture analysis to single script * fixed warning message * add Magpie profiling skill (experimental) * Magpie skill: add instructions for custom vLLM graph capture profiling * add option to set delay iterations * fixed number of steps * changes to add docker image building and trzce splitting in main steps * copied file to avoid dependency * Modified documentation for modified script name * Modified documentation for modified script name * reverting virtual env changes * ensures output_dir is passed as string to python functions * removed arbitrary efficiency targets * removed prefix if local * remove section per request from @tsrikris * Fix formatting in Inference_analysis.md * hardcoded some commands and removed ssh permissions feature * Adding golden ref generation script * Golden Report Unit: Delete unwanted files * added more env prefixes * removed ssh network specification * removed yes container yes venv option in table * README update for CLI based use * Revise Quickstart Guide for TraceLens Installation Updated the installation step and reordered the quickstart guide for clarity. * Minor cleanup on standalone analysis orchestrator * vLLM v0.17.0 patches, modified documentation * documentation * Update eval framework and refresh MOE reference data Eval script fixes: - quality_scripted_evals: Allow extra columns in generated CSVs (only fail on missing reference columns), fixing num_kernels false positive - workflow-llm-eval skill: Accept both Insight and Issue as valid P-item field labels - workflow_scripted_evals: Check metrics for zero impact estimates when plot_data.json is absent (valid skip). Improve error message for missing subagent findings. MOE reference data: - Restructure moe test cases to match gemm directory layout - Add moe_04_top2_dense and moe_06_tiny_batch_many_experts test cases - Update all 6 MOE reference reports: Issue -> Insight field label - Refresh reference CSVs and traces from Unit-tests-new - Update unit_test_traces.csv with all 6 MOE cases Made-with: Cursor * change default steady state iterations to 32 (#59) * Format eval scripts with black Made-with: Cursor * Format all Python files with black to pass CI Made-with: Cursor * Update patch for vllm version 0.17.0 * added rooflines for ck moe * Improve sub_regions selection logic for decode-only Refactor sub_regions filtering logic to handle empty cases. * Refactor conditional block for decode_only check * updating patch files * Replace unit_tests directory with compressed archive Pack all eval unit test data (traces + references) into a single unit_tests.tar.gz archive. The eval scripts auto-expand the archive at runtime if the directory is missing or stale. - 510 files -> 1 archive (65 KB compressed) - run_evals.sh and run_repeatability.sh auto-expand before running - evals/unit_tests/ added to .gitignore (generated at runtime) - evals/results/ and evals/repeatability_results/ also gitignored Made-with: Cursor * Reset NODE and CONTAINER variables in run_evals.sh Clear NODE and CONTAINER variables for default values. * Minor changes to graph capture merge * Profiling skills edits * Added caching logic for capture trees * only extension matching since capture traces are in different folder * Perf models for quantized and natched_gemms * Remove GDNAttention from extensions mapping Removed GDNAttention from the performance model extensions mapping. * Changes to move inference specific script in TraceLens * Fix code snippets in Inference_analysis.md Updated code snippets for generating performance reports and steady-state analysis in the inference analysis documentation. * Update command for splitting inference trace annotation * changes to shape profiler * Updates to MoE performance memory estimation * MoE Memory Estimates Improved: Black formatting * fixed a bug for roots without graph launch * error checking improvement * added datatype for Float4_e2m1fn_x2 * added memset in cuda runtime events; improved caching of trees * updating cuda graph patches * added sparse_fwd kernel * added datatypes * added new rooflines * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Potential fix for pull request finding remove whitespace Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * small bug fixes, added __init__ * Create build-wheel.yml * added a check for decode step len * manual merging. add group by parent module as flag * removed internal/deprecated files * added newer trace capture merge file --------- Co-authored-by: Tharun Adithya Srikrishnan <tsrikris@amd.com> Co-authored-by: Deval Shah <devashah@amd.com> Co-authored-by: devalshahamd <deval.shah@amd.com> Co-authored-by: mohbasit <mohbasit@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Gabe Weisz <gabe.weisz@amd.com> Co-authored-by: DCCS-5239 <mohbasit@chi-mi300x-005.ord.vultr.cpe.ice.amd.com> Co-authored-by: gabeweisz <162640284+gabeweisz@users.noreply.github.com> Co-authored-by: Your Name <you@example.com> Co-authored-by: Ahmedhasssan-aig <ahasssan@amd.com> Co-authored-by: Ahmedhasssan-aig <Ahmed.Hasssan@amd.com> Co-authored-by: xiaofei.zheng@amd.com <xiaofei.zheng> Co-authored-by: Spandan More <spandan.more@amd.com> Co-authored-by: Ahmed Hasssan <ahasssan@chi-mi300x-007.ord.vultr.cpe.ice.amd.com> Co-authored-by: Akash Haridas <akash.haridas@amd.com>

Changes in tree_perf were made in 85fa323 that added num_kernels as groupby condition. This changed report format. Now it is controlled by a flag. There should probably be tests for this flag. Change in torch_op_mapping was also made in 3fcc166 that changed how ops are categorized in ops_summary_by_category sheet. This improved op categorization but changed ops_summary_by_category report output leading to failing tests. This should probably be a separate PR that would require regeneration of perf reports so tests don't fail.

unagiboi and others added 9 commits March 20, 2026 17:15

renamed module

3c9907a

Update tree_perf.py (#111)

6605e14

Merge branch 'staging-public-main' into sync_private_changes

07515c3

fixed syntax

9669e24

Merge branch 'staging-public-main' into sync_private_changes

3e15f4c

added parent module flags

0926d1b

Merge branch 'staging-public-main' into sync_private_changes

21d5c98

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Syncing private -> public (#104)#554

Syncing private -> public (#104)#554
unagiboi wants to merge 9 commits intomainfrom
sync_private_changes

unagiboi commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

unagiboi commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants