Open
Conversation
* Fallback for Torch Compile, HipG raph * Prevent Triton analysis: TraceLens not supported * Specify limitations for speculative analysis * Flag communication analysis and route users to TraceLens * Rename BatchNorm analyzer to Norm Analyzer, Update MoE classificaiton to natural language * Flag Reduce analysis limitations * Update CPU bottleneck threshold * graph+cpature instructions * Updated README.md * graph+cpature instructions * Add docs. for Agentic Mode * Moved capture folder analysis to one script * Adding MI350X support for AgenticMode * Update Inference_analysis.md * Added roofline conceptual details * Added roofline conceptual details * Added roofline conceptual details * Update trace_capture_merge_experimental.py * Fixed a bug in capture-graph augmentation * Fixed a bug in capture-graph augmentation * Update docs/Inference_analysis.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update docs/Inference_analysis.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update docs/Inference_analysis.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update docs/Inference_analysis.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Added number of kernels for grouping * Remove perf summarization for bwd events * fixed timestamps for graph+capture and name_to_uid list * skip ops with 0 events * Extend the cpu_root_node list of the main tree * consider python function nodes if enabled * consider python function nodes if enabled * Revert "consider python function nodes if enabled" This reverts commit d2b4250d67ab1f3278a3d9f133ef69542b9edb8e. * Discard python function when findng leaf cpu node * Rename method to get_df_kernel_launchers_unique_args_module * docker image building script * Add tests created by cursor * run black on test * added sglang patches * add copyrights * very linty * hook up the standalone test analysis to github actions * Hard code plotting functionality to reduce context * Organize imports for plots * Added docker file and run script * Add venv support instead of only docker * remove pythonpath * Initial commit for evals * Unified docker creation script * apply annotation to aiter varlen attention * Moved inference perf models to extension files, restructured the extension files * perf model extenstions for other ops * Revert "added sglang patches" * Revert "Support TraceLens installed in a venv instead of a docker container" * remove local paths * solving merge conflict * add options * Updating evals, orchestrator for repeatability * Add REAME.md for evals * Remove intermediates * Routing output to eval results directory * Split up deterministic and LLM checks, add repeatability test * Add unit_tests for attention, conv, and moe eval cases - Created evals/unit_tests/ with 16 test cases (3 attention, 7 conv, 6 moe) - Each case contains trace JSON + analysis_output_ref (standalone_analysis.md + perf_report_csvs) - Updated unit_test_traces.csv with all 16 entries - Updated run_evals.sh with container detection for running inside Docker Made-with: Cursor * Update run_evals.sh * Reverting change to run_evals.sh * Remove container details * feat: categorize flash_attn::_flash_attn_backward as SDPA_bwd - Add flash_attn::_flash_attn_backward to op_to_perf_model_class_map (flash_attention_backward) so it is classified as SDPA_bwd instead of other. - Add sdpa_bwd to CATEGORY_SKILL_MAP in platform_specs (sdpa-analysis). - Add --category option to sdpa_analysis.py (sdpa_fwd | sdpa_bwd) so Standalone can run SDPA analysis for the backward category. Made-with: Cursor * reformat with black * cursor fixed the BWD atten perf model * store capture traces in a subdir * Enable deterministic plotting * tweaks to report to make it look better * plot generation tweaks * reformat * Remove improvement quantification unless estimated through roofline * Mitigating race conditions in file write * Repeatability Fixes: Cursor CLI Retry, Update tolerance, Graceful exit * Streamlining orchestrator + Remove redundancy * Evals update with non-deterministic estimate removal * Refactored utils for standalone analysis * Remove GEMM ref. pending addition * added changes for sglang graph capture * Update dsr1_fp8_mi355x_sglang_graph.sh * Update dsr1_fp8_mi355x_sglang_graph.sh * update skills files to recognize changes in feat/sdpa_bwd_category * Constraining impact estimate template, further validation * Update evals to reflect deterministic evals * Evals v1: Clean up * Evals v1: Reformatting * Update guidance on agent updates * adds estimated e2e column and assumes perf improvement reaches 75-100% of roofline * fix plot * restore tables * feat: mcp server * feat: code format * feat: Lint & Copyright Header * feat: lint * ask for ssh all permissions if full_network fails * asks for virtual env path * insight instead of issue * include flops/byte column * Feat/semantic breakdown/add semantic breakdown skill (#33) * Initial Commit * Added integration with standalone mode analysis * Added gzip loading to the semantic breakdown script * Added tracediff-like report generation script * Major refactor to single folder * Added output format tweaks + Documentation * Reverted changes made to the standalone orchestrator when merging the semantic skills (#40) * insights instead of issues * Improve Trace2Tree runtime (#522) Each invocation of `_get_graph_gpu_events` searches for the GPU events associated with a graph launch. This was done by searching through every event in the trace and finding the events with the same linking key. This was where a bulk of the runtime was coming from. `add_gpu_ops_to_tree` was traversing through all events as well, even if they weren't GPU events. In the initial tree traversal in `_preprocess_and_index_events`, the GPU event uids are cached in an array and the CPU op to GPU kernel links are cached in a dict. This improves runtime significantly. A trace that previously took 40 minutes to process now only takes around 1 minute. * Enhance trace annotation with GPU busy duration Refactor extract_iteration to include gpu_busy_duration and update related functions for consistency. * addressed comments. trimmed md files * Extract report template from orchestrator into standalone file Move the ~170-line report template and formatting rules from Step 10 of the orchestrator into a new standalone_analysis_template.md file. This delays context bloat until the agent actually needs the template (after analysis), reducing hallucination risk. - Create standalone_analysis_template.md with full report structure and formatting rules as HTML comments - Replace Step 10 body with concise read/copy/fill instructions - Reduce validation retries from 2 to 1 (template guarantees structure) - Orchestrator shrinks from 838 to 654 lines (-22%) Made-with: Cursor * readme update * Update Inference_analysis.md * venv support * added remote file path lookup * using prefix for every command * MoE Perf. Models for SGLang * SGLang roofline and graph+capture changes * Added support for inference ops * Inference analysis related feayures * roofline for unquantized GEMM * added unfused moe roofline * removed redundant code * Details on perf models * Added perf report sanity check, moved eager/graph+capture analysis to single script * fixed warning message * add Magpie profiling skill (experimental) * Magpie skill: add instructions for custom vLLM graph capture profiling * add option to set delay iterations * fixed number of steps * changes to add docker image building and trzce splitting in main steps * copied file to avoid dependency * Modified documentation for modified script name * Modified documentation for modified script name * reverting virtual env changes * ensures output_dir is passed as string to python functions * removed arbitrary efficiency targets * removed prefix if local * remove section per request from @tsrikris * Fix formatting in Inference_analysis.md * hardcoded some commands and removed ssh permissions feature * Adding golden ref generation script * Golden Report Unit: Delete unwanted files * added more env prefixes * removed ssh network specification * removed yes container yes venv option in table * README update for CLI based use * Revise Quickstart Guide for TraceLens Installation Updated the installation step and reordered the quickstart guide for clarity. * Minor cleanup on standalone analysis orchestrator * vLLM v0.17.0 patches, modified documentation * documentation * Update eval framework and refresh MOE reference data Eval script fixes: - quality_scripted_evals: Allow extra columns in generated CSVs (only fail on missing reference columns), fixing num_kernels false positive - workflow-llm-eval skill: Accept both Insight and Issue as valid P-item field labels - workflow_scripted_evals: Check metrics for zero impact estimates when plot_data.json is absent (valid skip). Improve error message for missing subagent findings. MOE reference data: - Restructure moe test cases to match gemm directory layout - Add moe_04_top2_dense and moe_06_tiny_batch_many_experts test cases - Update all 6 MOE reference reports: Issue -> Insight field label - Refresh reference CSVs and traces from Unit-tests-new - Update unit_test_traces.csv with all 6 MOE cases Made-with: Cursor * change default steady state iterations to 32 (#59) * Format eval scripts with black Made-with: Cursor * Format all Python files with black to pass CI Made-with: Cursor * Update patch for vllm version 0.17.0 * added rooflines for ck moe * Improve sub_regions selection logic for decode-only Refactor sub_regions filtering logic to handle empty cases. * Refactor conditional block for decode_only check * updating patch files * Replace unit_tests directory with compressed archive Pack all eval unit test data (traces + references) into a single unit_tests.tar.gz archive. The eval scripts auto-expand the archive at runtime if the directory is missing or stale. - 510 files -> 1 archive (65 KB compressed) - run_evals.sh and run_repeatability.sh auto-expand before running - evals/unit_tests/ added to .gitignore (generated at runtime) - evals/results/ and evals/repeatability_results/ also gitignored Made-with: Cursor * Reset NODE and CONTAINER variables in run_evals.sh Clear NODE and CONTAINER variables for default values. * Minor changes to graph capture merge * Profiling skills edits * Added caching logic for capture trees * only extension matching since capture traces are in different folder * Perf models for quantized and natched_gemms * Remove GDNAttention from extensions mapping Removed GDNAttention from the performance model extensions mapping. * Changes to move inference specific script in TraceLens * Fix code snippets in Inference_analysis.md Updated code snippets for generating performance reports and steady-state analysis in the inference analysis documentation. * Update command for splitting inference trace annotation * changes to shape profiler * Updates to MoE performance memory estimation * MoE Memory Estimates Improved: Black formatting * fixed a bug for roots without graph launch * error checking improvement * added datatype for Float4_e2m1fn_x2 * added memset in cuda runtime events; improved caching of trees * updating cuda graph patches * added sparse_fwd kernel * added datatypes * added new rooflines * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Potential fix for pull request finding remove whitespace Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * small bug fixes, added __init__ * Create build-wheel.yml * added a check for decode step len * manual merging. add group by parent module as flag * removed internal/deprecated files * added newer trace capture merge file --------- Co-authored-by: Tharun Adithya Srikrishnan <tsrikris@amd.com> Co-authored-by: Deval Shah <devashah@amd.com> Co-authored-by: devalshahamd <deval.shah@amd.com> Co-authored-by: mohbasit <mohbasit@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Gabe Weisz <gabe.weisz@amd.com> Co-authored-by: DCCS-5239 <mohbasit@chi-mi300x-005.ord.vultr.cpe.ice.amd.com> Co-authored-by: gabeweisz <162640284+gabeweisz@users.noreply.github.com> Co-authored-by: Your Name <you@example.com> Co-authored-by: Ahmedhasssan-aig <ahasssan@amd.com> Co-authored-by: Ahmedhasssan-aig <Ahmed.Hasssan@amd.com> Co-authored-by: xiaofei.zheng@amd.com <xiaofei.zheng> Co-authored-by: Spandan More <spandan.more@amd.com> Co-authored-by: Ahmed Hasssan <ahasssan@chi-mi300x-007.ord.vultr.cpe.ice.amd.com> Co-authored-by: Akash Haridas <akash.haridas@amd.com>
Changes in tree_perf were made in 85fa323 that added num_kernels as groupby condition. This changed report format. Now it is controlled by a flag. There should probably be tests for this flag. Change in torch_op_mapping was also made in 3fcc166 that changed how ops are categorized in ops_summary_by_category sheet. This improved op categorization but changed ops_summary_by_category report output leading to failing tests. This should probably be a separate PR that would require regeneration of perf reports so tests don't fail.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Syncing private and public changes.