Perform a second pass in unified table creation to find GPU events wi… by devalshahamd · Pull Request #511 · AMD-AGI/TraceLens

devalshahamd · 2026-03-06T00:15:30Z

This pull request enhances the logic in the traverse method of tree_perf.py to ensure GPU kernel events that are not associated with any CPU operation in their parent hierarchy are properly included in the collected events. The update identifies "orphan" GPU kernels, groups them by their immediate parent event, and creates synthetic parent events to represent these relationships.

Improvements to GPU kernel event handling:

Added logic to identify GPU kernel events (including kernel, gpu_memcpy, gpu_memset) that do not have a CPU operation in their parent hierarchy, ensuring these "orphan" kernels are not missed in the collection process.
Grouped orphan GPU kernels by their immediate parent event and generated synthetic parent events to accurately represent the relationship in the collected events.

…thput cpu_op in the hierarchy

Copilot

Pull request overview

Enhances unified perf event collection in TreePerfAnalyzer.collect_unified_perf_events() to ensure GPU kernels that aren’t associated with any cpu_op ancestor (“orphan” kernels) are still surfaced for unified perf analysis.

Changes:

Adds a second pass after cpu_root_nodes traversal to detect GPU kernels not captured by CPU-op-based traversal.
Filters out already-collected GPU event UIDs (and optionally NCCL kernels) and identifies kernels without any cpu_op in their parent chain.
Groups orphan kernels by immediate parent event and appends synthetic collected events to represent them.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-06T00:20:25Z

TraceLens/TreePerf/tree_perf.py

+                continue
+            parent_evt = self.tree.get_UID2event(parent_uid)
+            for kernel in kernels:
+                synthetic = dict(parent_evt)


synthetic = dict(parent_evt) copies the parent event's UID, so if a runtime parent has multiple orphan kernels this will append multiple collected rows with the same UID. This makes UID-based identification/aggregation ambiguous in downstream tables. Consider generating a unique UID for synthetic events (e.g., based on the kernel UID) and/or adding an explicit flag/field indicating the row is synthetic.

Suggested change

synthetic = dict(parent_evt)

synthetic = dict(parent_evt)

# Assign a unique UID for the synthetic event to avoid clashes

synthetic["UID"] = f"{parent_uid}_synthetic_{kernel['UID']}"

# Explicitly mark this row as synthetic for downstream consumers

synthetic["is_synthetic"] = True

Copilot · 2026-03-06T00:20:25Z

TraceLens/TreePerf/tree_perf.py

+                continue
+            parent_evt = self.tree.get_UID2event(parent_uid)
+            for kernel in kernels:
+                synthetic = dict(parent_evt)


dict(parent_evt) is a shallow copy, so mutable fields like args and children will be shared between parent_evt and synthetic. If any later code mutates those nested structures (even just for formatting/enrichment), it can accidentally affect the original tree event. Prefer constructing a minimal synthetic dict with only the required fields, or use a deep copy and then strip fields that shouldn't carry over (e.g., children).

Suggested change

synthetic = dict(parent_evt)

# Create a deep copy to avoid sharing mutable nested structures (e.g., args).

synthetic = copy.deepcopy(parent_evt)

# Synthetic events should not carry over the original children hierarchy.

synthetic.pop("children", None)

Copilot · 2026-03-06T00:20:25Z

TraceLens/TreePerf/tree_perf.py

+            parent_to_orphans[parent_uid].append(kernel)
+
+        for parent_uid, kernels in parent_to_orphans.items():
+            if parent_uid is None:


Orphan kernels whose get_parent_event(kernel) is None are currently dropped (if parent_uid is None: continue). That contradicts the goal of ensuring unlinked GPU kernels are included. Consider emitting a synthetic top-level event for these kernels (or attaching them to a dedicated "orphan" bucket) so they still appear in collected/the unified perf table.

Suggested change

if parent_uid is None:

if parent_uid is None:

# Kernels with no parent at all: emit a synthetic top-level event

# so they are still visible in the unified perf table.

for kernel in kernels:

synthetic = dict(kernel)

# Treat this kernel as its own "root" unified event, anchored

# by its own UID in gpu_events.

synthetic["gpu_events"] = [kernel["UID"]]

collected.append(synthetic)

Copilot · 2026-03-06T00:20:26Z

TraceLens/TreePerf/tree_perf.py

+        # Collect GPU kernels that have no cpu_op in their parent hierarchy.
+        # These are missed by the cpu_root_nodes traversal above.
+        collected_gpu_uids = set()
+        for evt in collected:
+            collected_gpu_uids.update(evt.get("gpu_events", []))
+
+        orphan_kernels = []
+        for evt in self.tree.events:
+            if self.event_to_category(evt) not in {"kernel", "gpu_memcpy", "gpu_memset"}:
+                continue
+            if evt["UID"] in collected_gpu_uids:
+                continue
+            if not include_nccl and "nccl" in evt.get("name", "").lower():
+                continue
+
+            has_cpu_op = False
+            parent = self.tree.get_parent_event(evt)
+            while parent is not None:
+                if self.event_to_category(parent) == "cpu_op":
+                    has_cpu_op = True
+                    break
+                parent = self.tree.get_parent_event(parent)
+
+            if not has_cpu_op:
+                orphan_kernels.append(evt)
+


This second-pass logic introduces new behavior (including unlinked runtime/kernel paths) but there are no unit tests exercising collect_unified_perf_events() for the "runtime has no cpu_op ancestor" case. Adding a focused test (similar to tests/test_kernel_launchers.py::TestUnlinkedRuntimeEvents) would help prevent regressions and validate NCCL filtering + synthetic event creation.

Perform a seconf pass in unified table creation to find GPU events wi…

cca739c

…thput cpu_op in the hierarchy

devalshahamd requested review from ajassani and Copilot March 6, 2026 00:15

Copilot started reviewing on behalf of devalshahamd March 6, 2026 00:16 View session

Copilot AI reviewed Mar 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perform a second pass in unified table creation to find GPU events wi…#511

Perform a second pass in unified table creation to find GPU events wi…#511
devalshahamd wants to merge 1 commit intomainfrom
feat/unified_perf_fix

devalshahamd commented Mar 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-                synthetic = dict(parent_evt)
+                synthetic = dict(parent_evt)
+                # Assign a unique UID for the synthetic event to avoid clashes
+                synthetic["UID"] = f"{parent_uid}_synthetic_{kernel['UID']}"
+                # Explicitly mark this row as synthetic for downstream consumers
+                synthetic["is_synthetic"] = True

-                synthetic = dict(parent_evt)
+                # Create a deep copy to avoid sharing mutable nested structures (e.g., args).
+                synthetic = copy.deepcopy(parent_evt)
+                # Synthetic events should not carry over the original children hierarchy.
+                synthetic.pop("children", None)

-            if parent_uid is None:
+            if parent_uid is None:
+                # Kernels with no parent at all: emit a synthetic top-level event
+                # so they are still visible in the unified perf table.
+                for kernel in kernels:
+                    synthetic = dict(kernel)
+                    # Treat this kernel as its own "root" unified event, anchored
+                    # by its own UID in gpu_events.
+                    synthetic["gpu_events"] = [kernel["UID"]]
+                    collected.append(synthetic)

Conversation

devalshahamd commented Mar 6, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants