Skip to content

Perform a second pass in unified table creation to find GPU events wi…#511

Open
devalshahamd wants to merge 1 commit intomainfrom
feat/unified_perf_fix
Open

Perform a second pass in unified table creation to find GPU events wi…#511
devalshahamd wants to merge 1 commit intomainfrom
feat/unified_perf_fix

Conversation

@devalshahamd
Copy link
Contributor

This pull request enhances the logic in the traverse method of tree_perf.py to ensure GPU kernel events that are not associated with any CPU operation in their parent hierarchy are properly included in the collected events. The update identifies "orphan" GPU kernels, groups them by their immediate parent event, and creates synthetic parent events to represent these relationships.

Improvements to GPU kernel event handling:

  • Added logic to identify GPU kernel events (including kernel, gpu_memcpy, gpu_memset) that do not have a CPU operation in their parent hierarchy, ensuring these "orphan" kernels are not missed in the collection process.
  • Grouped orphan GPU kernels by their immediate parent event and generated synthetic parent events to accurately represent the relationship in the collected events.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Enhances unified perf event collection in TreePerfAnalyzer.collect_unified_perf_events() to ensure GPU kernels that aren’t associated with any cpu_op ancestor (“orphan” kernels) are still surfaced for unified perf analysis.

Changes:

  • Adds a second pass after cpu_root_nodes traversal to detect GPU kernels not captured by CPU-op-based traversal.
  • Filters out already-collected GPU event UIDs (and optionally NCCL kernels) and identifies kernels without any cpu_op in their parent chain.
  • Groups orphan kernels by immediate parent event and appends synthetic collected events to represent them.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

continue
parent_evt = self.tree.get_UID2event(parent_uid)
for kernel in kernels:
synthetic = dict(parent_evt)
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

synthetic = dict(parent_evt) copies the parent event's UID, so if a runtime parent has multiple orphan kernels this will append multiple collected rows with the same UID. This makes UID-based identification/aggregation ambiguous in downstream tables. Consider generating a unique UID for synthetic events (e.g., based on the kernel UID) and/or adding an explicit flag/field indicating the row is synthetic.

Suggested change
synthetic = dict(parent_evt)
synthetic = dict(parent_evt)
# Assign a unique UID for the synthetic event to avoid clashes
synthetic["UID"] = f"{parent_uid}_synthetic_{kernel['UID']}"
# Explicitly mark this row as synthetic for downstream consumers
synthetic["is_synthetic"] = True

Copilot uses AI. Check for mistakes.
continue
parent_evt = self.tree.get_UID2event(parent_uid)
for kernel in kernels:
synthetic = dict(parent_evt)
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dict(parent_evt) is a shallow copy, so mutable fields like args and children will be shared between parent_evt and synthetic. If any later code mutates those nested structures (even just for formatting/enrichment), it can accidentally affect the original tree event. Prefer constructing a minimal synthetic dict with only the required fields, or use a deep copy and then strip fields that shouldn't carry over (e.g., children).

Suggested change
synthetic = dict(parent_evt)
# Create a deep copy to avoid sharing mutable nested structures (e.g., args).
synthetic = copy.deepcopy(parent_evt)
# Synthetic events should not carry over the original children hierarchy.
synthetic.pop("children", None)

Copilot uses AI. Check for mistakes.
parent_to_orphans[parent_uid].append(kernel)

for parent_uid, kernels in parent_to_orphans.items():
if parent_uid is None:
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Orphan kernels whose get_parent_event(kernel) is None are currently dropped (if parent_uid is None: continue). That contradicts the goal of ensuring unlinked GPU kernels are included. Consider emitting a synthetic top-level event for these kernels (or attaching them to a dedicated "orphan" bucket) so they still appear in collected/the unified perf table.

Suggested change
if parent_uid is None:
if parent_uid is None:
# Kernels with no parent at all: emit a synthetic top-level event
# so they are still visible in the unified perf table.
for kernel in kernels:
synthetic = dict(kernel)
# Treat this kernel as its own "root" unified event, anchored
# by its own UID in gpu_events.
synthetic["gpu_events"] = [kernel["UID"]]
collected.append(synthetic)

Copilot uses AI. Check for mistakes.
Comment on lines +1481 to +1506
# Collect GPU kernels that have no cpu_op in their parent hierarchy.
# These are missed by the cpu_root_nodes traversal above.
collected_gpu_uids = set()
for evt in collected:
collected_gpu_uids.update(evt.get("gpu_events", []))

orphan_kernels = []
for evt in self.tree.events:
if self.event_to_category(evt) not in {"kernel", "gpu_memcpy", "gpu_memset"}:
continue
if evt["UID"] in collected_gpu_uids:
continue
if not include_nccl and "nccl" in evt.get("name", "").lower():
continue

has_cpu_op = False
parent = self.tree.get_parent_event(evt)
while parent is not None:
if self.event_to_category(parent) == "cpu_op":
has_cpu_op = True
break
parent = self.tree.get_parent_event(parent)

if not has_cpu_op:
orphan_kernels.append(evt)

Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This second-pass logic introduces new behavior (including unlinked runtime/kernel paths) but there are no unit tests exercising collect_unified_perf_events() for the "runtime has no cpu_op ancestor" case. Adding a focused test (similar to tests/test_kernel_launchers.py::TestUnlinkedRuntimeEvents) would help prevent regressions and validate NCCL filtering + synthetic event creation.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants