Perform a second pass in unified table creation to find GPU events wi…#511
Perform a second pass in unified table creation to find GPU events wi…#511devalshahamd wants to merge 1 commit intomainfrom
Conversation
…thput cpu_op in the hierarchy
There was a problem hiding this comment.
Pull request overview
Enhances unified perf event collection in TreePerfAnalyzer.collect_unified_perf_events() to ensure GPU kernels that aren’t associated with any cpu_op ancestor (“orphan” kernels) are still surfaced for unified perf analysis.
Changes:
- Adds a second pass after
cpu_root_nodestraversal to detect GPU kernels not captured by CPU-op-based traversal. - Filters out already-collected GPU event UIDs (and optionally NCCL kernels) and identifies kernels without any
cpu_opin their parent chain. - Groups orphan kernels by immediate parent event and appends synthetic collected events to represent them.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| continue | ||
| parent_evt = self.tree.get_UID2event(parent_uid) | ||
| for kernel in kernels: | ||
| synthetic = dict(parent_evt) |
There was a problem hiding this comment.
synthetic = dict(parent_evt) copies the parent event's UID, so if a runtime parent has multiple orphan kernels this will append multiple collected rows with the same UID. This makes UID-based identification/aggregation ambiguous in downstream tables. Consider generating a unique UID for synthetic events (e.g., based on the kernel UID) and/or adding an explicit flag/field indicating the row is synthetic.
| synthetic = dict(parent_evt) | |
| synthetic = dict(parent_evt) | |
| # Assign a unique UID for the synthetic event to avoid clashes | |
| synthetic["UID"] = f"{parent_uid}_synthetic_{kernel['UID']}" | |
| # Explicitly mark this row as synthetic for downstream consumers | |
| synthetic["is_synthetic"] = True |
| continue | ||
| parent_evt = self.tree.get_UID2event(parent_uid) | ||
| for kernel in kernels: | ||
| synthetic = dict(parent_evt) |
There was a problem hiding this comment.
dict(parent_evt) is a shallow copy, so mutable fields like args and children will be shared between parent_evt and synthetic. If any later code mutates those nested structures (even just for formatting/enrichment), it can accidentally affect the original tree event. Prefer constructing a minimal synthetic dict with only the required fields, or use a deep copy and then strip fields that shouldn't carry over (e.g., children).
| synthetic = dict(parent_evt) | |
| # Create a deep copy to avoid sharing mutable nested structures (e.g., args). | |
| synthetic = copy.deepcopy(parent_evt) | |
| # Synthetic events should not carry over the original children hierarchy. | |
| synthetic.pop("children", None) |
| parent_to_orphans[parent_uid].append(kernel) | ||
|
|
||
| for parent_uid, kernels in parent_to_orphans.items(): | ||
| if parent_uid is None: |
There was a problem hiding this comment.
Orphan kernels whose get_parent_event(kernel) is None are currently dropped (if parent_uid is None: continue). That contradicts the goal of ensuring unlinked GPU kernels are included. Consider emitting a synthetic top-level event for these kernels (or attaching them to a dedicated "orphan" bucket) so they still appear in collected/the unified perf table.
| if parent_uid is None: | |
| if parent_uid is None: | |
| # Kernels with no parent at all: emit a synthetic top-level event | |
| # so they are still visible in the unified perf table. | |
| for kernel in kernels: | |
| synthetic = dict(kernel) | |
| # Treat this kernel as its own "root" unified event, anchored | |
| # by its own UID in gpu_events. | |
| synthetic["gpu_events"] = [kernel["UID"]] | |
| collected.append(synthetic) |
| # Collect GPU kernels that have no cpu_op in their parent hierarchy. | ||
| # These are missed by the cpu_root_nodes traversal above. | ||
| collected_gpu_uids = set() | ||
| for evt in collected: | ||
| collected_gpu_uids.update(evt.get("gpu_events", [])) | ||
|
|
||
| orphan_kernels = [] | ||
| for evt in self.tree.events: | ||
| if self.event_to_category(evt) not in {"kernel", "gpu_memcpy", "gpu_memset"}: | ||
| continue | ||
| if evt["UID"] in collected_gpu_uids: | ||
| continue | ||
| if not include_nccl and "nccl" in evt.get("name", "").lower(): | ||
| continue | ||
|
|
||
| has_cpu_op = False | ||
| parent = self.tree.get_parent_event(evt) | ||
| while parent is not None: | ||
| if self.event_to_category(parent) == "cpu_op": | ||
| has_cpu_op = True | ||
| break | ||
| parent = self.tree.get_parent_event(parent) | ||
|
|
||
| if not has_cpu_op: | ||
| orphan_kernels.append(evt) | ||
|
|
There was a problem hiding this comment.
This second-pass logic introduces new behavior (including unlinked runtime/kernel paths) but there are no unit tests exercising collect_unified_perf_events() for the "runtime has no cpu_op ancestor" case. Adding a focused test (similar to tests/test_kernel_launchers.py::TestUnlinkedRuntimeEvents) would help prevent regressions and validate NCCL filtering + synthetic event creation.
This pull request enhances the logic in the
traversemethod oftree_perf.pyto ensure GPU kernel events that are not associated with any CPU operation in their parent hierarchy are properly included in the collected events. The update identifies "orphan" GPU kernels, groups them by their immediate parent event, and creates synthetic parent events to represent these relationships.Improvements to GPU kernel event handling:
kernel,gpu_memcpy,gpu_memset) that do not have a CPU operation in their parent hierarchy, ensuring these "orphan" kernels are not missed in the collection process.