Skip to content

Prevent dynamic kernel construction during pending async work#4443

Open
shreyaskommuri wants to merge 3 commits intoNVIDIA:mainfrom
shreyaskommuri:fix-make-kernel-async-guard
Open

Prevent dynamic kernel construction during pending async work#4443
shreyaskommuri wants to merge 3 commits intoNVIDIA:mainfrom
shreyaskommuri:fix-make-kernel-async-guard

Conversation

@shreyaskommuri
Copy link
Copy Markdown

@shreyaskommuri shreyaskommuri commented May 3, 2026

Summary

Fixes the minimum failure mode from #4359 by guarding `cudaq.make_kernel()` while async `sample_async` or `observe_async` work is pending. Instead of allowing unsafe concurrent access to the process-wide MLIR context and risking a segfault, CUDA-Q now raises an actionable `RuntimeError`.

This does not make dynamic kernel construction fully thread-safe; it documents and enforces the current constraint.

Changes

`python/cudaq/kernel/utils.py`

  • Adds `_active_async_work_count` (protected by `threading.Lock`) and three helpers: `register_async_work`, `unregister_async_work`, `check_no_active_async_work`.

`python/cudaq/kernel/kernel_builder.py`

  • `make_kernel()` calls `check_no_active_async_work()` before touching the global MLIR context.
  • Adds a constraint note to the docstring.

`python/cudaq/runtime/sample.py`

  • `AsyncSampleResult` now accepts `async_work_registered` to avoid double-counting when constructed from `sample_async`.
  • `get()` unregisters async work in a `finally` block (correct even when `impl.get()` throws).
  • `del` unregisters async work when the result is discarded without calling `get()`, preventing the counter from leaking and permanently blocking `make_kernel()`.
  • `sample_async` registers before the async launch and unregisters on exception.

`python/cudaq/runtime/observe.py`

  • Adds `_AsyncObserveResult` wrapping the C++ future with the same register/unregister lifecycle as `AsyncSampleResult`, including `del`.
  • `observe_async` now returns `_AsyncObserveResult` instead of the raw C++ future.

`docs/sphinx/using/examples/multi_gpu_workflows.rst`

  • Adds a `.. note::` documenting the build-before-dispatch constraint for `make_kernel`.

Testing

  • `python3 -m py_compile` across all changed Python files — passes.
  • `git diff --check` — no whitespace errors.
  • Two new unit tests in `python/tests/builder/test_kernel_builder.py`:
    • `test_make_kernel_rejects_pending_async_work` — verifies `make_kernel` raises while an `AsyncSampleResult` is live, then succeeds after `get()`.
    • `test_make_kernel_allowed_after_gc_of_async_result` — verifies `make_kernel` succeeds after the result is dropped (`del`) without calling `get()`, exercising the `del` path.

Not tested

  • Targeted pytest, because local Python 3.14 environment does not have `pytest` installed.
  • GPU/mqpu reproducer, because this host is macOS CPU-only.

Guard cudaq.make_kernel while async sample or observe futures may still access the process-wide MLIR context, so users get an actionable error instead of a segfault.

Constraint: Python dynamic kernels share a process-wide MLIR context that is unsafe to mutate while async execution is pending.

Rejected: Making MLIR kernel construction fully thread-safe in this patch | too broad for the issue's minimum accepted behavior and likely crosses C++/MLIR ownership boundaries.

Confidence: medium

Scope-risk: moderate

Directive: Keep dynamic kernel construction outside async dispatch loops unless the MLIR context ownership model is made thread-safe.

Tested: python3 -m py_compile python/cudaq/kernel/utils.py python/cudaq/kernel/kernel_builder.py python/cudaq/runtime/sample.py python/cudaq/runtime/observe.py python/tests/builder/test_kernel_builder.py; git diff --check

Not-tested: pytest target because local Python 3.14 environment does not have pytest installed; GPU/mqpu reproducer not run on this macOS CPU-only host.

Co-authored-by: OmX <omx@oh-my-codex.dev>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 3, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@shreyaskommuri
Copy link
Copy Markdown
Author

Follow-up fixes on top of the original guard

After reviewing the initial implementation, I found two correctness bugs and one missing test case. These are now fixed in the same commit.


Bug 1 — `AsyncSampleResult.del` leaked the async-work count

Problem: If a caller discards an `AsyncSampleResult` without ever calling `.get()` (e.g. fire-and-forget, or an exception skips the call), `_active_async_work_count` stays > 0 permanently. Every subsequent `cudaq.make_kernel()` call in that process raises `RuntimeError`, which is worse than the segfault the guard was meant to prevent.

Fix: `del` now calls `unregister_async_work()` when `_async_work_registered` is still `True`. `unregister_async_work` already guards against going below zero, so a double-decrement is impossible.


Bug 2 — `_AsyncObserveResult` had no `del`

Problem: Same permanent-blockage issue on the observe path. The new `_AsyncObserveResult` class had `get()` correctly unregistering via `finally`, but nothing handled the GC path.

Fix: Added `del` with the same pattern as the sample fix.


New test — GC path

Added `test_make_kernel_allowed_after_gc_of_async_result` to `test_kernel_builder.py`. The existing `test_make_kernel_rejects_pending_async_work` only exercised the `.get()` path; this new test uses `del future` to exercise `del` directly.

AsyncSampleResult.__del__ and the new _AsyncObserveResult.__del__ now
call unregister_async_work() when _async_work_registered is still True,
covering the case where the caller drops the future without calling
get().  Without this, _active_async_work_count stays > 0 permanently
and every subsequent cudaq.make_kernel() raises RuntimeError.

Also adds test_make_kernel_allowed_after_gc_of_async_result to exercise
the __del__ path directly, and removes the redundant blank line before
the literalinclude directive in multi_gpu_workflows.rst.

Signed-off-by: shreyaskommuri <shreyaskommuri@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant