Skip to content

Conversation

ryanbaumann
Copy link

@ryanbaumann ryanbaumann commented Oct 17, 2025

[Work in progress, still evaluating root cause and test cases]

Summary

Fixes critical bug causing RuntimeError: Attempted to exit cancel scope in a different task when making sequential SSE requests in production (GCP Agent Engine). First request succeeds, subsequent requests fail with 75% failure rate.

Root Cause

Manual cancel_scope.cancel() at line 145 violates anyio task lifecycle. In production environments with concurrent request handling, cleanup executes in different task context than setup.

Solution

Remove manual cancel - let anyio's TaskGroup.aexit handle cleanup automatically. This follows anyio best practices and prevents the lifecycle violation.

Testing

  • Production: Reproduced on GCP Agent Engine (75% failure rate)
  • Local: Attempted to add a test case but difficult since couldn't reproduce the error locally.
  • Note: Bug is environment-dependent - manifests in production concurrent environments but dormant in simple local tests

Changes

  • Removed tg.cancel_scope.cancel() from src/mcp/client/sse.py:145
  • No API changes, fully backward compatible

Complete investigation: https://github.com/chalosalvador/google-adk-mcp-tools cc/ @chalosalvador

…uests in production

**Context:**
Discovered via Google ADK deployments to GCP Agent Engine. Sequential MCP server requests fail with RuntimeError after first request succeeds.

**Problem:**
Manual cancel_scope.cancel() at line 145 violates anyio task lifecycle. In production with concurrent request handling (ADK on Agent Engine), cleanup executes in different task context than setup, triggering 'Attempted to exit cancel scope in a different task' RuntimeError.

First request: succeeds (same task)
Subsequent requests: fail (different task context)
Failure rate: 75% in production

**Fix:**
Remove manual cancel. Let anyio TaskGroup.__aexit__ handle cleanup - this is the framework's job, not ours.

**Impact:**
- Unblocks ADK deployments to GCP Agent Engine
- Zero API changes
- Backward compatible

**Testing:**
Reproduced in GCP Agent Engine, validated locally with control environments.

**TODO:**
Need test case covering concurrent request patterns to prevent regression.

Reference: https://github.com/chalosalvador/google-adk-mcp-tools
…pendent bug

Adds regression test for the cancel_scope bug fixed in previous commit.

**Test Strategy:**
Documents expected behavior (sequential connections should work) with
detailed comments explaining why the bug doesn't manifest locally.

**Environment Dependency:**
- Production (GCP Agent Engine): bug manifests, 75% failure rate
- Local tests: bug dormant, both buggy and fixed code pass
- Reason: Concurrent request handling creates varying task contexts

**Test Value:**
- Regression protection against reintroducing manual cancel
- Documents expected behavior for sequential connections
- Explains testing limitations clearly
- Provides foundation for future concurrent testing infrastructure

Reference: https://github.com/chalosalvador/google-adk-mcp-tools
This was internal documentation not meant for upstream PR.
@ryanbaumann ryanbaumann changed the title fix(sse): Remove manual cancel_scope.cancel() to prevent task lifecycle violation [WIP] fix(sse): Remove manual cancel_scope.cancel() to prevent task lifecycle violation Oct 18, 2025
Improved the fix for the cancel scope lifecycle violation by adding
graceful handling of ClosedResourceError. When streams close during
shutdown, the error handler may attempt to send exceptions to already
closed streams, which should be ignored as this is expected behavior.

This maintains the original fix (removing manual tg.cancel_scope.cancel())
while handling the race condition it exposed in error cleanup.
The test_sse_sequential_connections.py file doesn't effectively test the
bug because it's environment-dependent (only manifests in GCP Agent Engine
with concurrent request handling, not in simple sequential pytest execution).

The existing test suite (663 tests) validates the fix works correctly.
The bug and fix are thoroughly documented in code comments in sse.py.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant