Skip to content

Conversation

@pmilovanov
Copy link

Bug Description

The aiohttp transport in litellm does not properly propagate timeout parameters, resulting in ClientTimeout being created with all None values. This allows requests to hang indefinitely during SSL write operations.

Impact

🚨 Critical Production Issue:

  • Dataflow jobs hung for 12+ minutes per request during SSL operations
  • No timeout enforcement despite passing timeout=60
  • Complete job failures due to worker timeouts
  • Affects all Vertex AI/Gemini users through litellm

Root Cause

In litellm/llms/custom_httpx/aiohttp_transport.py:261:

async def handle_async_request(self, request: httpx.Request) -> httpx.Response:
    timeout = request.extensions.get("timeout", {})  # ← Returns empty dict {}!
    
    # Later creates ClientTimeout with None values:
    timeout=ClientTimeout(
        sock_connect=timeout.get("connect"),  # None
        sock_read=timeout.get("read"),        # None  
        connect=timeout.get("pool"),          # None
    )

The issue: request.extensions.get("timeout", {}) returns {} instead of timeout configuration.

Evidence from Production

Stack trace showing 717 second (12 minute) hang:

File "litellm/llms/custom_httpx/aiohttp_transport.py", line 207, in handle_async_request
    response = await client_session.request(
...
File "/usr/local/lib/python3.11/ssl.py", line 930, in write
    return self._sslobj.write(data)

Full details in Dataflow job: 2025-10-16_14_44_34-5867955337894223011

Reproduction

This PR includes:

  1. reproduce_timeout_bug.py - Demonstrates the bug with diagnostic logging
  2. demonstrate_fix.py - Shows the workaround
  3. TIMEOUT_BUG_REPRODUCTION.md - Complete documentation
  4. Diagnostic logging in aiohttp_transport.py to prove the issue

To reproduce:

export VERTEXAI_PROJECT=your-gcp-project
pip install -e .
python reproduce_timeout_bug.py

You will see:

[TIMEOUT DEBUG] timeout dict: {}
[TIMEOUT DEBUG] ClientTimeout values: {'sock_connect': None, 'sock_read': None, 'connect': None}

This proves that despite passing timeout=30, aiohttp receives no timeout.

Current Workaround

import litellm
litellm.disable_aiohttp_transport = True  # Use httpx native transport instead

This forces litellm to use httpx's native transport, which correctly propagates timeouts.

Proposed Fix

The aiohttp transport needs to handle different timeout formats:

async def handle_async_request(self, request: httpx.Request) -> httpx.Response:
    timeout_ext = request.extensions.get("timeout")
    
    # Convert httpx.Timeout to dict format aiohttp expects
    if isinstance(timeout_ext, httpx.Timeout):
        timeout = {
            "connect": timeout_ext.connect or 60,
            "read": timeout_ext.read or 60,
            "pool": timeout_ext.pool or 60,
        }
    elif isinstance(timeout_ext, (int, float)):
        timeout = {"connect": timeout_ext, "read": timeout_ext, "pool": timeout_ext}
    else:
        # Provide safe defaults instead of None
        timeout = timeout_ext or {"connect": 60, "read": 60, "pool": 60}
    
    timeout=ClientTimeout(
        sock_connect=timeout.get("connect", 60),
        sock_read=timeout.get("read", 60),
        connect=timeout.get("pool", 60),
    )

Related Issues

Checklist

  • Bug demonstrated with reproduction script
  • Production evidence documented
  • Workaround provided
  • Fix proposed
  • Unit tests (awaiting maintainer feedback on approach)

Questions for Maintainers

  1. Should httpx.Timeout objects be converted to the dict format aiohttp expects?
  2. What should the default timeout values be when not specified?
  3. Is there a reason request.extensions["timeout"] comes through as empty dict?

This is a critical bug causing production failures. Happy to refine the fix based on your guidance.

This demonstrates a critical bug where litellm's aiohttp transport
creates ClientTimeout with all None values, allowing indefinite hangs
during SSL operations.

Added:
- reproduce_timeout_bug.py: Shows timeout dict is empty {}
- demonstrate_fix.py: Shows workaround using httpx transport
- TIMEOUT_BUG_REPRODUCTION.md: Full documentation with evidence
- Diagnostic logging in aiohttp_transport.py to prove the bug

The bug occurs because request.extensions.get('timeout', {}) returns
empty dict instead of timeout configuration, resulting in:
  ClientTimeout(sock_connect=None, sock_read=None, connect=None)

Impact: Production Dataflow jobs hung for 12+ minutes per request.

Evidence from production stack trace included in documentation.
@vercel
Copy link

vercel bot commented Oct 17, 2025

@pmilovanov is attempting to deploy a commit to the CLERKIEAI Team on Vercel.

A member of the Team first needs to authorize it.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@pmilovanov pmilovanov marked this pull request as draft October 17, 2025 16:50
@pmilovanov pmilovanov closed this Oct 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants