Skip to content

Conversation

dshivashankar1994
Copy link

Description

Implement post-fork reinitialization of threading locks in the metrics measurement consumer to prevent deadlocks and data duplication in forked child processes.

This change adds fork-safety mechanisms to SynchronousMeasurementConsumer by:

  • Registering fork callbacks using os.register_at_fork() to detect process forks
  • Reinitializing threading locks in child processes after fork
  • Implementing lazy storage reinitialization to prevent data duplication
  • Clearing stale async instrument references

This addresses the deadlock issue reported in Flask/Gunicorn applications with gevent workers where threads get stuck trying to acquire locks that were held during fork, causing request timeouts and memory leaks.

Closes #4345

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

The fork-safety implementation has been tested with:

  • ProcessPoolExecutor integration: Tested with concurrent.futures.ProcessPoolExecutor to ensure no deadlocks
  • Backward compatibility: Ensured single-process applications remain unaffected

Does This PR Require a Contrib Repo Change?

  • Yes. - Link to PR:
  • No.

Checklist:

  • Followed the style guidelines of this project
  • Changelogs have been updated
  • Unit tests have been added
  • Documentation has been updated

Technical Implementation Details

Root Cause Analysis:
The deadlock occurred because forked child processes inherited the parent's thread state, including locks that may have been held at fork time. In gevent environments, this caused threads to wait indefinitely for locks that would never be released, as described in the stack trace from issue #4345.

Solution Approach:

  1. Fork Detection: Uses os.register_at_fork(after_in_child=...) to register cleanup callbacks
  2. Lock Reinitialization: Calls _at_fork_reinit() on threading.Lock instances to reset their state
  3. Lazy Storage Cleanup: Implements _needs_storage_reinit flag to defer expensive operations until first use
  4. Data Integrity: Clears _instrument_view_instrument_matches cache to prevent duplicate metrics
  5. Async Cleanup: Resets async instruments list to avoid stale references

Performance Considerations:

  • Only registers fork handler if os.register_at_fork exists (Python 3.7+)
  • Uses lazy reinitialization to minimize fork overhead
  • Gracefully handles exceptions during reinitialization
  • Zero impact on single-process applications

This fix ensures that OpenTelemetry metrics work reliably in production environments using pre-fork server models, resolving the critical deadlock issue that was causing request timeouts and memory leaks in Flask/Gunicorn deployments.

Implement post-fork reinitialization of threading locks in the metrics
measurement consumer to prevent deadlocks and data duplication in forked child processes.
@dshivashankar1994 dshivashankar1994 requested a review from a team as a code owner October 9, 2025 12:10
Copy link

CLA Not Signed

@xrmx xrmx moved this to Ready for review in @xrmx's Python PR digest Oct 9, 2025
@xrmx xrmx moved this from Ready for review to Reviewed PRs that need fixes in @xrmx's Python PR digest Oct 9, 2025
@xrmx
Copy link
Contributor

xrmx commented Oct 9, 2025

@dshivashankar1994 Thanks for the PR but you need to sign the CLA in order to contribute to OpenTelemetry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Reviewed PRs that need fixes

Development

Successfully merging this pull request may close these issues.

Deadlock accessing metric reader storage when running under Flask/Gunicorn with gevent workers

2 participants