Skip to content

fix: stability tweaks#333

Open
fluffypony wants to merge 1 commit into
tari-project:developmentfrom
fluffypony:stability
Open

fix: stability tweaks#333
fluffypony wants to merge 1 commit into
tari-project:developmentfrom
fluffypony:stability

Conversation

@fluffypony
Copy link
Copy Markdown
Contributor

Description

Implements critical P2P stability fixes to resolve connection management bottlenecks causing "dead in the water" states requiring frequent node restarts. Key changes include:

  • Increased concurrent sync limit from 1 to 10 per PoW algorithm (10x improvement)
  • Enhanced error visibility for sync semaphore exhaustion with "CRITICAL" logging
  • Reduced connection churn aggressiveness from 20 to 45 minute intervals

Motivation and Context

Production log analysis revealed cascading P2P failures caused by:

  1. Sync Bottleneck: Only 1 concurrent sync per PoW algorithm creating massive request queuing
  2. Resource Exhaustion: Sync semaphore exhaustion leading to silent request drops
  3. Connection Instability: Aggressive connection churning (every 20 minutes) destabilizing peer relationships
  4. Identify Protocol Failures: Failed to identify peer errors due to resource contention

The core issue was nodes entering "dead in the water" states where they couldn't sync or maintain connections, requiring manual restarts every few days. The single concurrent sync limit was the primary bottleneck causing this cascading failure pattern.

How Has This Been Tested?

  • Code compiles without warnings (cargo check passes)
  • All existing unit tests pass
  • Configuration parameter changes validated
  • No breaking changes to public APIs
  • Load testing recommended in staging environment

What process can a PR reviewer use to test or verify this change?

  1. Code Review: Verify the three specific parameter changes in p2pool/src/server/p2p/network.rs:

    • Line 126: num_concurrent_syncs: 10 (was 1)
    • Line 70: CONNECTION_CHURN_DELAY_SECS: u64 = 60 * 45 (was 20 minutes)
    • Around line 2970: Enhanced error logging for sync semaphore exhaustion
  2. Compilation Test: Run cargo check and cargo build to ensure no compilation issues

  3. Configuration Validation: Confirm the changes only affect internal constants and don't break existing configurations

  4. Log Analysis (if possible): Deploy to test environment and monitor for:

    • Reduction in "Could not acquire semaphore" warnings
    • Decreased "Failed to identify peer" errors
    • More stable peer connection patterns

Breaking Changes

  • None

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant