Skip to content

Conversation

@ggivo
Copy link
Contributor

@ggivo ggivo commented Dec 1, 2025

[automatic-failover] Integrate health checks with probing policies and retry logic

Summary

This PR integrates health check functionality into the automatic failover feature, providing endpoint health monitoring with configurable probing policies and retry logic. The implementation is ported from Jedis and adapted to Lettuce's architecture.

Key Features

Health Check System

  • HealthCheckImpl: Core implementation with periodic health status monitoring
  • HealthCheckStrategy: Configurable strategy interface for health check behavior
    • Interval, timeout, number of probes, delay between probes
    • Pluggable health check logic via doHealthCheck(RedisURI)
  • HealthStatusManager: Centralized management of health checks across multiple endpoints

Probing Policies

  • ANY_SUCCESS: Returns HEALTHY if any probe succeeds
  • ALL_SUCCESS: Returns HEALTHY only if all probes succeed
  • MAJORITY_SUCCESS: Returns HEALTHY if majority of probes succeed

Retry Logic

  • Configurable numProbes for multiple health check attempts
  • Configurable delay between probe retries
  • Thread-safe status updates with timestamp-based conflict resolution

MultiDbClient Integration

  • Health checks automatically created/started when databases are added
  • Health checks automatically stopped when databases are removed
  • Health checks stopped on connection close
  • Endpoints without health checks return HEALTHY by default

Implementation Details

Architecture:

  • Uses static ExecutorService (cached thread pool) for health check execution
  • Per-instance ScheduledExecutorService for periodic scheduling

Lifecycle Management:

  • Health checks start with UNKNOWN status
  • Transition to HEALTHY/UNHEALTHY after first check completes

Testing

Unit Tests

  • HealthCheckCollectionUnitTests - Collection management
  • HealthCheckImplUnitTests - Core functionality
  • HealthCheckImplRetryLogicUnitTests - Health Check Retry logic
  • HealthCheckImplProbingPolicyUnitTests - Probing policies
  • TestHealthCheckStrategy - Shared test helper, with controllable health statuses per endpoint

Integration Tests

  • HealthCheckIntegrationTest
    • Health check configuration and lifecycle
    • Failover triggering on unhealthy status
    • Circuit breaker coordination
    • Dynamic database add/remove
    • Health status transitions

Files Changed

Main Implementation:

  • src/main/java/io/lettuce/core/failover/health/ - Health check package (10 files)
  • src/main/java/io/lettuce/core/failover/MultiDbClientImpl.java - Integration
  • src/main/java/io/lettuce/core/failover/StatefulRedisMultiDbConnectionImpl.java - Connection lifecycle

Tests:

  • src/test/java/io/lettuce/core/failover/health/
  • src/test/java/io/lettuce/core/failover/HealthCheckIntegrationTest.java - Integration tests

Related Issues

  • CAE-1685: Integrate health checks and probing

Breaking Changes

None - This is a new feature addition.

Follow-up Items for Discussion

  1. Static ExecutorService vs EventExecutorGroup

    • Current: Uses static ExecutorService (unbounded cached thread pool) for health check execution
    • Consideration: Migrate to EventExecutorGroup from ClientResources for better lifecycle management and bounded thread pool
    • Trade-offs: Health checks use blocking future.get() which requires dedicated threads; EventExecutorGroup is shared across components
    • Decision needed: Keep per-instance scheduler + shared worker pool, or refactor to fully async pattern
  2. StatusTracker.waitForHealthStatus() Blocking Behavior

    • Current: Synchronous blocking wait for health status
    • Issue: Blocks calling thread until health status reaches desired state or timeout
    • Future work: Revisit once connectAsync() is introduced to provide non-blocking alternative
    • Potential solution: Add CompletableFuture<HealthStatus> awaitHealthStatus() method
  3. HealthCheckImpl.stop() Blocking Shutdown

    • Current: Synchronous shutdown with scheduler.awaitTermination()
    • Issue: Blocks calling thread during graceful shutdown (up to 1 second)
    • Future work: Provide CompletableFuture<Void> closeAsync() for non-blocking shutdown
    • Alignment: Should align with Lettuce's async shutdown patterns when available

ggivo added 28 commits November 26, 2025 15:31
Changes
 - add connection.getHealthStatus(RedisUri endpoint)
 - HEALTHY - returned for Databases without health checks configured
 - add test
Changes
 - add test to ensure health status changes from custom health checks are reflected
Changes
 - add test to ensure health status changes from custom health checks are reflected
  - Should start health checks automatically when connection is created
  - Should stop health checks when connection is closed
   - rename health check thread names to lettuce-*
   - clean up warnings
   - format
   - javadocs & autor updated
  - Update  StatefulMultiDbConnectionIntegrationTests to account for added additional test server in MultiDbTestSupport
  - Junit4  @after replaced with JUnit5
@ggivo ggivo changed the base branch from main to feature/automatic-failover-1 December 1, 2025 11:22
  - Update  StatefulMultiDbConnectionIntegrationTests to account for added additional test server in MultiDbTestSupport
  - Junit4  @after replaced with JUnit5
@ggivo ggivo requested review from atakavci and tishun and removed request for atakavci December 1, 2025 11:52
@ggivo ggivo marked this pull request as ready for review December 1, 2025 11:53
@ggivo ggivo requested review from a-TODO-rov and atakavci December 2, 2025 08:31
Copy link
Collaborator

@atakavci atakavci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some of them are nothing more than questions.
one to take more seriously is the order with registering listeners.
i ll try go another round with clear mind.

Copy link
Collaborator

@atakavci atakavci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ggivo ggivo merged commit be6a0d1 into feature/automatic-failover-1 Dec 5, 2025
13 of 14 checks passed
@ggivo ggivo deleted the topic/failover/CAE-1685-integrate-healthchecks-and-probing branch December 5, 2025 11:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants