Skip to content

fix(gateway): make readiness health checks dependency-aware#1328

Open
alangou wants to merge 2 commits into
NVIDIA:mainfrom
alangou:alangou/os-156-update-gateway-health-check-to-account-for-database
Open

fix(gateway): make readiness health checks dependency-aware#1328
alangou wants to merge 2 commits into
NVIDIA:mainfrom
alangou:alangou/os-156-update-gateway-health-check-to-account-for-database

Conversation

@alangou
Copy link
Copy Markdown
Contributor

@alangou alangou commented May 12, 2026

Summary

This PR makes gateway readiness signals dependency-aware instead of always healthy, while keeping liveness intentionally lightweight.
It adds a configurable database readiness timeout, wires it end-to-end (CLI/env -> server config -> health router), and aligns Helm defaults so application timeout stays below Kubernetes probe timeout.
It also extends readiness coverage with Docker e2e and updates the Rust test task so openshell-server test-only coverage runs with test-support.

Related Issue

  • closes OS-156
  • Parent initiative: OSGH-111 (Runtime Reliability)

Changes

  • Added dependency-aware health behavior in openshell-server:
    • /healthz remains liveness-only (200 when process is responsive)
    • /readyz and /health perform DB connectivity checks and return 503 when unavailable
    • readiness payload includes structured dependency details (checks.database.status, latency, error)
  • Added bounded timeout handling for DB readiness checks.
  • Added configurable readiness timeout:
    • CLI flag: --readiness-db-timeout-secs
    • Env var: OPENSHELL_READINESS_DB_TIMEOUT_SECS
    • Core/server config wiring and runtime validation (> 0)
  • Added persistence connectivity helpers (ping / close) for both SQLite and Postgres stores.
  • Added Prometheus readiness metrics:
    • openshell_server_readiness_database_healthy gauge (1 healthy, 0 unhealthy)
    • openshell_server_readiness_database_probe_duration_seconds histogram labeled by outcome (success, db_error, timeout)
    • health tests validate metric emission and /metrics exposure
  • Scoped pool teardown APIs to test support only:
    • Store::close and backend close methods are behind #[cfg(any(test, feature = "test-support"))]
  • Updated existing integration tests that instantiate health_router so they provide a real Store.
  • Added readiness integration test coverage:
    • crates/openshell-server/tests/health_endpoint_integration.rs
  • Added Docker e2e coverage for readiness:
    • e2e/rust/tests/readyz_health.rs
    • e2e test target in e2e/rust/Cargo.toml
    • Docker e2e wrapper exposes a dedicated health port for /readyz probing
  • Helm chart updates for timeout customization and safe defaults:
    • server.readinessDbTimeoutSecs passed to gateway args
    • probes.readiness.timeoutSeconds default set to 2s
    • app default timeout set to 1s so app timeout < probe timeout
  • Test task update for test-support coverage by default in Rust test lane:
    • tasks/test.toml runs workspace tests excluding openshell-server, then runs openshell-server with --features test-support

Why /healthz still returns 200 when DB is down

/healthz is kept as a pure liveness probe by design. If liveness depended on DB, transient DB outages could trigger unnecessary pod restarts and CrashLoop behavior without fixing the dependency outage. Readiness (/readyz) is the dependency-aware signal used to remove unhealthy pods from traffic.

Why close is test-only

close is used to simulate dependency outages in tests. Exposing it in runtime code would make it possible to tear down an active pool under live traffic. Until a dedicated graceful-shutdown flow exists, keeping it behind test support prevents accidental production use.

Testing

  • mise run pre-commit passes
  • Unit tests added/updated
  • E2E tests added/updated (if applicable)

Validation run:

  • mise run e2e
  • mise run ci

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

Copy link
Copy Markdown
Collaborator

@TaylorMutch TaylorMutch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of focused comments on the readiness changes.

Comment thread crates/openshell-server/src/http.rs Outdated
Comment thread crates/openshell-server/src/persistence/mod.rs
@alangou alangou force-pushed the alangou/os-156-update-gateway-health-check-to-account-for-database branch 5 times, most recently from 2cb34c9 to 05b6998 Compare May 13, 2026 12:44
@alangou
Copy link
Copy Markdown
Contributor Author

alangou commented May 13, 2026

@TaylorMutch I removed the hardcoded value for the database timeout some documentation and deployments resources has been updated to reflect that change (helm chart, doc)

@alangou alangou requested a review from TaylorMutch May 13, 2026 13:42
@TaylorMutch TaylorMutch added the test:e2e Requires end-to-end coverage label May 13, 2026
@github-actions
Copy link
Copy Markdown

Label test:e2e applied for 05b6998. Open the existing run and click Re-run all jobs to execute with the label set. The E2E Gate check on this PR will flip green automatically once the run finishes.

@alangou alangou force-pushed the alangou/os-156-update-gateway-health-check-to-account-for-database branch from 05b6998 to 2b11909 Compare May 18, 2026 09:22
@alangou
Copy link
Copy Markdown
Contributor Author

alangou commented May 18, 2026

@TaylorMutch I have rebased the branch PR

Copy link
Copy Markdown
Collaborator

@TaylorMutch TaylorMutch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few changes I think are needed on this PR, but it's getting closer.

Comment thread docs/sandboxes/manage-gateways.mdx Outdated
openshell gateway info
```

Gateways expose `/healthz` as a liveness signal and `/readyz` as a dependency-aware readiness signal. `/readyz` checks database connectivity and reports unhealthy when the probe exceeds the configured timeout. Tune that timeout with `--readiness-db-timeout-secs` or `OPENSHELL_READINESS_DB_TIMEOUT_SECS` when you need a slower or stricter readiness window.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably only relevant on the Kubernetes docs which care about these things; should this move into the Kubernetes docs?

Comment thread e2e/with-docker-gateway.sh Outdated
--config "${GATEWAY_CONFIG}"
--bind-address 0.0.0.0
--port "${HOST_PORT}"
--health-port "${HEALTH_PORT}"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't currently depend on the health port in any configuration other than Kubernetes, so I would suggest we not add it into the docker e2e path and only exercise it on the kubernetes e2e tests (which would enable the health port by default).

Comment thread deploy/helm/openshell/values.yaml Outdated
# Database readiness timeout in seconds used by /readyz and /health.
# Keep this lower than probes.readiness.timeoutSeconds so the gateway can
# return a readiness decision before Kubernetes times out the probe.
readinessDbTimeoutSecs: 1
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a possible footgun; the readiness check should be pretty instantaneous right? And if we have to add in a configuration option to the server to determine this, that seems strange to me.

Perhaps the health check for DB health should run in the background and the health check handler only checks if we detect the state to be unhealthy? That would remove this race condition. WDYT?

Copy link
Copy Markdown
Contributor Author

@alangou alangou May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that works for me. The database healthcheck would run in background (with the configurable timeout) and update some sort of store in the gateway. When /readyz is queried the store is read. This would indeed be almost instant

@TaylorMutch TaylorMutch self-assigned this May 19, 2026
@alangou alangou force-pushed the alangou/os-156-update-gateway-health-check-to-account-for-database branch 3 times, most recently from d008505 to ab4deb4 Compare May 19, 2026 15:45
alangou added 2 commits May 19, 2026 18:06
Emit Prometheus readiness metrics for database probes (healthy gauge and
outcome-labeled latency histogram) with coverage in health HTTP tests.
Restrict Store::close behind test support cfg to prevent accidental runtime
pool shutdown under live traffic.

Signed-off-by: Adrien Langou <alangou@nvidia.com>
Signed-off-by: Adrien Langou <alangou@nvidia.com>
@alangou alangou force-pushed the alangou/os-156-update-gateway-health-check-to-account-for-database branch from ab4deb4 to a167000 Compare May 19, 2026 16:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:e2e Requires end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants