Native Auto-Heal / Self-Healing Orchestration for Unhealthy Containers #198

mnaiman · 2026-03-22T10:35:40Z

mnaiman
Mar 22, 2026

Summary
I would like to propose adding an Auto-Heal engine to Drydock. While Docker can restart containers that crash, it lacks a native mechanism to automatically restart containers that become unhealthy while still running.

Current third-party solutions (like docker-autoheal or docker-surgeon) are often unmaintained, lack multi-host support, or require complex sidecar setups. Since Drydock already has secure access to multiple Docker hosts (via sockets/HTTPS), it is perfectly positioned to act as a centralized self-healing orchestrator.

The Problem
When a container enters an unhealthy state:

Manual intervention is currently required to restart it.
Standard Docker restart_policy usually only triggers on process exit, not health check failure.
Existing tools are often "single-host only," meaning in a multi-node environment, you have to manage dozens of separate auto-heal instances.

Proposed Solution
Integrate a monitoring loop within Drydock that polls the health status of containers across all connected hosts and performs corrective actions (Stop/Start or Restart) based on predefined rules.

Key Features

Multi-Host Awareness: Leverage Drydock’s existing host connections to monitor containers globally from a single UI/service.
Label-Based Configuration: Use Docker labels to opt-in specific containers. Examples:
- drydock.autoheal=true
- drydock.autoheal.action=restart
- drydock.autoheal.delay=30s (to prevent boot-loops)
Dual-Trigger Logic:
- Health Status: Monitor the native Docker Healthcheck API (healthy vs unhealthy).
- Log Monitoring: For containers without native health checks, trigger a restart if a specific keyword (e.g., "Fatal Error" or "Connection Timeout") appears in the logs.
Notifications: Integrate with Drydock’s notification system to alert users when a container has been automatically recovered.

Why Drydock?
Drydock already possesses the "keys to the kingdom"—it has the connectivity and the GUI. Adding this would turn Drydock from a management dashboard into a proactive stability tool, filling a significant gap in the Docker ecosystem.

s-b-e-n-s-o-n · 2026-03-22T19:01:51Z

s-b-e-n-s-o-n
Mar 22, 2026
Maintainer

Thanks for the thoughtful proposal! This aligns well with where Drydock is headed.

What fits naturally:

Drydock already has container actions (start/stop/restart) shipped since v1.2.0, and our roadmap includes health check gates (v2.1.0) and expanded container operations (v2.2.0). Adding proactive health monitoring with auto-restart sits right in that trajectory.

The label-based opt-in you're describing fits perfectly with the existing dd.* label convention. Something like:

dd.autoheal=true
dd.autoheal.action=restart (default) or stop
dd.autoheal.delay=30s
dd.autoheal.max_retries=3

And since Drydock already has multi-host awareness via the agent architecture, this would work across all connected hosts out of the box — which is exactly the gap you identified with single-host tools.

Notifications on auto-heal events would plug directly into the existing trigger system (Slack, webhook, email, etc.) with no extra work.

Where I'd push back — log monitoring:

The log keyword matching ("dual-trigger") is a significantly larger scope. Parsing container logs reliably means dealing with encoding detection, multiline formats, log rotation, rate limiting, and buffering — it's essentially a log aggregation engine. There are mature, dedicated tools for that (Loki + Promtail, Dozzle, etc.) that do it far better than we could as a side feature. I'd rather keep Drydock focused and let users pair it with a purpose-built log stack.

Rough timeline:

Health-status event notifications could land as early as v1.6–v1.7 as a new trigger event type. The full auto-heal loop (monitor → restart → verify → notify) fits best alongside the health check gates in v2.1.0.

Appreciate you taking the time to write this up — it's a good signal that the multi-host health gap is real.

0 replies

mnaiman · 2026-03-22T19:09:41Z

mnaiman
Mar 22, 2026
Author

Perfect, thanks, looking forward.

As for log monitoring, it was low priority, I have adjusted healthchecks manually to be log aware, so if keyword appears, container is unhealthy now.

Waiting patiently for autoheal feature. 😊
Key is to have some kind of trigger to invoke eg. email that something happened and drydock performed action (container restart).

0 replies

s-b-e-n-s-o-n · 2026-05-12T13:31:06Z

s-b-e-n-s-o-n
May 12, 2026
Maintainer

This is tracked on the roadmap at Phase 9.4.

The label-based opt-in (dd.autoheal=true, dd.autoheal.action, dd.autoheal.delay, dd.autoheal.max_retries) fits cleanly with the existing dd.* convention, and multi-host coverage comes for free via the agent architecture.

Health-status event notifications (alerting when a container enters unhealthy state, even before auto-restart is implemented) could land earlier — v1.6–1.7 — as a new trigger event type. The full monitor → restart → verify → notify loop is targeted alongside health check gates in the v2.1.0 window.

The log-keyword trigger is out of scope; for that use case, pairing Drydock with a dedicated log tool (Loki, Dozzle) and surfacing the result as a Docker healthcheck is the right pattern, as you've already discovered.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Native Auto-Heal / Self-Healing Orchestration for Unhealthy Containers #198

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Native Auto-Heal / Self-Healing Orchestration for Unhealthy Containers #198

Uh oh!

mnaiman Mar 22, 2026

Replies: 3 comments

Uh oh!

s-b-e-n-s-o-n Mar 22, 2026 Maintainer

Uh oh!

mnaiman Mar 22, 2026 Author

Uh oh!

s-b-e-n-s-o-n May 12, 2026 Maintainer

mnaiman
Mar 22, 2026

s-b-e-n-s-o-n
Mar 22, 2026
Maintainer

mnaiman
Mar 22, 2026
Author

s-b-e-n-s-o-n
May 12, 2026
Maintainer