Skip to content

Default failure-rate policy can cascade service-specific restart loops into SEVERE full container stop #6

@thomaswitt

Description

@thomaswitt

Summary

Async::Service::Policy default behavior (6 failures / 60s) is reasonable, but
in multi-service Falcon setups a failure loop in one service can stop the entire
container, including healthy critical services (e.g. web).

This is expected by policy design, but easy to hit with common integrations (for
example ActiveJob Inline service mode).

In my case, this was triggered by ActiveJob Inline service mode restart churn (tracked separately in async-job-adapter-active_job).

Observed behavior

With:

  • async-service 0.20.1
  • async-container 0.34.2
  • falcon 0.54.2

When one child repeatedly exits non-success, logs show:

  • Process is blocking, sending kill signal...
  • Failure rate exceeded threshold, stopping container!

Then container.stop(true) stops all services.

Why this is problematic in practice

In mixed service containers, one non-critical service loop can take down the web
process even when web itself is healthy.

Request

Could you consider one or more of:

  1. Stronger docs/examples for container_policy in Falcon configs (service-scoped handling).
  2. Clear guidance for integrations that should not run under strict default escalation without
    custom policy.
  3. Optional policy mode or helper pattern for service-aware escalation (e.g. ignore/soft-fail selected service names).

Repro context

The loop source in my case is tracked separately in async-job-adapter-active_job
(Inline processor in service mode), but this issue is about operability of
default policy behavior in multi-service containers.

Workaround

Using container_policy do ... end in falcon.rb with a custom policy that ignores escalation for background-jobs but keeps strict behavior for web/supervisor.

TL;DR: We were using :async_job, but with its default Inline backend; in Falcon service mode that worker exits immediately, gets restarted in a tight loop, and async-service’s failure-rate policy escalated those repeated exits into a full container stop. We mitigated by making policy service-aware (ignore background-jobs escalation, keep strict web/supervisor enforcement), so background jobs still run and the web app no longer gets taken down.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions