[hyperactor] watchdog; use it to monitor channel stalls #1633

mariusae · 2025-10-21T21:33:14Z

Stack from ghstack (oldest at bottom):

Introduce hyperactor::sync::watchdog::Watchdog, designed to monitor progress (i.e., lack of stalls) of asynchronous code.

Stalls are particularly pernicious when mixed with timeouts and other policy; watch dogs allow us to carefully log the state of the program to diagnose these kinds of issues, and also to use to implement policy.

We use the watchdog to monitor "net" channels for stalls. Specifically, we ensure that the client is live (within a configurable watchdog timeout). We refactor logging to capture the pertinent state, to be used both in the watch dog, as well as in the normal log messages.

We also reconnect the channel on watchdog failures, in case the stalls are due to lower-level library or systems issues.

Differential Revision: D85079295

NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on Phabricator!

Introduce `hyperactor::sync::watchdog::Watchdog`, designed to monitor progress (i.e., lack of stalls) of asynchronous code. Stalls are particularly pernicious when mixed with timeouts and other policy; watch dogs allow us to carefully log the state of the program to diagnose these kinds of issues, and also to use to implement policy. We use the watchdog to monitor "net" channels for stalls. Specifically, we ensure that the client is live (within a configurable watchdog timeout). We refactor logging to capture the pertinent state, to be used both in the watch dog, as well as in the normal log messages. We also reconnect the channel on watchdog failures, in case the stalls are due to lower-level library or systems issues. Differential Revision: [D85079295](https://our.internmc.facebook.com/intern/diff/D85079295/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D85079295/)! [ghstack-poisoned]

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 21, 2025

mariusae mentioned this pull request Oct 21, 2025

[hyperactor] implement a grace period for stalled client loops #1634

Open

meta-codesync bot added fb-exported meta-exported labels Oct 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[hyperactor] watchdog; use it to monitor channel stalls #1633

[hyperactor] watchdog; use it to monitor channel stalls #1633

Uh oh!

mariusae commented Oct 21, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[hyperactor] watchdog; use it to monitor channel stalls #1633

Are you sure you want to change the base?

[hyperactor] watchdog; use it to monitor channel stalls #1633

Uh oh!

Conversation

mariusae commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mariusae commented Oct 21, 2025 •

edited

Loading