Skip to content

Conversation

@mariusae
Copy link
Member

@mariusae mariusae commented Oct 21, 2025

Stack from ghstack (oldest at bottom):

Introduce hyperactor::sync::watchdog::Watchdog, designed to monitor progress (i.e., lack of stalls) of asynchronous code.

Stalls are particularly pernicious when mixed with timeouts and other policy; watch dogs allow us to carefully log the state of the program to diagnose these kinds of issues, and also to use to implement policy.

We use the watchdog to monitor "net" channels for stalls. Specifically, we ensure that the client is live (within a configurable watchdog timeout). We refactor logging to capture the pertinent state, to be used both in the watch dog, as well as in the normal log messages.

We also reconnect the channel on watchdog failures, in case the stalls are due to lower-level library or systems issues.

Differential Revision: D85079295

NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on Phabricator!

Introduce `hyperactor::sync::watchdog::Watchdog`, designed to monitor progress (i.e., lack of stalls) of asynchronous code.

Stalls are particularly pernicious when mixed with timeouts and other policy; watch dogs allow us to carefully log the state of the program to diagnose these kinds of issues, and also to use to implement policy.

We use the watchdog to monitor "net" channels for stalls. Specifically, we ensure that the client is live (within a configurable watchdog timeout). We refactor logging to capture the pertinent state, to be used both in the watch dog, as well as in the normal log messages.

We also reconnect the channel on watchdog failures, in case the stalls are due to lower-level library or systems issues.

Differential Revision: [D85079295](https://our.internmc.facebook.com/intern/diff/D85079295/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D85079295/)!

[ghstack-poisoned]
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants