[hyperactor] watchdog; use it to monitor channel stalls #1633
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
Stack from ghstack (oldest at bottom):
Introduce
hyperactor::sync::watchdog::Watchdog, designed to monitor progress (i.e., lack of stalls) of asynchronous code.Stalls are particularly pernicious when mixed with timeouts and other policy; watch dogs allow us to carefully log the state of the program to diagnose these kinds of issues, and also to use to implement policy.
We use the watchdog to monitor "net" channels for stalls. Specifically, we ensure that the client is live (within a configurable watchdog timeout). We refactor logging to capture the pertinent state, to be used both in the watch dog, as well as in the normal log messages.
We also reconnect the channel on watchdog failures, in case the stalls are due to lower-level library or systems issues.
Differential Revision: D85079295
NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on Phabricator!