-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
Desired Feature
Introduce functionality to allow for the recording of an Agent's execution state (checkpointing) and the ability to resume execution from the last recorded checkpoint after an interruption or process restart.
Rationale
Currently, if an Agent is performing a long-running, multi-step task (especially those involving external tool calls, human feedback loops, or complex state transitions), any unexpected process interruption (e.g., system crash, deployment restart, or manual stop) results in the loss of all progress. The task must be restarted from the beginning, leading to wasted resources (API calls, computation time) and a poor user experience.
Implementing a checkpointing mechanism would provide:
Fault Tolerance: Agents can seamlessly recover from transient errors or unexpected shutdowns.
Resource Efficiency: Avoid re-running costly or time-consuming steps.
Long-Task Handling: Enable the Agent to manage very long conversations or task sequences that might exceed typical execution time limits.