Conversation
| - The graph is marked `FAILED` if at least one node fails | ||
| - Concurrently running nodes are allowed to finish by default (graceful exit) | ||
| - Customers can override this for fail-fast behavior | ||
| - `CANCELLED`: explicitly cancelled by the caller (e.g., via a hook that sets `cancelNode`) |
There was a problem hiding this comment.
Should we have INTERRUPTED for future usage?
There was a problem hiding this comment.
It'll be included in v1. I only gave the blurb about it in the overview because there is already a lot of content in this doc to discuss.
| enum Status { | ||
| PENDING = 'PENDING', | ||
| EXECUTING = 'EXECUTING', | ||
| COMPLETED = 'COMPLETED', |
There was a problem hiding this comment.
interrupt is a follow up?
There was a problem hiding this comment.
Yes correct. It'll follow the same pattern as Python though which already went through a design review.
| ``` | ||
|
|
||
| - Base state class shared across multi-agent patterns (Graph, Swarm, etc.) | ||
| - `user`: mutable section for customer-defined state. Typed via Zod schema inference (`z.infer<typeof schema>`) so customers get compile-time type safety and runtime validation without writing generics manually. Customers update it directly (e.g., `state.user.counter += 1`). |
There was a problem hiding this comment.
so customers get compile-time type safety and runtime validation without writing generics
Can you expand on this a bit? Whereis/what is schema coming from in this case?
Wondering if AgentState should be using the same pattern
There was a problem hiding this comment.
The schema is passed in by the user into the GraphBuilder. This will then give users a well typed state object they are free to manipulate in code through node functions, hook callbacks, etc.
An alternative approach I considered was using generics, but that became cumbersome because it would require making everything above MultiAgentState generic as well.
Documentation Deployment CompleteYour documentation preview has been successfully deployed! Preview URL: https://d3ehv1nix5p99z.cloudfront.net/pr-543/ |
| readonly status: Status | ||
| readonly duration: number | ||
| readonly error?: Error | ||
| readonly output: ContentBlock[] |
There was a problem hiding this comment.
Might be worth having input on here for completeness TBH
There was a problem hiding this comment.
We could. NodeInput, NodeState, and NodeResult are meant to be flexible and actually could share fields with one another. I actually wanted to consider combining everything into NodeState and will discuss that afterward. For now though, I think this is a good start because it mirrors what we did in Python.
| class NodeResult { | ||
| readonly nodeId: string | ||
| readonly status: Status | ||
| readonly duration: number |
There was a problem hiding this comment.
Question : What is the trade off vs timestamp here
There was a problem hiding this comment.
timestamp is a point in time where duration is a time delta. timestamp would be something like a DateTime type where as duration would be a number measuring seconds.
There was a problem hiding this comment.
yes I mean we expose start & finish time instead of duration. user can have more extensibility if there is a use case.
| const graph = new GraphBuilder() | ||
| .addNode(researcher) | ||
| .addNode(writer) | ||
| .addNode(reviewer) |
There was a problem hiding this comment.
So does this mean in addNode we are going to add sugar so you can pass in Node or Agent and we'll just wrap as needed? Or are we making Agent actually implement Node
There was a problem hiding this comment.
We wrap the Agent instance into an AgentNode instance on behalf of the customer.
| class NodeResult { | ||
| readonly nodeId: string | ||
| readonly status: Status | ||
| readonly duration: number |
There was a problem hiding this comment.
nit: we should normalize/standardize on a duration of some sort
There was a problem hiding this comment.
Agreed. We did not do this in Python. There are multiple duration fields where some are measured in seconds, others in milliseconds. Here I figured seconds was a good standard.
| - Downstream nodes receive content blocks built from upstream dependency outputs | ||
|
|
||
| ```typescript | ||
| class NodeResult { |
There was a problem hiding this comment.
should we track the parent node? would be None for the entry point node
There was a problem hiding this comment.
I'm not sure we need to track it in result. Users can already see the graph structure directly going through graph.state and graph.config. But could be added here if there is a good reason. It is an additive thing which is backwards compatible.
Unshure
left a comment
There was a problem hiding this comment.
What proposed changes here are different than what we have in python?
| ### Node | ||
|
|
||
| ```typescript | ||
| type NodeInput = string | ContentBlock[] |
There was a problem hiding this comment.
nit: Should this be more general?
https://github.com/strands-agents/sdk-typescript/blob/main/src/agent/agent.ts#L118
There was a problem hiding this comment.
Yes this can be extended and made more general. We could use invokeArgs for the time being, but I figured I would consider a NodeInput incase we wanted to expand even further. But for now, probably worth replacing with InvokeArgs. Same for GraphInput. Could be replaced by InvokeArgs.
| maxNodeExecutions: 10, | ||
| executionTimeout: 60, | ||
| userSchema: z.object({ | ||
| drafts: z.number().default(0), | ||
| approved: z.boolean().default(false), | ||
| }), |
There was a problem hiding this comment.
nit: Should all of these be their own add...() functions for the builder?
There was a problem hiding this comment.
FWIW, I have not seen the builder pattern much in JS/TS.
IMHO the add methods would be just noise vs making it a config option
| executionTimeout?: number | ||
| maxConcurrency?: number | ||
| hookProviders?: HookProvider[] | ||
| userSchema?: z.ZodObject<z.ZodRawShape> |
There was a problem hiding this comment.
Is there a way that external things can contribute? I'm thinking if you add a hook, what's the mechansim for a hook to add state as well that doesn't necessarily match this schema? Should that be allowed and/or is that desirable?
|
|
||
| ```typescript | ||
| interface NodeConfig { | ||
| timeout?: number |
There was a problem hiding this comment.
retry:? whether on timeout or error
There was a problem hiding this comment.
Yeah we can add retry configs here. NodeConfig is setup to be additive so any more ideas you have pitch them.
| - `multiAgentState`: reference to the current `MultiAgentState` (e.g., `GraphState`, `SwarmState`) for reading execution metadata and updating user state | ||
|
|
||
| ```typescript | ||
| interface NodeConfig { |
There was a problem hiding this comment.
Agent as Node, Function and Multiagent as Node use same NodeConfig interface.
Anything else we should include?
There was a problem hiding this comment.
We could create a MutliAgentNodeConfig or AgentNodeConfig if need be but right now I'm not seeing a use case. But they would extend NodeConfig if we decided on it.
| } | ||
| ``` | ||
|
|
||
| - Wraps a `MultiAgentBase` instance, enabling nested orchestration (e.g., a `Graph` as a node in another `Graph`) |
There was a problem hiding this comment.
Can we get the multiagent interface to match the agent one? From python, I think MultiAgentBase is pretty much the same as AgentBase, and I was thinking about getting rid of one
| - `state` is optional on `stream()`; if not provided, the node creates a default `NodeState` internally. Callers pass it to override node-scoped context (e.g., `executionCount` when a node is retried within the same invocation). | ||
|
|
||
| ```typescript | ||
| class AgentNode extends Node { |
There was a problem hiding this comment.
Do we plan to do AgentBase for other remote agents?
| - Telemetry, hooks, and interrupts are planned for multi-agent patterns but follow closely from existing SDK patterns and should not require a separate design discussion | ||
| - Session management will be reviewed separately | ||
|
|
||
| ## Interfaces |
There was a problem hiding this comment.
High level
This doc does a good job of telling us what we are doing. But it feels like the onus is on the reader to figure out how this differs from the python implementation. I think it would be helpful to call out early and explicitly exactly what we are changing (or potentially controversially keeping) compared to the existing impl.
For example, are we keeping the (questionable) OR edge logic we use in python, or are we going to use AND for consistency?
|
|
||
| Not covered in this document: | ||
|
|
||
| - Swarm builds on the same core interfaces introduced here (Status, Node, streaming events) and should not require a separate design discussion |
There was a problem hiding this comment.
(outside scope of this doc) one thing i'd like to see is how we can actually get rid of the tool injection logic. I don't like that swarm is so intrusive.
Probably we should switch to structured output
| } | ||
| ``` | ||
|
|
||
| - Implements `MultiAgentBase` for uniform orchestration |
There was a problem hiding this comment.
What's in MultiAgentBase? Is the a world in which we ditch MultiAgentBase and just do AgentBase? In Python I think MultiAgentBase just includes serialize_state & deserialize_state and maybe those go on Agentbase instead? (almost like a snapshot)
| ```typescript | ||
| class GraphResult { | ||
| readonly status: Status | ||
| readonly results: Record<string, NodeResult> |
There was a problem hiding this comment.
Can we have a list of node results? This is one of the limitations we had with initial version of the graph. It didn't allow for cycles so each agent can have a single result.
With the cyclic graphs, each agent can generate multiple node results
| @@ -0,0 +1,351 @@ | |||
| # TypeScript SDK - Graph - Design | |||
There was a problem hiding this comment.
high level q: Is there any deviation in the dev ex of using graphs in ts vs python?
There was a problem hiding this comment.
Slight deviations yes. Apologies I didn't have time to include in the doc. I'll address after review.
| - Aggregate result from a graph execution | ||
|
|
||
| ```typescript | ||
| class GraphState extends MultiAgentState { |
There was a problem hiding this comment.
again for cyclic graphs. what if one execution succeeded and one failed?
this is pretty annoying for Python bc we had to think about backwards compatibility with the already established data types/classes
| // | | ||
| // formatOutput (condition: approved) | ||
| // | ||
| const graph = new GraphBuilder() |
There was a problem hiding this comment.
Have we considered alternative/simpler devx's?
GraphBuilder.build(
flows: [
[ researcher, writer],
[ writer, reviewer],
[ reviewer, writer, (state) => !state.user.approved],
[ reviewer, formatOutput, (state) => !state.user.approved],
],
options: {
maxNodeExecutions: 10,
executionTimeout: 60,
userSchema: z.object({
drafts: z.number().default(0),
approved: z.boolean().default(false),
}),
}
)or even
GraphBuilder.build(
flows: [
[ researcher, writer, reviewer],
[ reviewer, writer, (state) => !state.user.approved],
[ reviewer, formatOutput, (state) => !state.user.approved],
],
options: {
maxNodeExecutions: 10,
executionTimeout: 60,
userSchema: z.object({
drafts: z.number().default(0),
approved: z.boolean().default(false),
}),
}
)| - Implements `MultiAgentBase` for uniform orchestration | ||
| - `state` is optional; if not provided, the graph creates a fresh `GraphState` internally | ||
| - Uses eager execution — as soon as a node completes, newly-ready downstream nodes start immediately (no batch waves) | ||
| - A node is "ready" when all of its incoming edge sources have completed and all edge conditions evaluate to true |
There was a problem hiding this comment.
We get rid of any for sure right
|
|
||
| - Directed edge between two nodes | ||
| - `handler`: optional function evaluated at runtime to determine whether the edge should be traversed | ||
| - Unconditional edges are always traversed when the source node completes |
There was a problem hiding this comment.
so this is OR logic on edges, then?
| class GraphBuilder { | ||
| addNode(agent: Agent | MultiAgentBase | NodeHandler, config?: NodeConfig): GraphBuilder | ||
| addEdge(source: string, target: string, handler?: EdgeHandler): GraphBuilder | ||
| build(config?: GraphConfig): Graph |
There was a problem hiding this comment.
nit: if we are doing builder pattern, i'd expect the configs to be set-able on the builder directly, i.e. builder.set_timeout()
|
|
||
| Each graph invocation could be assigned a unique state ID. The graph engine would store and retrieve state through a session manager keyed by this ID, rather than holding state directly as fields on graph and node instances. When invoking a graph, customers could specify a state ID to resume from a previous invocation's state. | ||
|
|
||
| This matters for concurrent isolation — if state lives as instance members on the graph or node objects, every invocation overwrites the same fields and concurrent calls interfere with each other. An in-memory session manager that stores state in a table (keyed by state ID) solves this without requiring external storage. It gives each invocation its own isolated state while keeping the same graph instance reusable across concurrent calls. |
There was a problem hiding this comment.
We can explore this, but I am curious on the repercussions; this would lean more towards returning a state "object" rather than having state on the graph/swarm
For graphs/swarm this might make a lot of sense to start with off the bat - though a POC would help here
|
|
||
| ### Remote Node | ||
|
|
||
| A remote node would execute on a different process or machine — wrapping an agent behind an API, an MCP server, or an A2A endpoint. It would serialize input, send it over the wire, and deserialize the response back into `ContentBlock[]`. The `Node` abstraction already supports this — a `RemoteNode` subclass would implement `_stream()` with network calls instead of local execution, and the graph engine wouldn't need to change since it calls `node.stream()` uniformly. |
There was a problem hiding this comment.
This differs from RemoteAgent in that it elevates the concept of remote execution outside of an agent? I like the idea.
There's nothing preventing this today in Python is there? Assuming we support it via FunctionNode?
| .addNode(reviewer) | ||
| .addNode(formatOutput) | ||
| .addEdge('researcher', 'writer') | ||
| .addEdge('writer', 'reviewer') |
There was a problem hiding this comment.
Referencing by id/name is odd IMHO; why not pass the object itself
| .addNode(formatOutput) | ||
| .addEdge('researcher', 'writer') | ||
| .addEdge('writer', 'reviewer') | ||
| .addEdge('reviewer', 'writer', (state) => !state.user.approved) |
There was a problem hiding this comment.
For extensibility I'd suggest:
| .addEdge('reviewer', 'writer', (state) => !state.user.approved) | |
| .addEdge('reviewer', 'writer', ({state}) => !state.user.approved) |
But otherwise like elevating this here
TS vs Python Diff (what Kiro says)
Comparison of the TS graph design against the current Python Node Architecture — Polymorphic hierarchy vs isinstance branchingPython uses a single TS introduces a polymorphic hierarchy: abstract
Execution Model — Eager vs batch-basedPython uses batch execution: nodes in a batch run in parallel via TS uses eager execution: as soon as any node completes, downstream nodes that are now ready start immediately. No waiting for batch siblings. This matters for graphs with mixed-latency nodes — in Python, a fast node in batch N+1 waits for all of batch N to finish, even if its specific dependency completed early. TS also adds Failure Semantics — Graceful vs fail-fastPython is fail-fast: exceptions propagate up, sibling tasks in the same batch are cancelled immediately. TS captures errors in TS also adds User State — Typed via Zod vs no equivalentTS introduces Python's Agent State IsolationTS Python requires opting in via Builder API — Config object vs setter methodsPython uses individual fluent setters: TS consolidates into a single .build({
maxNodeExecutions: 10,
executionTimeout: 60,
maxConcurrency: 5,
userSchema: z.object({ ... }),
})Both auto-detect entry points (nodes with no incoming edges). Python additionally allows explicit Streaming Events — Typed objects vs dictsPython yields plain dicts (events call TS yields typed event objects with discriminated unions via the if (event.type === 'multiAgentNodeStartEvent') { ... }Features in Python not covered by TS design (intentionally deferred)
The design doc explicitly states these follow existing SDK patterns and don't need separate design discussion. Features in TS design not in Python
|
| - Forwards input to the nested pattern and wraps its streaming events with the node's ID | ||
|
|
||
| ```typescript | ||
| type NodeHandler = (input: NodeInput, state: MultiAgentState) => NodeHandlerResult | Promise<NodeHandlerResult> | AsyncGenerator<MultiAgentStreamEvent, NodeHandlerResult, undefined> |
There was a problem hiding this comment.
In general I think we should prefer objects as arguments instead of individual arguments:
| type NodeHandler = (input: NodeInput, state: MultiAgentState) => NodeHandlerResult | Promise<NodeHandlerResult> | AsyncGenerator<MultiAgentStreamEvent, NodeHandlerResult, undefined> | |
| type NodeHandler = ({input: NodeInput, state: MultiAgentState}) => NodeHandlerResult | Promise<NodeHandlerResult> | AsyncGenerator<MultiAgentStreamEvent, NodeHandlerResult, undefined> |
Allows folks to pick and choose and allows us to extend in the future
Description
Related Issues
Type of Change
Checklist
mkdocs serveBy submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.