Skip to content

designs - typescript - graph#543

Draft
pgrayy wants to merge 1 commit intostrands-agents:mainfrom
pgrayy:design-typescript-graphs
Draft

designs - typescript - graph#543
pgrayy wants to merge 1 commit intostrands-agents:mainfrom
pgrayy:design-typescript-graphs

Conversation

@pgrayy
Copy link
Member

@pgrayy pgrayy commented Feb 12, 2026

Description

Related Issues

Type of Change

  • New content
  • Content update/revision
  • Structure/organization improvement
  • Typo/formatting fix
  • Bug fix
  • Other (please describe):

Checklist

  • I have read the CONTRIBUTING document
  • My changes follow the project's documentation style
  • I have tested the documentation locally using mkdocs serve
  • Links in the documentation are valid and working

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

- The graph is marked `FAILED` if at least one node fails
- Concurrently running nodes are allowed to finish by default (graceful exit)
- Customers can override this for fail-fast behavior
- `CANCELLED`: explicitly cancelled by the caller (e.g., via a hook that sets `cancelNode`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have INTERRUPTED for future usage?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll be included in v1. I only gave the blurb about it in the overview because there is already a lot of content in this doc to discuss.

enum Status {
PENDING = 'PENDING',
EXECUTING = 'EXECUTING',
COMPLETED = 'COMPLETED',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interrupt is a follow up?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes correct. It'll follow the same pattern as Python though which already went through a design review.

```

- Base state class shared across multi-agent patterns (Graph, Swarm, etc.)
- `user`: mutable section for customer-defined state. Typed via Zod schema inference (`z.infer<typeof schema>`) so customers get compile-time type safety and runtime validation without writing generics manually. Customers update it directly (e.g., `state.user.counter += 1`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so customers get compile-time type safety and runtime validation without writing generics

Can you expand on this a bit? Whereis/what is schema coming from in this case?

Wondering if AgentState should be using the same pattern

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schema is passed in by the user into the GraphBuilder. This will then give users a well typed state object they are free to manipulate in code through node functions, hook callbacks, etc.

An alternative approach I considered was using generics, but that became cumbersome because it would require making everything above MultiAgentState generic as well.

@strands-agent
Copy link
Contributor

Documentation Deployment Complete

Your documentation preview has been successfully deployed!

Preview URL: https://d3ehv1nix5p99z.cloudfront.net/pr-543/

readonly status: Status
readonly duration: number
readonly error?: Error
readonly output: ContentBlock[]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth having input on here for completeness TBH

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could. NodeInput, NodeState, and NodeResult are meant to be flexible and actually could share fields with one another. I actually wanted to consider combining everything into NodeState and will discuss that afterward. For now though, I think this is a good start because it mirrors what we did in Python.

class NodeResult {
readonly nodeId: string
readonly status: Status
readonly duration: number
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question : What is the trade off vs timestamp here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timestamp is a point in time where duration is a time delta. timestamp would be something like a DateTime type where as duration would be a number measuring seconds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I mean we expose start & finish time instead of duration. user can have more extensibility if there is a use case.

const graph = new GraphBuilder()
.addNode(researcher)
.addNode(writer)
.addNode(reviewer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So does this mean in addNode we are going to add sugar so you can pass in Node or Agent and we'll just wrap as needed? Or are we making Agent actually implement Node

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We wrap the Agent instance into an AgentNode instance on behalf of the customer.

class NodeResult {
readonly nodeId: string
readonly status: Status
readonly duration: number
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we should normalize/standardize on a duration of some sort

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. We did not do this in Python. There are multiple duration fields where some are measured in seconds, others in milliseconds. Here I figured seconds was a good standard.

- Downstream nodes receive content blocks built from upstream dependency outputs

```typescript
class NodeResult {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we track the parent node? would be None for the entry point node

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we need to track it in result. Users can already see the graph structure directly going through graph.state and graph.config. But could be added here if there is a good reason. It is an additive thing which is backwards compatible.

Copy link
Member

@Unshure Unshure left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What proposed changes here are different than what we have in python?

### Node

```typescript
type NodeInput = string | ContentBlock[]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this can be extended and made more general. We could use invokeArgs for the time being, but I figured I would consider a NodeInput incase we wanted to expand even further. But for now, probably worth replacing with InvokeArgs. Same for GraphInput. Could be replaced by InvokeArgs.

Comment on lines +312 to +317
maxNodeExecutions: 10,
executionTimeout: 60,
userSchema: z.object({
drafts: z.number().default(0),
approved: z.boolean().default(false),
}),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Should all of these be their own add...() functions for the builder?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I have not seen the builder pattern much in JS/TS.


IMHO the add methods would be just noise vs making it a config option

executionTimeout?: number
maxConcurrency?: number
hookProviders?: HookProvider[]
userSchema?: z.ZodObject<z.ZodRawShape>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way that external things can contribute? I'm thinking if you add a hook, what's the mechansim for a hook to add state as well that doesn't necessarily match this schema? Should that be allowed and/or is that desirable?


```typescript
interface NodeConfig {
timeout?: number
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retry:? whether on timeout or error

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we can add retry configs here. NodeConfig is setup to be additive so any more ideas you have pitch them.

- `multiAgentState`: reference to the current `MultiAgentState` (e.g., `GraphState`, `SwarmState`) for reading execution metadata and updating user state

```typescript
interface NodeConfig {
Copy link
Contributor

@JackYPCOnline JackYPCOnline Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agent as Node, Function and Multiagent as Node use same NodeConfig interface.
Anything else we should include?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could create a MutliAgentNodeConfig or AgentNodeConfig if need be but right now I'm not seeing a use case. But they would extend NodeConfig if we decided on it.

}
```

- Wraps a `MultiAgentBase` instance, enabling nested orchestration (e.g., a `Graph` as a node in another `Graph`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get the multiagent interface to match the agent one? From python, I think MultiAgentBase is pretty much the same as AgentBase, and I was thinking about getting rid of one

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 had a similar question.

- `state` is optional on `stream()`; if not provided, the node creates a default `NodeState` internally. Callers pass it to override node-scoped context (e.g., `executionCount` when a node is retried within the same invocation).

```typescript
class AgentNode extends Node {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we plan to do AgentBase for other remote agents?

- Telemetry, hooks, and interrupts are planned for multi-agent patterns but follow closely from existing SDK patterns and should not require a separate design discussion
- Session management will be reviewed separately

## Interfaces
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High level

This doc does a good job of telling us what we are doing. But it feels like the onus is on the reader to figure out how this differs from the python implementation. I think it would be helpful to call out early and explicitly exactly what we are changing (or potentially controversially keeping) compared to the existing impl.

For example, are we keeping the (questionable) OR edge logic we use in python, or are we going to use AND for consistency?


Not covered in this document:

- Swarm builds on the same core interfaces introduced here (Status, Node, streaming events) and should not require a separate design discussion
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(outside scope of this doc) one thing i'd like to see is how we can actually get rid of the tool injection logic. I don't like that swarm is so intrusive.

Probably we should switch to structured output

}
```

- Implements `MultiAgentBase` for uniform orchestration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's in MultiAgentBase? Is the a world in which we ditch MultiAgentBase and just do AgentBase? In Python I think MultiAgentBase just includes serialize_state & deserialize_state and maybe those go on Agentbase instead? (almost like a snapshot)

```typescript
class GraphResult {
readonly status: Status
readonly results: Record<string, NodeResult>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a list of node results? This is one of the limitations we had with initial version of the graph. It didn't allow for cycles so each agent can have a single result.

With the cyclic graphs, each agent can generate multiple node results

@@ -0,0 +1,351 @@
# TypeScript SDK - Graph - Design
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high level q: Is there any deviation in the dev ex of using graphs in ts vs python?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slight deviations yes. Apologies I didn't have time to include in the doc. I'll address after review.

- Aggregate result from a graph execution

```typescript
class GraphState extends MultiAgentState {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again for cyclic graphs. what if one execution succeeded and one failed?

this is pretty annoying for Python bc we had to think about backwards compatibility with the already established data types/classes

// |
// formatOutput (condition: approved)
//
const graph = new GraphBuilder()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we considered alternative/simpler devx's?

GraphBuilder.build(
  flows: [
    [ researcher, writer],
    [ writer, reviewer],
    [ reviewer, writer, (state) => !state.user.approved],
    [ reviewer, formatOutput, (state) => !state.user.approved],
  ],
  options: {
    maxNodeExecutions: 10,
    executionTimeout: 60,
    userSchema: z.object({
      drafts: z.number().default(0),
      approved: z.boolean().default(false),
    }),
  }
)

or even

GraphBuilder.build(
  flows: [
    [ researcher, writer, reviewer],
    [ reviewer, writer, (state) => !state.user.approved],
    [ reviewer, formatOutput, (state) => !state.user.approved],
  ],
  options: {
    maxNodeExecutions: 10,
    executionTimeout: 60,
    userSchema: z.object({
      drafts: z.number().default(0),
      approved: z.boolean().default(false),
    }),
  }
)

- Implements `MultiAgentBase` for uniform orchestration
- `state` is optional; if not provided, the graph creates a fresh `GraphState` internally
- Uses eager execution — as soon as a node completes, newly-ready downstream nodes start immediately (no batch waves)
- A node is "ready" when all of its incoming edge sources have completed and all edge conditions evaluate to true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We get rid of any for sure right


- Directed edge between two nodes
- `handler`: optional function evaluated at runtime to determine whether the edge should be traversed
- Unconditional edges are always traversed when the source node completes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is OR logic on edges, then?

class GraphBuilder {
addNode(agent: Agent | MultiAgentBase | NodeHandler, config?: NodeConfig): GraphBuilder
addEdge(source: string, target: string, handler?: EdgeHandler): GraphBuilder
build(config?: GraphConfig): Graph
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if we are doing builder pattern, i'd expect the configs to be set-able on the builder directly, i.e. builder.set_timeout()


Each graph invocation could be assigned a unique state ID. The graph engine would store and retrieve state through a session manager keyed by this ID, rather than holding state directly as fields on graph and node instances. When invoking a graph, customers could specify a state ID to resume from a previous invocation's state.

This matters for concurrent isolation — if state lives as instance members on the graph or node objects, every invocation overwrites the same fields and concurrent calls interfere with each other. An in-memory session manager that stores state in a table (keyed by state ID) solves this without requiring external storage. It gives each invocation its own isolated state while keeping the same graph instance reusable across concurrent calls.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can explore this, but I am curious on the repercussions; this would lean more towards returning a state "object" rather than having state on the graph/swarm

For graphs/swarm this might make a lot of sense to start with off the bat - though a POC would help here


### Remote Node

A remote node would execute on a different process or machine — wrapping an agent behind an API, an MCP server, or an A2A endpoint. It would serialize input, send it over the wire, and deserialize the response back into `ContentBlock[]`. The `Node` abstraction already supports this — a `RemoteNode` subclass would implement `_stream()` with network calls instead of local execution, and the graph engine wouldn't need to change since it calls `node.stream()` uniformly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This differs from RemoteAgent in that it elevates the concept of remote execution outside of an agent? I like the idea.

There's nothing preventing this today in Python is there? Assuming we support it via FunctionNode?

.addNode(reviewer)
.addNode(formatOutput)
.addEdge('researcher', 'writer')
.addEdge('writer', 'reviewer')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Referencing by id/name is odd IMHO; why not pass the object itself

.addNode(formatOutput)
.addEdge('researcher', 'writer')
.addEdge('writer', 'reviewer')
.addEdge('reviewer', 'writer', (state) => !state.user.approved)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For extensibility I'd suggest:

Suggested change
.addEdge('reviewer', 'writer', (state) => !state.user.approved)
.addEdge('reviewer', 'writer', ({state}) => !state.user.approved)

But otherwise like elevating this here

@mkmeral
Copy link
Contributor

mkmeral commented Feb 12, 2026

TS vs Python Diff (what Kiro says)

Don't take it at face value, this is to augment the offline conversation right now

Comparison of the TS graph design against the current Python multiagent/graph.py implementation.

Node Architecture — Polymorphic hierarchy vs isinstance branching

Python uses a single GraphNode dataclass with executor: AgentBase | MultiAgentBase and dispatches via isinstance checks in _execute_node (~50 lines of branching).

TS introduces a polymorphic hierarchy: abstract NodeAgentNode, MultiAgentNode, FunctionNode. Each subclass implements _stream(), and the base stream() handles boilerplate (duration, status, error handling). The graph engine calls node.stream() uniformly — no type-checking needed.

FunctionNode is new — it wraps plain/async/generator functions via NodeHandler, enabling transformation, routing, and side-effect nodes without wrapping everything in an Agent. This is a gap in the Python SDK.

Execution Model — Eager vs batch-based

Python uses batch execution: nodes in a batch run in parallel via asyncio.Queue, then after the entire batch completes, newly-ready nodes are discovered and form the next batch.

TS uses eager execution: as soon as any node completes, downstream nodes that are now ready start immediately. No waiting for batch siblings.

This matters for graphs with mixed-latency nodes — in Python, a fast node in batch N+1 waits for all of batch N to finish, even if its specific dependency completed early.

TS also adds maxConcurrency to limit parallel node execution, which Python doesn't have.

Failure Semantics — Graceful vs fail-fast

Python is fail-fast: exceptions propagate up, sibling tasks in the same batch are cancelled immediately.

TS captures errors in NodeResult instead of throwing. Concurrently running nodes are allowed to finish (graceful exit). The graph is marked FAILED if any node fails. Downstream dependents of the failed node stay PENDING. Customers can override for fail-fast behavior.

TS also adds CANCELLED status and graph.cancel() for explicit graceful cancellation, which Python doesn't support.

User State — Typed via Zod vs no equivalent

TS introduces MultiAgentState.user typed via Zod schema inference (z.infer<typeof schema>), giving compile-time type safety and runtime validation. Customers pass userSchema in GraphConfig and mutate state.user directly from edge conditions and function nodes.

Python's GraphState has no user-defined state section. Edge conditions receive the full GraphState but there's no structured way to carry custom state across nodes.

Agent State Isolation

TS AgentNode always captures and restores agent messages/state per execution — the same agent instance can run multiple times without carrying over conversation history.

Python requires opting in via reset_on_revisit on the builder. Without it, revisited nodes accumulate state from prior executions.

Builder API — Config object vs setter methods

Python uses individual fluent setters: set_max_node_executions(), set_execution_timeout(), set_node_timeout(), set_graph_id(), etc.

TS consolidates into a single GraphConfig object passed to build():

.build({
  maxNodeExecutions: 10,
  executionTimeout: 60,
  maxConcurrency: 5,
  userSchema: z.object({ ... }),
})

Both auto-detect entry points (nodes with no incoming edges). Python additionally allows explicit set_entry_point().

Streaming Events — Typed objects vs dicts

Python yields plain dicts (events call .as_dict() before yielding).

TS yields typed event objects with discriminated unions via the type field, enabling pattern matching:

if (event.type === 'multiAgentNodeStartEvent') { ... }
Features in Python not covered by TS design (intentionally deferred)
  • Interrupts (_InterruptState, interrupt resumption, MultiAgentNodeInterruptEvent)
  • Session management (serialize_state / deserialize_state, SessionManager)
  • Hooks (BeforeNodeCallEvent, AfterNodeCallEvent, BeforeMultiAgentInvocationEvent, etc.)
  • OpenTelemetry tracing

The design doc explicitly states these follow existing SDK patterns and don't need separate design discussion.

Features in TS design not in Python
  • FunctionNode for arbitrary function/async/generator nodes
  • Typed user state via Zod schema
  • maxConcurrency limit
  • graph.cancel() for graceful cancellation
  • Graceful failure handling (errors in results, concurrent nodes finish)
  • Eager execution model
  • RemoteNode as a future consideration

- Forwards input to the nested pattern and wraps its streaming events with the node's ID

```typescript
type NodeHandler = (input: NodeInput, state: MultiAgentState) => NodeHandlerResult | Promise<NodeHandlerResult> | AsyncGenerator<MultiAgentStreamEvent, NodeHandlerResult, undefined>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I think we should prefer objects as arguments instead of individual arguments:

Suggested change
type NodeHandler = (input: NodeInput, state: MultiAgentState) => NodeHandlerResult | Promise<NodeHandlerResult> | AsyncGenerator<MultiAgentStreamEvent, NodeHandlerResult, undefined>
type NodeHandler = ({input: NodeInput, state: MultiAgentState}) => NodeHandlerResult | Promise<NodeHandlerResult> | AsyncGenerator<MultiAgentStreamEvent, NodeHandlerResult, undefined>

Allows folks to pick and choose and allows us to extend in the future

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants