Bug Description
When using parallel agent stepping (step_agents_parallel, step_agents_multithreaded, or do_async), if any single agent's astep() or step() raises an exception, all other agent tasks are cancelled/abandoned and the entire simulation crashes.
This is particularly problematic for LLM-backed agents where transient failures are expected and common (rate limits, timeouts, malformed JSON responses, network errors, etc.). A single flaky LLM response should not kill a 100-agent simulation.
Environment
- mesa-llm: v0.3.0 (commit
a8161c7)
- mesa: 3.5.0
- Python: 3.12
Root Cause
Three parallel execution paths in mesa_llm/parallel_stepping.py lack error isolation:
step_agents_parallel() (line 31) — asyncio.gather(*tasks) called without return_exceptions=True, so one failed coroutine cancels all others
step_agents_multithreaded() (lines 52–53) — bare future.result() in a loop with no try/except, so the first exception aborts remaining agents
_agentset_do_async() (line 124) — same asyncio.gather issue as Welcome to Mesa-LLM Discussions! #1
Reproduction
from mesa.model import Model
from mesa.agent import Agent
from mesa_llm.parallel_stepping import step_agents_parallel
class DummyModel(Model):
def __init__(self):
super().__init__(seed=42)
class FailingAgent(Agent):
def __init__(self, model):
super().__init__(model)
async def astep(self):
raise RuntimeError("LLM timeout")
class WorkingAgent(Agent):
def __init__(self, model):
super().__init__(model)
self.counter = 0
async def astep(self):
self.counter += 1
import asyncio
async def main():
m = DummyModel()
failing = FailingAgent(m)
working = WorkingAgent(m)
await step_agents_parallel([failing, working])
print(working.counter) # Never reached
asyncio.run(main())
Actual Behavior
Traceback (most recent call last):
File "repro.py", line 28, in main
await step_agents_parallel([failing, working])
File ".../mesa_llm/parallel_stepping.py", line 31, in step_agents_parallel
await asyncio.gather(*tasks)
File "repro.py", line 14, in astep
raise RuntimeError("LLM timeout")
RuntimeError: LLM timeout
WorkingAgent.astep() is cancelled and never completes. The entire simulation crashes.
Expected Behavior
- Failed agents should be isolated — other agents complete normally
- Failures should be logged with agent ID and exception details
- Users should be able to inspect which agents failed and why (e.g., via a structured result object)
- An optional "raise" mode should be available for debugging, but the default should be resilient
Bug Description
When using parallel agent stepping (
step_agents_parallel,step_agents_multithreaded, ordo_async), if any single agent'sastep()orstep()raises an exception, all other agent tasks are cancelled/abandoned and the entire simulation crashes.This is particularly problematic for LLM-backed agents where transient failures are expected and common (rate limits, timeouts, malformed JSON responses, network errors, etc.). A single flaky LLM response should not kill a 100-agent simulation.
Environment
a8161c7)Root Cause
Three parallel execution paths in
mesa_llm/parallel_stepping.pylack error isolation:step_agents_parallel()(line 31) —asyncio.gather(*tasks)called withoutreturn_exceptions=True, so one failed coroutine cancels all othersstep_agents_multithreaded()(lines 52–53) — barefuture.result()in a loop with notry/except, so the first exception aborts remaining agents_agentset_do_async()(line 124) — sameasyncio.gatherissue as Welcome to Mesa-LLM Discussions! #1Reproduction
Actual Behavior
WorkingAgent.astep()is cancelled and never completes. The entire simulation crashes.Expected Behavior