Skip to content

bug: Parallel stepping crashes entire simulation when a single agent's step fails #220

@yashhzd

Description

@yashhzd

Bug Description

When using parallel agent stepping (step_agents_parallel, step_agents_multithreaded, or do_async), if any single agent's astep() or step() raises an exception, all other agent tasks are cancelled/abandoned and the entire simulation crashes.

This is particularly problematic for LLM-backed agents where transient failures are expected and common (rate limits, timeouts, malformed JSON responses, network errors, etc.). A single flaky LLM response should not kill a 100-agent simulation.

Environment

  • mesa-llm: v0.3.0 (commit a8161c7)
  • mesa: 3.5.0
  • Python: 3.12

Root Cause

Three parallel execution paths in mesa_llm/parallel_stepping.py lack error isolation:

  1. step_agents_parallel() (line 31) — asyncio.gather(*tasks) called without return_exceptions=True, so one failed coroutine cancels all others
  2. step_agents_multithreaded() (lines 52–53) — bare future.result() in a loop with no try/except, so the first exception aborts remaining agents
  3. _agentset_do_async() (line 124) — same asyncio.gather issue as Welcome to Mesa-LLM Discussions! #1

Reproduction

from mesa.model import Model
from mesa.agent import Agent
from mesa_llm.parallel_stepping import step_agents_parallel

class DummyModel(Model):
    def __init__(self):
        super().__init__(seed=42)

class FailingAgent(Agent):
    def __init__(self, model):
        super().__init__(model)

    async def astep(self):
        raise RuntimeError("LLM timeout")

class WorkingAgent(Agent):
    def __init__(self, model):
        super().__init__(model)
        self.counter = 0

    async def astep(self):
        self.counter += 1

import asyncio

async def main():
    m = DummyModel()
    failing = FailingAgent(m)
    working = WorkingAgent(m)
    await step_agents_parallel([failing, working])
    print(working.counter)  # Never reached

asyncio.run(main())

Actual Behavior

Traceback (most recent call last):
  File "repro.py", line 28, in main
    await step_agents_parallel([failing, working])
  File ".../mesa_llm/parallel_stepping.py", line 31, in step_agents_parallel
    await asyncio.gather(*tasks)
  File "repro.py", line 14, in astep
    raise RuntimeError("LLM timeout")
RuntimeError: LLM timeout

WorkingAgent.astep() is cancelled and never completes. The entire simulation crashes.

Expected Behavior

  • Failed agents should be isolated — other agents complete normally
  • Failures should be logged with agent ID and exception details
  • Users should be able to inspect which agents failed and why (e.g., via a structured result object)
  • An optional "raise" mode should be available for debugging, but the default should be resilient

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions