Why multi-agent LLM systems fail, and how to contain it
Why multi-agent LLM systems fail, grounded in the MAST failure taxonomy and mapped to how each failure propagates across agent topologies and the containment levers that bound the blast radius.
When a multi-agent system returns a wrong answer, you face one decision before any other: was this a fault you should have contained, or one you should have recovered from? Most teams instrument for recovery. They add retries, fallbacks, a human in the loop, then discover that the error had already reached three other agents before anything flagged it. Most multi-agent failures work this way: they are propagation failures, where a fault in one agent reaches others. This page maps where they travel, how far you let them, and which structural levers actually stop the spread.
Which failures spread, and which stay put? The containment map
The short answer: system-design and inter-agent failures spread, and verification failures escape. That distinction decides where you spend engineering effort, because a fault that propagates is one you must contain at the source, while a fault that only escapes at the boundary is one you must gate at the exit. Recovery, which just re-runs the failed step, does nothing for either once the corrupted state has been read by a downstream agent.
The taxonomy below is the load-bearing asset of this page. It takes the failure categories from MAST, the first empirically grounded taxonomy of multi-agent LLM failures, developed from 150 expert-annotated execution traces (kappa 0.88) and scaled to a 1,600+ trace dataset across seven frameworks via an LLM-as-judge pipeline (Cemri et al., Why Do Multi-Agent LLM Systems Fail?, arXiv:2503.13657, preprint, as of 2026-07). To those categories it adds the layer MAST does not: for each one, how the error propagates, what its blast radius looks like, and which containment lever bounds it. The propagation and containment columns are our synthesis, grounded in MAST’s modes and in OWASP’s cascading-failure risk class (OWASP Top 10 for Agentic Applications, ASI08, as of 2025-12). Treat them as an analytic framework, not measured percentages.
| MAST category (representative modes) | How the error propagates | Blast-radius signature | Primary containment lever | Contain or recover? |
|---|---|---|---|---|
| System-design & specification (disobey task/role spec, step repetition, loss of conversation history, no termination awareness) | A bad instruction or a lost context is inherited by every agent spawned under it; loops re-enter the same broken state | Structural and recurring: the fault is wired into the seams, so it reappears everywhere the spec is read | Fix the orchestration contract; enforce termination conditions and explicit history hand-offs as invariants | Contain. A retry re-runs the same broken spec and reproduces the fault |
| Inter-agent misalignment (fail to ask for clarification, task derailment, information withholding, ignored input, reasoning-action mismatch) | One agent’s partial or wrong message becomes another agent’s trusted input; small drift compounds hop to hop | Directional fan-out along the communication edges, worst in hub-and-spoke and debate topologies | Typed, verified hand-offs; clarification gates before commitment; cross-checks between peers | Contain at the edge. Downstream recovery inherits the already-corrupted message |
| Verification & termination (premature termination, no or incomplete verification, incorrect verification) | The system stops or approves before the error is caught, so a wrong result exits labeled as a success | Escape to output: bounded inside the system, maximal at the boundary where a user acts on it | An independent verifier the producing agents can’t overrule; hard acceptance gates on the exit | Neither, once shipped. You can only gate the exit before it leaves |
Read the last column first. It is the design decision the rest of this page defends: for the first two categories, containment is the metric that separates architectures that survive a fault from ones that amplify it; for the third, the only move left is a gate, because a wrong answer that has already left the system cannot be pulled back. In MAST’s own corpus, system-design and specification failures were the single largest category of annotated failures, inter-agent misalignment second, and verification the smallest of the three, across the paper’s 1,600+ trace dataset. We report that ordering rather than exact per-category percentages on purpose: the decimals shifted between the paper’s preprint versions and are published without confidence intervals, so the rank is the finding you can lean on.
On the numbers in this piece. Every external figure resolves to a cited primary source, labeled preprint or standards-body, with the date we checked it. Where a source reports a point estimate without an interval, we say so. The compounding-reliability figures further down are deterministic arithmetic from stated assumptions, not a measured benchmark.
Why a chain of capable agents is an unreliable system
Strong single-agent scores do not add up to a reliable pipeline, and the reason is multiplication. Model a workflow as a series of agent hops, each of which either preserves or corrupts the state it receives. If each hop succeeds independently with probability p, the end-to-end success of an N-hop chain is p raised to the N. That is the whole mechanism behind “every demo worked, production doesn’t.”
The arithmetic is unforgiving. The values below are deterministic, computed from the stated assumption of independent, uniform per-step reliability. No measurement is implied.
| Per-step reliability p | 5 hops | 10 hops | 20 hops |
|---|---|---|---|
| 0.99 | 0.95 | 0.90 | 0.82 |
| 0.95 | 0.77 | 0.60 | 0.36 |
| 0.90 | 0.59 | 0.35 | 0.12 |
A 95%-reliable agent, which sounds shippable in isolation, yields roughly a one-in-three end-to-end success rate over a twenty-hop workflow. Drop to 90% per step and the same workflow succeeds about one time in eight. Push each step to 99% and you claw back to four-in-five, which is why so much reliability work is really per-step hardening.
What counts as a “step” matters here. A single agent turn is rarely one atomic action; it can be a plan, a tool call, a parse of the tool’s output, and a hand-off, each with its own chance of going wrong. So the effective number of hops in a workflow that looks like five agents is often much larger, and the per-step reliability you should plug in is lower than the number you’d quote for the model in isolation. The curve is steeper than it first appears.
Two caveats sit next to this model. First, the independence assumption rarely holds: when a poisoned piece of shared state is read by many agents at once, their failures correlate. For a series chain at a fixed per-step rate, that correlation does not push the all-steps-clean rate below p raised to the N, because positively correlated failures concentrate on fewer runs and leave more runs entirely clean; what it worsens is blast radius, since one poisoned write corrupts many hops at once, and it can add a common-cause failure that lowers the effective per-step rate you should plug in. Second, the model has no containment in it. Add a verification gate that catches and quarantines a fault before it reaches the next hop, and you break the exponent, so the curve stops being the story. That is the entire argument for spending on containment before squeezing another point of per-step accuracy: containment changes the shape of the curve, per-step accuracy only shifts it. Understanding how far a single failure can reach is the prerequisite for either.
System-design and specification failures: the largest category
These are the failures wired in before the agents ever run, and MAST found them to be the most common category in its corpus. They share a signature: the fault lives in the setup, so it reproduces on every execution and every retry. You can’t recover your way out of a broken contract.
Disobeying the task or role specification is the base case. An agent given an ambiguous or overloaded prompt does something adjacent to what you meant, and because downstream agents treat its output as authoritative, the deviation is now the workflow’s ground truth. Step repetition is the loop failure: an agent re-enters a state it has already processed, burning budget and, worse, re-emitting a slightly different result each pass so that later agents can’t tell which version is canonical. Loss of conversation history removes the context an agent needs to stay consistent with earlier decisions, and in a shared-memory design that gap propagates to everyone reading the same store. Unawareness of termination conditions is the failure that never ends: no agent holds the authority or the signal to declare the task done, so the system spins.
Propagation here is structural. The error is a good draw from a mis-specified process, so it recurs deterministically rather than randomly. That makes containment an architecture problem, addressed before runtime rather than during it. The levers that bound it are contractual: explicit, testable role and task specifications; idempotent steps so a repeat is a no-op instead of a fork; an authoritative history hand-off contract so no agent silently loses state; and hard termination conditions owned by the orchestrator. A workflow that gets these right narrows the class of faults that can even enter the topology. It buys resistance to cascades at the seams, where that resistance is cheapest.
Inter-agent misalignment: where coordination actually breaks
Coordination failures are the category that makes multi-agent systems distinct from a single long prompt, and they are the ones a single-model benchmark can never surface. MAST groups six modes here, and every one of them is a message-passing defect: the pathology is in what one agent tells another, or fails to.
Consider the shape of it. An agent that fails to ask for clarification proceeds on a guess, and that guess is transmitted downstream as fact. Task derailment is drift: each agent nudges the objective slightly, and over several hops the system is confidently solving a different problem than the one you posed. Information withholding is the silent version: an agent holds back a detail a peer needed, and the peer, having no way to know the gap exists, fills it with a plausible fabrication. Ignoring another agent’s input discards a correct signal that was actually present in the conversation. Reasoning-action mismatch, one of the more frequent modes MAST logged, is the agent whose stated plan and executed action diverge, so the trace reads as correct while the behavior is not. Conversation reset throws away accumulated alignment and starts a sub-dialogue from scratch.
Picture an orchestrator whose research agent returns a figure it never actually verified, phrased with the same confidence it uses for checked facts. The writing agent quotes the figure, the reviewing agent checks the prose rather than the source, and the number ships. No single agent lied; each trusted the hop before it. That is the common thread across all six modes: trust without verification. Each agent treats an upstream message as clean input, so one corrupted hand-off becomes every downstream agent’s premise. This is directional fan-out, and its blast radius depends on the wiring. In a hub-and-spoke orchestrator the bad message reaches every worker the hub then dispatches; in a debate it can converge the whole panel onto one agent’s confident error.
Containment for this category lives on the edges between agents, not inside them.
- Typed, verified hand-offs. Treat an inter-agent message like an API boundary: validate structure and key claims before the receiver acts, so a malformed or low-confidence hand-off is rejected instead of trusted.
- Clarification gates. Require an agent to resolve a flagged ambiguity before it commits, turning a silent guess into an explicit question.
- Peer cross-checks. Have an independent agent check a claim against the source before it becomes a shared premise, so information withholding and reasoning-action mismatch surface at the hop where they occur.
The goal of each lever is the same: catch the corrupted message at the edge it crosses, so the fraction of faults held to a single hop stays high and drift never gets the second hop it needs to compound.
Verification and termination: the failures that ship
This is the smallest MAST category and the most dangerous, because it is the one whose failures leave the building. The first two categories produce internal faults; this one decides whether an internal fault becomes an external one. Recovery is not the primary metric for these systems; containment is. For verification failures, containment means gating the exit before a wrong answer is ever emitted.
Premature termination stops the workflow before the task is actually complete, returning a partial result dressed as a finished one. No or incomplete verification means nothing checked the output against the requirements at all, so any error from the previous two categories flows straight through. Incorrect verification is the quiet trap: a check ran, and it was wrong, which is worse than no check because it manufactures false confidence. A skipped check leaves a visible gap a reviewer might close; a passing-but-wrong check closes the gap with a green light nobody reopens. MAST reports that even after targeted interventions on the top failure modes, task-completion rates stayed low. Verification is a structural gap in these systems, and patching one mode leaves the category open.
The blast radius here is unusual: near zero inside the system and maximal at the boundary. A verification failure does not spread to other agents; it escapes to the user, who then acts on a result the system certified as correct. That asymmetry dictates the containment design. The verifier must be independent, because an agent that also produced the answer can’t be trusted to grade it, the same conflict that makes a model a biased judge of its own output. The gate must be hard: a result that fails verification does not ship with a warning, it does not ship at all. And termination must be a checked condition, not a default the system falls into when it runs out of turns. A production tracing tool will tell you a verification failure happened; it will not tell you how far the next one will reach, which is the question containment is built to answer.
How the same fault behaves in different topologies
A failure mode is only half the picture. The other half is the topology it runs on, because the same corrupted message has a completely different blast radius depending on the wiring. This is the crux of OWASP’s cascading-failures risk class (ASI08), where false signals cascade through automated pipelines with escalating impact. The table below crosses the four common topologies with their dominant propagation behavior. It uses a different lens than the taxonomy table above, on purpose: this one is about wiring, the previous one about failure class.
| Topology | Dominant propagation behavior | MAST modes that bite hardest | Containment that fits the wiring |
|---|---|---|---|
| Orchestrator–worker (hub-and-spoke) | The hub trusts a worker’s result and redistributes it, so one bad return fans out to every sibling the hub then dispatches | Task derailment, information withholding, ignored input | Verify at the hub before redistribution; circuit-break a worker that returns low-confidence output |
| Sequential pipeline (chain) | Monotonic compounding: each stage consumes the last, so an early error is amplified with no path back | Step repetition, premature termination, no verification | Stage-level acceptance gates; idempotent stages; a checkpoint that can halt the chain |
| Multi-agent debate / voting | Convergence to a shared error: a confident wrong agent herds the panel, and majority vote launders the mistake | Reasoning-action mismatch, fail to clarify, ignored input | Independent seeding; a verifier outside the vote; weight by evidence, not by confidence |
| Shared-state / blackboard | Global poisoning: one bad write to shared memory is read by every agent at once, so errors correlate | Loss of conversation history, conversation reset, information withholding | Validate on write; version and scope the store; isolate agents into separate failure domains |
The pattern across the rows is that topology sets the ceiling on blast radius and containment lowers it. A chain compounds but is easy to gate stage by stage. A blackboard is efficient and has the widest reach, because shared state means a single poisoned write is inherited globally rather than passed hop by hop, which is exactly why one fault in a shared store lands on every agent at once rather than decaying along a path. Debate can correct errors or amplify them, and which one you get depends on whether the panel is genuinely independent or just several samples herding toward the loudest confidence. Independence is the load-bearing property: three agents seeded from the same context and prompt give you one opinion sampled three times rather than three independent views, so a majority vote over correlated samples reports agreement, not correctness. No topology is reliability-optimal by itself. What separates a safe one from a dangerous one is whether you’ve measured and bounded its failure behavior.
Containment beats recovery: the levers that bound blast radius
Containment is a small, transferable set of design moves, borrowed directly from how resilient distributed systems already handle cascading service failures: circuit breakers at the orchestration layer, idempotent agent actions, bounded error-correction, and isolating agents into failure domains to limit blast radius. These are standard resilience patterns; the shift is applying them to LLM agents, the same class of fault OWASP’s ASI08 flags for agentic systems. Mapped onto the MAST modes, they form a coherent containment layer.
- Circuit breakers at the orchestration layer. When an agent or a tool starts returning low-confidence or malformed output, the orchestrator trips a breaker and stops routing work through it, instead of letting each caller rediscover the failure. This is the direct counter to fan-out in hub-and-spoke topologies.
- Idempotent actions. If a step can be safely repeated with no additional effect, step repetition becomes a no-op instead of a fork, and retries stop manufacturing divergent versions of the same result.
- Bounded error-correction. Cap how hard the system tries to self-heal. Unbounded correction is its own cascade: one agent’s fix triggers another’s, whose recovery attempt triggers a third, and the system amplifies the original problem instead of damping it.
- Failure-domain isolation. Scope shared state and tool access so a poisoned write or a rogue agent is confined to a blast domain rather than inherited by the whole topology. This is the strongest single lever against blackboard-style global poisoning.
- Independent verification gates. An exit gate the producing agents can’t overrule turns the verification category from an escape hatch into a wall.
None of these levers is exotic; the failure is that agent pipelines ship without the isolation and breakers a service team would treat as mandatory. The borrowed vocabulary is deliberate: a payments service that called three flaky dependencies in series without a circuit breaker would never pass review, yet the equivalent agent pipeline often ships because the components are LLMs and the scaffolding feels optional. It is not. Deciding which lever to spend on is a triage against the two taxonomy tables: find the modes your topology amplifies, and buy the containment that bounds those first. You can browse the vocabulary and the underlying work through the reliability glossary and our research index as you make that call.
How to measure whether your system actually contains failures
Containment is a claim, and an unmeasured claim is a hope. Three metrics make it measurable, and each has a canonical definition in the lane’s vocabulary:
- Blast radius. How far a single injected fault reaches, measured as propagation depth (how many hops) and breadth (how many agents touched) before something stops it. A high blast-radius measurement means the fault fanned out; a low one means it was caught early.
- Containment rate. The fraction of a seeded fault held at hop one, the number that separates a chain that gates well from one that compounds, always reported with its interval.
- Cascade resistance. A topology’s tendency to damp a fault versus amplify it, which lets you compare two architectures on the axis that predicts production reliability better than single-run task scores do. A topology’s cascade-resistance score is what ranks one wiring against another before either ships.
The measurement method is fault injection, not passive observation: introduce a controlled fault at a known point, then trace how far it propagates across the topology, and repeat enough times to put an interval on the containment rate. How many times depends on the precision you need and the variance you observe; a handful of runs gives you a wide interval, and a tight interval on a rare failure needs many.
A worked illustration, on schematic inputs: inject 100 identical faults into a three-agent chain, and suppose the stage gates quarantine the fault at the first hop 80 times, at the second hop 15 times, and let it reach the output 5 times. The one-hop containment rate is 80 out of 100, and the mean propagation depth is 1.25 hops: (80x1 + 15x2 + 5x3) / 100. Those figures are chosen to show the arithmetic, not read off a run, and both still need an interval before you’d trust them on a real system. This is where the eval-rigor half of reliability meets the propagation half. One run tells you what happened once; it doesn’t tell you what your containment rate is, any more than one coin flip tells you the coin is fair. The same discipline that demands confidence intervals on eval scores demands them on containment numbers, including ours.
On that last point, plainly: this site points toward a reliability profiler, a pre-launch instrument designed to inject controlled faults, measure how far they cascade across a multi-agent topology, and report a containment rate with a confidence interval. That is design intent. No containment number has been measured, so this page claims none, and every external figure on it traces back to a source you can verify. The flagship is held to that same standard.
Where this goes next
The move that changes your reliability posture is small: stop asking whether an agent can recover from a fault and start asking how far that fault reaches before anything stops it. That reframing turns a vague worry about “agents being flaky” into three measurable quantities (reach, containment, and resistance) that you can put intervals on and hold a system to.
From here, the containment vocabulary is the next layer to internalize: the precise definitions of blast radius, containment rate, and cascade resistance, each with the measurement method behind it. The taxonomy on this page tells you which failures spread and which escape. The vocabulary tells you how to measure whether your system holds. And the research program publishes each propagation and containment figure under the method and interval that measured it.