Reliability testing

How to test agent reliability beyond a single eval run

Single-run eval samples agent reliability once. Rigorous testing measures it across many runs with confidence intervals, statistical power, pass^k, and fault injection for cascade propagation.

A 95%-per-run agent clears all five of five runs only about 77 percent of the time (computed from stated assumptions, not measured). That gap, between a per-run pass rate and a five-run one, is the whole problem with judging an agent from a single eval run. One run returns one number, and re-running the identical suite moves it, because decoding is stochastic and the tool calls underneath it are not perfectly ordered. Single-run eval tells you the agent can pass. Whether it will pass, run after run and hop after hop, is a different measurement, and this page is the method for making it.

The reliability-testing framework: five dimensions one run cannot reach

Reliability is a statistical property of a system under repetition, and single-run eval measures none of it. The table below lays out the method in full. It keys on the five testing dimensions that separate a reliability test from a task score: what a single run structurally cannot see on each dimension, the rigorous method that recovers it, and the judgment call for when that dimension is the constraint actually binding your release.

Testing dimensionWhat single-run eval missesThe rigorous methodJudgment: when this dimension binds
Run-to-run varianceOne run is one draw; stochastic decoding and nondeterministic tool ordering make the same suite score differently next time, and a single number hides the spreadRe-run the identical suite N times; report the mean pass rate with its across-run standard deviationBinding the moment your pass rate moves between runs on identical inputs; high variance means no single number is decision-grade
Sample size for a target powerA one-run delta cannot separate a real regression from sampling noise, so an underpowered suite both misses true drops and blocks on chance onesSize N from the smallest regression you must catch and the power you want to catch it with, via the normal-approximation power calculationBinding whenever you gate a release on an eval delta; an undersized suite ships the regression you thought you tested for
Confidence intervalsA bare pass rate (0.86) is not comparable across versions; you cannot tell a genuine improvement from run noise without its spreadReport every pass rate with a 95% interval (Wilson for a proportion, bootstrap for compound metrics) and compare the intervalsBinding whenever you rank or compare systems; two point estimates whose intervals overlap are not a ranking
pass^k and reliability@kpass@k rewards succeeding once and climbs toward 1 as k grows, so it scores best-of-k capability, not the every-run consistency an autonomous agent needsTrack pass^k, the per-task probability all k independent runs pass, and report its suite-level mean as reliability@k, at the k that matches how many times the agent runs before a human checks its outputBinding for repeated or unattended execution; a high pass@k over a low pass^k is a system that can succeed and routinely will not
Cascade / propagation injectionTask eval scores the final answer and never sees a mid-pipeline fault travel between agents, which is where multi-agent systems breakInject a controlled fault at a known hop, trace its blast radius in depth and breadth, and report a containment rate with an intervalBinding for any multi-agent topology; single-run task eval is blind to the failure class that dominates multi-agent production

Two dimensions are about the statistics of a single agent scored many times: variance, power, intervals, and the pass^k reframe. The fifth is about the structure of many agents scored together: how a fault moves once it starts. A serious reliability test covers both halves, because a system can be statistically stable on every isolated task and still cascade the first time a real fault lands mid-pipeline, in the way faults travel through an agent pipeline.

The rest of this page works down the table, dimension by dimension, with the arithmetic each one rests on.

One run is one sample, so treat it like one

The first correction is the cheapest and the one teams skip most. Run your suite again.

An agent’s pass rate is a random variable. Temperature above zero makes generation a sampling process; retrieval order, tool latency, and concurrent state make the scaffold around it nondeterministic too. So the 86 percent you recorded this morning is one realization of a distribution, and the first question to ask of any eval number is how wide that distribution is. The pass@k work on code generation made the same point from the capability side: the authors found that repeated sampling was a surprisingly effective strategy, which only has teeth because each sample lands somewhere different (Chen et al., Evaluating Large Language Models Trained on Code, arXiv:2107.03374, preprint, as of 2026-07).

Before you can re-run anything, decide what one run is. A trial is the whole unit you will hold a user to, which for an agent usually means an end-to-end task attempt including its tool calls and hand-offs, scored pass or fail on the outcome that matters. Scoring at the wrong granularity quietly corrupts the statistics: count individual tokens or individual tool calls as trials and you inflate N with events the user never experiences as separate successes, which shrinks your interval to a width the system has not earned. Fix the unit first, then repeat it.

The method is to re-run the identical suite N times at that fixed unit and record the pass rate each time. Report the mean and the standard deviation across runs. If the spread is a few points, your single number was roughly safe to quote. If the spread is fifteen points, your “86 percent” was a coin you happened to see land heads, and every downstream decision built on it inherited that noise.

Variance is itself a property of the system you are shipping. A high-variance agent is less reliable than a low-variance one at the same mean, because a user meets a single run rather than the average of many, and two agents that both score 0.86 on average are not interchangeable if one holds a two-point spread and the other swings twenty. Several distinct sources feed that spread, and separating them changes what you fix: sampling temperature and top-p set the decode-level noise, nondeterministic retrieval and tool ordering add scaffold-level noise, and a drifting external dependency adds a slow trend that repeated runs on one afternoon will not even see. Measuring the spread is the prerequisite for everything below, because both the interval you report and the power you can achieve are functions of it.

pass@k measures capability; pass^k measures reliability

The single most useful reframe in agent testing is the difference between succeeding once and succeeding every time. Conflating the two is how a system looks shippable and behaves unreliably.

pass@k comes from code generation, where you sample k attempts and keep any that works. Formally, a problem is scored as solved when at least one of the k samples passes the tests, and the reported score is the fraction of problems solved that way (Chen et al., arXiv:2107.03374, preprint, as of 2026-07). pass@k is the right metric when a human or verifier picks the best of several drafts. For an autonomous agent that runs unattended it certifies the wrong thing: it rewards getting the answer once in k tries and says nothing about the other k-1.

Reliability is the consistency counterpart to pass@k. Call it pass^k: the probability that all k independent runs pass. Under the working assumption of independent runs with a constant per-run pass probability p, pass^k equals p raised to the k, while best-of-k pass@k equals 1 minus (1 minus p) raised to the k. The two pull apart fast, and the gap between them is precisely what single-run and best-of-k reporting hide. The table is computed from those stated assumptions and is not measured on any system.

Per-run pass rate ppass@5 (best of 5, any passes)pass^5 (all 5 pass)Capability-reliability gapJudgment
0.99~1.0000.951~0.05Even a near-perfect agent slips ~5 points once you require all five
0.95~1.0000.774~0.23A “shippable” 95% agent fails at least one of five runs roughly a quarter of the time
0.90~1.0000.590~0.41Best-of-five reads as flawless while all-five is barely better than even
0.800.99970.328~0.67pass@k certifies a success the system will almost never repeat cleanly

Read across any row and the story is the same. As k rises, pass@k inflates toward certainty and pass^k collapses toward zero, so a leaderboard reporting best-of-k improves as you sample harder while the system a user actually depends on gets less reliable. The number you track should match how your system runs. If an agent executes a twenty-step workflow with no human between steps, the relevant quantity is closer to pass^20, and at 0.95 per step that is about 0.36, computed from the same assumptions. Choose k to match the number of consecutive successes your product needs, then report pass^k at that k.

Independence is the assumption doing the work here, and real runs bend it: when the k runs of one task are coupled (a shared seed, cached state, a common hard sub-step), they pass or fail together, so the true all-pass rate sits above p raised to the k, not below it. Positive correlation makes p raised to the k a conservative floor; the risk that actually bites is a heterogeneous suite, where a healthy average pass^k masks a cluster of hard tasks that almost never pass clean. Report the assumption, and the task-to-task spread, alongside the number.

Choosing k is the design decision that gives the metric its meaning, and it is a product question before it is a statistical one. Set k to the number of consecutive attempts your system runs before a human, a verifier, or a hard gate inspects the output. A chatbot with a person reading every reply lives near k of 1, so best-of-k pass@k and a small pass^k are both reasonable to track. A nightly agent that executes a fifty-step pipeline unattended lives near k of 50, where only a high pass^k has any bearing on whether the run completes clean. We track the per-task all-pass quantity as pass^k, and report its suite-level aggregate as reliability@k; the label matters less than reporting it at the k your system actually operates at, because a pass^k quoted at a convenient small k is as misleading as a pass rate quoted with no interval.

How many runs, and the interval you are buying

How many runs is enough” has an answer, and it falls out of the precision you need rather than a round number someone picked. Every pass rate you report is an estimate of a proportion, and the confidence interval on a proportion narrows with the square root of the sample size.

Using the normal approximation to the binomial, the standard method for a proportion (NIST/SEMATECH e-Handbook, Sample sizes required, primary, as of 2026-07), a 95% interval has a half-width of about 1.96 times the square root of p times (1 minus p) over N. The half-width is widest at p near 0.5 and tighter toward the extremes. The table gives the worst-case half-width and the half-width at a strong 0.9 pass rate, computed from those stated assumptions and not measured.

Runs N95% half-width at p = 0.5 (worst case)95% half-width at p = 0.9Judgment: what this N lets you claim
10~31 pp~19 ppA 30-point interval decides nothing; it catches only catastrophic gaps
30~18 pp~11 ppFine as a smoke test; too wide to gate a release
100~10 pp~6 ppSees large regressions; blind to small real ones
384~5 pp~3 ppThe common budget when you need to resolve a five-point change
1000~3 pp~2 ppNeeded to call a small but genuine delta with confidence

The shape of the curve is the operational lesson. Going from 10 runs to 100 roughly cuts the interval by a factor of three; going from 100 to 1000 buys only another factor of three for ten times the compute. Precision gets expensive precisely where release decisions get interesting, which is why picking N should be a deliberate budget.

Power is the same calculation pointed at a decision instead of an estimate. If you gate releases on eval deltas, the question shifts from “what is the pass rate” to “will this suite detect the regression you would block on.” To catch a 5-point drop from a 0.9 baseline at 80% power and a two-sided 5% significance level, the one-sample normal-approximation size is about (1.96 plus 0.84) squared times 0.9 times 0.1 over 0.05 squared, which is roughly 280 runs per version (computed from stated assumptions, not measured). Run fewer and a real 5-point regression clears your gate more often than it is caught, and the test that felt like a safeguard was theater. Comparing two versions head to head needs more, since each carries its own sampling error.

You can buy back much of that cost with paired testing. If you run both the old and the new version against the same frozen inputs and the same seeds, the run-to-run noise both versions share, item difficulty and decode noise, cancels in the difference, so a paired comparison detects a smaller true delta at the same N than two independent samples do. When the inputs cannot be held identical, stratify the suite so the hard and easy cases are represented in the same proportion every run, and report the pass rate per stratum rather than one blended number that a shift in the case mix can move on its own.

The discipline that ties these together is simple and non-negotiable: no reliability number ships without its N and its interval. A pass rate without an interval cannot be compared to last week’s, cannot be ranked against a rival, and cannot certify anything. The normal approximation above overstates precision when p sits near 0 or 1, so report a Wilson interval for a proportion and a bootstrap interval for compound metrics; the arithmetic here is the floor to build on.

Test the topology as well as the task

Everything so far tests a single agent scored many times. A multi-agent system has a second failure surface that no amount of task-level repetition reveals: what happens when a fault starts inside the pipeline and moves. This is the fifth dimension, and it is a different kind of test.

The failure class is real and catalogued. MAST derived its 14 recurring failure modes in 3 categories from 150 expert-annotated traces (kappa 0.88), then applied them across a 1,600+ trace dataset spanning 7 frameworks; one category, inter-agent misalignment, is error propagation under another name (Cemri et al., Why Do Multi-Agent LLM Systems Fail?, arXiv:2503.13657, preprint, as of 2026-07). OWASP files the same behavior as a distinct agentic risk class, ASI08, describing false signals that cascade through automated pipelines with escalating impact (OWASP Top 10 for Agentic Applications, as of 2025-12). Both name the class. Neither hands you a test procedure for it, which is the gap this dimension fills.

The procedure is fault injection, borrowed from resilience engineering and pointed at agents. Introduce a controlled fault at a known hop, a stale value, a confidently wrong field, a dropped message, then trace how far it travels before something stops it. The mechanics of why a well-formed wrong answer sails through structural validation and how far it then reaches are worked out in the companion pillar on how a single fault propagates across agent topologies; the point here is that those mechanics are testable. The failure-class view, mapping each observed mode to how it travels, lives in the analysis of why multi-agent systems fail.

Injection turns propagation into two numbers you can put an interval on: blast radius and containment rate. Repeat the injection enough times to bound the containment rate the same way you bounded the pass rate above; a containment number without an interval cannot separate a well-wired topology from a lucky one. Where the same injection reaches far in one wiring and stops in another, you are measuring the topology’s cascade resistance.

Two design notes keep an injection test rigorous. Inject the fault types your system actually produces, weighted the way error propagation shows up in your traces. A type check already catches the crashes; the faults worth injecting are the well-formed wrong values it waves through. And instrument the trace so you can attribute an escape to the hop that let it through, because a containment rate you cannot decompose tells you that you failed without telling you where. The resilience patterns that make containment achievable, and how far each analogy from distributed systems actually holds, are laid out in the note on porting service-reliability patterns to agents.

A containment rate is also a per-topology number, and blending it across wirings hides the thing you are testing for. The same injected fault that a sequential chain gates at one stage can fan out across every worker a hub dispatches or poison a shared store that all agents read, so a system that runs several topologies at once earns a containment measurement per topology, each with its own N and interval. Report them separately and the weak wiring shows up as a low containment rate on one row instead of averaging into a comfortable aggregate. The propagation dimension runs on the same statistics as the rest: the injection is the experiment, the containment rate is its estimate, and the estimate is worth nothing until it carries an interval built from enough repetitions to bound it.

Assembling the five into one protocol

The dimensions compose into a single test plan rather than a menu you pick from, and the order matters because each stage feeds the next. A reliability test worth the name runs roughly like this.

  1. Fix the suite and the k. Freeze the task set and decide how many consecutive successes your product needs before a human intervenes. That k sets which pass^k you are testing and stops the target from drifting between runs.
  2. Measure variance first. Run the frozen suite enough times to estimate the across-run spread. The spread you observe drives both the interval you can report and the sample size the next stage needs.
  3. Size the run count to the decision. Set N from the smallest regression you would block a release on and the power you want to catch it with. An undersized suite makes every later number decorative.
  4. Report intervals, then pass^k. Attach a 95% interval to the pass rate and compute pass^k at your chosen k. A ranking survives only where the intervals do not overlap.
  5. Inject faults and measure containment. For any multi-agent topology, add propagation testing on top of task testing, and report the containment rate with its own interval.

Each stage narrows what the next can legitimately claim. Skip stage two and you cannot size stage three. Skip stage three and stages four and five report precise-looking numbers with no power behind them. The protocol is cumulative, which is why bolting a confidence interval onto an underpowered ten-run suite produces a tidy interval around a number that still cannot decide anything.

Which stage is worth the most effort depends on where your system actually lives. A single agent behind a human reviewer is dominated by the statistical half: variance, power, intervals, pass^k at a small k. A deep autonomous pipeline is dominated by the fifth dimension, because its per-step numbers can all look fine while a mid-chain fault owns the output. Spend against your architecture.

The decision rule, and where this goes

Testing agent reliability comes down to refusing to let a single number stand in for a distribution. Report no reliability figure without its N and its interval. Track pass^k at the k your system runs before anyone checks it. Power the suite to the regression you would actually block a release on. And for any multi-agent topology, injection-test how far a fault propagates and what fraction you hold to a single hop, because task eval is structurally blind to the failure class that dominates multi-agent production. If your system is one agent with a human in the loop and no coordination, the propagation half of this does not pay off yet, and you can stop after the statistics.

The instrument this site is building toward is a pre-launch reliability profiler: its design intent is to run a suite many times, inject controlled faults, and report pass^k and a containment rate with confidence intervals, scored against field norms. Pre-launch means no such number is measured yet, so none is asserted here. Hold the profiler to that same standard when it publishes.

From here, the reliability glossary gives each metric a canonical definition and the measurement method behind it, and the research program turns those propagation and containment figures into measured runs, each reporting the pass^k/power protocol on this page with its interval.