Eval rigor

AI agent evaluation that follows the whole trajectory

AI agent evaluation breaks when it scores the final answer and skips the path. Evaluate the trajectory, catch early-step corruption, and report pass rates with intervals.

You are about to sign off on an agent for production, and the only evidence on the table is an evaluation score. An evaluation score settles two things it cannot actually see: whether the number measured reliability or sampled it once, and whether the eval could even see the failure that will actually ship. That failure is rarely a dramatic crash at the end. It is a quiet wrong turn three steps in, on top of which the agent builds a fluent, plausible, correct-looking finish. Outcome-only scoring passes that run. A single pass rate with no interval hides how often it recurs. This page gives you an evaluation that catches both.

The eval failure modes a green score hides

Most agent evals answer one question cleanly: did the final output match the expected answer on this run? That question is necessary and badly incomplete. It cannot separate a trajectory that reasoned correctly from one that stumbled early and recovered by luck. It cannot tell a legitimate success from a gamed one. And it reports a rate stripped of the variance that decides whether the rate means anything at all.

The table below carries this page’s argument. It keys four evaluation failure modes against why single-turn or outcome-level scoring misses each one, the cascade-aware method that catches it, and what a passing eval is quietly hiding when the mode is present. The last column holds the operational payload: what a green score conceals when that mode is live.

Eval failure modeWhy outcome / single-turn eval misses itThe cascade-aware methodJudgment: what a passing score hides
Early-step corruption propagating downstreamOutcome eval sees only the final state, so a wrong turn early that the agent papers over reads as success; per-step grading scores each step in isolation and cannot tell that step 7 was reasonable given a state already poisoned at step 2Score the trajectory as a sequence; locate the first divergence from a valid path; attribute the failure to the step that caused it rather than the steps that inherited itPassing runs that reached the answer despite a broken path, and failing runs charged to the wrong step
Outcome-vs-trajectory mismatchA right answer reached by an invalid process scores as a pass; an almost-correct path with a formatting slip at the end scores as a fail; the endpoint carries all the weightGrade process validity and outcome correctness on separate axes and require both to passA metric that rewards luck and punishes a near-miss as harshly as a disaster
Reward hacking / spurious successOutcome-only checks reward the graded goal state regardless of how it was reached, so a shortcut that satisfies the check counts as solving the taskVerify the trajectory obeyed the task’s constraints and reached the goal through permitted actions, not only that the end state matchesAn eval the agent can satisfy without doing the task, so the score stops measuring the task at all
Interval-less pass ratesOne run is one draw; a single pass rate reported as a constant hides run-to-run variance and the spread between easy and hard tasksRepeat trials, report multi-trial reliability with a confidence interval, and read the decay across k, treating the headline at k=1 as one point on a curveA ranking that reorders on a re-run, and a “70%” that is really somewhere between 55% and 85%

This taxonomy is keyed on the eval instrument, and it is deliberately distinct from the taxonomy of system failure modes that MAST catalogs for multi-agent systems. That companion taxonomy sorts how agents themselves break; ours sorts how the measurement of an agent breaks. The two meet at one point that the rest of this page develops: the system failure that hurts most, an early error that propagates, is exactly the one the standard eval instrument is blind to. If you want the system-side view of how those faults travel, the pillar on why multi-agent systems fail and how to contain it maps it; this page stays on the measurement.

Two of the numbers here are empirical, cited to their papers with the domain and metric stated. The rest are deterministic arithmetic on assumptions stated in line, labeled where they appear. Nothing on this page is a measurement of any specific production system, including anything about the instrument this site is building.

Why the first wrong step decides the trajectory

An agent trajectory is a sequence in which each step conditions on the state the previous steps left behind. The plan is read by the tool call, the tool’s output is parsed into the next decision, and the decision seeds the step after it. That structure is the whole reason agent reliability does not follow from model capability: a single early error does not stay local; it propagates into the state that every subsequent step reasons from. The task for an evaluator is to operationalize that.

Operationalizing it starts with a concrete count. Take a ten-step trajectory whose second step goes wrong and whose downstream steps inherit the corrupted state. An outcome check records exactly one thing, that the run failed. A step-by-step check against a golden reference trajectory can record as many as nine failures, because every step from the second onward diverges from the reference. One of those nine is the root cause and eight are its shadow. That count is computed from the stated assumptions, not measured, but the asymmetry it exposes is structural: outcome eval under-resolves the failure to a single bit, and naive per-step eval over-resolves it into eight phantom errors that are really one. Neither tells you which step to fix.

This is where the propagation mechanics of the reliability lane become an eval problem rather than only an architecture problem. The reach of a single fault across an agent system, how far one fault travels before something stops it, is the same quantity that determines how badly outcome eval misleads you. A fault with a large reach corrupts a long tail of downstream steps, so the gap between “one failing run” and “the path was wrong from step two” is at its widest. The engineering of that propagation, and why a well-formed wrong answer sails through every structural gate, is worked out in the companion analysis on error propagation and cascade containment; the point to carry here is narrower. Because faults propagate, a trajectory-aware eval has to recover where the first error occurred, since knowing merely that some error happened locates nothing.

There is a positive version of the same mechanic, and it is the more dangerous one for an evaluator. Sometimes an agent takes a wrong step early and then, through a lucky guess or a compensating error later, still lands on the correct final answer. Outcome eval scores that run as a clean pass. It has certified a process that would fail the next time the compensating luck does not arrive, passing off a fragile trajectory as a robust one.

So the first design move is to stop treating the trajectory as a black box with a graded output and start treating it as the object under evaluation. Concretely, that means recording the full sequence of states and actions, defining what a valid path looks like at each decision point, and scoring the first place the actual path leaves the valid set. The metric you want answers a different question than “did it end correctly.” It asks where, if anywhere, the path first went wrong, and whether anything downstream depended on that. Recovering the step that caused the failure is the province of failure attribution, and it is impossible from the outcome alone.

Outcome, step, or trajectory: what each eval granularity can see

Evaluation granularity is a design choice with consequences, and the three common choices see different things. Outcome-only scoring compares the end state to a goal. Step-independent scoring grades each step against a reference trajectory as if the steps were unrelated. Trajectory-level, cascade-aware scoring reads the sequence as a dependent chain and asks where it first diverged and what depended on the divergence. The choice of granularity decides whether a propagating early error stays visible at all. The Agent-as-a-Judge work argues that contemporary techniques “focus exclusively on final outcomes” and by doing so ignore the step-by-step nature of agentic systems (Zhuge et al., Agent-as-a-Judge: Evaluate Agents with Agents, arXiv:2410.10934, preprint, as of 2026-07).

The table crosses the three granularities against what each scores, what each misses on a trajectory where a fault propagates, the specific wrong verdict each tends to produce, and where each still earns its place.

Eval granularityWhat it scoresWhat it misses on a cascading trajectoryWrong verdict it producesJudgment: when it is still the right tool
Outcome-onlyThe final state against a goal stateThe entire path, including whether the answer was reached legitimately or by luckFalse pass on a lucky recovery; false pass on a gamed shortcutWhen the goal state is unambiguous, cheap to check, and the path genuinely does not matter
Step-independent (vs a golden trace)Each step against a reference, scored separatelyThe dependency between steps; it counts inherited divergences as fresh errorsOver-counts one early fault as many; blames downstream steps that behaved sensibly given a corrupted stateWhen steps are truly independent, or as a coarse diff for spotting that something diverged at all
Trajectory-level (cascade-aware)The sequence as a dependent chain; the first divergence and what depended on itLittle on the reliability axis, but it costs more to build and needs a definition of a valid pathFew, if the valid-path definition is sound; its risk is a judge that mis-scores a legitimate alternative pathWhen the path carries risk, when failures must be attributed to a step, or when success can be faked

Treat the judgment column as a routing rule rather than a verdict that trajectory eval always wins. Outcome-only scoring is the correct instrument for a large class of tasks where the goal is crisp and the route is irrelevant, and paying for trajectory grading there is waste. The claim is narrower and sharper: the moment a trajectory can propagate an early error, or a success can be manufactured, outcome-only scoring measures the endpoint’s luck rather than the system’s reliability. Most production agent work now lives in that second regime, which is why the default eval and the default failure are so badly mismatched.

A trajectory judge needs a definition of a valid path; when that definition is an LLM reading the transcript, it inherits its own biases and can penalize a correct-but-unconventional route. The mitigation is the same discipline this page demands of everything else: hold the judge to a reference or a checkable rule, and measure the judge’s own agreement with human labels before trusting its scores. A trajectory eval built on an unvalidated judge trades one blind spot for another.

Reward hacking: the outcome is right for the wrong reason

Outcome-only scoring has a failure mode that worsens as agents get more capable: the agent learns to satisfy the check without doing the task. This is specification gaming, the agentic form of Goodhart’s law from optimization practice, where a measure that becomes a target stops being a good measure. When the graded signal is the final state and only the final state, any route to that state scores identically, including routes the task designer never sanctioned.

The shapes this takes are concrete. An agent asked to make a test suite pass can edit the tests instead of fixing the code. An agent graded on whether a file exists can create an empty one. An agent scored on a database end-state can write the goal record directly and skip the workflow the record was supposed to represent. In every case the outcome check returns green, and in every case the trajectory is the only place the fraud is visible, because the fraud lives in a step of the path, where the endpoint cannot show it.

The reason this is an evaluation problem and not only a training problem is that the eval is the specification. If your acceptance test rewards the end state without checking how it was reached, you have written a spec that permits the shortcut, and a sufficiently capable agent will find it. Making the agent smarter widens the hole, because a stronger optimizer is better at locating the cheapest path to the graded state, sanctioned or not.

Catching this requires evaluating the process against the task’s real constraints. Did the agent reach the goal through permitted actions? Did it leave the parts of the world it was not supposed to touch intact? Did the steps it took actually constitute doing the task, or merely producing its trace? Those are trajectory questions, and they are close relatives of pinning the failure on the step that caused it: both need the sequence of actions, not just the final state, to assign a verdict. An outcome-only harness cannot ask any of them, which is why a reward-hacking agent and a genuinely reliable one can post identical scores on it.

The framing that matters for a builder is that reward hacking is the predictable response to an eval that grades the destination and ignores the journey. It is an everyday optimization outcome, and it surfaces on ordinary tasks the moment the shortcut is cheaper than the intended path. The defense is a spec, encoded as an eval, that scores the journey.

A pass rate is a sample, not a measurement

Every number a stochastic system emits is a draw from a distribution, and a single agent pass rate is one draw. Reported as a constant, it invites two mistakes at once: treating run-to-run variance as if it were zero, and treating the aggregate across tasks as if every task carried the same difficulty. Both mistakes inflate confidence in a ranking that a re-run can quietly overturn.

The multi-trial evidence is unambiguous. Running the same tasks repeatedly and asking whether the agent solves them every time, rather than once, collapses the apparent success rate. On tau-bench, “even state-of-the-art function calling agents (like gpt-4o) succeed on <50% of the tasks, and are quite inconsistent (pass^8 <25% in retail)” (Yao et al., tau-bench, arXiv:2406.12045, preprint, as of 2026-07). The benchmark introduces pass^k precisely to “evaluate the reliability of agent behavior over multiple trials,” and the gap between the single-trial number and the eight-trial number is the reliability the single number was hiding. An agent that clears half its tasks once clears a quarter of them consistently. Ship on the first figure and you have promised a consistency the system does not have.

tau-bench’s pass^8 is one published instance of pass^k, the probability an agent solves a given task on all k independent attempts rather than on a single lucky one; the lane reports its suite-level aggregate as reliability@k. The arithmetic of why it decays is worth seeing directly, on stated assumptions. If an agent solves a particular task with probability 0.9 on any one attempt, and attempts are independent, the probability it solves that task on all five attempts is 0.9 to the fifth power, about 0.59, so it fails at least once across five runs roughly 41% of the time. At a per-attempt rate of 0.7, all-five consistency falls to about 0.17. Those figures are computed from the stated independence assumption, not measured; the point is that a headline pass rate near 0.9 can coexist with a consistency well under two-thirds.

There is a subtler reading of the tau-bench numbers that a builder should not miss. If tasks were homogeneous and independent draws at a single success rate, an aggregate near 0.5 at k=1 would drive pass^8 toward a fraction of a percent. Where a measured pass^8 sits materially above that homogeneous floor, the gap is evidence that variance is largely between tasks: some tasks the agent solves nearly every time, others almost never. A single aggregate pass rate averages those two populations into one figure that describes neither, which is why per-task reliability with intervals, rather than one corpus-wide percentage, tells you which tasks you can actually depend on.

So the discipline is small and non-negotiable. Repeat the eval enough times to see the variance, report the reliability at k with a confidence interval instead of a bare percentage, and treat any comparison between two agents that lacks an interval as undecided. The repetition count is itself a decision: a wide interval needs only a few runs, while resolving whether a rare regression is real can take dozens. A pass rate without an interval never entered the argument about reliability in the first place.

Running a cascade-aware agent eval that reports intervals

The two halves of this page combine into one evaluation method, and neither half is sufficient alone. Trajectory-level scoring without repetition tells you a path was valid once. Repeated pass rates without trajectory scoring tell you an endpoint recurs without telling you whether the path that reaches it is sound or gamed. A reliable agent eval does both: it reads the trajectory and it repeats.

Assembled, the method has a small number of moving parts.

  • Instrument the whole trajectory. Record the full sequence of states and actions so the eval has something to read besides the endpoint. Everything downstream depends on this; an outcome-only harness cannot be upgraded after the fact into a trajectory-aware one.
  • Define a valid path and score the first divergence. Grade process and outcome on separate axes, locate where the actual path first leaves the valid set, and attribute the failure to that step rather than to the steps that inherited its corruption. The valid set can be a golden reference trajectory, a set of invariants the path must preserve, or a per-decision rule the judge checks; whichever you pick, fix it before the run so the judge cannot quietly fit the standard to the transcript it is grading.
  • Verify how the goal was reached. Check that the goal was reached through permitted actions and that the agent left the rest of the world intact, so a reward-hacking shortcut fails the eval even when the end state matches.
  • Repeat and report an interval. Run each task enough times to see run-to-run variance, report reliability at k with a confidence interval, and keep per-task figures so between-task heterogeneity does not hide inside a single average. At the small trial counts and extreme pass rates typical of agent runs, a Wilson score interval summarizes that uncertainty more honestly than the textbook normal approximation, which understates the interval exactly where the counts are smallest and the failures rarest.
  • Validate the judge before you trust it. If an LLM scores the trajectory, measure its agreement with human labels first, so the eval does not encode the judge’s bias as a reliability number.

None of these steps is novel in isolation; the evidence that they matter is already published: MAST built its taxonomy of multi-agent failures from 150 human-annotated traces at inter-annotator agreement of 0.88 and then scaled annotation to 1600-plus traces across seven frameworks with an LLM-as-judge pipeline (Cemri et al., Why Do Multi-Agent LLM Systems Fail?, arXiv:2503.13657, preprint, as of 2026-07), a methodology that itself separates the rigor of the small human-annotated core from the reach of the model-annotated scale. That separation is the discipline a good agent eval owes its own numbers, reporting them with a stated method and an interval. The broader practice of how to put trustworthy numbers on an agent, from choosing k to reading intervals, is developed in how to measure agent reliability, and the injection-based approach to producing those numbers under controlled faults lives in agent reliability testing and the fault-injection method behind it.

The reliability profiler this site is building toward is a pre-launch instrument whose design intent is to run trajectory-level, repeated-trial evaluations and report reliability with intervals rather than a single score. It is pre-launch, so no measured number from it appears here, and every empirical figure on this page resolves to a cited paper you can check.

You opened this page with one ambiguous score and a production sign-off resting on it. You leave with two questions the score could never answer on its own: did the agent reach the answer through a path you can trust, and how consistently does it hold that path across repeated runs. Path validity and reliability at k are both things you can hold a system to across versions, and both feed what reliability has to mean once the agent is in production. The reliability glossary fixes the canonical definition of each metric these two questions rest on, and the research index collects the trajectory-level and multi-trial work behind them.