How we measure
How we measure
Every number we publish about AI-agent reliability is a choice: which method, which assumptions, which source. This page explains those choices in plain language, so you can decide whether to trust a result and check our work if you want to.
We publish meta-analyses, original benchmarks, and experiments on AI-agent reliability. The goal is the same every time: a clear, defensible number, produced by a method you can check and reported with an honest sense of how much to trust it. To earn that trust we hold ourselves to a few fixed rules about how results are produced and described.
How we pick a method
For most measurements there is more than one accepted method. When we publish an analysis, a benchmark, or an experiment, we choose by the same order of preference every time:
- A documented statistical procedure, where one exists: for example, a bias-corrected bootstrap confidence interval, or a paired significance test for comparing two systems on the same runs.
- A widely cited, peer-reviewed method, when there is no single standard: for example, a published estimator for eval variance or for inter-judge agreement, cited to the paper it comes from.
- Plain arithmetic the math agrees on, for things like a containment rate (faults contained over faults injected), where the result is not a matter of opinion once the runs are fixed.
Where a respected alternative exists, we say so in the write-up and explain why we chose the default. We never silently average competing methods to manufacture a single tidy number.
What “estimate” actually means
Most of our results are estimates, and we use that word on purpose. An estimate is a careful, defensible projection from measured runs, reported with an honest account of its uncertainty. It describes the runs and conditions it was measured under, and stops there. Two things make a result an estimate:
- The metric models behavior rather than guaranteeing it. A containment or blast-radius number describes the runs we sampled, under the topology and fault mix we chose. A different workload or fault set can move it.
- The conditions are held steady by assumption. A reliability projection holds the model, harness, prompts, and fault distribution steady. A real deployment rarely is.
So when a benchmark reports “containment 82% (95% CI 76–88%),” the point estimate is exact on the runs sampled, and still an estimate, because a different fault mix or a rerun could move it. Where we can, we show that margin of error, so a single number isn't mistaken for a guarantee.
How we cite our sources
When an analysis or benchmark relies on an external figure (a metric definition, a statistical procedure, a baseline number), we cite it with three fixed parts: the publisher who actually issued it (the standards body, the journal, the paper's authors; never a blog that re-posted it), a direct link so you can confirm it in one click, and the retrieved date we read and recorded it. That is our standard citation format across research write-ups: a benchmark pins the paper a method came from the same way a meta-analysis pins each study it pools. Here is exactly how a pinned source reads:
Example: how we pin a source
- e-Handbook of Statistical Methods: confidence intervals and hypothesis tests. U.S. National Institute of Standards and Technology (NIST/SEMATECH). Retrieved .
How often we review
Publishing a result is the start of maintaining it. We sort published work into review bands by how fast it goes stale:
- Benchmarks & datasets
- Re-run and re-reported when the underlying models, harness, or baselines change (the events that move a measured result), and we note when each result was last reviewed.
- Published analyses
- Reviewed when the statistical methods they rest on are revised, or when new studies materially change a pooled conclusion.
- Owner
- The LatentEval editorial process owns these reviews. The maintainers stand behind the math; we do not attach a named individual “reviewed by” byline (see below).
Our no-fabrication rule
This is the rule that overrides the others. We do not invent numbers, sources, reviewers, accuracy claims, or outcomes to make a result look more authoritative than it is. Concretely:
- No invented citations. If we can't pin a figure to a real, linkable publisher, we don't state it as fact. We either label it plainly as an assumption or leave it out.
- No fake precision. We don't report a containment rate to two decimals when the method is only good to a few points. Where a result has a meaningful margin of error, we show it, so a rough estimate isn't read as a precise one.
- No fabricated authority. No invented benchmark results, no made-up citations, no cherry-picked baseline, no “validated by” badge we can't back up. We compare against the strongest existing defense we can find. If we ever add an external reviewer, their real name and credentials will appear here.
- No overclaiming. Our numbers describe the systems and conditions we measured, and no more. Applying them to your specific deployment calls for your own testing, and the suite-wide disclaimer says so. Individual write-ups add precise caveats where the stakes are higher.
More on how we work
Everything here is meant to be checkable. If a figure looks wrong or a source has moved, that's a bug we want to fix.
- About: who publishes and stands behind this work.
- Disclaimer: the limits of an estimate.
- Privacy: how we handle data and what we measure, described plainly.