Site map
Every page on the site, in one place: the research library, the glossary, the Everyday AI guides, the AI for Builders guides, and the pages behind them. Looking for the machine-readable version? Crawlers use the XML sitemap.
Main pages
- Home
- About
- Methodology
- Privacy
- Disclaimer
- Site map
Research
- AI agent evaluation that follows the whole trajectory
- AI agent reliability, from consistency to containment
- Bias-correct your LLM-as-a-judge eval before reporting it
- Do microservices resilience patterns port to AI agents?
- Error propagation and cascade containment in agent systems
- How many runs a reliable eval needs to catch a regression
- How to evaluate a RAG pipeline beyond a single score
- How to measure agent reliability past a single pass rate
- How to test agent reliability beyond a single eval run
- Is your eval difference statistically significant?
- Is your LLM-as-a-judge reliable? Test the evaluator
- LLM evals: which methods to trust and where they lie
- LLM-as-a-judge bias, and the tests that catch it
- Make AI agents reliable in production: a budget playbook
- Multi-agent orchestration patterns and the failures they amplify
- Multi-agent systems, defined by how they fail
- What LLM evals are, and what each type can certify
- When RAG retrieves wrong chunks: failure modes and containment
- Where RAGAS wins at RAG evaluation, and where it stops
- Why multi-agent LLM systems fail, and how to contain it
Glossary
- Blast radius (agent systems)
- Cascade resistance
- Chaos engineering for AI agents
- Consensus and voting reliability
- Construct validity (benchmarks)
- Containment rate
- Error propagation (multi-agent)
- Eval confidence interval
- Eval reproducibility
- Failure attribution (agents)
- Fault injection (agents)
- Multi-agent debate failure
- Orchestrator-worker reliability
- reliability@k and pass^k
Everyday AI
- The model you never picked is the one about to bill you.
- You can't hear the difference. Your AI sounds just as sure when it's wrong.
- You flipped the switch you could see. The other one is still on.
- You told your shopping agent yes. Amazon wants a court to say no.
- Your AI reads your calendar. A stranger's invite can give it orders.
- Your browser can now click Buy for you. It reads the page's instructions too.
- Your shopping agent has your card. It can be paid to steer you.
AI for Builders
- A refused Claude call returns HTTP 200, empty content. Dashboards log success.
- The MCP changelog told you what's deprecated. It stayed silent on what breaks.
- The prompt wording is a hyperparameter you never swept.
- You picked the model at the top. The harness picked it for you.
- You reach up because the task looks hard. Only the invoice changes.
- Your dashboard stays green while your users still get wrong answers.