Your dashboard stays green while your users still get wrong answers.
LangChain's State of Agent Engineering survey of 1,340 practitioners: 89% run observability, only 52.4% run offline evals. An offline eval is what scores whether the answer was right.
Every trace in your dashboard is green. Latency sits inside budget, no span threw, every tool call came back 200, token spend is flat against last week. And you still cannot answer the one question a user actually cares about: were the answers right?
You have perfect visibility into a process and no measurement of its output.
This is the industry’s default posture, shared by the teams doing everything else right. In LangChain’s State of Agent Engineering survey of 1,340 practitioners, 89% had wired up observability (±1.7 pts) while only 52.4% ran offline evals (±2.7 pts), and 32% named quality, not latency and not cost, as the single biggest barrier to shipping an agent (±2.5 pts).1 The teams furthest along have the most traces and the least idea whether the traces are correct.
Observability inspects the span your agent emitted. An eval scores the answer it returned. They are two different measurements, and a dashboard of green traces is not evidence of correctness. Build a graded eval set that scores outputs against known-good answers and run it as a gate on every prompt and model change. Skip it and your first signal that quality regressed will be a user, arriving days after the trace that caused it scrolled off the screen.
The refund answer was wrong. The trace was green anyway.
Two words are doing the work in that survey, so pin them before going further.
Observability is the tracing and metrics layer: spans, latency, tool-call status, retries, exceptions, cost. It answers one question well, “what happened during this run,” and it is genuinely necessary. You cannot debug a ten-step agent you cannot see.
Evals are the other measurement: you take an output, compare it against a reference or a rubric, and produce a score. That answers a different question, “was the result correct,” and it is the one your trace viewer cannot touch. (What an eval actually is, in one page.)
A support agent reads a policy doc and answers a refund question. Every span is ok, the retrieval tool returned in 240ms, the model replied in one pass, the run cost a tenth of a cent. The answer it returned was “the refund window is 90 days.” The policy says 30. Nothing in your observability stack has an opinion about that, because nothing errored.
Your observability stack has a field for latency and a field for errors. It has no field for wrong.
The whole industry watches more than it measures.
The survey puts numbers on the habit. All three come from the same 1,340 responses, each margin the derived 95% sampling error for that proportion:1
- 89% (±1.7 pts) have implemented observability for their agents.
- 52.4% (±2.7 pts) run offline evals on a test set.
- 32% (±2.5 pts) name quality as the top barrier to deployment, ahead of latency and cost.
Observability is adopted far more often than offline evals: a gap of about 37 points between watching the run and scoring the result. There is a plain reason. Tracing comes close to free: import the SDK, and spans, latency, and token counts arrive on every call whether you asked for them or not. A correctness score does not arrive free. Someone has to decide what “right” means for this task, assemble reference answers, and pick a scorer. You instrument what is easy to instrument and defer what is hard to define.
This is not a vendor failure: the same platforms that sell the tracing also sell the eval harness. LangChain’s own product does both halves. Teams adopted the half that installs itself.
Barely half of teams score their agents’ outputs, and quality is the number-one thing keeping agents out of production. Those two facts are the same fact.
Observability says yes to every liveness question and no to every correctness one.
Line up the failures each measurement catches, question by question. Neither substitutes for the other, and the blind spot runs to half the table.
| Question | Observability | Evals |
|---|---|---|
| Did the run finish? | yes | no |
| Did a tool call fail or retry? | yes | no |
| Was latency and cost in budget? | yes | no |
| Did the chain error, stall, or loop? | yes | no |
| Was the answer correct? | no | yes |
| Did quality regress after a prompt or model change? | no | yes |
| Is a fluent, well-formed answer actually right? | no | yes |
The top rows are liveness: is the machine running. The bottom rows are correctness: is the machine right. Observability owns the top and is blind to the bottom by construction, because a correct answer and a confidently wrong one produce the identical trace. Same spans, same status, same latency band. The same shape shows up one layer down: a refused Claude call returns HTTP 200, so it reads as a success on every dashboard you own. To your dashboard, the wrong “90 days” and the right “30 days” are the same run.
Observability is a liveness measurement wearing a quality dashboard’s clothes. It tells you the agent worked, never that the work was good.
The costliest failure is a confident, fluent answer that is completely wrong.
Rank agent failures by what they cost you. The cheap ones are loud: a timeout, an unhandled exception, a tool that 500s. Observability catches every one, because each throws something your alerts already watch.
The expensive one is silent. The model returns a fluent, well-structured, completely wrong answer, and it costs the most precisely because nothing fires. No span errors. No retry. No latency spike. It sails through every check in the top half of that table and lands in front of a user as settled fact. Confident phrasing is the exact signal humans are wired to trust, which is why a wrong answer that sounds sure does the most damage.
This is the same trap that lets a benchmark score look healthy while measuring the wrong thing: a green number, like a green trace, is only as honest as whether it is checking correctness at all. Catching the silent failure is a scoring problem, and scoring is what an eval does. An eval’s verdict is only worth what its score measures. That question has a name, construct validity, and it is the reason LLM-as-a-judge scoring needs its own rigor.
The failure that survives every observability check is the one that looks exactly like success. Only a correctness score sees it.
You already capture the run. Now score the output.
The good news in that 89% number: you already have the raw material. Your traces are a labeled dataset waiting to happen. Turning watching into measuring is a smaller lift than starting cold.
The distinction is what you assert on. Observability checks the span; an eval checks the output.
# Observability: did the run complete? All green.
assert span.status == "ok" # no exception, tool call returned
assert span.latency_ms < 2000 # within budget
# The output below passes every check above:
output = "The refund window is 90 days." # the policy says 30
# Eval: does the output match ground truth? This is the assertion
# observability structurally cannot make.
score = judge(output, reference="30-day refund window")
assert score >= 0.9
A workable path from traces to a real gate:
- Pick one high-traffic agent path. Pull 30 to 50 recent traces and label each with a known-good answer. That labeled set is your eval, and the size you need is a measurable question, not a guess.
- Score outputs, not spans. Use exact-match or a unit test where the answer is checkable; use an LLM judge where it is not, and report the judge’s numbers with the same rigor you would demand of any measurement.
- Run the set as a gate in CI on every prompt and model change. A green trace is not a passing grade, and a model swap you never scored is a quality change you never measured. That is the eval that tells you a cheaper model still clears the bar.
- Then add online evals on a sample of live traffic, and alert on eval-score regression the way you already alert on error rate. Right is a metric too.
The instrumentation to measure correctness is already in place. What is left is to assert on the output your traces already captured.
Instrument the run, then grade it.
A dashboard full of green is the most comfortable way to ship an agent that quietly got worse. Observability tells you the run happened; only a score tells you the run was right, and a swap or a regression you never scored is a reliability change you cannot see. If you want the measurement side of this done properly, that is the entire subject of the research desk: start with what LLM evals are and are not and how to measure agent reliability past a single pass rate.
Footnotes
-
LangChain, “State of Agent Engineering” (1,340 responses, collected November 2025): 89% implemented observability, 52.4% run offline evals on test sets, and roughly 32% (about one third) cite quality as the top barrier to deployment and adoption. The report publishes point estimates without confidence intervals; the ±1.7, ±2.7, and ±2.5 point margins quoted inline are the derived 95% sampling error for a simple proportion at n=1,340 (1.96 × √(p(1-p)/n)), a lower bound on true uncertainty for a self-selected practitioner sample. https://www.langchain.com/state-of-agent-engineering ↩ ↩2