A feature scores 95% accuracy in evaluation. In production, users report it "doesn't understand" their requests. The evaluation suite is not wrong. It is measuring something other than production behavior.
This is the central problem of AI evaluation infrastructure: the tools were built for deterministic software. AI systems are probabilistic, context-dependent, and non-reproducible. The tooling gap is not incremental. It is categorical.
Where Current Evaluation Breaks
Static benchmarks against dynamic reality. Benchmarks are fixed snapshots of yesterday's problem distribution. User behavior drifts continuously. A benchmark that represented production traffic six months ago may share less than 60% overlap with current input patterns. The score stays green. The user experience degrades.
Deterministic testing assumptions. Unit tests assert that f(x) = y. LLM outputs are stochastic. The same prompt produces different outputs across runs, model versions, and even temperature settings. Traditional pass/fail testing cannot express "this output is acceptable" for a system where acceptable outputs form a distribution, not a point.
Surface-level monitoring. Token counts, latency percentiles, and error rates describe the mechanics of inference. They do not describe reasoning quality. A model that hallucinates confidently produces normal telemetry. The failure is semantic, and the monitoring infrastructure has no semantic layer.
Scale mismatch in human evaluation. Human raters provide high-quality signal but cannot scale to production velocity. Inter-rater reliability is typically 0.6-0.8 on subjective quality judgments. A single rater evaluating the same output on different days will disagree with themselves 15-25% of the time. Human evaluation is a calibration tool, not a monitoring system.
What the Infrastructure Requires
Continuous production evaluation. Evaluation must happen on live traffic, not on a test set selected months before deployment. Every user interaction is a potential evaluation signal. The infrastructure must sample, score, and aggregate quality metrics from production data in near-real-time.
Failure pattern detection. AI failures are not random. They cluster around specific input patterns, context configurations, and user interaction sequences. The eval system must surface these clusters automatically -- identifying that the model fails systematically on negation, or on requests requiring multi-step reasoning, or on inputs exceeding a certain complexity threshold. This is unsupervised pattern recognition over failure cases, not hand-written test assertions.
Reasoning traces. When output quality degrades, the debugging question is "why did the reasoning go wrong," not "what was the output." Evaluation infrastructure must capture intermediate reasoning steps, confidence distributions, and alternative paths considered. Without this, debugging a production failure requires reproducing it -- which, for a stochastic system, may be impossible.
Adaptive test generation. Edge cases discovered in production must flow back into the evaluation suite automatically. The test corpus must evolve at the same rate as user behavior. Static test suites become stale faster than they can be manually updated.
The Gap
Current infrastructure measures AI systems with tools designed for deterministic software. The evaluation layer for probabilistic reasoning systems -- continuous, production-integrated, semantically aware, and self-updating -- does not exist as a mature category.
The teams that build reliable AI systems today do so with custom internal tooling, stitched together from logging pipelines, ad-hoc scoring scripts, and manual review processes. This is where evaluation infrastructure was for traditional software in the early 2000s, before CI/CD became a category.
The gap will be filled. The question is whether it happens through purpose-built infrastructure or through continued accumulation of ad-hoc solutions that break at scale.