← back to writing

The AI Evals Rebuild: How to Actually Test AI Systems

• 4 min read

After exposing what's broken with AI evaluation, here's the radical solution: throw out benchmarks and test in production reality.

After three posts tearing apart the AI evaluation industry, my inbox exploded.

"Okay, we get it. Evals are broken. So what should we build instead?"

Fair question. I've spent the last two years helping companies rebuild their evaluation systems from scratch. Not with fancier benchmarks or more sophisticated metrics. But with something radically simpler.

Here's the controversial truth: The best AI evaluation system is no evaluation system.

Let me explain.

The Production Mirror Principle

Last month, I worked with a payments company struggling with their fraud detection AI. They had built an elaborate evaluation framework:

  • 50,000 synthetic fraud cases
  • 15 different accuracy metrics
  • Automated benchmark suite running nightly
  • Beautiful dashboards showing 97.3% accuracy

Their actual fraud detection rate in production? 62%.

We threw it all out.

Instead, we built something stupidly simple: We ran the new model alongside the old one on real transactions. No predictions, no actions. Just logging what each model would have done.

After one week of shadow mode, we knew exactly how the new model would perform. No benchmarks. No synthetic data. Just reality.

The first principle of real evaluation: Test where you deploy, not where it's convenient.

The Three Pillars That Actually Work

1. Shadow Mode Everything

Before any model touches production, it runs in shadow mode for at least a week.

Here's what that means:

  • The model sees real production inputs
  • It generates predictions but doesn't act on them
  • You compare what it would have done vs. what actually happened

A fintech company I advised discovered their "improved" model would have blocked 40% of legitimate transactions. The benchmarks showed 94% accuracy. Shadow mode revealed the truth.

2. Incremental Rollout with Kill Switches

Start with 0.1% of traffic. Not 10%. Not 1%. Zero point one percent.

I learned this from a disaster at a recommendation startup. They rolled out their new model to 20% of users. Within 3 hours, revenue dropped 30%. The rollback took 6 hours. They lost $2 million.

Here's the framework that actually works:

  • 0.1% for 24 hours
  • 1% for 3 days
  • 5% for a week
  • 25% for a week
  • 50% for a week
  • 100% only after proving value at each stage

And the critical part: One-button rollback at every stage. Not "we can roll back if needed." One button. 10 seconds. Done.

3. User Feedback Loops Beat Synthetic Benchmarks

The best evaluation metric I've ever seen? A thumbs up/thumbs down button.

A customer support AI I worked with had elaborate BLEU scores, perplexity measurements, and response quality metrics. We replaced it all with two buttons: "This helped" and "This didn't help."

Within a month, we had 10,000 real user evaluations. We discovered:

  • The model was great at simple questions (85% helpful)
  • Terrible at billing issues (20% helpful)
  • Actively harmful for technical support (users preferred no response)

No benchmark would have caught this. Real users did immediately.

The Minimum Viable Eval Stack

You need exactly four things:

1. A/B Testing Framework

Not complex. Dead simple. Route X% of traffic to model A, Y% to model B. Compare business metrics, not AI metrics.

2. Basic Logging

  • Input
  • Output
  • Latency
  • User action after seeing output
  • Business outcome

That's it. If you're logging more than this, you're overthinking it.

3. Cost Tracking

Every AI request costs money. Track it per model, per user, per feature. I've seen companies burn $50K in a weekend because they didn't know their new model was 100x more expensive.

4. User Feedback Widget

Thumbs up. Thumbs down. Optional text field. Nothing more.

The Case Study That Changed My Mind

A logistics company came to me with 47 different models for route optimization. They had built an entire evaluation platform:

  • Synthetic city simulations
  • Millions of test deliveries
  • Complex scoring algorithms
  • 6-person team just managing evaluations

We replaced it all with one Google Sheet.

Every morning, drivers rated yesterday's routes: Good, Okay, or Bad. That's it.

Within two weeks, we identified the 3 models that actually improved driver satisfaction. We killed the other 44.

Delivery times improved 15%. Driver retention increased 20%. The evaluation system that couldn't predict this? Deprecated.

Why Industry Benchmarks Are Worse Than Useless

Here's what nobody wants to admit: Industry benchmarks actively make your AI worse.

When you optimize for benchmarks, you're optimizing for someone else's problem. You're solving imaginary challenges instead of real ones.

I consulted for a medical AI company that spent 18 months improving their score on medical benchmarks. They went from 72% to 91%. Impressive, right?

When they deployed to hospitals, doctors hated it. The model was great at benchmark diseases nobody sees. It was terrible at common cases that weren't in the benchmark.

Benchmarks test what's easy to measure, not what matters.

The Questions That Actually Matter

Forget "What's your F1 score?" Here are the questions that predict AI success:

  1. What happens when this fails? If you can't answer this in one sentence, you're not ready for production.

  2. How do we know it's failing? Not "the metrics look bad." How does a human notice something's wrong?

  3. How fast can we turn it off? Measured in seconds, not minutes.

  4. What's the worst-case scenario cost? Both in dollars and user trust.

  5. Who's responsible when it breaks? "The AI team" is not an answer.

The Controversial Part

Here's what really upsets the AI evaluation industry: Most companies don't need AI evaluation. They need AI monitoring.

Evaluation assumes you can predict production behavior in a lab. You can't.

Monitoring assumes production will surprise you. It will.

The companies succeeding with AI aren't the ones with the best benchmarks. They're the ones who can detect and fix production issues fastest.

Your 5-Step Eval Rebuild

  1. Kill your benchmarks. Today. They're not helping.

  2. Start shadow mode. Pick your most critical AI system. Run the next version in shadow for a week.

  3. Add user feedback. Two buttons. That's all.

  4. Track business metrics, not AI metrics. Revenue, user retention, support tickets. Not accuracy scores.

  5. Build rollback first. Before you deploy anything, make sure you can undo it in under 30 seconds.

The Path Forward

The evaluation industry wants you to believe you need complex tools, sophisticated metrics, and massive benchmark suites.

You don't.

You need to know three things:

  • Does it work where we deploy it?
  • Do users prefer it?
  • Can we afford it?

Everything else is academic theater.

The next time someone shows you a 97% accuracy score, ask them one question: "What's your rollback time?"

If they can't answer in seconds, they're not ready for production.

And production is the only evaluation that matters.

share

next up