The AI Evals Rebuild: How to Actually Test AI Systems

September 6, 2025·4 min read

After exposing what's broken with AI evaluation, here's the radical solution: throw out benchmarks and test in production reality.

ai ai-evaluation evals testing production-systems contrarian solution-architecture

After three posts tearing apart the AI evaluation industry, my inbox exploded.

"Okay, we get it. Evals are broken. So what should we build instead?"

Fair question. I've spent the last two years helping companies rebuild their evaluation systems from scratch. Not with fancier benchmarks or more sophisticated metrics. But with something radically simpler.

Here's the controversial truth: The best AI evaluation system is no evaluation system.

Let me explain.

The Production Mirror Principle

Last month, I worked with a payments company struggling with their fraud detection AI. They had built an elaborate evaluation framework:

50,000 synthetic fraud cases
15 different accuracy metrics
Automated benchmark suite running nightly
Beautiful dashboards showing 97.3% accuracy

Their actual fraud detection rate in production? 62%.

We threw it all out.

Instead, we built something stupidly simple: We ran the new model alongside the old one on real transactions. No predictions, no actions. Just logging what each model would have done.

After one week of shadow mode, we knew exactly how the new model would perform. No benchmarks. No synthetic data. Just reality.

The first principle of real evaluation: Test where you deploy, not where it's convenient.

The Three Pillars That Actually Work

1. Shadow Mode Everything

Before any model touches production, it runs in shadow mode for at least a week.

Here's what that means:

The model sees real production inputs
It generates predictions but doesn't act on them
You compare what it would have done vs. what actually happened

A fintech company I advised discovered their "improved" model would have blocked 40% of legitimate transactions. The benchmarks showed 94% accuracy. Shadow mode revealed the truth.

2. Incremental Rollout with Kill Switches

Start with 0.1% of traffic. Not 10%. Not 1%. Zero point one percent.

I learned this from a disaster at a recommendation startup. They rolled out their new model to 20% of users. Within 3 hours, revenue dropped 30%. The rollback took 6 hours. They lost $2 million.

Here's the framework that actually works:

0.1% for 24 hours
1% for 3 days
5% for a week
25% for a week
50% for a week
100% only after proving value at each stage

And the critical part: One-button rollback at every stage. Not "we can roll back if needed." One button. 10 seconds. Done.

3. User Feedback Loops Beat Synthetic Benchmarks

The best evaluation metric I've ever seen? A thumbs up/thumbs down button.

A customer support AI I worked with had elaborate BLEU scores, perplexity measurements, and response quality metrics. We replaced it all with two buttons: "This helped" and "This didn't help."

Within a month, we had 10,000 real user evaluations. We discovered:

The model was great at simple questions (85% helpful)
Terrible at billing issues (20% helpful)
Actively harmful for technical support (users preferred no response)

No benchmark would have caught this. Real users did immediately.

The Minimum Viable Eval Stack

You need exactly four things:

1. A/B Testing Framework

Not complex. Dead simple. Route X% of traffic to model A, Y% to model B. Compare business metrics, not AI metrics.

2. Basic Logging

Input
Output
Latency
User action after seeing output
Business outcome

That's it. If you're logging more than this, you're overthinking it.

3. Cost Tracking

Every AI request costs money. Track it per model, per user, per feature. I've seen companies burn $50K in a weekend because they didn't know their new model was 100x more expensive.

4. User Feedback Widget

Thumbs up. Thumbs down. Optional text field. Nothing more.

The Case Study That Changed My Mind

A logistics company came to me with 47 different models for route optimization. They had built an entire evaluation platform:

Synthetic city simulations
Millions of test deliveries
Complex scoring algorithms
6-person team just managing evaluations

We replaced it all with one Google Sheet.

Every morning, drivers rated yesterday's routes: Good, Okay, or Bad. That's it.

Within two weeks, we identified the 3 models that actually improved driver satisfaction. We killed the other 44.

Delivery times improved 15%. Driver retention increased 20%. The evaluation system that couldn't predict this? Deprecated.

Why Industry Benchmarks Are Worse Than Useless

Here's what nobody wants to admit: Industry benchmarks actively make your AI worse.

When you optimize for benchmarks, you're optimizing for someone else's problem. You're solving imaginary challenges instead of real ones.

I consulted for a medical AI company that spent 18 months improving their score on medical benchmarks. They went from 72% to 91%. Impressive, right?

When they deployed to hospitals, doctors hated it. The model was great at benchmark diseases nobody sees. It was terrible at common cases that weren't in the benchmark.

Benchmarks test what's easy to measure, not what matters.

The Questions That Actually Matter

Forget "What's your F1 score?" Here are the questions that predict AI success:

What happens when this fails? If you can't answer this in one sentence, you're not ready for production.
How do we know it's failing? Not "the metrics look bad." How does a human notice something's wrong?
How fast can we turn it off? Measured in seconds, not minutes.
What's the worst-case scenario cost? Both in dollars and user trust.
Who's responsible when it breaks? "The AI team" is not an answer.

The Controversial Part

Here's what really upsets the AI evaluation industry: Most companies don't need AI evaluation. They need AI monitoring.

Evaluation assumes you can predict production behavior in a lab. You can't.

Monitoring assumes production will surprise you. It will.

The companies succeeding with AI aren't the ones with the best benchmarks. They're the ones who can detect and fix production issues fastest.

Your 5-Step Eval Rebuild

Kill your benchmarks. Today. They're not helping.
Start shadow mode. Pick your most critical AI system. Run the next version in shadow for a week.
Add user feedback. Two buttons. That's all.
Track business metrics, not AI metrics. Revenue, user retention, support tickets. Not accuracy scores.
Build rollback first. Before you deploy anything, make sure you can undo it in under 30 seconds.

The Path Forward

The evaluation industry wants you to believe you need complex tools, sophisticated metrics, and massive benchmark suites.

You don't.

You need to know three things:

Does it work where we deploy it?
Do users prefer it?
Can we afford it?

Everything else is academic theater.

The next time someone shows you a 97% accuracy score, ask them one question: "What's your rollback time?"

If they can't answer in seconds, they're not ready for production.

And production is the only evaluation that matters.

The Hidden Costs of Poor AI Evals: Why the Industry Pays the Price

Poor AI evaluations don't just hurt individual companies. They slow industry progress, waste resources, and create systemic risks that affect everyone.

Why AI Evals Companies Fell for the PLG Trap: The Inevitable Mistake

AI evals companies didn't choose PLG by accident. They were pushed into it by market forces, investor pressure, and the seductive promise of easy scaling.

The Evaluation Infrastructure We Need: Why AI Testing is Fundamentally Broken

Current AI evaluation approaches are built for software, not systems that reason. Here's the infrastructure we actually need.

haas.

←back to writing

The AI Evals Rebuild: How to Actually Test AI Systems

September 6, 2025·4 min read

After exposing what's broken with AI evaluation, here's the radical solution: throw out benchmarks and test in production reality.

ai ai-evaluation evals testing production-systems contrarian solution-architecture

After three posts tearing apart the AI evaluation industry, my inbox exploded.

"Okay, we get it. Evals are broken. So what should we build instead?"

Here's the controversial truth: The best AI evaluation system is no evaluation system.

Let me explain.

The Production Mirror Principle

Last month, I worked with a payments company struggling with their fraud detection AI. They had built an elaborate evaluation framework:

50,000 synthetic fraud cases
15 different accuracy metrics
Automated benchmark suite running nightly
Beautiful dashboards showing 97.3% accuracy

Their actual fraud detection rate in production? 62%.

We threw it all out.

Instead, we built something stupidly simple: We ran the new model alongside the old one on real transactions. No predictions, no actions. Just logging what each model would have done.

After one week of shadow mode, we knew exactly how the new model would perform. No benchmarks. No synthetic data. Just reality.

The first principle of real evaluation: Test where you deploy, not where it's convenient.

The Three Pillars That Actually Work

1. Shadow Mode Everything

Before any model touches production, it runs in shadow mode for at least a week.

Here's what that means:

The model sees real production inputs
It generates predictions but doesn't act on them
You compare what it would have done vs. what actually happened

A fintech company I advised discovered their "improved" model would have blocked 40% of legitimate transactions. The benchmarks showed 94% accuracy. Shadow mode revealed the truth.

2. Incremental Rollout with Kill Switches

Start with 0.1% of traffic. Not 10%. Not 1%. Zero point one percent.

I learned this from a disaster at a recommendation startup. They rolled out their new model to 20% of users. Within 3 hours, revenue dropped 30%. The rollback took 6 hours. They lost $2 million.

Here's the framework that actually works:

0.1% for 24 hours
1% for 3 days
5% for a week
25% for a week
50% for a week
100% only after proving value at each stage

And the critical part: One-button rollback at every stage. Not "we can roll back if needed." One button. 10 seconds. Done.

3. User Feedback Loops Beat Synthetic Benchmarks

The best evaluation metric I've ever seen? A thumbs up/thumbs down button.

A customer support AI I worked with had elaborate BLEU scores, perplexity measurements, and response quality metrics. We replaced it all with two buttons: "This helped" and "This didn't help."

Within a month, we had 10,000 real user evaluations. We discovered:

The model was great at simple questions (85% helpful)
Terrible at billing issues (20% helpful)
Actively harmful for technical support (users preferred no response)

No benchmark would have caught this. Real users did immediately.

The Minimum Viable Eval Stack

You need exactly four things:

1. A/B Testing Framework

Not complex. Dead simple. Route X% of traffic to model A, Y% to model B. Compare business metrics, not AI metrics.

2. Basic Logging

Input
Output
Latency
User action after seeing output
Business outcome

That's it. If you're logging more than this, you're overthinking it.

3. Cost Tracking

Every AI request costs money. Track it per model, per user, per feature. I've seen companies burn $50K in a weekend because they didn't know their new model was 100x more expensive.

4. User Feedback Widget

Thumbs up. Thumbs down. Optional text field. Nothing more.

The Case Study That Changed My Mind

A logistics company came to me with 47 different models for route optimization. They had built an entire evaluation platform:

Synthetic city simulations
Millions of test deliveries
Complex scoring algorithms
6-person team just managing evaluations

We replaced it all with one Google Sheet.

Every morning, drivers rated yesterday's routes: Good, Okay, or Bad. That's it.

Within two weeks, we identified the 3 models that actually improved driver satisfaction. We killed the other 44.

Delivery times improved 15%. Driver retention increased 20%. The evaluation system that couldn't predict this? Deprecated.

Why Industry Benchmarks Are Worse Than Useless

Here's what nobody wants to admit: Industry benchmarks actively make your AI worse.

When you optimize for benchmarks, you're optimizing for someone else's problem. You're solving imaginary challenges instead of real ones.

I consulted for a medical AI company that spent 18 months improving their score on medical benchmarks. They went from 72% to 91%. Impressive, right?

When they deployed to hospitals, doctors hated it. The model was great at benchmark diseases nobody sees. It was terrible at common cases that weren't in the benchmark.

Benchmarks test what's easy to measure, not what matters.

The Questions That Actually Matter

Forget "What's your F1 score?" Here are the questions that predict AI success:

What happens when this fails? If you can't answer this in one sentence, you're not ready for production.
How do we know it's failing? Not "the metrics look bad." How does a human notice something's wrong?
How fast can we turn it off? Measured in seconds, not minutes.
What's the worst-case scenario cost? Both in dollars and user trust.
Who's responsible when it breaks? "The AI team" is not an answer.

The Controversial Part

Here's what really upsets the AI evaluation industry: Most companies don't need AI evaluation. They need AI monitoring.

Evaluation assumes you can predict production behavior in a lab. You can't.

Monitoring assumes production will surprise you. It will.

The companies succeeding with AI aren't the ones with the best benchmarks. They're the ones who can detect and fix production issues fastest.

Your 5-Step Eval Rebuild

Kill your benchmarks. Today. They're not helping.
Start shadow mode. Pick your most critical AI system. Run the next version in shadow for a week.
Add user feedback. Two buttons. That's all.
Track business metrics, not AI metrics. Revenue, user retention, support tickets. Not accuracy scores.
Build rollback first. Before you deploy anything, make sure you can undo it in under 30 seconds.

The Path Forward

The evaluation industry wants you to believe you need complex tools, sophisticated metrics, and massive benchmark suites.

You don't.

You need to know three things:

Does it work where we deploy it?
Do users prefer it?
Can we afford it?

Everything else is academic theater.

The next time someone shows you a 97% accuracy score, ask them one question: "What's your rollback time?"

If they can't answer in seconds, they're not ready for production.

And production is the only evaluation that matters.

The AI Evals Rebuild: How to Actually Test AI Systems

The Production Mirror Principle

The Three Pillars That Actually Work

1. Shadow Mode Everything

2. Incremental Rollout with Kill Switches

3. User Feedback Loops Beat Synthetic Benchmarks

The Minimum Viable Eval Stack

1. A/B Testing Framework

2. Basic Logging

3. Cost Tracking

4. User Feedback Widget

The Case Study That Changed My Mind

Why Industry Benchmarks Are Worse Than Useless

The Questions That Actually Matter

The Controversial Part

Your 5-Step Eval Rebuild

The Path Forward

Continue reading

The Hidden Costs of Poor AI Evals: Why the Industry Pays the Price

Why AI Evals Companies Fell for the PLG Trap: The Inevitable Mistake

The Evaluation Infrastructure We Need: Why AI Testing is Fundamentally Broken

The AI Evals Rebuild: How to Actually Test AI Systems

The Production Mirror Principle

The Three Pillars That Actually Work

1. Shadow Mode Everything

2. Incremental Rollout with Kill Switches

3. User Feedback Loops Beat Synthetic Benchmarks

The Minimum Viable Eval Stack

1. A/B Testing Framework

2. Basic Logging

3. Cost Tracking

4. User Feedback Widget

The Case Study That Changed My Mind

Why Industry Benchmarks Are Worse Than Useless

The Questions That Actually Matter

The Controversial Part

Your 5-Step Eval Rebuild

The Path Forward

Continue reading

The Hidden Costs of Poor AI Evals: Why the Industry Pays the Price

Why AI Evals Companies Fell for the PLG Trap: The Inevitable Mistake

The Evaluation Infrastructure We Need: Why AI Testing is Fundamentally Broken