~/haas
researchwritingsystemsabout

Status: Active

Building evaluation infrastructure for AI systems.
San Francisco.

emailgithubtwitter
~/haas
researchwritingsystemsabout

Status: Active

Building evaluation infrastructure for AI systems.
San Francisco.

emailgithubtwitter
~/haas
researchwritingsystemsabout
←writing

Forget Perfect Data: Building a Usable Voice Profile Extractor

June 26, 2025·4 min read

60% accuracy is enough to ship. Your obsession with perfect data is why you have no revenue.

#ai#voice-ai#personality-replication#startup-engineering#data-science

60% accuracy is sufficient for revenue.

I shipped a voice profile extractor at 60% accuracy. Now it powers this entire blog's AI content, runs at 80% accuracy, and took one weekend to build.

Your voice AI project? Still waiting for clean data. Still "almost ready." Still making zero money.

Here's the uncomfortable truth the ML researchers won't tell you: Users don't care about your F1 scores. They care about whether your product works well enough to solve their problem. And "well enough" is way lower than you think.

I built the Jonathan Voice Engine with messy markdown files, basic regex patterns, and zero training data. It writes exactly like me—contractions, contrarian takes, and all. No BERT. No transformers. Just pattern matching that ships.

The Perfect Data Trap (And Why You're Stuck In It)

Many years ago at a tiny startup, we burned through $200K and six months because the CTO read too many ML papers. "We need clean data," he said. "We need balanced demographics."

We needed revenue. We got bankruptcy.

While we were perfecting our data quality metrics:

  • Competitor A shipped with 100 crappy recordings
  • Competitor B used their founders' podcast transcripts
  • Competitor C literally used YouTube auto-captions

All three are still in business. We're not.

Here's the thing most people miss: Your users don't give a shit about your F1 scores. They care about whether your product works well enough to solve their problem. And "well enough" is way lower than you think.

What Actually Matters in Voice Profile Extraction

Everyone thinks voice profile extraction is about sophisticated NLP models and transformer architectures. That's academic thinking. Here's what actually matters:

1. Consistent Patterns Beat Perfect Accuracy

My voice profiler tracks simple patterns:

  • Contraction frequency (I use them constantly)
  • Sentence length (short, punchy, 2-4 sentences)
  • Rhetorical questions (transition device)
  • Active voice ratio (>90%)
  • Signature phrases ("Here's the thing most people miss...")

That's it. No BERT. No transformers. Just pattern matching that works.

2. Domain-Specific Markers Trump Generic Features

Generic voice analysis looks for things like "formality level" and "sentiment." Useless.

My system looks for:

  • Contrarian indicators ("conventional wisdom is wrong")
  • Specific framework references (startup bargain, strategic quality)
  • Industry context markers (security, startups, AI)
  • Experience-based examples ("At one startup I advised...", "In my experience...")

These domain markers are 10x more valuable than generic linguistic features.

3. Fast Iteration Beats Slow Perfection

My development cycle:

  • Monday: Basic regex extraction (2 hours)
  • Tuesday: Statistical analysis layer (4 hours)
  • Wednesday: Validation scoring system (3 hours)
  • Thursday: Integration with Claude API (2 hours)
  • Friday: Testing and refinement (all day)

Total: One week to working system.

The Architecture Nobody Tells You About

Here's the actual code structure that powers my voice engine:

// Core extraction pipeline
class VoiceProfileExtractor {
  extract(posts: string[]): VoiceProfile {
    return {
      tone: this.extractToneMarkers(posts),
      style: this.extractStylePatterns(posts),
      perspectives: this.extractBeliefs(posts),
      frameworks: this.extractFrameworks(posts),
      phrases: this.extractSignaturePhrases(posts),
    }
  }

  // The magic: simple pattern matching
  extractToneMarkers(posts: string[]) {
    return {
      directness: this.measureDirectness(posts), // No hedge words
      contrarian: this.measureContrarian(posts), // Challenge patterns
      empathy: this.measureEmpathy(posts), // "I understand" patterns
      pragmatism: this.measurePragmatism(posts), // "What works" focus
    }
  }
}

Notice what's missing? Machine learning. Deep learning. Any learning at all.

It's just measuring what's already there.

Building Your Own: The Non-Obvious Steps

Want to build your own voice profiler? Here's what actually works:

Step 1: Start With Your Worst Data

Don't clean your data. Don't normalize it. Use it raw. Why? Because production data will be messy too. If your system can't handle your worst data, it's useless.

Step 2: Extract Observable Patterns First

Before you think about AI:

  • Count things (words, sentences, paragraphs)
  • Find patterns (phrases, structures, transitions)
  • Measure ratios (active/passive, short/long, direct/hedged)

You'll be shocked how far basic counting gets you.

Step 3: Build Validation Before Accuracy

Most people build a model then try to validate it. Backwards.

Build your validation system first:

  • Define what "sounds right" means quantitatively
  • Create scoring rubrics for each dimension
  • Test manually on 10-20 examples
  • THEN build the extraction system to hit those targets

Step 4: Ship at 60% Accuracy

My voice engine shipped at 60% accuracy. Now it's at 80%.

Those 20 percentage points came from:

  • Real usage data
  • User feedback
  • Iterative improvements
  • Parameter tuning based on results

You can't get from 60% to 80% in development. You can only get there in production.

The Uncomfortable Truth About Voice AI

Here's what nobody wants to admit: Most voice profile extraction is solving the wrong problem.

You don't need to perfectly replicate someone's voice. You need to:

  • Capture their key perspectives
  • Maintain consistent tone
  • Apply their frameworks
  • Sound authentic enough to be useful

My AI doesn't write exactly like me. It writes like me on a good day, when I'm focused and articulate. That's actually more valuable than perfect replication.

Real Implementation Lessons

After building this system, here's what I learned:

1. Authenticity Scoring > Similarity Scoring

Don't measure how similar the output is to training data. Measure whether it feels authentic. My scoring system penalizes:

  • Academic language (-10%)
  • Hedge words (-15%)
  • Missing contractions (-20%)
  • Generic advice (-25%)

These penalties matter more than matching exact phrases.

2. Context Injection > Model Training

Instead of training models, inject context at generation time:

  • Recent examples of target voice
  • Specific frameworks to reference
  • Domain-specific knowledge
  • Signature phrases to use

This approach is 100x faster than model training and surprisingly effective.

3. Human Validation > Automated Metrics

My best accuracy improvements came from:

  • Reading output and marking what felt wrong
  • Adjusting weights based on intuition
  • Testing edge cases manually
  • Getting feedback from blog readers

Fancy metrics didn't help. Human judgment did.

Ship Your Shitty V1

You know what's worse than shipping bad AI? Not shipping at all.

I've watched dozens of teams die waiting for perfect voice data. Meanwhile, some kid with a laptop and ChatGPT is eating their lunch.

My voice engine shipped with:

  • 60% accuracy
  • Obvious failure modes
  • Zero edge case handling
  • Embarrassing bugs

Now it powers this entire blog's AI content. The 20 percentage points between 60% and 80% came from production usage, real feedback, and iterative fixes. You cannot get those improvements in development. You can only get them in production.

The perfectionists are still polishing their datasets. I'm shipping content.

Stop optimizing for your ego. Start optimizing for learning speed. Ship the shitty v1. Fix it live. The market rewards velocity, not virtue.


Technical Note: Want to validate your own content for authenticity? The Jonathan Voice Engine can analyze any text:

echo "Your text here" | bun scripts/jonathan-voice.ts validate

It'll score your content across multiple dimensions and tell you exactly what's missing. Because here's the truth: measuring authenticity is more valuable than generating it.

share

Continue reading

They Told Me This Wasn't the Future

They Told Me This Was Not the Future: All while I was having coffee. "This isn't real AI," the skeptics say.

Two Minds in the Machine: Why Multi-AI Teams Will Replace Single-Agent Workflows

The single AI assistant model is already obsolete. Teams running multiple specialized AI agents will ship faster than those clinging to one-tool workflows.

I Tested 5 Embedding Models on 10K Developer Questions

Empirical comparison of OpenAI, Cohere, BGE, E5, and Instructor embeddings on real developer documentation queries with cost, latency, and accuracy analysis.

Status: Active

Building evaluation infrastructure for AI systems.
San Francisco.

emailgithubtwitter