I built a multi-AI content pipeline: Gemini generates outlines and drafts, Claude validates voice consistency. One command produces a publishable blog post.
Three architectural decisions failed. The failures are more instructive than the system that works.
Dynamic Model Selection Was Wasted Engineering
I built logic to choose the "best" model per task based on topic, length, and complexity. The decision logic was arbitrary -- there is no defensible way to score whether a topic is "more creative" versus "more analytical." The switching overhead killed performance. Different models have different formatting habits, so a post that started in Gemini and switched to Claude mid-draft read like two different authors.
What works: fixed model assignment per pipeline stage. Gemini always outlines. Claude always validates. No switching. The consistency is the feature.
Complex Prompt Chains Degraded Output
I built prompt templates with conditional sections, variable substitution, and nested instructions that "optimized" based on topic domain, audience, and tone. The prompts became 1,500-word novels. Output quality dropped because the models spent tokens parsing meta-instructions instead of writing.
A 200-word prompt with clear instructions consistently beat the sophisticated version. Prompt complexity has a U-shaped return curve, and most teams are on the wrong side of it.
Real-Time Quality Scoring Added Latency Without Value
I built a system to score content quality in real-time and retry with different parameters on low scores. The scoring required another LLM call (slow), produced inconsistent results across runs (unreliable), and the retry logic generated slight variations of the same mediocre output (useless).
What works: structural validation. Word count, section count, presence of examples. These are deterministic, fast, and sufficient. Iterate on the system over time, not on individual generations.
What Worked
Task decomposition. Outline-then-write was the real breakthrough. Gemini produces better outlines when that is its only job. It produces better content when given its own outline as input. This is not about multiple models. It is about giving each step a clear input and output.
Error handling. AI models fail constantly -- rate limits, timeouts, malformed outputs. Error handling strategy matters more than prompt strategy. Exponential backoff with jitter, fallback approaches (not fallback models), manual override for everything.
Boring infrastructure. The hard problems are state management, data flow, and cost tracking. Gemini's output format drifts between calls, so you need normalization layers. Multiple models mean multiple API bills. One inefficient prompt can cost hundreds of dollars before you notice.
The Honest Assessment
Pure automation produces mediocre content. The best output comes from AI drafting, human editing for voice and insight, AI expanding, and human final review. The pipeline saves time on scaffolding so effort goes to substance.
Use one model until you hit real limitations. Add complexity only when you can measure the improvement. Spend engineering time on error handling and data flow, not on model selection logic.