i am jonathan haas.
i run EvalOps, a lab focused on making AI systems less brittle. right now that means stress-testing language models, measuring how they drift, and wiring guardrails back into production.
before this i shipped code at Snap, Carta, and DoorDash, then built ThreatKey. most days you can find me pairing with teams that want fewer surprises in production.
recent writing
Empirical comparison of OpenAI, Cohere, BGE, E5, and Instructor embeddings on real developer documentation queries with cost, latency, and accuracy analysis.
A comprehensive synthesis of 21 posts on DX: patterns, principles, and practices for building exceptional developer tools and experiences.
It started with a Jupyter notebook. 'Look, I built a chatbot in 10 minutes!' Nine months later, three engineers had quit and the company almost folded.
projects i'm proud of
applied research shop pressure-testing evaluation guardrails with real teams.
field notes on hardening production systems before they fall apart.
multi-agent probes that flag conflicting model behavior before users see it.
hands-on playbook for shipping self-improving LLM apps without guesswork.