31st March 2024 - Link Blog
Your AI Product Needs Evals (via) Hamel Husain: “I’ve seen many successful and unsuccessful approaches to building LLM products. I’ve found that unsuccessful products almost always share a common root cause: a failure to create robust evaluation systems.”
I’ve been frustrated about this for a while: I know I need to move beyond “vibe checks” for the systems I have started to build on top of LLMs, but I was lacking a thorough guide about how to build automated (and manual) evals in a productive way.
Hamel has provided exactly the tutorial I was needing for this, with a really thorough example case-study.
Using GPT-4 to create test cases is an interesting approach: “Write 50 different instructions that a real estate agent can give to his assistant to create contacts on his CRM. The contact details can include name, phone, email, partner name, birthday, tags, company, address and job.”
Also important: “... unlike traditional unit tests, you don’t necessarily need a 100% pass rate. Your pass rate is a product decision.”
Hamel’s guide then covers the importance of traces for evaluating real-world performance of your deployed application, plus the pros and cons of leaning on automated evaluation using LLMs themselves.
Plus some wisdom from a footnote: “A reasonable heuristic is to keep reading logs until you feel like you aren’t learning anything new.”
Recent articles
- Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7 - 16th April 2026
- Meta's new model is Muse Spark, and meta.ai chat has some interesting tools - 8th April 2026
- Anthropic's Project Glasswing - restricting Claude Mythos to security researchers - sounds necessary to me - 7th April 2026