Quality Assurance

Synthesize, but Verify: The Data Flywheel Behind Useful AI Automation

Opening — Why this matters now The easiest AI demo in the world is a model producing something plausible. A product description. A support reply. A defect image. A peer-review report. A compliance explanation. A benchmark answer. The output looks competent enough to be shown in a slide deck, which is often where corporate AI strategy goes to enjoy a short but well-lit life. ...

When the Judge Needs Judging: LLM Evaluators Under Cross-Examination

The dashboard says the judge is fine. The document disagrees. Judge is an easy word to trust. It suggests robes, procedure, and someone in the room who is supposed to be less confused than everyone else. In AI evaluation, the word has become dangerously comfortable. Product teams now use LLMs to score summaries, rank chatbot answers, approve RAG outputs, compare model releases, and decide whether another model’s response is “good enough.” The attraction is obvious: human review is expensive, slow, and occasionally insists on context. An LLM judge is fast, scalable, and does not ask why the evaluation rubric was written five minutes before the sprint review. ...

$Cover image$

Back to the Drawing Board: How DiagramIR Quietly Fixes Math Diagrams for AI

A diagram is not a paragraph with lines attached. That sounds obvious, which is usually where software product teams get into trouble. Text can be judged by fluency, relevance, and whether the answer has wandered into confident nonsense. A geometry diagram has extra obligations. The side marked 8 should look longer than the side marked 3. The angle labelled $90^\circ$ should not be having an identity crisis. Labels should sit near the thing they label. The image should not be half outside the frame, unless the product strategy is “modern art, but for sixth grade”. ...

Assert Less, Observe More: AICL and the New QA Stack for LLM Apps

TL;DR for operators LLM application testing should stop pretending that the whole product behaves like ordinary software. The database connector, retry logic, API wrapper, and schema validator still deserve normal unit, integration, and load tests. Fine. Keep those. They are not the problem. The problem starts when the product becomes a stateful language system: prompts are assembled dynamically, retrieval changes the context, tool calls modify the execution path, memory leaks across turns, and a model update can improve one workflow while quietly breaking another. At that point, exact-match assertions become less like QA and more like theatre with a YAML file. ...