
Assert Less, Observe More: AICL and the New QA Stack for LLM Apps
TL;DR Traditional QA treats software as deterministic; LLM apps aren’t. This paper proposes a three‑layer view (System Shell → Prompt Orchestration → LLM Inference) and argues for a collaborative testing strategy: retain classical testing where it still fits, translate assertions into semantic checks, integrate AI‑safety style probes, and extend QA into runtime. The kicker is AICL, a compact agent‑interaction protocol that bakes in observability, context isolation, and deterministic replay. Why this matters for operators and product teams LLM products now look like systems—not prompts. They combine RAG, tools, stateful multi‑turn workflows, and sometimes multi‑agent handoffs. The result is probabilistic behavior plus cross‑layer failure modes. If you keep writing boolean, exact‑match tests, you’ll ship brittle releases and discover regressions in production. The fix isn’t to abandon testing; it’s to move from asserting single outputs to observing semantic behavior distributions. ...