TL;DR for operators

LLM application testing should stop pretending that the whole product behaves like ordinary software. The database connector, retry logic, API wrapper, and schema validator still deserve normal unit, integration, and load tests. Fine. Keep those. They are not the problem.

The problem starts when the product becomes a stateful language system: prompts are assembled dynamically, retrieval changes the context, tool calls modify the execution path, memory leaks across turns, and a model update can improve one workflow while quietly breaking another. At that point, exact-match assertions become less like QA and more like theatre with a YAML file.

The paper behind this article, Rethinking Testing for LLM Applications: Characteristics, Challenges, and a Lightweight Interaction Protocol, argues for a three-layer view of LLM applications: System Shell, Prompt Orchestration, and LLM Inference Core.1 That structure is useful because it tells operators where classical testing still works, where it must be translated into semantic evaluation, and where AI-safety and runtime monitoring need to enter the QA stack.

Its proposed protocol, AICL, is the most operational part of the paper. It defines typed agent-interaction messages such as QUERY, PLAN, RESULT, ERROR, MEMORY.STORE, MEMORY.RECALL, and COORD.DELEGATE, with metadata for identity, timestamps, model version, context scope, confidence, cost, latency, and provenance. In business language: it tries to make agent behaviour inspectable after the fact, not merely impressive during the demo.

The practical takeaway is simple: assert less, observe more. Keep deterministic tests for deterministic components. Add semantic regression suites for prompts and behaviours. Record enough structured interaction data to replay failures. Feed production traces back into CI. Do not claim that AICL “solves” QA; the paper does not empirically prove that. Treat it as a design pattern for making LLM applications testable enough to deserve serious deployment.

The familiar test suite fails at the unfamiliar boundary

Picture the normal release meeting.

The engineering team says the API tests pass. The product team says the chatbot answered the golden prompts correctly. The compliance team asks whether the system can leak memory from one customer workflow into another. The QA dashboard looks green, because naturally the dashboard has no field for “latent semantic contamination caused by multi-turn orchestration and retrieval drift.” A triumph of measurement, provided one does not measure the dangerous bit.

That is the setting this paper speaks to. The useful claim is not “LLMs are nondeterministic,” which everyone has now repeated enough to qualify as a mild industry ritual. The sharper claim is that LLM applications are not one thing. They are systems stitched from deterministic software, dynamic orchestration logic, probabilistic model inference, external tools, retrieval indexes, safety filters, memory stores, and runtime state.

Testing fails when these are treated as a single object.

A traditional application can often be tested around a stable relation: input, function, output. LLM applications still contain parts like that. But the moment a user request passes through prompt assembly, context compression, retrieval, tool selection, model generation, and post-processing, the test unit is no longer “one input.” It becomes a behavioural sequence. The output is not just an answer; it is the visible residue of a path.

That is why the paper’s three-layer architecture is more than a diagram. It is a way to stop asking one testing method to do three different jobs.

The three layers create three QA contracts

The paper decomposes LLM applications into three layers: the System Shell Layer, the Prompt Orchestration Layer, and the LLM Inference Core. This is the article’s centre of gravity. Once the layers are separated, the testing logic becomes much less mystical.

Layer What lives there Testing contract What breaks if you use the wrong contract
System Shell APIs, preprocessing, post-processing, tool adapters, retries, timeouts, human-in-the-loop interfaces, event triggers Classical unit, interface, integration, performance, and stress testing You overcomplicate deterministic engineering and miss basic reliability failures
Prompt Orchestration Prompt templates, context construction, retrieval insertion, memory handling, graph workflows, multi-agent handoffs Semantic checks, prompt-space coverage, state-transition tests, paraphrase robustness, trace-based replay You test single prompts while the real product fails through assembled context and hidden state
LLM Inference Core Model parameters, inference service, decoding settings, guardrails, model version Capability benchmarks, safety tests, hallucination checks, drift/regression analysis, probabilistic evaluation You expect reproducible exact outputs from a probabilistic component and confuse variation with failure

The System Shell remains close to traditional software. If a tool adapter sends malformed JSON, no LLM philosophy is required. Test the schema. If a retry policy creates duplicate transactions, that is an engineering bug, not an epistemological event. The paper is admirably unsentimental here: conventional testing is not obsolete. It is merely insufficient.

The Prompt Orchestration Layer is where QA becomes expensive. This layer decides what the model actually sees. It may combine user input, system instructions, retrieved documents, conversation history, memory, tool results, and hidden workflow state. Two user prompts that look similar may produce different final prompts. Two final prompts that look similar may represent different state histories. And one small preprocessing error can propagate through orchestration into a model response that appears “hallucinated” even when the root cause sits upstream.

The LLM Inference Core is the probabilistic engine. Here, exact equality is usually the wrong standard. The question is not whether the model produces the same sentence every time. The question is whether its behaviour remains acceptable across task variants, model versions, decoding configurations, adversarial inputs, and context conditions. That is a different kind of contract.

The mechanism is the message: predictability decreases as we move from shell to core, while testing cost rises. A serious QA stack should reflect that gradient instead of flattening everything into “prompt tests.”

Exact assertions do not disappear; they move upstairs

A lazy reading of the paper would say: “Traditional tests do not work for LLMs.” That is not quite right. Traditional tests work where the system is still traditional.

The paper’s better move is to reposition classical testing into three roles.

First, retain it for deterministic components. Unit tests, interface tests, integration tests, load tests, and exception-handling tests still belong in the shell layer. They set the operational baseline. Without them, the organisation ends up blaming “model behaviour” for failures caused by ordinary glue code. This is a surprisingly common form of technical self-flattery.

Second, translate old testing ideas into semantic equivalents. The old assertion expected == actual does not survive open-ended language output very well. But the underlying purpose of an assertion — deciding whether behaviour is acceptable — still matters. It needs new instruments: LLM-as-judge scoring, embedding similarity, multi-candidate comparison, semantic category coverage, and behavioural consistency across paraphrases.

Third, integrate testing with methods from AI evaluation and AI safety. LLM applications need checks for factuality, contextual consistency, bias, harmful content, jailbreak resistance, prompt injection, privacy leakage, and state contamination. These are not just “model properties” once the model is embedded in a product. They become system-level risks.

The shift is not from testing to vibes. It is from brittle assertions to structured observation.

A useful semantic test suite does not ask, “Did the model produce this exact sentence?” It asks questions like:

Testing question Better operational form
Is the answer correct? Does it satisfy task-specific semantic criteria across multiple valid phrasings?
Is the workflow robust? Does performance remain stable under paraphrases, typos, dialectal variation, incomplete inputs, and adversarial reformulations?
Did the release regress? Which capabilities changed across model, prompt, retrieval-index, or tool versions?
Can we debug a failure? Can we reconstruct the prompt, context, tool calls, memory operations, model version, and final output?
Is the agent safe enough? Does it resist direct and indirect prompt injection under realistic multi-turn conditions?

That table is not in the paper in exactly this form; it is the business translation. The paper directly provides the conceptual mapping. Cognaptus infers the operating checklist.

The paper’s “evidence” is a framework map, not a leaderboard

This matters because the paper is not an empirical benchmark paper. It does not run a new QA tool across a dataset and report percentage-point gains. There are no ablations proving that AICL reduces debugging time by a measured amount. There is no controlled enterprise deployment showing fewer incidents after adoption. Anyone claiming otherwise has added garnish from the imagination pantry.

The paper’s evidence is architectural and comparative. Its tables have specific purposes.

Paper element Likely purpose What it supports What it does not prove
Three-layer architecture Main analytical framework Different layers of LLM applications require different testing assumptions That these are the only possible layers or the best decomposition for every architecture
Mapping of traditional testing roles Applicability analysis Classical QA remains useful in the System Shell and partly translatable elsewhere That existing test tools already cover orchestration and inference risks adequately
Review of AI testing and AI-security methods Comparison with prior work Software engineering and AI-safety communities cover different parts of the QA problem That combining their methods is easy or already standardised
Six challenge categories Problem classification LLM QA failures cluster around semantic evaluation, open input space, state observability, regression, security/compliance, and multimodal integration The relative frequency or cost of each failure category in production
AICL protocol proposal Implementation design sketch Structured agent messages can improve replayability, context isolation, provenance, and evaluation hooks That AICL has been validated at scale, adopted by toolchains, or accepted as a standard

This is still valuable. Not every useful paper needs a leaderboard. Framework papers are valuable when they improve diagnosis. The danger is reading them as if they already delivered implementation proof.

For operators, the paper’s contribution is not “deploy AICL tomorrow and relax.” The contribution is a clearer separation between what can be asserted, what must be evaluated, what must be traced, and what must be monitored continuously.

Six QA failures come from one deeper problem: lost observability

The paper identifies six core challenges in testing LLM applications. They look diverse at first, but most point back to the same deeper mechanism: behaviour becomes hard to observe at the level where the failure is created.

Semantic evaluation is hard because natural language has multiple valid outputs. A customer-support agent may give a correct refund policy in five different phrasings. Exact-match testing punishes variation. On the other hand, a wrong answer can sound polished enough to pass a casual review. Language creates both false negatives and false confidence. Charming, obviously.

Open input robustness is hard because real users do not submit neat benchmark prompts. They write fragments, typos, mixed languages, emotional complaints, adversarial traps, and domain-specific shorthand. Traditional boundary-value analysis works beautifully for fields like “age must be between 0 and 120.” It is less majestic when the boundary is “a sarcastic complaint from a high-value client implying a policy exception without explicitly asking for one.”

Dynamic state and observability are harder still. Stateful agents accumulate context. They may store memory, retrieve prior facts, call tools, update a workflow, or branch into another agent. If a failure appears on turn seven, the answer alone is not enough evidence. The organisation needs the path: what was retrieved, which prompt template fired, what memory was recalled, which tool returned what, what model version ran, and what post-processor changed the result.

Capability evolution creates regression risk. A model update, fine-tune, prompt change, LoRA adapter, RAG index refresh, or safety-policy modification can improve one capability and degrade another. Traditional regression testing assumes that expected behaviours can be enumerated and compared. LLM applications need version-differential behavioural testing: not just “does it pass,” but “what changed, for whom, under which context, and is that change acceptable?”

Security and compliance expand the surface area. Prompt injection, jailbreaks, privacy leakage, harmful content, bias, supply-chain compromise, and jurisdiction-specific compliance issues are not cleanly contained in the model. They travel through tools, documents, memory, prompts, and user workflows. Testing the model alone is like checking the lock on one door in a building with twelve windows open.

Multimodal and system-level integration adds another layer of propagation. In multimodal applications, an image understanding error can distort a text plan; a speech recognition error can alter a retrieval query; a tool failure can be hidden by a fluent response. Performance also becomes semantic: latency and cost depend not only on traffic volume, but on context length, call-chain depth, modality, and agent coordination.

The paper’s unifying point is that LLM QA is a systems problem. The problem is not only that outputs are probabilistic. The problem is that the causal chain behind the output is often invisible.

AICL is a protocol for making agent behaviour replayable

AICL — Agent Interaction Communication Language — is the paper’s attempt to make LLM agent interactions testable by construction. It is schema-driven and transport-independent. Instead of letting agents communicate only through free-form natural language, AICL introduces typed messages with structured metadata.

The paper lists message types including HELLO, QUERY, PLAN, FACT or FACTS, RESULT, ERROR, MEMORY.STORE, MEMORY.RECALL, COORD.DELEGATE, and structured reasoning markers such as REASONING.START, REASONING.STEP, and REASONING.COMPLETE. Each message can carry metadata such as a unique ID, timestamp, protocol version, conversation identifier, context scope, model version, confidence, declared priors, task space, correlation pointer, reasoning trace, cost, latency, signature, and capability claims.

The business value is not that these names are magical. The value is that they force runtime behaviour into a form that a test harness can inspect.

A simplified AICL-style interaction looks like this:

[QUERY: tool:weather_now{location:"Shanghai"}
 | id:u!q1, ts:t(2025-08-15T02:00:00Z), ver:"1.2.0", ctx:[u!conv123]]

[RESULT: {data:{temp_c:29, cond:"rain"}, schema:"tool:weather_now/1"}
 | id:u!r1, of:u!q1, conf:0.93, model_version:"gpt-5-20250801", ts:t(...)]

The important part is not the weather. It is the correlation: RESULT points back to QUERY; the model version is recorded; the context is explicit; the output has a schema; confidence and provenance can be captured. A failure can later be replayed and compared.

That matters because many agent failures are not visible from the final answer. Suppose an enterprise procurement agent recommends a supplier. The final message may look reasonable. But the QA question is: did it retrieve the right policy version, call the approved vendor database, ignore expired supplier records, preserve tenant isolation, and explain uncertainty properly? Without structured traces, the answer is “trust us,” which is less a QA strategy than a scented candle.

AICL aims to support five test-oriented capabilities:

AICL feature Operational consequence
Deterministic replay identifiers and canonical encoding Past runs can become reproducible test cases instead of anecdotal screenshots
Explicit context scopes and memory operations State leakage and contamination become easier to detect
Typed QUERY, RESULT, and ERROR messages Tool calls and failures become machine-verifiable rather than buried in prose
Model version, confidence, priors, cost, and latency metadata Regression, calibration, and performance baselines become easier to compare
Reusable production logs Runtime incidents can feed offline regression suites

The last point is the most important. Mature LLM QA will not be built only from handcrafted golden prompts. It will be built from production traces, clustered by workflow, enriched with failure labels, and replayed across releases. AICL is best understood as a substrate for that loop.

The new QA stack is layered, not replaced

The obvious mistake is to replace one oversimplification with another. First came “just write unit tests.” Then came “just use LLM-as-judge.” Soon someone will say “just use AICL.” Progress, apparently, is changing the word after “just.”

The paper points toward a layered QA stack instead.

QA layer Keep Add Runtime extension
Shell reliability Unit tests, schema validation, integration tests, load tests, retries, timeouts Tool-contract tests and API conformance checks Service metrics, error budgets, incident traces
Prompt orchestration Workflow tests and template checks Semantic assertions, prompt-space coverage, paraphrase robustness, state-transition tests Conversation replay, context snapshots, memory isolation checks
Retrieval and tools Index freshness checks, tool availability, schema tests Citation/factuality checks, retrieved-context validation, tool-result consistency Drift monitoring, stale-document detection, tool failure analysis
Model behaviour Benchmarks and safety tests Capability regression, version-differential analysis, jailbreak and injection suites Behaviour drift alerts, sampled human review, release gates
Agent interaction Logs Structured AICL-style messages with provenance and correlation IDs Production logs promoted into regression suites

This is where the business relevance becomes concrete.

For a customer-support bot, the QA gain is not merely fewer hallucinations. It is knowing whether a bad refund answer came from outdated retrieval, a prompt-template regression, a model-version change, or a memory contamination bug.

For a financial research assistant, the gain is not “AI safety” in the abstract. It is replaying the exact context that produced an unsupported claim, checking whether the model ignored a source, and preventing the same path from surviving the next release.

For an internal operations agent, the gain is not a prettier prompt. It is a testable handoff chain: who delegated what to which tool, under which assumptions, with what latency, cost, confidence, and error handling.

The ROI is diagnostic compression. Failures become cheaper to reproduce, classify, and prevent. That is less glamorous than “autonomous agents,” but much more useful after the invoice arrives.

What the paper directly shows, what Cognaptus infers, and what remains uncertain

The paper’s analytical structure is strong enough to guide practice, but its boundaries should be kept visible.

Category Statement
What the paper directly shows LLM applications differ from traditional software because of probabilistic generation, context dependence, emergent capabilities, and system-level composition. A three-layer architecture helps map which testing methods remain applicable, which need semantic reinterpretation, and which require new runtime assurance mechanisms.
What the paper synthesises Traditional software testing, AI-oriented model testing, and AI-security research each cover part of the problem but leave gaps around orchestration, state, lifecycle regression, and system-level observability.
What the paper proposes A collaborative QA framework using retain, translate, integrate, and runtime strategies, plus AICL as a structured protocol for replayable and testable agent interactions.
What Cognaptus infers for operators Organisations should design LLM QA around behavioural traces and release gates, not only prompt-level golden tests. AICL-style logging can turn production failures into regression assets.
What remains uncertain AICL’s adoption path, tooling burden, evaluator reliability, privacy implications, overhead, compatibility with existing agent frameworks, and effectiveness in real production settings are not empirically established by the paper.

This distinction matters. AICL is a protocol proposal, not a market standard. A schema can improve observability only if teams actually instrument workflows, preserve useful metadata, protect sensitive traces, define evaluation criteria, and integrate replay into CI/CD. Otherwise it becomes another logging format that nobody reads until the post-mortem, at which point everyone agrees observability is important. Again.

There is also a privacy and governance issue. Recording context, memory operations, tool calls, confidence, model versions, and reasoning markers can help QA. It can also create sensitive audit material. In regulated environments, the trace itself becomes an asset that must be governed. More observability is good; uncontrolled observability is just a future compliance incident wearing a badge.

Finally, semantic evaluation remains fragile. LLM-as-judge methods can be inconsistent or biased. Embedding similarity can miss task-specific correctness. Human review is expensive. Multi-candidate voting can hide systematic errors if the judges share the same blind spot. The paper recognises these directions; it does not dissolve them.

The operating principle: test the path, not just the answer

The cleanest business lesson from the paper is that LLM application QA should move from answer checking to path checking.

An answer can be acceptable by accident. A fluent response can be produced from the wrong document. A correct tool result can be inserted into the wrong prompt. A safe model can become unsafe when surrounded by hostile retrieved content. A strong benchmark score can coexist with a production memory leak. The final text is not enough.

The testable object should be the interaction path:

  1. What did the user ask?
  2. What context was assembled?
  3. What was retrieved?
  4. What memory was recalled or stored?
  5. Which tool was called?
  6. What did the tool return?
  7. Which model version generated the response?
  8. What confidence, assumptions, cost, and latency were recorded?
  9. Which evaluator judged the result?
  10. Can the path be replayed after a release change?

That is the QA stack this paper gestures toward. Not a single magic benchmark. Not a perfect judge. Not a theological debate about whether LLMs are software. A structured way to observe probabilistic systems with enough discipline that failures can be made boring.

And boring, in production, is a compliment.

Conclusion: QA for LLM apps needs fewer heroic demos and more replay buttons

The old QA bargain was simple: specify expected behaviour, run tests, compare outputs. LLM applications do not fully honour that bargain. They are probabilistic, stateful, context-dependent, tool-using, and continuously evolving. Treating them as deterministic functions is not conservatism. It is denial with a test runner.

The paper’s contribution is to make the problem legible. The System Shell can still be tested like software. Prompt Orchestration needs semantic and state-aware evaluation. The LLM Inference Core needs probabilistic capability and safety testing. Cross-layer behaviour needs observability, replay, and runtime feedback.

AICL is interesting because it does not try to make agents less complex by wishing. It makes their interactions more inspectable. That is the right instinct. The future of LLM QA will not be built from perfect assertions. It will be built from structured traces, semantic regression, version-differential analysis, and production feedback loops that turn yesterday’s weird failure into tomorrow’s boring test case.

Assert less. Observe more. Then replay the thing that broke.

Cognaptus: Automate the Present, Incubate the Future.


  1. Wei Ma, Yixiao Yang, Qiang Hu, Shi Ying, Zhi Jin, Bo Du, Zhenchang Xing, Tianlin Li, Junjie Shi, Yang Liu, and Lingxiao Jiang, “Rethinking Testing for LLM Applications: Characteristics, Challenges, and a Lightweight Interaction Protocol,” arXiv:2508.20737, 2025. https://arxiv.org/abs/2508.20737 ↩︎