The word “risk” is doing too much unpaid labor

A policy agent says: “Flag high-risk cases.”

An execution agent receives the instruction, nods politely in machine language, and flags what it considers high-risk. The dashboard looks normal. The audit trail says the instruction was followed. Everyone enjoys the comforting fiction that the system understood itself.

Then the failure arrives.

Not because the agents were irrational. Not because the workflow broke. Not because someone forgot to add another prompt beginning with “You are a careful assistant.” The problem is simpler and more annoying: the two agents used the same word with different operational meanings.

This is the problem addressed by Verifiable Semantics for Agent-to-Agent Communication, a paper by Philipp Schoenegger, Matt Carlson, Chris Schneider, and Chris Daly.1 Its central move is refreshingly unsentimental. It does not ask whether two agents “really” share an internal concept. It asks whether they give compatible answers when tested on the same observable events.

That shift matters. In enterprise AI, meaning is not a literary seminar. Meaning is what happens when one system’s label becomes another system’s action.

The paper proposes a protocol for certifying a shared vocabulary between agents. A term enters the usable “core” only if the agents demonstrate sufficiently low disagreement on audited examples. Downstream reasoning is then restricted to that certified core. When model updates or contextual drift corrupt the core, the system re-audits and, where possible, renegotiates the shared vocabulary.

The contribution is not that agents can be made magically aligned. The contribution is more practical: make their disagreement measurable before letting them coordinate on important decisions. A small miracle, by enterprise standards.

Shared language is not shared meaning

The usual intuition is misleading. If two agents both use natural language, and both say “harmful,” “escalate,” “benign,” or “high-risk,” then their communication appears interpretable. Compared with opaque emergent protocols or “neuralese,” natural language feels safer because humans can read it.

Readable, yes. Verified, no.

The paper begins from a tension in agent communication:

Communication mode What it offers What it fails to guarantee
Natural language Human readability and auditability Stable shared semantics across agents
Emergent learned protocols Task efficiency Human interpretability and third-party verification
Stimulus-meaning certification Behavioral verification of specific terms Broad expressiveness without testing

The third line is the paper’s bet. Instead of choosing between readable but unstable language and efficient but opaque protocols, it certifies a reduced vocabulary by testing agents directly.

This is the reader misconception worth removing early: natural language does not automatically solve semantic alignment. Two agents can use the same token and apply it differently. The failure is especially easy to miss because the transcript remains legible. The logs say “escalate.” The models say “escalate.” Unfortunately, one model means “send to senior reviewer,” while another means “mark as severe violation.” The word is shared. The policy is not.

For business systems, this is not a philosophical inconvenience. It is a governance problem. When autonomous components hand off decisions across compliance, moderation, credit risk, customer support, fraud detection, or trading operations, each handoff depends on terms surviving the journey.

The paper’s solution is to stop treating that survival as an assumption.

The mechanism: turn words into witnessed tests

The paper’s mechanism begins with a concrete unit: an observable event.

An event can be a content scenario, transaction, image, customer case, support ticket, compliance incident, or any other domain-specific input with a stable identifier. The important part is that both agents can be queried on the same event.

For each term, the system asks each agent whether the term applies to the event. The agent can assent, dissent, or remain neutral. These verdicts are recorded as witnessed tests in a public or auditable ledger.

A term’s “stimulus meaning” is then not defined by the model’s hidden representation. It is defined extensionally: by the pattern of responses the agent gives across events.

That is the heart of the paper.

The protocol does not require Agent A and Agent B to contain identical internal representations of “sensitive.” It only requires them to behave compatibly when asked whether “sensitive” applies to the same audited cases. For deployment, that is usually the more useful property. We do not need to inspect the cathedral inside the model’s mind. We need to know whether the door opens when both agents see the same key.

The certification process has three important counts:

Quantity Meaning Why it matters
Audited exposures Events where at least one agent gives a non-neutral verdict Measures whether the term is active in the sampled event space
Eligible comparisons Events where both agents give non-neutral verdicts Defines the cases where contradiction can be measured
Contradictions Eligible comparisons where one agent assents and the other dissents Measures direct semantic disagreement

A term is not certified merely because observed disagreement looks low. The paper uses a one-sided Wilson confidence bound to estimate an upper bound on the true contradiction rate from finite samples. A term enters the certified core only if that upper bound is below the chosen tolerance threshold and if coverage exceeds a minimum floor.

This second condition is easy to overlook, but operationally important. Without a coverage floor, a rarely used term could appear “safe” simply because the agents barely engage with it. The protocol prevents this cheap victory. A term must be both low-disagreement and sufficiently testable.

The result is a certified core vocabulary: the set of terms that agents are allowed to use for consequential shared reasoning.

Certification is not the product; core-guarded reasoning is

Certification alone would be a nice audit report. Nice audit reports are where operational discipline goes to nap.

The paper’s stronger move is core-guarded reasoning. After certification, agents restrict consequential communication and decision-making to the certified core. Terms outside the core are not necessarily banned from all use. They can remain available for exploration, learning, or low-stakes communication. But they are not trusted for decisions requiring reproducible agreement.

This matters because the certification guarantee only becomes useful when it constrains behavior. If agents certify a core and then continue using uncertified terms in high-stakes handoffs, the protocol becomes decorative governance. We already have enough of that; it usually arrives in PDF form.

The paper frames the guarantee as reproducibility: if agents share a certified core and use only terms in that core, then their disagreement on certified terms is bounded by the chosen threshold, subject to the confidence level used in certification.

The mechanism can be read as a four-step pipeline:

Step Technical action Operational interpretation
1. Define events Build a domain event space with stable identifiers Make the audit target concrete
2. Test terms Query both agents on term-event pairs Observe meaning through behavior
3. Certify core Use statistical bounds and coverage floors Admit only terms with bounded contradiction
4. Guard reasoning Restrict consequential communication to certified terms Convert audit evidence into workflow control

This is why the paper is best read mechanism-first. The business implication is not “semantic alignment is important.” That conclusion is obvious and therefore not worth charging readers for. The interesting part is how the paper turns meaning into a testable operational object.

Observable events become witnessed tests. Witnessed tests become certified terms. Certified terms become a guarded vocabulary. The guarded vocabulary becomes a control layer for agent workflows.

That is the architecture.

The key trade-off: smaller vocabulary, safer coordination

The protocol does not promise full expressiveness. It explicitly trades vocabulary breadth for reliability.

If the contradiction threshold is strict, fewer terms pass certification, but the terms that pass have stronger reliability guarantees. If the threshold is relaxed, the core becomes broader, but disagreement tolerance rises. The confidence level and coverage floor provide additional control.

This is a governance dial, not a universal constant.

A content moderation pipeline may accept a relatively narrow core for escalation decisions because false handoffs can create legal or reputational exposure. A brainstorming agent network may tolerate looser thresholds because the cost of disagreement is lower. A compliance workflow should probably not rely on the same semantic tolerance as a marketing ideation bot, though one suspects some procurement decks will try.

The paper’s trade-off analysis supports this framing. It shows that coverage and reliability move against each other: stricter certification reduces vocabulary coverage, while looser certification allows more terms into the core at the cost of higher guarded disagreement. Underlying alignment also matters. If agents are already mostly aligned, the protocol can maintain a larger core at low disagreement. If agents are poorly aligned, even strict thresholds may leave the system with a small usable vocabulary.

That outcome is not a defect. A small core can be a warning signal. If two agents cannot agree on the terms required for coordination, the correct answer is not to let them coordinate anyway with a nicer UI.

What the experiments actually show

The paper evaluates the mechanism in three stages: simulation, trade-off analysis, and a toy LLM validation. These should not be treated as equal kinds of evidence.

Test Likely purpose What it supports What it does not prove
Simulation across divergence regimes Main mechanism demonstration Core-guarding reduces disagreement by filtering unsafe terms Production-scale performance in real workflows
Drift, recertification, and renegotiation simulation Lifecycle mechanism test Re-auditing can detect drift and shrink the core; renegotiation can recover vocabulary That the proposed renegotiation heuristic is sufficient in adversarial or complex organizations
Coverage-reliability analysis Sensitivity and deployment trade-off Threshold choices control vocabulary breadth and reliability A universally optimal threshold
Fine-tuned LLM moderation task Small real-model validation The protocol can be applied to language models and reduce disagreement General validity across larger models, domains, or multi-agent systems

The simulation study uses two agents with a six-term color vocabulary and tests three divergence conditions: Noise-Only, Moderate Drift, and High Divergence. Across runs, the protocol audits terms on a sample of events, certifies a core, and compares unguarded disagreement with core-guarded disagreement on held-out events.

The results are clean:

Condition Unguarded disagreement Core-guarded disagreement Average core size
Noise-Only 2.1% 2.1% 3.8
Moderate Drift 7.4% 2.1% 2.6
High Divergence 40.7% 1.8% 0.2

The interpretation is straightforward but worth stating precisely. Core-guarding does not make divergent agents agree on everything. It prevents them from relying on terms where agreement cannot be certified.

In the Noise-Only condition, guarded and unguarded disagreement are similar because there is little semantic divergence to filter. In Moderate Drift, unguarded disagreement rises to 7.4%, while guarded disagreement remains near 2.1% because the certified core excludes problematic terms. In High Divergence, unguarded disagreement reaches 40.7%, and the protocol nearly empties the core. In 95 of 100 high-divergence runs, no terms are certified.

That last result is more important than it looks. A system that refuses unsafe coordination is doing useful work. In many enterprise contexts, the most valuable output of an assurance layer is not “yes.” It is “no, and here is the audited reason.”

Drift turns certificates into perishable goods

The paper also models semantic drift over time. This is necessary because agent systems are not static. Models are updated. Prompts change. Fine-tuning changes policy boundaries. Contexts shift. One department quietly modifies its escalation criteria and forgets to tell the orchestration layer. Very normal. Very doomed.

A term that was safe at time $t$ may become unsafe at time $t+1$.

The paper proposes recertification as the detection mechanism. At scheduled intervals, the system re-audits certified terms on fresh events. If the Wilson bound rises above the threshold, or coverage falls below the required floor, the term is removed from the core.

This is conservative by design. Recertification preserves reliability by shrinking the vocabulary when evidence deteriorates.

The simulation illustrates the lifecycle. When drift is injected and the core is frozen, guarded disagreement rises from about 2.2% to 4.6%. With recertification, the drifted term is removed, the core shrinks from 4.4 to 3.5 terms, and guarded disagreement returns to around 2%. The system pays for reliability with reduced expressiveness.

Then comes renegotiation. Recertification can remove terms, but removal alone does not restore lost vocabulary. The paper sketches a renegotiation mechanism: identify the more “entrenched” agent for an excluded term, use that agent’s interpretation as a reference policy, have the other agent adopt it, and then re-audit the term. In the simulation, renegotiation recovers the core to about 4.0 terms.

This part is useful, but it should be read with discipline. Recertification is the stronger operational idea. Renegotiation is more tentative. The entrenchment criterion is a plausible heuristic, not a full governance theory of whose semantics should prevail. In a real compliance system, the “more entrenched” model should not automatically win if its interpretation conflicts with law, policy, or business intent. Seniority is not semantics. It is merely one possible clue.

The LLM validation is encouraging, but small

The paper’s real-model validation uses two LoRA adapters fine-tuned on Qwen2.5-3B-Instruct in a toy content moderation setting. The authors create a six-term vocabulary: harmful, misleading, sensitive, spam, benign, and escalate. The agents are fine-tuned on different label distributions to induce divergent policies.

The protocol audits the terms, certifies the aligned subset, and tests core-guarding on held-out events.

Only two of the six terms pass certification: “benign” with 0% contradiction and “sensitive” with 2% contradiction. Core-guarding reduces disagreement from 5.3% to 2.6%, a 51% reduction.

Condition Terms used Disagreement
Unguarded 6 5.3%
Core-guarded 2 2.6%

This is not production proof. It is a demonstration that the protocol can be applied to actual fine-tuned language models, not only to toy simulated agents.

The magnitude is meaningful because it shows the mechanism working in the intended direction: certify aligned terms, exclude misaligned terms, reduce downstream disagreement. The boundary is equally important: the experiment uses a small vocabulary, small models, a toy moderation task, and a limited held-out evaluation. It does not show that enterprise-scale agent networks can be certified cheaply across thousands of terms and many operational contexts.

The right interpretation is neither “solved” nor “irrelevant.” The result is a credible prototype signal.

A practical governance layer for agent workflows

For enterprise systems, the paper’s value is not in replacing alignment research. It is in suggesting a concrete control layer for multi-agent workflows.

Many organizations are moving toward agentic architectures: one agent classifies a case, another retrieves policy, another decides whether to escalate, another drafts the action, another checks compliance. These systems can fail even when each component performs acceptably in isolation. The handoff terms become the fragile points.

Stimulus-meaning certification can be translated into a practical governance pattern:

Workflow layer Certification question Business control
Intake classification Do agents apply labels consistently to the same cases? Reduce inconsistent routing
Compliance triage Do “high-risk,” “sensitive,” and “escalate” mean the same thing across agents? Prevent silent policy drift
Content moderation Do moderation agents share category boundaries? Improve consistency and auditability
Fraud or credit review Do risk terms survive model updates and regional variation? Support defensible decision handoffs
Autonomous operations Do monitoring and execution agents agree on trigger terms? Reduce unsafe automated actions

The business value is cheaper diagnosis, not merely lower disagreement. When a workflow fails, operators need to know whether the failure came from model capability, policy design, retrieval quality, data coverage, or semantic mismatch between agents. A ledger of witnessed term tests gives a concrete place to look.

It also changes deployment discipline. Instead of approving an agent chain because its end-to-end demo looks good, the organization can ask: which handoff terms are certified, under what threshold, on what event sample, with what recertification schedule?

That question will ruin several beautiful demos. Good.

What Cognaptus would infer — and what the paper does not directly show

The paper directly shows three things.

First, a term-level certification protocol can bound contradiction rates using audited event samples and a Wilson confidence bound.

Second, restricting consequential reasoning to certified terms can reduce disagreement in simulations, especially as semantic divergence rises.

Third, the same basic mechanism can be applied to a small fine-tuned LLM moderation setup, reducing disagreement from 5.3% to 2.6%.

From this, Cognaptus would infer a broader design principle: enterprise agent systems need semantic gates, not just performance benchmarks. Before agents use terms to coordinate consequential actions, those terms should be tested as operational interfaces.

But several things remain uncertain.

The paper does not establish scalability to large vocabularies, many agents, or high-context enterprise language. It does not solve compositional meaning: certifying “high” and “risk” separately does not guarantee shared understanding of “high-risk.” It does not fully handle context-dependent terms whose meaning changes across domains. It does not remove the honesty assumption during audits. It does not prove that the renegotiation mechanism is sufficient for adversarial or politically complex organizations.

These are not fatal flaws. They are the edge of the current contribution.

The best way to use the paper is not to declare that semantic alignment is now solved. The best way is to extract an implementation pattern:

  1. Identify task-critical handoff terms.
  2. Build representative event sets with stable identifiers.
  3. Query each agent on term-event pairs.
  4. Certify terms using a statistical threshold and coverage floor.
  5. Restrict high-stakes decisions to the certified core.
  6. Recertify after model updates, prompt changes, or distribution shifts.
  7. Treat failed certification as a workflow risk signal, not as an inconvenience to be prompt-engineered away.

That is a serious governance workflow. It is also wonderfully impolite to vague AI assurance claims.

The boundary: term-level verification is not full semantic safety

The most important limitation is granularity. The protocol certifies individual terms. Business language rarely behaves so politely.

“Sensitive” can mean one thing in medical data, another in political speech, another in financial disclosure, and another in internal HR workflows. “High-risk” is a compound. “Escalate immediately unless benign” is a policy expression. Certifying isolated terms helps, but it does not automatically certify the compositional logic built from them.

The paper acknowledges this. If event sampling covers the relevant contexts, disagreement can surface. If it does not, the certificate may be too narrow for deployment. This pushes a practical burden onto event design. A weak event set produces weak assurance. No statistical interval rescues a lazy audit sample.

Pairwise certification is another boundary. The paper is framed around two agents. Real workflows may involve many agents with different roles and update cycles. Pairwise certification can extend in principle, but audit costs and transitivity questions will grow. Agent A agreeing with Agent B, and Agent B agreeing with Agent C, does not automatically mean the whole workflow has a clean shared semantic layer.

The honesty assumption also deserves attention. The protocol assumes agents report verdicts reflecting their actual dispositions during certification. An adversarial or strategically aware agent could behave well during audits and diverge later. The authors argue that successful gaming would require multi-step planning and consistent deception beyond current architectures, but as agent capability rises, randomized audits, holdout events, and deployment-time monitoring will become necessary.

Finally, renegotiation is a sketch. In business settings, the right reference meaning may come from law, contract, operating policy, expert committee, or customer risk appetite—not from the agent with more historical decided verdicts. Entrenchment can be useful, but it should not be confused with authority.

The real lesson: make language an interface with tests

The most useful way to read this paper is as interface design.

In software engineering, interfaces need contracts. Inputs, outputs, types, tests, versioning, deprecation, and monitoring. Multi-agent systems are beginning to use natural language as an interface, but often without equivalent interface discipline. The result is a dangerous elegance: agents appear to coordinate because their messages are readable.

Stimulus-meaning certification says: readability is not enough. If a term controls action, test it.

That principle is likely to age well.

As agentic systems become more common, organizations will need to manage not only model performance but semantic compatibility. Which terms are safe? Which terms drifted after the latest fine-tuning? Which agent pair disagrees on “sensitive”? Which workflow should stop using “escalate” until the core is refreshed?

These questions are not glamorous. That is usually a sign they matter.

The paper does not give us a complete enterprise semantic safety platform. It gives a compact mechanism for turning shared vocabulary from an assumption into an auditable object. In the practical world of agent workflows, that is a meaningful step.

Because when AI agents need to work together, the first question is not whether they can speak.

It is whether they are certified to mean the same thing.

Cognaptus: Automate the Present, Incubate the Future.


  1. Philipp Schoenegger, Matt Carlson, Chris Schneider, and Chris Daly, “Verifiable Semantics for Agent-to-Agent Communication,” arXiv:2602.16424v2, 2026. https://arxiv.org/abs/2602.16424 ↩︎