Agents on the Wire: Protocols, Memory, and Guardrails for Real-World Agentic AI

TL;DR for operators

An agent demo usually fails in production for boring reasons. Not because the model suddenly forgot how to reason. Because the agent cannot reliably discover another agent, remember the right state, expose a stable contract, validate risky outputs, or execute generated code without turning the server into an involuntary escape room.

That is the useful reading of Derouiche, Brahmi, and Mezni’s survey of agentic AI frameworks.¹ The paper is not a leaderboard. It does not say that LangGraph beats CrewAI, or that A2A will conquer MCP, or that one framework has achieved enterprise enlightenment after three GitHub stars and a conference booth. It does something more useful: it maps the components that determine whether agentic AI becomes infrastructure or remains a collection of clever scripts.

The operational message is blunt:

Framework choice is workflow choice. CrewAI, AutoGen, LangGraph, Semantic Kernel, Agno, Google ADK, LlamaIndex, MetaGPT, SmolAgents, PydanticAI, and the OpenAI Agents SDK are not interchangeable skins over the same idea. They encode different assumptions about roles, graphs, memory, tools, orchestration, and control.
Protocols are becoming the real battleground. MCP, ACP, A2A, ANP, and Agora are attempts to make agents discoverable, callable, and composable. Today, the paper finds a fragmented protocol landscape: HTTP dominates, JSON-style schemas help, but semantic differences still make seamless interoperation fragile.
Memory is not just chat history with ambition. Short-term, long-term, semantic, procedural, and episodic memory create different operational risks. The moment memory persists across sessions, it becomes a governance object.
Guardrails remain uneven. Some frameworks provide validators, retry logic, schema checks, or flow-level controls. Many still require external safety logic. That matters when agents call APIs, execute code, or touch regulated data.
Service-computing readiness is incomplete. The paper’s strongest business insight is that agent platforms are still mostly task-centric. To become enterprise infrastructure, they need registries, publishing mechanisms, discovery, service contracts, policy layers, and auditable orchestration.

The takeaway for builders is not “pick the coolest agent framework.” It is: design the agent stack as separable infrastructure. Orchestration, communication, memory, guardrails, and service contracts should be modular. Otherwise, you are not building an agent platform. You are building a very polite integration problem.

The expensive mistake is treating agent frameworks as interchangeable plumbing

Most enterprise AI projects begin with a deceptively simple question: which framework should we use?

That question sounds practical. It is also slightly dangerous. It assumes that the framework is the main object of choice. The paper pushes against that assumption by comparing frameworks across architecture, communication, memory, guardrails, applications, and service-computing readiness. The pattern that emerges is not “there are many tools.” Everyone already knew that. The pattern is that each framework defines a different theory of how work should happen.

A role-based framework assumes work resembles a team: researcher, planner, coder, reviewer, manager. A graph-based framework assumes work resembles state transitions: node, edge, condition, retry, branch, rollback. A conversational multi-agent framework assumes work resembles dialogue: agents exchange messages, invoke tools, and converge through interaction. A data-centric framework assumes work begins with retrieval and context management. An enterprise planner assumes work must connect to existing services, skills, policies, and controlled execution.

Those assumptions matter because they decide what is easy, what is visible, and what becomes painful later.

If a company builds a claims-processing agent on a role-based framework, it may move quickly at first. “Verifier agent,” “policy agent,” “customer-response agent,” and “supervisor agent” sound reassuringly organisational. Then the workflow reaches an exception: the customer uploaded a corrupted file, the policy changed mid-process, and a regulator requires an auditable trail of why the denial was issued. Suddenly, the system needs deterministic state, replay, versioned artefacts, error handling, and policy enforcement. A team metaphor is no longer enough. The agent needs service infrastructure.

That is why the paper’s comparison-based structure is useful. It lets us stop asking which framework is “best” and start asking which operational assumption we are buying.

What the paper directly contributes

The paper makes three main contributions.

First, it synthesises modern agentic AI frameworks into a practical taxonomy. The comparison covers orchestration style, communication mechanisms, memory design, guardrail support, and fit with service-oriented computing. This is not a minor bookkeeping exercise. In a market where every tool claims to build “autonomous agents,” taxonomy is an act of hygiene.

Second, it compares emerging communication protocols: MCP, ACP, A2A, ANP, and Agora. The protocol section is important because it shifts the discussion from agent internals to agent ecosystems. A framework helps build an agent. A protocol helps agents find, call, coordinate with, and understand one another.

Third, the paper identifies shared gaps across today’s agentic systems: weak runtime discovery, fragmented abstractions, incomplete guardrail layers, code execution risks, and limited service-contract maturity. This is the part buyers should underline. The paper is not saying the ecosystem is immature in a vague “early days” way. It is naming the missing infrastructure pieces.

Protocols are not about agents “chatting”; they are about contracts

It is tempting to describe agent protocols as ways for agents to talk to each other. That is true in the same sense that banking APIs are ways for banks to “talk.” Technically correct. Operationally useless.

The real purpose of a protocol is to reduce bespoke integration. It should answer questions such as:

What capabilities does this agent expose?
How does another agent discover those capabilities?
What message format is expected?
What artefact is returned?
Which transport layer is used?
How are identity, permissions, and context handled?
Can this interaction be audited, retried, or substituted?

The paper compares five modern protocol directions.

Protocol	What it is mainly good for	Operational strength	Operational weakness
MCP	Structured LLM-tool integration through a client-server model	Useful for tool access, schema validation, and context exchange	Discovery is comparatively manual; less naturally peer-to-peer
ACP	Cross-agent collaboration through structured messages around goals, actions, and intents	Fits multi-agent collaboration and REST-style integration	Still depends on standard adoption and compatible semantics
A2A	Agent-to-agent coordination through constructs such as Agent Cards, Task Objects, and Artifacts	Stronger capability discovery and enterprise-style handoff	Ecosystem maturity and cross-framework adoption remain open
ANP	Decentralised identity and semantic interoperability using DIDs and JSON-LD-style descriptions	Useful for open networks and agent markets	More complex; relies on semantic agreement and identity infrastructure
Agora	Meta-coordination through machine-interpretable Protocol Documents	Useful where agents may need to choose or construct protocols dynamically	More ambitious; practical adoption is still a major boundary

The paper’s key finding is that these protocols are moving toward service-oriented interoperability, but fragmentation persists. Some use custom performatives. Some use goal-oriented messages. Some emphasise JSON-LD or Protocol Documents. HTTP is common, which helps. Shared transport, however, is not the same as shared meaning. Anyone who has integrated two “RESTful” APIs knows this tragedy well.

The business interpretation is simple: protocol choice should follow the handoff problem. If the main issue is letting models call tools safely, MCP-style integration may be enough. If the issue is cross-team agent coordination, A2A or ACP-style structures become more relevant. If the issue is decentralised discovery and verifiable identity, ANP is more interesting. If the environment may contain multiple protocols, Agora’s meta-layer idea is conceptually attractive.

But attractive architecture is not production evidence. The paper is a comparative review, not a field report proving one protocol’s operational superiority.

Frameworks encode different theories of work

The paper’s framework survey covers a broad ecosystem: AutoGen, CrewAI, LangGraph, Semantic Kernel, Agno, Google ADK, MetaGPT, LlamaIndex, SmolAgents, PydanticAI, and the OpenAI Agents SDK. The important point is not that these names exist. The important point is how differently they make work legible.

Framework family	Typical strength	What to watch
Role/team-based systems such as CrewAI and MetaGPT	Fast conceptual mapping from human teams to agent teams	Static roles can become rigid when tasks evolve
Conversational multi-agent systems such as AutoGen	Rich interaction among agents and tools	Conversation flows can sprawl; code execution needs strong containment
Graph/state-machine systems such as LangGraph	Traceable orchestration, stateful workflows, retry logic, conditional routing	Discovery and publishing still require surrounding infrastructure
Enterprise planner/skill systems such as Semantic Kernel	Integration with structured skills, planners, and enterprise services	Service discovery and policy enforcement may still need external implementation
Data-centric systems such as LlamaIndex	Retrieval and indexed knowledge access for context-heavy tasks	Retrieval quality and memory governance become central risks
Minimal/schema-first systems such as SmolAgents and PydanticAI	Simplicity, transparency, typed modelling	Often require manual attachment of deeper memory, orchestration, or guardrails
Cloud/distributed systems such as Google ADK	Multi-agent orchestration aligned with scalable infrastructure	Experimental status and dependence on surrounding cloud services matter

The survey’s comparison suggests a useful architectural principle: do not choose a framework as if it were an operating system. Choose it as a layer.

LangGraph may be the orchestration spine for workflows that need traceability and recovery. CrewAI may be a useful collaboration layer where role clarity helps humans understand the system. LlamaIndex may handle knowledge-intensive retrieval. Semantic Kernel may serve enterprise skill integration. AutoGen may support rapid multi-agent interaction. The better stack may be compositional, assuming the interfaces are disciplined.

That last phrase is doing work. Without disciplined interfaces, composability becomes duct tape with better branding.

Memory is the continuity layer, not a decorative feature

The paper treats memory as a central comparison point, and rightly so. Memory is where an agent stops being a one-shot function call and starts becoming a system with continuity. That continuity can be valuable. It can also become a liability with excellent recall.

The paper distinguishes several memory types:

Short-term memory keeps immediate task or conversation context.
Long-term memory persists user preferences, task history, or learned knowledge across sessions.
Semantic memory stores concepts, representations, or reusable reasoning structures.
Procedural memory retains task flows, strategies, and learned routines.
Episodic memory preserves contextual snapshots of specific interactions or experiences.

The framework comparison shows uneven support. LangGraph emphasises stateful graph nodes. CrewAI includes agent-level memory with contextual and entity memory. AutoGen supports shared dialogue context. Semantic Kernel provides extensible memory modules. LlamaIndex centres retrieval from indexed data. PydanticAI and SmolAgents are more manual or externalised. Google ADK supports shared memory across system modules. MetaGPT includes implicit memory through role-based behaviour.

For business use, the sharp distinction is not “has memory” versus “does not have memory.” The distinction is whether memory is governed.

A customer-support agent remembering a user’s preferred language is useful. The same agent retaining sensitive complaint details indefinitely is not. A research agent remembering which sources were rejected is useful. The same agent treating a stale rejected source as permanently invalid is not. A coding agent remembering a project’s architecture is useful. The same agent retrieving old credentials from an incident transcript is the sort of feature that makes security teams develop facial twitches.

Memory needs eligibility rules, retention limits, provenance, redaction, versioning, and audit. In enterprise settings, memory should be an API with policy, not an accidental by-product of long context windows.

Guardrails decide whether delegation becomes liability transfer

The paper’s guardrail comparison is one of its most practically important sections. It finds that guardrail support is emerging but uneven. AutoGen, LangGraph, Agno, and the OpenAI Agents SDK are described as having stronger native support through validators, retry logic, flow-level checks, trust layers, or schema validation. CrewAI, MetaGPT, and Google ADK are characterised as partial. LlamaIndex and Semantic Kernel validate at specific stages. SmolAgents prioritises developer control and lacks broad native guardrails.

This does not mean the weaker systems are unusable. It means buyers should stop confusing agent orchestration with agent assurance.

Guardrails in agentic systems should sit at multiple control points:

Control point	Example mechanism	Failure it reduces
Before tool calls	Schema validation, allowlists, permission checks	Wrong tool, wrong argument, wrong authority
During execution	Timeouts, rate limits, sandboxing, streaming monitors	Runaway cost, unsafe code, uncontrolled side effects
After output	Validators, policy checks, typed artefact inspection	Invalid answer, unsafe recommendation, non-compliant response
Across workflow	State checks, replay logs, approval gates	Irrecoverable drift, hidden decision paths, unauditable actions

The paper specifically flags code safety. Frameworks that generate or execute code can create severe risks: filesystem access, shell commands, unsafe imports, or external dependencies. The suggested mitigations are familiar but necessary: sandboxed environments such as Docker with strict capabilities, or restriction to pre-approved pure functions.

This is where agentic AI gets less glamorous and more useful. Production systems are not judged by whether the demo completes the happy path. They are judged by what happens when the agent is wrong, overconfident, under-specified, interrupted, maliciously prompted, or asked to operate on real infrastructure. Delightful little edge cases, all of them.

Service-computing readiness is the paper’s most business-relevant lens

The strongest section of the paper is its service-computing analysis. This is where the survey becomes more than a catalogue of tools.

The paper asks whether agentic AI frameworks are ready to integrate into service-computing ecosystems. That means dynamic discovery, publishing, composition, orchestration, reusable service contracts, policy enforcement, and agreement-like behaviour. In plain English: can agents behave like enterprise services rather than isolated assistants?

The answer is: partially.

Semantic Kernel and Google ADK are presented as relatively strong in composition, but still dependent on external registries and orchestration layers for full service-computing behaviour. LangGraph has strong composition patterns through its state-machine abstraction and extensibility hooks, but discovery still requires adapters or catalogues. CrewAI, AutoGen, Agno, and MetaGPT support useful planning or collaboration patterns, but need auxiliary registries or service wrappers to participate in dynamic service ecosystems.

The paper also maps older W3C service concepts into the agentic AI context:

Service-computing idea	Agentic AI translation	Why it matters
WSDL-like descriptions	Function, tool, or agent capability contracts	Agents need discoverable, versioned interfaces
BPEL-like orchestration	Explicit workflow sequences among agents	Agent processes need replayable structure and error semantics
WS-Policy-style constraints	Runtime parameters and behavioural policies	Agents need enforceable operating conditions
WS-Security-style mechanisms	Signed, authenticated, encrypted messages	Cross-agent communication needs trust and provenance
WS-Coordination	Sessions, roles, shared context, turn-taking	Multi-agent systems need managed interaction state
WS-Agreement	SLA or quality terms for agent selection	Agent delegation should consider reliability, latency, and cost

No one should read this as a call to revive the full ceremony of early-2000s enterprise middleware. Nobody needs more XML nostalgia in their life. The useful point is that the business problems solved by service computing never disappeared. We just renamed them after adding LLMs.

Agents need contracts. Agents need registries. Agents need policy. Agents need runtime coordination. Agents need auditable agreements about what they can do and what guarantees they offer. The agent ecosystem is rediscovering this, occasionally with the confidence of someone inventing plumbing in a house already full of pipes.

How to read the paper’s evidence

Because this paper is a systematic review and comparative analysis, its evidence is mostly classificatory and architectural. It does not run benchmark experiments, ablations, or controlled performance comparisons. That matters for interpretation.

Paper element	Likely purpose	What it supports	What it does not prove
Traditional vs modern agent comparison	Background taxonomy	Shows how LLM-powered agents differ from classical agents in autonomy, tools, memory, and context	Does not prove modern agents are more reliable in production
Agent communication protocol table	Main comparative evidence	Compares MCP, ACP, A2A, ANP, and Agora across format, semantics, discovery, transport, and use case	Does not prove one protocol will dominate or interoperate flawlessly
Framework design taxonomy figure	Conceptual synthesis	Organises major framework design patterns	Does not validate framework performance
Unified class model figure	Implementation-level abstraction	Extracts common structural components across frameworks	Does not guarantee these abstractions are sufficient for deployment
Memory support table	Main comparative evidence	Shows how frameworks differ across short-term, long-term, semantic, procedural, and episodic memory	Does not measure memory quality, retrieval accuracy, or privacy risk
Guardrail discussion	Main architectural finding	Identifies uneven native safety support and need for external enforcement	Does not evaluate guardrails under adversarial testing
Service-computing compatibility table	Main business-relevant evidence	Assesses discovery, publishing, and composition readiness	Does not prove enterprise readiness without pilots
W3C adaptation table	Exploratory architectural extension	Connects agent infrastructure to established service-computing concepts	Does not mean W3C standards are already implemented uniformly
CrewAI code listing	Implementation detail	Makes a role-based agent example concrete	Does not establish general framework superiority

This distinction is not academic pedantry. It changes how a CTO should use the paper. The survey is a map, not a stopwatch. It helps narrow architectural options and expose gaps. It should not be used as a procurement verdict by itself.

The buyer’s checklist: match the stack to the failure mode

The paper becomes most useful when translated into failure modes.

If the workflow fails because handoffs are ambiguous, prioritise protocols and typed artefacts. A2A-style constructs, ACP-style goal messages, or MCP plus a registry may matter more than the agent framework itself.

If the workflow fails because the agent loses context, prioritise memory design. Decide what belongs in short-term state, durable user memory, semantic retrieval, procedural playbooks, and episodic incident records. Then govern each layer separately.

If the workflow fails because execution is unsafe, prioritise guardrails and sandboxing. Validators, approval gates, deterministic policies, and code isolation should be built into the workflow boundary, not added after the first incident email.

If the workflow fails because it cannot be audited, prioritise graph/state orchestration and artefact lineage. A pretty conversation transcript is not the same as an operational record.

If the workflow fails because teams cannot reuse components, prioritise service contracts. Agents should publish capabilities, versions, inputs, outputs, constraints, and ownership. Otherwise, every new integration becomes artisanal middleware. Charming, perhaps. Scalable, no.

What Cognaptus infers for business use

The paper directly shows that agentic AI frameworks differ materially across architecture, memory, communication, guardrails, and service-computing alignment. It also shows that modern protocols are emerging to address interoperability, though fragmentation remains. It identifies runtime discovery, code safety, static roles, and incompatible abstractions as major limitations.

From that, Cognaptus infers three practical design principles.

First, separate the orchestration spine from the collaboration layer. A graph or state-machine layer is often better for traceability, retries, and operational control. Role-based or conversational agents can still be useful, but they should not be the only representation of the workflow.

Second, treat memory as governed infrastructure. Memory should have policies for storage, retrieval, retention, deletion, and jurisdiction. This is especially true in finance, healthcare, customer support, legal operations, and internal enterprise automation.

Third, design for protocol translation early. The protocol ecosystem is not settled. An architecture that binds every tool, agent, and workflow to one framework’s internal abstraction is betting against interoperability. A service wrapper, registry, and typed artefact layer provide optionality.

The uncertain part is magnitude. The paper does not measure cost savings, latency, success rates, handoff failure rates, incident reduction, or developer productivity. Those have to be tested in the target workflow. A bank’s compliance-review agent, a logistics exception-handler, and a software-engineering copilot will stress different parts of the stack.

The boundary: this is not a leaderboard

The paper’s limitation is also its usefulness. It is not pretending to be an empirical benchmark. There are no controlled experiments showing that one framework improves task success by a particular percentage. There are no ablations isolating memory, guardrails, or protocol choice. There are no stress tests under adversarial prompts, heavy concurrency, changing APIs, or regulated audit requirements.

That means the framework comparisons should be read as structured intelligence, not final judgement.

A vendor may update its guardrails after the paper’s documentation snapshot. A protocol may gain adoption or fade. A framework that looks incomplete in service discovery may become viable when paired with an external registry. A lightweight framework may outperform a heavier one for a narrow internal workflow. Context still has the annoying habit of mattering.

For enterprise readers, the right next step is not to crown a winner. It is to run a pilot with explicit measurements:

task completion rate;
handoff failure rate;
invalid tool-call rate;
retrieval accuracy;
policy violation rate;
cost per successful task;
time to debug failure;
audit replay completeness;
human approval burden;
incident severity under adversarial prompts.

That is where architecture becomes evidence.

Conclusion: agents become useful when they become boring infrastructure

Agentic AI is often sold as autonomy. The paper’s quieter lesson is that autonomy needs plumbing.

Protocols make agents discoverable and callable. Memory makes them continuous. Guardrails make them safe enough to delegate to. Service contracts make them reusable. Registries make them composable. Orchestration makes them debuggable. Without those layers, an “agent platform” is mostly a prompt, a tool list, and optimism wearing a lanyard.

The useful future of agentic AI will not be defined by the framework with the loudest launch post. It will be defined by systems that can coordinate across boundaries, remember only what they should, expose stable contracts, execute safely, and fail in ways humans can inspect.

That is less cinematic than a swarm of autonomous digital workers. It is also much closer to how real businesses buy, deploy, and trust infrastructure.

Cognaptus: Automate the Present, Incubate the Future.

Hana Derouiche, Zaki Brahmi, and Haithem Mazeni, “Agentic AI Frameworks: Architectures, Protocols, and Design Challenges,” arXiv:2508.10146, 2025. https://arxiv.org/abs/2508.10146 ↩︎

TL;DR for operators#

The expensive mistake is treating agent frameworks as interchangeable plumbing#

What the paper directly contributes#

Protocols are not about agents “chatting”; they are about contracts#

Frameworks encode different theories of work#

Memory is the continuity layer, not a decorative feature#

Guardrails decide whether delegation becomes liability transfer#

Service-computing readiness is the paper’s most business-relevant lens#

How to read the paper’s evidence#

The buyer’s checklist: match the stack to the failure mode#

What Cognaptus infers for business use#

The boundary: this is not a leaderboard#

Conclusion: agents become useful when they become boring infrastructure#