TL;DR for operators
An agent demo usually fails in production for boring reasons. Not because the model suddenly forgot how to reason. Because the agent cannot reliably discover another agent, remember the right state, expose a stable contract, validate risky outputs, or execute generated code without turning the server into an involuntary escape room.
That is the useful reading of Derouiche, Brahmi, and Mezni’s survey of agentic AI frameworks.1 The paper is not a leaderboard. It does not say that LangGraph beats CrewAI, or that A2A will conquer MCP, or that one framework has achieved enterprise enlightenment after three GitHub stars and a conference booth. It does something more useful: it maps the components that determine whether agentic AI becomes infrastructure or remains a collection of clever scripts.
The operational message is blunt:
- Framework choice is workflow choice. CrewAI, AutoGen, LangGraph, Semantic Kernel, Agno, Google ADK, LlamaIndex, MetaGPT, SmolAgents, PydanticAI, and the OpenAI Agents SDK are not interchangeable skins over the same idea. They encode different assumptions about roles, graphs, memory, tools, orchestration, and control.
- Protocols are becoming the real battleground. MCP, ACP, A2A, ANP, and Agora are attempts to make agents discoverable, callable, and composable. Today, the paper finds a fragmented protocol landscape: HTTP dominates, JSON-style schemas help, but semantic differences still make seamless interoperation fragile.
- Memory is not just chat history with ambition. Short-term, long-term, semantic, procedural, and episodic memory create different operational risks. The moment memory persists across sessions, it becomes a governance object.
- Guardrails remain uneven. Some frameworks provide validators, retry logic, schema checks, or flow-level controls. Many still require external safety logic. That matters when agents call APIs, execute code, or touch regulated data.
- Service-computing readiness is incomplete. The paper’s strongest business insight is that agent platforms are still mostly task-centric. To become enterprise infrastructure, they need registries, publishing mechanisms, discovery, service contracts, policy layers, and auditable orchestration.
The takeaway for builders is not “pick the coolest agent framework.” It is: design the agent stack as separable infrastructure. Orchestration, communication, memory, guardrails, and service contracts should be modular. Otherwise, you are not building an agent platform. You are building a very polite integration problem.
The expensive mistake is treating agent frameworks as interchangeable plumbing
Most enterprise AI projects begin with a deceptively simple question: which framework should we use?
That question sounds practical. It is also slightly dangerous. It assumes that the framework is the main object of choice. The paper pushes against that assumption by comparing frameworks across architecture, communication, memory, guardrails, applications, and service-computing readiness. The pattern that emerges is not “there are many tools.” Everyone already knew that. The pattern is that each framework defines a different theory of how work should happen.
A role-based framework assumes work resembles a team: researcher, planner, coder, reviewer, manager. A graph-based framework assumes work resembles state transitions: node, edge, condition, retry, branch, rollback. A conversational multi-agent framework assumes work resembles dialogue: agents exchange messages, invoke tools, and converge through interaction. A data-centric framework assumes work begins with retrieval and context management. An enterprise planner assumes work must connect to existing services, skills, policies, and controlled execution.
Those assumptions matter because they decide what is easy, what is visible, and what becomes painful later.
If a company builds a claims-processing agent on a role-based framework, it may move quickly at first. “Verifier agent,” “policy agent,” “customer-response agent,” and “supervisor agent” sound reassuringly organisational. Then the workflow reaches an exception: the customer uploaded a corrupted file, the policy changed mid-process, and a regulator requires an auditable trail of why the denial was issued. Suddenly, the system needs deterministic state, replay, versioned artefacts, error handling, and policy enforcement. A team metaphor is no longer enough. The agent needs service infrastructure.
That is why the paper’s comparison-based structure is useful. It lets us stop asking which framework is “best” and start asking which operational assumption we are buying.
What the paper directly contributes
The paper makes three main contributions.
First, it synthesises modern agentic AI frameworks into a practical taxonomy. The comparison covers orchestration style, communication mechanisms, memory design, guardrail support, and fit with service-oriented computing. This is not a minor bookkeeping exercise. In a market where every tool claims to build “autonomous agents,” taxonomy is an act of hygiene.
Second, it compares emerging communication protocols: MCP, ACP, A2A, ANP, and Agora. The protocol section is important because it shifts the discussion from agent internals to agent ecosystems. A framework helps build an agent. A protocol helps agents find, call, coordinate with, and understand one another.
Third, the paper identifies shared gaps across today’s agentic systems: weak runtime discovery, fragmented abstractions, incomplete guardrail layers, code execution risks, and limited service-contract maturity. This is the part buyers should underline. The paper is not saying the ecosystem is immature in a vague “early days” way. It is naming the missing infrastructure pieces.
Protocols are not about agents “chatting”; they are about contracts
It is tempting to describe agent protocols as ways for agents to talk to each other. That is true in the same sense that banking APIs are ways for banks to “talk.” Technically correct. Operationally useless.
The real purpose of a protocol is to reduce bespoke integration. It should answer questions such as:
- What capabilities does this agent expose?
- How does another agent discover those capabilities?
- What message format is expected?
- What artefact is returned?
- Which transport layer is used?
- How are identity, permissions, and context handled?
- Can this interaction be audited, retried, or substituted?
The paper compares five modern protocol directions.
| Protocol | What it is mainly good for | Operational strength | Operational weakness |
|---|---|---|---|
| MCP | Structured LLM-tool integration through a client-server model | Useful for tool access, schema validation, and context exchange | Discovery is comparatively manual; less naturally peer-to-peer |
| ACP | Cross-agent collaboration through structured messages around goals, actions, and intents | Fits multi-agent collaboration and REST-style integration | Still depends on standard adoption and compatible semantics |
| A2A | Agent-to-agent coordination through constructs such as Agent Cards, Task Objects, and Artifacts | Stronger capability discovery and enterprise-style handoff | Ecosystem maturity and cross-framework adoption remain open |
| ANP | Decentralised identity and semantic interoperability using DIDs and JSON-LD-style descriptions | Useful for open networks and agent markets | More complex; relies on semantic agreement and identity infrastructure |
| Agora | Meta-coordination through machine-interpretable Protocol Documents | Useful where agents may need to choose or construct protocols dynamically | More ambitious; practical adoption is still a major boundary |
The paper’s key finding is that these protocols are moving toward service-oriented interoperability, but fragmentation persists. Some use custom performatives. Some use goal-oriented messages. Some emphasise JSON-LD or Protocol Documents. HTTP is common, which helps. Shared transport, however, is not the same as shared meaning. Anyone who has integrated two “RESTful” APIs knows this tragedy well.
The business interpretation is simple: protocol choice should follow the handoff problem. If the main issue is letting models call tools safely, MCP-style integration may be enough. If the issue is cross-team agent coordination, A2A or ACP-style structures become more relevant. If the issue is decentralised discovery and verifiable identity, ANP is more interesting. If the environment may contain multiple protocols, Agora’s meta-layer idea is conceptually attractive.
But attractive architecture is not production evidence. The paper is a comparative review, not a field report proving one protocol’s operational superiority.
Frameworks encode different theories of work
The paper’s framework survey covers a broad ecosystem: AutoGen, CrewAI, LangGraph, Semantic Kernel, Agno, Google ADK, MetaGPT, LlamaIndex, SmolAgents, PydanticAI, and the OpenAI Agents SDK. The important point is not that these names exist. The important point is how differently they make work legible.
| Framework family | Typical strength | What to watch |
|---|---|---|
| Role/team-based systems such as CrewAI and MetaGPT | Fast conceptual mapping from human teams to agent teams | Static roles can become rigid when tasks evolve |
| Conversational multi-agent systems such as AutoGen | Rich interaction among agents and tools | Conversation flows can sprawl; code execution needs strong containment |
| Graph/state-machine systems such as LangGraph | Traceable orchestration, stateful workflows, retry logic, conditional routing | Discovery and publishing still require surrounding infrastructure |
| Enterprise planner/skill systems such as Semantic Kernel | Integration with structured skills, planners, and enterprise services | Service discovery and policy enforcement may still need external implementation |
| Data-centric systems such as LlamaIndex | Retrieval and indexed knowledge access for context-heavy tasks | Retrieval quality and memory governance become central risks |
| Minimal/schema-first systems such as SmolAgents and PydanticAI | Simplicity, transparency, typed modelling | Often require manual attachment of deeper memory, orchestration, or guardrails |
| Cloud/distributed systems such as Google ADK | Multi-agent orchestration aligned with scalable infrastructure | Experimental status and dependence on surrounding cloud services matter |
The survey’s comparison suggests a useful architectural principle: do not choose a framework as if it were an operating system. Choose it as a layer.
LangGraph may be the orchestration spine for workflows that need traceability and recovery. CrewAI may be a useful collaboration layer where role clarity helps humans understand the system. LlamaIndex may handle knowledge-intensive retrieval. Semantic Kernel may serve enterprise skill integration. AutoGen may support rapid multi-agent interaction. The better stack may be compositional, assuming the interfaces are disciplined.
That last phrase is doing work. Without disciplined interfaces, composability becomes duct tape with better branding.
Memory is the continuity layer, not a decorative feature
The paper treats memory as a central comparison point, and rightly so. Memory is where an agent stops being a one-shot function call and starts becoming a system with continuity. That continuity can be valuable. It can also become a liability with excellent recall.
The paper distinguishes several memory types:
- Short-term memory keeps immediate task or conversation context.
- Long-term memory persists user preferences, task history, or learned knowledge across sessions.
- Semantic memory stores concepts, representations, or reusable reasoning structures.
- Procedural memory retains task flows, strategies, and learned routines.
- Episodic memory preserves contextual snapshots of specific interactions or experiences.
The framework comparison shows uneven support. LangGraph emphasises stateful graph nodes. CrewAI includes agent-level memory with contextual and entity memory. AutoGen supports shared dialogue context. Semantic Kernel provides extensible memory modules. LlamaIndex centres retrieval from indexed data. PydanticAI and SmolAgents are more manual or externalised. Google ADK supports shared memory across system modules. MetaGPT includes implicit memory through role-based behaviour.
For business use, the sharp distinction is not “has memory” versus “does not have memory.” The distinction is whether memory is governed.
A customer-support agent remembering a user’s preferred language is useful. The same agent retaining sensitive complaint details indefinitely is not. A research agent remembering which sources were rejected is useful. The same agent treating a stale rejected source as permanently invalid is not. A coding agent remembering a project’s architecture is useful. The same agent retrieving old credentials from an incident transcript is the sort of feature that makes security teams develop facial twitches.
Memory needs eligibility rules, retention limits, provenance, redaction, versioning, and audit. In enterprise settings, memory should be an API with policy, not an accidental by-product of long context windows.
Guardrails decide whether delegation becomes liability transfer
The paper’s guardrail comparison is one of its most practically important sections. It finds that guardrail support is emerging but uneven. AutoGen, LangGraph, Agno, and the OpenAI Agents SDK are described as having stronger native support through validators, retry logic, flow-level checks, trust layers, or schema validation. CrewAI, MetaGPT, and Google ADK are characterised as partial. LlamaIndex and Semantic Kernel validate at specific stages. SmolAgents prioritises developer control and lacks broad native guardrails.
This does not mean the weaker systems are unusable. It means buyers should stop confusing agent orchestration with agent assurance.
Guardrails in agentic systems should sit at multiple control points:
| Control point | Example mechanism | Failure it reduces |
|---|---|---|
| Before tool calls | Schema validation, allowlists, permission checks | Wrong tool, wrong argument, wrong authority |
| During execution | Timeouts, rate limits, sandboxing, streaming monitors | Runaway cost, unsafe code, uncontrolled side effects |
| After output | Validators, policy checks, typed artefact inspection | Invalid answer, unsafe recommendation, non-compliant response |
| Across workflow | State checks, replay logs, approval gates | Irrecoverable drift, hidden decision paths, unauditable actions |
The paper specifically flags code safety. Frameworks that generate or execute code can create severe risks: filesystem access, shell commands, unsafe imports, or external dependencies. The suggested mitigations are familiar but necessary: sandboxed environments such as Docker with strict capabilities, or restriction to pre-approved pure functions.
This is where agentic AI gets less glamorous and more useful. Production systems are not judged by whether the demo completes the happy path. They are judged by what happens when the agent is wrong, overconfident, under-specified, interrupted, maliciously prompted, or asked to operate on real infrastructure. Delightful little edge cases, all of them.
Service-computing readiness is the paper’s most business-relevant lens
The strongest section of the paper is its service-computing analysis. This is where the survey becomes more than a catalogue of tools.
The paper asks whether agentic AI frameworks are ready to integrate into service-computing ecosystems. That means dynamic discovery, publishing, composition, orchestration, reusable service contracts, policy enforcement, and agreement-like behaviour. In plain English: can agents behave like enterprise services rather than isolated assistants?
The answer is: partially.
Semantic Kernel and Google ADK are presented as relatively strong in composition, but still dependent on external registries and orchestration layers for full service-computing behaviour. LangGraph has strong composition patterns through its state-machine abstraction and extensibility hooks, but discovery still requires adapters or catalogues. CrewAI, AutoGen, Agno, and MetaGPT support useful planning or collaboration patterns, but need auxiliary registries or service wrappers to participate in dynamic service ecosystems.
The paper also maps older W3C service concepts into the agentic AI context:
| Service-computing idea | Agentic AI translation | Why it matters |
|---|---|---|
| WSDL-like descriptions | Function, tool, or agent capability contracts | Agents need discoverable, versioned interfaces |
| BPEL-like orchestration | Explicit workflow sequences among agents | Agent processes need replayable structure and error semantics |
| WS-Policy-style constraints | Runtime parameters and behavioural policies | Agents need enforceable operating conditions |
| WS-Security-style mechanisms | Signed, authenticated, encrypted messages | Cross-agent communication needs trust and provenance |
| WS-Coordination | Sessions, roles, shared context, turn-taking | Multi-agent systems need managed interaction state |
| WS-Agreement | SLA or quality terms for agent selection | Agent delegation should consider reliability, latency, and cost |
No one should read this as a call to revive the full ceremony of early-2000s enterprise middleware. Nobody needs more XML nostalgia in their life. The useful point is that the business problems solved by service computing never disappeared. We just renamed them after adding LLMs.
Agents need contracts. Agents need registries. Agents need policy. Agents need runtime coordination. Agents need auditable agreements about what they can do and what guarantees they offer. The agent ecosystem is rediscovering this, occasionally with the confidence of someone inventing plumbing in a house already full of pipes.
How to read the paper’s evidence
Because this paper is a systematic review and comparative analysis, its evidence is mostly classificatory and architectural. It does not run benchmark experiments, ablations, or controlled performance comparisons. That matters for interpretation.
| Paper element | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Traditional vs modern agent comparison | Background taxonomy | Shows how LLM-powered agents differ from classical agents in autonomy, tools, memory, and context | Does not prove modern agents are more reliable in production |
| Agent communication protocol table | Main comparative evidence | Compares MCP, ACP, A2A, ANP, and Agora across format, semantics, discovery, transport, and use case | Does not prove one protocol will dominate or interoperate flawlessly |
| Framework design taxonomy figure | Conceptual synthesis | Organises major framework design patterns | Does not validate framework performance |
| Unified class model figure | Implementation-level abstraction | Extracts common structural components across frameworks | Does not guarantee these abstractions are sufficient for deployment |
| Memory support table | Main comparative evidence | Shows how frameworks differ across short-term, long-term, semantic, procedural, and episodic memory | Does not measure memory quality, retrieval accuracy, or privacy risk |
| Guardrail discussion | Main architectural finding | Identifies uneven native safety support and need for external enforcement | Does not evaluate guardrails under adversarial testing |
| Service-computing compatibility table | Main business-relevant evidence | Assesses discovery, publishing, and composition readiness | Does not prove enterprise readiness without pilots |
| W3C adaptation table | Exploratory architectural extension | Connects agent infrastructure to established service-computing concepts | Does not mean W3C standards are already implemented uniformly |
| CrewAI code listing | Implementation detail | Makes a role-based agent example concrete | Does not establish general framework superiority |
This distinction is not academic pedantry. It changes how a CTO should use the paper. The survey is a map, not a stopwatch. It helps narrow architectural options and expose gaps. It should not be used as a procurement verdict by itself.
The buyer’s checklist: match the stack to the failure mode
The paper becomes most useful when translated into failure modes.
If the workflow fails because handoffs are ambiguous, prioritise protocols and typed artefacts. A2A-style constructs, ACP-style goal messages, or MCP plus a registry may matter more than the agent framework itself.
If the workflow fails because the agent loses context, prioritise memory design. Decide what belongs in short-term state, durable user memory, semantic retrieval, procedural playbooks, and episodic incident records. Then govern each layer separately.
If the workflow fails because execution is unsafe, prioritise guardrails and sandboxing. Validators, approval gates, deterministic policies, and code isolation should be built into the workflow boundary, not added after the first incident email.
If the workflow fails because it cannot be audited, prioritise graph/state orchestration and artefact lineage. A pretty conversation transcript is not the same as an operational record.
If the workflow fails because teams cannot reuse components, prioritise service contracts. Agents should publish capabilities, versions, inputs, outputs, constraints, and ownership. Otherwise, every new integration becomes artisanal middleware. Charming, perhaps. Scalable, no.
What Cognaptus infers for business use
The paper directly shows that agentic AI frameworks differ materially across architecture, memory, communication, guardrails, and service-computing alignment. It also shows that modern protocols are emerging to address interoperability, though fragmentation remains. It identifies runtime discovery, code safety, static roles, and incompatible abstractions as major limitations.
From that, Cognaptus infers three practical design principles.
First, separate the orchestration spine from the collaboration layer. A graph or state-machine layer is often better for traceability, retries, and operational control. Role-based or conversational agents can still be useful, but they should not be the only representation of the workflow.
Second, treat memory as governed infrastructure. Memory should have policies for storage, retrieval, retention, deletion, and jurisdiction. This is especially true in finance, healthcare, customer support, legal operations, and internal enterprise automation.
Third, design for protocol translation early. The protocol ecosystem is not settled. An architecture that binds every tool, agent, and workflow to one framework’s internal abstraction is betting against interoperability. A service wrapper, registry, and typed artefact layer provide optionality.
The uncertain part is magnitude. The paper does not measure cost savings, latency, success rates, handoff failure rates, incident reduction, or developer productivity. Those have to be tested in the target workflow. A bank’s compliance-review agent, a logistics exception-handler, and a software-engineering copilot will stress different parts of the stack.
The boundary: this is not a leaderboard
The paper’s limitation is also its usefulness. It is not pretending to be an empirical benchmark. There are no controlled experiments showing that one framework improves task success by a particular percentage. There are no ablations isolating memory, guardrails, or protocol choice. There are no stress tests under adversarial prompts, heavy concurrency, changing APIs, or regulated audit requirements.
That means the framework comparisons should be read as structured intelligence, not final judgement.
A vendor may update its guardrails after the paper’s documentation snapshot. A protocol may gain adoption or fade. A framework that looks incomplete in service discovery may become viable when paired with an external registry. A lightweight framework may outperform a heavier one for a narrow internal workflow. Context still has the annoying habit of mattering.
For enterprise readers, the right next step is not to crown a winner. It is to run a pilot with explicit measurements:
- task completion rate;
- handoff failure rate;
- invalid tool-call rate;
- retrieval accuracy;
- policy violation rate;
- cost per successful task;
- time to debug failure;
- audit replay completeness;
- human approval burden;
- incident severity under adversarial prompts.
That is where architecture becomes evidence.
Conclusion: agents become useful when they become boring infrastructure
Agentic AI is often sold as autonomy. The paper’s quieter lesson is that autonomy needs plumbing.
Protocols make agents discoverable and callable. Memory makes them continuous. Guardrails make them safe enough to delegate to. Service contracts make them reusable. Registries make them composable. Orchestration makes them debuggable. Without those layers, an “agent platform” is mostly a prompt, a tool list, and optimism wearing a lanyard.
The useful future of agentic AI will not be defined by the framework with the loudest launch post. It will be defined by systems that can coordinate across boundaries, remember only what they should, expose stable contracts, execute safely, and fail in ways humans can inspect.
That is less cinematic than a swarm of autonomous digital workers. It is also much closer to how real businesses buy, deploy, and trust infrastructure.
Cognaptus: Automate the Present, Incubate the Future.
-
Hana Derouiche, Zaki Brahmi, and Haithem Mazeni, “Agentic AI Frameworks: Architectures, Protocols, and Design Challenges,” arXiv:2508.10146, 2025. https://arxiv.org/abs/2508.10146 ↩︎