Org-Charted Territory: Why AI Agents Need Middle Management

Opening — Why this matters now

The AI industry has spent the last two years trying to turn large language models into workers. The result is a small circus of agents: coding agents, browser agents, research agents, support agents, spreadsheet agents, and agents that appear to exist mainly to summon other agents. Naturally, the next problem is not intelligence. It is management.

That is the useful provocation in From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company, a recent paper introducing OneManCompany — or OMC — as a framework for treating multi-agent systems less like prompt chains and more like an operating organisation.¹

The paper’s central argument is simple, almost offensively managerial: skills are not enough. A skilled agent can perform a task. A team of skilled agents can communicate. But a business process needs more than communication. It needs hiring, assignment, review, escalation, memory, permissioning, retirement, and the unglamorous machinery that prevents work from turning into a polite hallucination festival.

For business users, this matters because most agentic automation projects fail in the space between “the demo worked” and “the process survived Monday morning.” The demo has a prompt. The process has exceptions, missing data, role boundaries, audit trails, budget limits, and someone asking why the invoice workflow emailed a client with a half-finished answer. Very inconvenient. Also very real.

OMC is not just another multi-agent framework. At least in ambition, it is an argument for an organisational layer: a layer that manages agents as a workforce rather than invoking them as disposable tools.

That distinction is where the business value lives.

Background — Context and prior art

Most agent systems today sit somewhere along a familiar ladder:

Layer	What it solves	What it does not solve
Tool use	Gives one agent access to APIs, files, browsers, code, or search	Does not organise multiple workers
Skills	Packages reusable behaviours or functions	Usually lives inside one agent
Multi-agent chat	Lets agents exchange messages or play roles	Often lacks durable contracts, accountability, or lifecycle management
Workflow graphs	Defines task routes and dependencies	Can become brittle when the task changes
AI organisation	Manages agents as a workforce with roles, review, hiring, memory, and evolution	Still expensive, complex, and not yet fully proven outside selected domains

The paper argues that current multi-agent systems remain constrained by three structural weaknesses.

First, team structures are often fixed before execution. The workflow designer decides in advance that there will be a planner, coder, reviewer, and tester. That works until the task quietly demands a data engineer, compliance reviewer, or API specialist. Reality does enjoy arriving uninvited.

Second, agents are coupled to their runtime environments. A LangGraph agent, a Claude Code session, and a script-based executor may each be useful, but they do not automatically behave like interchangeable employees inside one company. Without a common organisational interface, integration becomes a pile of backend-specific glue code.

Third, learning is usually shallow or session-bound. Many systems can revise a plan during a run. Fewer can remember what went wrong across projects, update standard operating procedures, evaluate underperforming agents, and replace them when needed. In human organisations this is called management. In AI systems it is still treated as a research feature, which tells us something about both AI and management.

OMC addresses this gap with three linked concepts:

Talent–Container architecture: separate the agent’s identity and capabilities from the runtime where it executes.
Explore–Execute–Review tree search: treat project execution as an iterative search over organisational strategies.
Self-evolution and HR lifecycle: allow both agents and the organisation to improve through feedback, retrospectives, SOP updates, and performance management.

This is why the paper’s title moves “from skills to talent.” A skill is a reusable capability. A talent is a deployable worker with role, tools, behaviour, memory, and lifecycle. That is not a cosmetic renaming. It changes the unit of automation.

Analysis — What the paper does

1. Talent is not a tool. It is a portable employee profile.

OMC decomposes each AI employee into two parts:

Component	Meaning in OMC	Business translation
Talent	The agent’s role, prompts, skills, tools, working principles, supporting resources, and configuration	The employee’s job profile, playbook, capabilities, and professional habits
Container	The runtime that hosts the talent: LangGraph, Claude Code, script-based executor, or another backend	The desk, machine, operating environment, and access layer
Employee	Talent plus Container	A managed AI worker ready to receive tasks

This separation is powerful because it decouples who the agent is from where the agent runs. A research analyst talent could, in principle, run in different containers. A container could host different talents. This is the kind of abstraction boring engineers like and fragile automation systems desperately need.

The paper defines six organisational interfaces through which containers interact with the platform:

Interface	Function	Why it matters operationally
Execution	Dispatch task and return output	Keeps runtime-specific execution behind a contract
Task	Manage per-agent queues and mutual exclusion	Prevents one agent from being assigned conflicting work at once
Event	Publish and subscribe to organisational events	Enables coordination without chaotic free-form chatter
Storage	Read and write persistent memory	Keeps project and employee knowledge durable
Context	Assemble role, guidance, and memory into execution context	Makes behaviour less dependent on improvised prompts
Lifecycle	Apply pre- and post-execution hooks	Supports validation, guardrails, audit, and self-improvement

The operating system analogy in the appendix is not decorative. OMC treats heterogeneous agents like an operating system treats heterogeneous processes and devices: abstract the messy substrate behind stable interfaces. This is the right instinct. Enterprise AI does not need every agent to be identical. It needs them to be governable.

2. The Talent Market turns capability gaps into hiring decisions.

OMC includes a Digital Talent Market. Instead of generating a fictional agent from a role prompt and hoping it has the implied capabilities — a classic industry pastime, right next to “just add RAG” — the system can recruit deployable agent packages.

The paper describes three supply channels:

Talent source	How it is built	Best fit
Curated repository agents	Packaged from established open-source agent repositories	Mature domains with proven implementations
Prompt-sourced agents with skill assembly	Starts from curated specialist personas, then attaches tools and skills	Roles with clear descriptions but incomplete implementations
Dynamic cloud-skill assembly	Builds persona and skill set from retrieved modular skills	Niche or emerging domains with no mature templates

The HR agent ranks candidates and presents them to the human CEO for approval. This matters because capability selection remains a governance decision, not just a retrieval problem. The paper’s version is still research-heavy, but the business pattern is already visible: future automation stacks may need something like vendor management for agents.

That includes provenance, permissions, performance history, and decommissioning. In other words, procurement will discover agents. We have been warned.

3. Explore–Execute–Review is a project loop, not a chat loop.

OMC’s coordination mechanism is described as an Explore–Execute–Review tree search. The system does not merely follow a fixed workflow graph. It expands a task tree dynamically, assigns subtasks to employees, executes them, reviews outputs, and revises strategy when needed.

The loop can be read as follows:

Stage	What happens	Business analogue
Explore	Decide how to decompose work, assign employees, or recruit missing capabilities	Planning, staffing, and work breakdown
Execute	Agents perform assigned subtasks through containers	Operational execution
Review	Supervisors accept, reject, escalate, or trigger iteration	Quality control and management review

The important detail is the review gate. A completed subtask does not automatically propagate downstream. It must be accepted by a supervisor. This is a direct response to one of the ugliest failure modes in agent systems: early errors quietly becoming input for later agents, which then make the error look more sophisticated. The technical term is error propagation. The office term is “who approved this?”

OMC also combines its task tree with dependency edges, forming a DAG-based execution layer. This allows sibling tasks to depend on one another: for example, frontend work can wait for an API contract even if both tasks share the same parent. A finite state machine governs task lifecycle. Retries are bounded. Timeouts and cost budgets act as circuit breakers. Deadlock detection prevents silent stalls.

This is not glamorous, but it is the machinery that separates a toy multi-agent demo from a process that can run without someone nervously refreshing a console.

4. Self-evolution is organisational memory, not mystical self-improvement.

The paper uses the language of self-evolution, but the mechanism is more practical than mystical. Agents update working principles after one-on-one feedback and task completion. Project retrospectives produce SOP updates. HR reviews agent performance every three projects. Agents that fail repeated reviews can enter a Performance Improvement Plan and eventually be offboarded.

This is almost funny because the paper imports human HR bureaucracy into AI systems. It is also sensible. A workforce that cannot evaluate, coach, or replace its own members is not a workforce. It is a loose collection of chat windows with confidence issues.

The key business point is this: OMC stores learning in agent profiles and organisational SOPs, not in model weights. That makes improvement cheaper and more auditable than retraining. It also makes it more fragile: if the written reflection is poor, the organisation learns the wrong lesson with excellent formatting.

Findings — Results with visualization

The paper evaluates OMC on PRDBench, a benchmark of 50 project-level software development tasks based on product requirement documents. Unlike narrow coding benchmarks, PRDBench evaluates whether an agentic system can interpret requirements, decompose work, implement solutions, and satisfy executable test criteria.²

The reported result is strong:

System type	Method	Success rate	Reported cost
Multi-agent	OMC	84.67%	$345.59 total / about $6.91 per task
Minimal	Claude-4.5	69.19%	Not reported
Minimal	GPT-5.2	62.49%	Not reported
Commercial	CodeX	62.09%	Not reported
Commercial	Claude Code	56.65%	Not reported
Minimal	Qwen3-Coder	43.84%	Not reported
Minimal	DeepSeek-V3.2	40.11%	Not reported

A fair interpretation is not “OMC is universally better.” The fair interpretation is narrower and more useful: for complex project-level software tasks, organisational coordination can outperform stronger single-agent baselines, but at a visible coordination cost.

The paper also reports four cross-domain case studies:

Case	Team pattern	Output	Reported cost / time
GitHub AI agent trend report	Researcher plus writer	Verified repository trend report emailed to user	About $4.5; under 10 minutes
Street-fighting web game	Game developer plus art designer	Playable prototype, revised after evaluator feedback	Cost breakdown shown in appendix
Illustrated audiobook short drama	Writer plus AV producer	Scripts, scene images, voice-over tracks, final videos	$1.57 for two videos and related assets
Automated research survey	Research scientists plus AI engineer	17 structured documents, mind map, research ideas	$16.26; under one hour

These cases are more demonstration than proof. They show generality, not statistical confidence. The authors are explicit that systematic evaluation beyond software development remains future work, and that self-evolution mechanisms have not yet been quantitatively ablated. Good. A paper that admits its limitations is already ahead of several product launches.

What seems genuinely new?

The novelty is not that OMC uses multiple agents. That field is crowded enough to need zoning laws. The contribution is the combination of four design choices:

Design choice	Why it matters
Talent–Container separation	Allows heterogeneous backends to be governed through common contracts
Talent Market	Makes workforce composition dynamic rather than preconfigured
Review-gated task DAG	Reduces silent error propagation and unmanaged dependency failures
HR-style lifecycle	Turns learning into persistent profiles, SOPs, reviews, and replacement decisions

This is a shift from agent orchestration to agent operations. Orchestration asks, “Which agent talks next?” Operations asks, “Who is qualified, what are they allowed to access, what counts as acceptance, what happens if they fail, and how does the organisation learn?”

The second question is less exciting. It is also the one that companies actually pay for.

Implications — Next steps and significance

For business automation: stop designing only the agent; design the organisation around it.

The practical lesson is blunt. An AI agent is not a business process. It is a worker inside a process. The process still needs operating design.

For a company building agentic automation, OMC suggests a better implementation checklist:

Question	Weak implementation	Stronger implementation
Capability	“Use a smart model.”	Define roles, tools, access rights, and acceptance criteria
Assignment	“Let the agent decide.”	Match tasks to agent profiles and performance history
Review	“Check final output.”	Review subtasks before downstream propagation
Memory	“Store chat history.”	Maintain task logs, working principles, SOPs, and exception records
Failure	“Retry.”	Retry with context, then escalate, reassign, or offboard
Cost	“Agents are cheap.”	Route simple tasks to single agents; reserve teams for complex work

This is where ROI becomes practical. Multi-agent coordination is not free. The paper reports about $6.91 per PRDBench task for OMC, and the authors openly note the cost-performance trade-off. That cost is irrational for a simple email rewrite. It may be trivial for a software defect, compliance review, financial reconciliation, or customer-facing workflow where one unreviewed error can be expensive.

The sensible enterprise architecture is not “agent everything.” It is adaptive dispatch: single-agent execution for simple tasks, multi-agent organisation for complex, risky, or multi-step work.

For governance: the organisational layer is a control layer.

Many AI governance discussions focus on model policy: what the model may say, what data it may access, what risks it may create. OMC reframes governance around operations:

Which agent is authorised to use which tool?
Which outputs require supervisor acceptance?
Which dependencies block execution?
Which failures trigger escalation?
Which lessons become SOPs?
Which agents should be retired?

This is closer to how real businesses manage risk. Governance is not a PDF policy admired once per quarter. It is embedded in queues, permissions, review gates, logs, and escalation paths.

The paper’s role-based tool access model is especially relevant. Each agent’s context includes only authorised tools. That means access control is handled before reasoning begins, rather than after the agent has already considered doing something creative. In security, creativity is not always a virtue.

For AI vendors: marketplaces may move from tools to workers.

If the Talent Market idea matures, the agent ecosystem may evolve from tool stores to worker marketplaces. Today, many agent platforms advertise integrations: Slack, Gmail, GitHub, Notion, databases, browsers, APIs. Useful, yes. But enterprises eventually need complete work units: a claims reviewer, month-end close assistant, compliance documentation agent, procurement analyst, or QA auditor.

That requires more than a tool wrapper. It requires:

Marketplace asset	Enterprise requirement
Role definition	Clear responsibility boundary
Tool bindings	Controlled operational capability
Benchmark record	Evidence of competence
Provenance	Source, maintainer, version history
Permission profile	What the agent can access and modify
Performance history	Quality, failure rate, cost, review outcomes
Offboarding path	Safe removal and replacement

This is where agent markets become less like app stores and more like staffing platforms mixed with software supply-chain management. Delightful. More spreadsheets. But necessary spreadsheets.

For Cognaptus-style automation: the case study template becomes agentic by design.

OMC maps neatly onto real-world business automation cases. Consider a utility maintenance agent, a recruitment screening agent, a call-center QA agent, or a fleet maintenance agent. In each case, the value does not come from one heroic LLM. It comes from a managed division of labour:

Business process	Possible AI organisation
Utility maintenance	Outage triage, dispatch, asset inspection, public notice drafting, reliability reporting
Recruitment screening	Resume screening, role matching, interview question generation, shortlist explanation, client briefing
Call-center QA	Transcript review, sentiment detection, compliance checking, coaching, QA reporting
Fleet maintenance	Vehicle health monitoring, maintenance scheduling, fuel efficiency analysis, incident summarisation, cost reporting

The implementation question becomes: what are the employees, what are their tools, what are their review gates, and what organisational memory should persist?

This is a healthier design pattern than asking one omnipotent chatbot to “handle operations.” Omnipotence is usually just poor scoping with better branding.

Practical adoption model

A business considering this architecture should not start by recreating OMC. It should start by borrowing the organisational principles.

Maturity level	What to implement first	Success signal
Level 1: Assisted workflow	One agent drafts, human approves	Faster drafting without loss of quality
Level 2: Role-specialised agents	Separate agents for extraction, analysis, drafting, review	Fewer errors from task mixing
Level 3: Review-gated workflow	Subtask outputs require acceptance before downstream use	Lower error propagation
Level 4: Persistent operating memory	SOPs, exception logs, reusable profiles	Better performance across repeated jobs
Level 5: Dynamic workforce	Recruit or activate specialist agents based on task needs	Broader task coverage without rebuilding workflows

Most companies should live at Levels 2–4 for a while. Level 5 is attractive, but dynamic hiring of agents only makes sense after the organisation has clear roles, access controls, acceptance standards, and logs. Otherwise, the company is not hiring talent. It is inviting strangers into the workflow with API keys. A bold strategy, mostly for people who dislike sleep.

Limitations — Where the paper is still early

The paper’s framework is ambitious, but several constraints deserve attention.

First, the quantitative benchmark is software-heavy. PRDBench is useful because project-level software work genuinely tests planning, decomposition, implementation, and evaluation. Still, enterprise workflows in finance, healthcare, logistics, legal operations, and public services carry different failure modes. A coding benchmark does not prove operational readiness in regulated processes.

Second, the self-evolution component is plausible but under-tested. The authors state that one-on-ones, retrospectives, and performance reviews are implemented, but they have not yet isolated the contribution of each component. This matters because organisational memory can help or harm. Bad retrospectives produce bad SOPs. Bad SOPs scale mistakes with confidence, the most efficient form of nonsense.

Third, cost efficiency remains unresolved. OMC’s result is strong, but the baselines do not report comparable cost data. Without cost-normalised evaluation, it is hard to know whether OMC is better because it is better organised, because it spends more compute, or because both are true.

Fourth, human judgment remains central. The CEO approves talent selection, injects requirements, prunes branches, and decides when iteration should stop. This is not a weakness; it is reality. But it means the quality of the system depends partly on the quality of the human controller. Agentic automation does not remove management. It concentrates management into sharper decision points.

Conclusion — The return of the org chart

The strongest idea in this paper is that agentic AI needs an organisational abstraction. Not another prompt template. Not another agent chatroom. Not another dashboard where simulated employees compliment each other before failing a dependency.

It needs a layer that manages workforce composition, runtime heterogeneity, task decomposition, dependency control, review gates, memory, access rights, cost limits, and lifecycle accountability.

OMC is early, and parts of it remain closer to research prototype than enterprise infrastructure. But its direction is important. The future of AI automation will not be won by the single smartest agent. It will be won by systems that know when to hire, when to delegate, when to review, when to escalate, when to remember, and when to fire.

Apparently, the future of autonomous AI is middle management.

How reassuringly human.

Cognaptus: Automate the Present, Incubate the Future.

Zhengxu Yu, Yu Fu, Zhiyuan He, Yuxuan Huang, Lee Ka Yiu, Meng Fang, Weilin Luo, and Jun Wang, “From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company,” arXiv:2604.22446v1, April 24, 2026. HTML version. ↩︎
The paper evaluates OMC on PRDBench, described as 50 project-level software development tasks based on structured product requirement documents, auxiliary data, test plans, and executable evaluation scripts. ↩︎

Opening — Why this matters now#

Background — Context and prior art#

Analysis — What the paper does#

1. Talent is not a tool. It is a portable employee profile.#

2. The Talent Market turns capability gaps into hiring decisions.#

3. Explore–Execute–Review is a project loop, not a chat loop.#

4. Self-evolution is organisational memory, not mystical self-improvement.#

Findings — Results with visualization#

What seems genuinely new?#

Implications — Next steps and significance#

For business automation: stop designing only the agent; design the organisation around it.#

For governance: the organisational layer is a control layer.#

For AI vendors: marketplaces may move from tools to workers.#

For Cognaptus-style automation: the case study template becomes agentic by design.#

Practical adoption model#

Limitations — Where the paper is still early#

Conclusion — The return of the org chart#