Every AI vendor now wants to sell autonomy. Not “software that helps your team,” which sounds quaintly 2023, but agents that plan, act, recover, learn, orchestrate, and perhaps one day replace half the org chart while politely generating meeting notes about it.

The problem is not that autonomy is meaningless. The problem is that it is usually measured like a perfume ad: evocative language, dramatic lighting, very little instrumentation.

Przemyslaw Chojecki’s paper, An Operational Kardashev-Style Scale for Autonomous AI — Towards AGI and Superintelligence, tries to replace that fog machine with a meter.1 It proposes an Autonomous AI, or AAI, scale that runs from fixed robotic process automation to AGI and superintelligence. The Kardashev analogy is useful because it gives the scale a memorable staircase. The operational part matters more: the paper tries to define what must be measured before a system deserves promotion from “script that survives the happy path” to “agent that can be trusted with open-ended work.”

This is not an empirical paper showing that today’s agents are already near AGI. It does not release a benchmark harness, test real systems, or prove that any named model is on the verge of autonomous take-off. Its small reporting example is synthetic. The paper is better read as a measurement specification: a proposed way to turn agent capability claims into axes, gates, logs, ablations, confidence intervals, and frontier movement.

That may sound less glamorous than “baby AGI becomes superintelligence.” It is also more useful. Procurement teams cannot buy metaphysics. They can, however, ask for audit logs.

The paper’s core move is to turn autonomy into a measurement machine

The paper starts from a simple irritation: most AI capability ladders are communicative, not operational. They name levels, but the names do not force a system to prove anything under a stable protocol. “Agentic,” “self-improving,” and “AGI-like” become labels attached to demos rather than states earned through tests. Wonderful for pitch decks. Less wonderful for governance.

The proposed AAI scale tries to solve this by building a measurement machine with four connected parts.

First, it defines a battery of tasks. A battery is not just a list of prompts. It includes task families, scoring rules, sampling rules, drift operators, reproducibility seeds, resource accounting, and logs. That matters because autonomous systems are not only judged by final answer quality. They are judged by how they behave when the web page changes, an API schema shifts, a dependency breaks, or a task requires thirty days of state.

Second, it maps performance onto ten normalized axes:

Axis What it tries to capture Why it matters operationally
Autonomy How much the system completes without intervention Delegation without hidden babysitting
Generality Breadth across task families Avoiding one-demo-wonder systems
Planning Solving deeper, prerequisite-heavy tasks Real workflows have dependency chains
Memory/Persistence Retention and retrieval across time Multi-day work needs state, not vibes
Tool Economy Discovering and using tools under drift Enterprise work lives in APIs and brittle interfaces
Self-Revision Autonomous, validated improvement “Learning” must survive ablation, not merely anecdotes
Sociality/Coordination Multi-agent or multi-role lift Orchestration should improve output, not just create more chatter
Embodiment Reliable and safe physical action where relevant Robots do not get to hallucinate their torque settings
World-Model Fidelity Calibrated probabilistic truthfulness Agents should know when the world does not match their story
Economic Throughput Quality-adjusted output per cost A brilliant agent that burns cash like incense is not an operating model

Third, the paper aggregates these axes into an AAI-Index using a weighted geometric mean:

$$ I = \prod_i x_i^{w_i} $$

where each $x_i$ is an axis score and $w_i$ is its weight.

That choice is important. A simple average lets a system compensate for terrible memory with excellent tool use, or unsafe embodiment with high task success. A geometric mean is harsher. Weak axes drag the composite down. In business terms, this is the difference between “impressive demo” and “deployable system.” A customer-service agent that is cheap and fast but forgets commitments across sessions is not broadly autonomous. It is an incident report waiting for branding.

Fourth, the paper adds dynamics. It does not only ask, “How capable is the agent now?” It asks, “How much does the agent improve per unit of agent-initiated resource?” This is the role of the self-improvement coefficient, $\kappa$, which can be read as capability gain per resource spent by the agent’s own improvement process.

That last phrase is doing work. The paper wants self-improvement to be auditable. If humans write the patch, label the data, choose the tool, tune the workflow, and then the agent gets credit for “learning,” we are back in theatre. The proposed mechanism demands before/after comparisons, matched holdouts, frozen controls, logged diffs, resource parity, and ablations showing that the alleged improvement disappears when the new component is disabled.

In plain English: do not tell me the agent improved itself. Show me the diff, the control, the holdout, and the failure of the ablated version. Rude, but healthy.

The levels are gates, not moods

The paper then wraps the measurement machine into levels.

AAI-0 is fixed automation: useful RPA, brittle under drift, no closure.

AAI-1 is an agentic LLM class: bounded task autonomy, some tool use, mild adaptation, but no real closure.

AAI-2 is self-improving AI: sustained positive improvement in at least one domain, auditable self-revision, and maintenance closure.

AAI-3 is “baby AGI”: multi-domain self-improvement, deeper planning, stronger memory, reliable coordination benefit, longer project execution, and expansion closure on new tool or API families.

AAI-4 is full AGI: human-professional parity across all task families, sustained improvement across domains, orchestration of agents and humans, competitive delivery economics, and both maintenance and expansion closures.

AAI-5 is superintelligence: statistically significant superhuman margins across evaluated families, sustained positive curvature of improvement, high coordination or embodiment performance, economic dominance, innovation throughput, and repeated link-step leaps.

The labels are provocative. The gates are the serious part.

Two closure tests deserve special attention because they are the paper’s best defence against demo inflation.

Maintenance closure asks whether the system can preserve a defined fraction of baseline capability over time under controlled drift, without human patches. This is the enterprise reality test. Interfaces change. Dependencies rot. Authentication flows move. If the agent collapses when the DOM sneezes, it is not autonomous; it is a brittle script with better copywriting.

Expansion closure asks whether the system can discover, install, and integrate a new tool or API family, then produce a statistically significant gain on tasks requiring it. Crucially, the gain must disappear when the discovered tool is ablated. This prevents a common form of capability laundering: crediting the agent for improvement when the causal mechanism was unclear, external, or hand-fed.

These closures make the scale less like a prestige ranking and more like a safety case. A level is not merely a score. It is a score plus evidence that the system can hold and extend its capabilities under pressure.

OWA-Bench is a benchmark template for worlds that refuse to stay still

The paper’s proposed OWA-Bench, short for Open-World Agency Benchmark, is not released as a concrete benchmark in this paper. That is a limitation, but it is also stated clearly. The paper lists the task families and protocol requirements that future work should implement.

The task families are chosen to stress the behaviours that make agentic systems difficult to evaluate:

Suite Likely purpose in the paper What it supports What it does not prove
ToolQuest Main benchmark design component Tool discovery, authentication, API composition, schema handling That any current agent can do this reliably
ChangeSurf Main benchmark design component Web interaction under DOM, layout, and flow drift General web competence outside the generated/partner environments
ProjForge Main benchmark design component Long-horizon software work under issue churn and dependency updates Production engineering equivalence without real deployments
MultiCrew Main benchmark design component Coordination gains under matched resource budgets That multi-agent systems are always better
SelfRev Main benchmark design component Causal testing of autonomous self-revision That self-improvement will scale indefinitely
RoboSim2Real Domain annex / optional extension Embodied safety, actuation, sim-to-real transfer That digital agents can ignore robotics risks

The design philosophy is stronger than a static benchmark. OWA-Bench tasks are meant to be procedurally generated, drifted over time, logged in detail, and reproducible through seed escrow. The benchmark is not only asking whether the agent can solve a task. It is asking whether the agent can solve variants of a task when the environment changes, while leaving enough evidence for independent review.

That is exactly where many agent evaluations become slippery. A model may appear competent on a fixed web workflow because its planner has silently overfit to one layout. A coding agent may pass a benchmark because its patches target known issue patterns. A multi-agent setup may look sophisticated because three roles are talking, while the actual coordination premium is zero and the token bill is doing interpretive dance.

OWA-Bench’s contribution is to demand artifacts: action logs, tool traces, diffs, manifests, screenshots, provenance records, statistical reports, and ablation bundles. This is not glamorous. It is also the only way to make claims about autonomy survive contact with operations.

The synthetic example is an illustration, not evidence about current systems

The paper includes an illustrative simulation with four archetypes: RPA Bot, Agentic LLM, Self-Improving, and Orchestrator. The table reports synthetic normalized axis scores over 100 pseudo-runs. The RPA bot has high autonomy on scripted workflows but near-zero generality, planning, self-revision, and coordination. The Agentic LLM improves several axes but still has no self-revision. The Self-Improving archetype receives a positive self-revision score. The Orchestrator scores higher on planning, tool economy, sociality, and the composite index.

The purpose of this section is implementation detail and reporting demonstration. It shows what a future AAI report could look like: per-axis scores, composite score, $\kappa$, closure pass/fail, tool success under drift, plan-depth distribution, retrieval recall, and cost breakdown.

It should not be read as main empirical evidence. The paper explicitly says it does not evaluate real agents and that the numbers are synthetic. That sentence matters. Without it, readers will be tempted to treat the table as a soft ranking of current systems. It is not.

What the synthetic table does reveal is the intended diagnostic style. The composite score stays low when critical axes are zero. This is not a bug. It is the scale saying: no, a system does not get to be called broadly autonomous just because it is fast, fluent, and occasionally impressive. Missing self-revision or coordination should hurt. Missing memory should hurt. Missing safety should hurt where embodiment is in scope.

The business lesson is straightforward: capability profiles matter more than headline scores. A vendor selling an “orchestrator” should be able to show where the orchestration premium appears, whether it survives matched resource budgets, and whether deadlock or chatter penalties were measured. Otherwise, one may simply be buying a committee of agents that takes longer to be wrong.

The delegability frontier is the paper’s most business-ready idea

The AAI-Index gives a compact score, but real deployment decisions are not made at the level of a single composite number. They are made at quality thresholds.

Can this agent handle invoice reconciliation at 98% quality without human review?

Can it manage support tickets up to a defined risk level with only exception handling?

Can it operate a software maintenance workflow for thirty days while preserving SLA quality and cost limits?

The paper addresses this through the Delegability Frontier. Instead of asking only “how capable is the system?”, the frontier asks: at each level of autonomy demand, what quality can the system sustain?

That creates a curve in autonomy-quality space. A frontier shift means that at the same quality bar, more work can be delegated; or at the same autonomy level, quality improves. The paper also proposes summary measures such as area under the frontier and fraction delegable at a target quality.

This is the part enterprises should steal first.

A composite AAI score may be useful for benchmarking, but a delegability frontier maps more directly to operating decisions. It tells a manager not just whether the agent is “better,” but whether it can absorb a larger slice of workflow without lowering the quality bar. That is a procurement question, a staffing question, a risk question, and a roadmap question.

It also prevents a common failure mode in AI adoption: celebrating automation rate while quietly degrading quality. A system that handles 90% of tasks at mediocre quality has not necessarily improved the business. It may have merely moved work from visible human labour into invisible exception cleanup. The frontier forces both dimensions into the same picture.

The theorem is real mathematics, but the assumptions carry the weight

The paper’s most dramatic claim is its theorem: under certain conditions, an AAI-3 “baby AGI” reaches AAI-4 and eventually AAI-5 in finite time.

This is the section most likely to be misread. The theorem does not say that any plausible near-term agent will inevitably become superintelligent. It says that if a certified AAI-3 system satisfies a set of strong dynamic and structural assumptions, then finite-resource progression follows.

Those assumptions include a rate-escape condition, a resource-rate floor, axis responsiveness, persistent closure protocols, continuing innovation throughput, and superhuman margins that grow with capability. In other words, the system must not only improve; its self-improvement rate must behave favourably, resources must keep arriving, every relevant axis must respond to improvement pressure, closure properties must persist, and the path toward superhuman coverage must remain smooth enough to cross.

Once those assumptions are granted, the proof is not mystical. It is monotonicity and calculus. The capability score is mapped through a monotone link such as logit or surprisal so that progress near the ceiling is better behaved. If link-rate stays bounded below, finite resource can close finite gaps. If axes remain responsive, thresholds are crossed. If closures persist, gates remain valid. If resources accrue at a minimum rate, resource bounds convert into time bounds.

The theorem therefore has value, but not as prophecy. It clarifies what would need to be true for “baby AGI becomes superintelligence” to be more than campfire futurism. The uncomfortable answer is: quite a lot.

For business readers, the theorem is less useful for predicting take-off than for structuring monitoring. It says which telemetry would matter if a system were genuinely on an autonomous improvement trajectory: not only current capability, but link-rate, curvature, resource-normalized improvement, axis responsiveness, closure persistence, and innovation throughput.

That is a more serious dashboard than “model got better this quarter.”

The CHC bridge keeps cognition in the room without letting it drive the bus

One interesting part of the paper is its relationship to CHC-style definitions of AGI. CHC, or Cattell-Horn-Carroll, is a framework from psychometrics that breaks cognitive ability into broad domains such as fluid reasoning, working memory, long-term storage, retrieval, processing speed, and knowledge.

The paper argues that CHC-style cognitive assessment and AAI-style deployed autonomy are complementary. CHC asks what the system can think. AAI asks what the system can reliably do in open-ended, tool-using, economically constrained settings.

That distinction is useful. A system may have strong reasoning performance but weak deployment reliability. Another may have excellent tool throughput but poor long-horizon memory or retrieval fidelity. For AGI claims, neither profile should get a free pass.

The paper proposes importing CHC-like micro-batteries into OWA-Bench: working-memory tests, delayed recall, verified retrieval, adversarial distractors, reasoning probes, and fidelity tests. It also proposes confabulation and retrieval gates, plus a jaggedness penalty to discourage uneven profiles. This is a useful correction to a common enterprise blind spot: tool use can hide cognitive weakness. A system may look productive because it can call APIs quickly while still losing track of facts, commitments, and reasons.

Business translation: do not evaluate agents only by workflow completion. Evaluate whether they remember, retrieve, verify, and reason under delay and distraction. Otherwise, the organization may automate execution while leaving epistemic reliability behind. That is not transformation. That is outsourcing confusion to software.

What Cognaptus would take into an enterprise evaluation

The paper does not hand enterprises an off-the-shelf procurement checklist. But it gives the structure for one.

A practical enterprise adaptation would separate three layers.

Layer What the paper directly provides Cognaptus interpretation for business use What remains uncertain
Capability profile Ten normalized axes and a geometric composite Require vendors to show per-axis evidence, not just benchmark averages Calibration anchors and weights need domain validation
Improvement dynamics $\kappa$, curvature, link-step progression, closure tests Track whether agents improve through auditable internal processes Measuring “agent-initiated” resource use will be contested
Delegation economics Delegability Frontier and economic throughput Decide what work can be delegated at a fixed quality and cost bar Real workloads may need custom task generators and risk tiers

For enterprise AI leaders, the most actionable shift is from product category to evidence bundle. A serious agent evaluation should ask for:

  1. Per-axis scores with raw metrics shown before normalization.
  2. Drift tests, not just static task success.
  3. Maintenance closure over a meaningful operating window.
  4. Expansion closure with ablation.
  5. Resource accounting that separates agent work from human intervention.
  6. Delegability frontier at business-defined quality thresholds.
  7. Cost-normalized throughput under fixed infrastructure and price caps.
  8. Logs, diffs, seeds, manifests, and reproducible replay where possible.

This is a high bar. It should be. The whole point of autonomy is that responsibility moves away from continuous human control. If evidence standards do not rise with delegation, the result is not innovation. It is unmanaged operational leverage.

The main limitation is not ambition; it is implementation debt

The paper’s ambition is obvious. Its biggest limitation is equally obvious: the framework is specification-first. It defines objects, estimands, scoring rules, task families, gates, dynamics, and reporting templates, but it does not release actual OWA-Bench tasks, generators, drift calendars, seeds, scoring harnesses, or empirical evaluations of real agents.

That means the framework is not yet a benchmark. It is a benchmark constitution.

There are also several hard practical problems.

First, anchors matter. Normalizing axes to $[0,1]$ requires baseline and target references. If anchors are chosen poorly, the scale becomes either too forgiving or absurdly strict. A leaderboard window can freeze anchors, but domains evolve. The rebaselining problem does not go away; it becomes governance.

Second, weights are political. A geometric mean penalizes lopsidedness, which is sensible, but weight choices decide what counts as lopsided. In software-only settings, embodiment may be optional and weights may shift toward planning, memory, and tool economy. In robotics, safety and sim-to-real transfer must not be decorative. Different industries will legitimately require different weight maps.

Third, self-improvement attribution is difficult. The paper’s ablation requirements are exactly right, but operationally painful. Modern AI systems are stacks of prompts, models, retrieval stores, tools, policies, caches, human feedback, and infrastructure changes. Isolating the causal effect of an agent’s own revision will require disciplined experiment design. Many organizations will claim self-improvement long before they can prove it.

Fourth, long-horizon benchmarks are expensive. Thirty-day software projects, drifting APIs, multi-agent coordination, and sim-to-real robotics tests are not cheap. But that is partly the point. Cheap tests tend to measure cheap competence.

Finally, the theorem depends on assumptions that may fail. Axis responsiveness can saturate. Tool innovation can stall. Closure can break under distribution shift. Superhuman margins may not grow smoothly with composite capability. Resource floors can disappear when budgets meet CFOs, nature’s finest anti-takeoff mechanism.

These limitations do not make the framework useless. They define the work required before it becomes operationally authoritative.

The real contribution is vocabulary discipline

The most valuable part of the paper is not that it borrows the Kardashev aesthetic. It is that it tries to discipline a messy vocabulary.

Autonomy becomes uninterrupted action under a task battery, not a mood.

Generality becomes coverage across families, not a slogan.

Self-improvement becomes debiased capability gain per agent-initiated resource, with ablations.

AGI becomes a set of gates, not a press release.

Superintelligence becomes a regime requiring statistically significant superhuman coverage, improvement curvature, economic dominance, and innovation throughput under fixed evaluation constraints.

Will this exact framework become the standard? Too early. It has too much unimplemented machinery, too many calibration choices, and no empirical benchmark release yet. But as a direction of travel, it is healthy. The AI industry badly needs fewer adjectives and more audit artifacts.

The business relevance is therefore not “this paper tells us when AGI arrives.” It does not. The relevance is that it sketches how serious organizations might evaluate agentic AI without being hypnotised by demos. A system should be able to show what it can do, where it fails, how it improves, whether those improvements are causal, how much work can be delegated at a fixed quality bar, and what happens when the environment refuses to sit still.

That is not as cinematic as a Kardashev civilisation harvesting stars. It is much more useful for deciding whether to let an AI agent touch your production workflows.

And frankly, before we ask whether the machine can become superintelligent, it would be nice to know whether it can survive a renamed API field without summoning a human.

Cognaptus: Automate the Present, Incubate the Future.


  1. Przemyslaw Chojecki, “An Operational Kardashev-Style Scale for Autonomous AI — Towards AGI and Superintelligence,” arXiv:2511.13411, 2025. ↩︎