Demo days are generous. A sales engineer opens a prepared workflow, the agent clicks through a familiar sequence, the dashboard turns green, and everyone politely pretends not to notice how much of the intelligence was smuggled into the setup.
ARC-AGI-3 is less polite.
The paper introduces an interactive benchmark for agentic intelligence: not a static puzzle, not a multiple-choice exam, and not a coding task with a unit test waiting like a benevolent parent. An agent enters a novel, abstract, turn-based environment. It receives no explicit objective. It must explore, infer the rules, identify what counts as success, build a working model of the environment, and execute a plan efficiently.1
That last word matters: efficiently.
The headline result is easy to quote and easy to misunderstand. At release, frontier models score below 1% on the official ARC-AGI-3 leaderboard, while humans can solve the environments under first-exposure conditions. That sounds like another “AI still lacks common sense” story. Tempting, yes. Also too cheap.
The more useful reading is sharper: ARC-AGI-3 separates three things that the market often bundles together under the flattering label of “agentic AI.” First, whether a model can adapt to a genuinely new task without being specially prepared. Second, whether a surrounding harness can make that model useful in a narrow environment. Third, whether a benchmark score is measuring intelligence or the amount of human design hidden in the wrapper.
For business leaders evaluating agents, that separation is the whole point.
ARC-AGI-3 changes the test from answering to discovering
Earlier ARC benchmarks tested abstraction from static examples. ARC-AGI-1 asked systems to infer transformation rules from small grid examples. ARC-AGI-2 increased the reasoning depth of that static format. ARC-AGI-3 moves into interaction.
That shift sounds modest until you notice what disappears.
In static benchmarks, the task frame is already supplied. The system knows that there is a puzzle, knows where the inputs and outputs are, and knows that the answer is some transformation of the visible data. In ARC-AGI-3, the agent sees a 64-by-64 grid with 16 possible colors and must act through a small set of permitted actions: key-like moves, undo, and sometimes cell selection. The interface is intentionally simple. The difficulty is not in dexterity or perception. It is in discovering what the environment is.
The paper frames the target capability through four functions:
| Function | What ARC-AGI-3 demands | Why static benchmarks under-test it |
|---|---|---|
| Exploration | Probe the environment to reveal mechanics | Static tasks passively expose the relevant information |
| Modeling | Form a causal model of state transitions | Static tasks usually reduce the world to input-output mapping |
| Goal inference | Work out what success even means | Most benchmarks state the objective directly |
| Planning and execution | Reach the goal without wasting actions | Many benchmarks reward final correctness more than path quality |
This is why ARC-AGI-3 is not merely “ARC, but game-like.” The benchmark attacks a different failure mode. A model can be impressive at answering questions and still be poor at deciding which questions need to be asked. Anyone who has watched an agent confidently loop through the same browser failure for eight minutes has already seen the commercial version of this problem. The benchmark simply makes the embarrassment measurable.
Completion is not the same as intelligence when the path is wasteful
The paper’s scoring method, RHAE — Relative Human Action Efficiency — is the most business-relevant contribution because it rejects the comforting binary of solved versus unsolved.
For each level completed by an AI system, the score compares the number of AI actions to a human baseline, defined as the upper-median best human action count for that level. The level score uses a squared efficiency term, so inefficient solutions collapse quickly. If a human baseline is 10 actions and the AI takes 100 actions, the score is not a forgiving 10%. It is 1%.
The exact benchmark score then adds several design choices: per-level normalization, weighting later levels more heavily, capping unusually high per-level scores at 115% of human baseline, and capping each environment score by the weighted fraction of levels completed. The metric is doing more than grading. It is encoding a theory of intelligence: adaptation that requires huge amounts of trial-and-error is not yet human-like adaptation.
That matters because many business agent demos quietly optimize for eventual completion. Give the system enough retries, enough tools, enough prompt glue, enough human-written heuristics, and it may eventually reach the desired state. For operations teams, eventual completion can still be valuable. A slow agent that clears invoices overnight is not useless. But it is not the same capability as a system that understands a new process after a few interactions.
ARC-AGI-3 makes that distinction explicit:
| Evaluation question | Traditional demo answer | ARC-AGI-3-style answer |
|---|---|---|
| Did the agent finish? | Yes or no | How many actions did it need, relative to humans? |
| Did the model reason? | It produced a plausible chain | Did exploration reduce uncertainty efficiently? |
| Did the system generalize? | It worked in the demo environment | Did it work on unseen, private, out-of-distribution environments? |
| Was the solution autonomous? | The agent clicked the buttons | How much human strategy was embedded in the harness? |
The square in RHAE is a small mathematical insult with a serious purpose. It says: brute force is not just inelegant; it is evidence. If a system needs ten times the human action count, it may be solving, but it is not adapting like a human. Apparently, “eventually got there after exhausting the state space” is not a synonym for intelligence. A harsh standard. Also a useful one.
Public demos are not official evidence, and that distinction is not pedantry
ARC-AGI-3 divides its environments into a public demonstration set, a semi-private set, and a fully private set. The public demo set contains 25 environments. The semi-private and fully private sets each contain 55 environments. This is not a minor administrative detail. It is the benchmark’s immune system.
The public set is intentionally a front door: accessible, engaging, and useful for showing the benchmark format. The private sets are the actual evaluation instruments. They are harder, broader, and intentionally out-of-distribution relative to the public environments. The paper also notes that ARC-AGI-3 inverts the older public-to-private balance: public data becomes demonstration material, while private environments become the basis for meaningful evaluation.
This design responds to a familiar pathology. Once a benchmark becomes famous, it stops being a neutral measuring device and becomes a training target. Static benchmarks are especially vulnerable. Models can be trained on similar examples, synthetic variants, reasoning traces, or benchmark-shaped distributions. Performance rises, but the measurement becomes contaminated.
ARC-AGI-3’s answer is not “make the puzzles secret and hope.” It uses several layers of protection:
| Design choice | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Public demo environments | Format familiarization | Researchers can understand the interface | Public scores measure AGI progress |
| Semi-private environments | API-based frontier testing | Lower-leakage comparison across models | Absolute protection from contamination |
| Fully private environments | Official competition evaluation | Stronger out-of-distribution scoring | Real-world business ROI |
| Core-knowledge-only design | Avoid language/cultural dependence | Tests abstract adaptation rather than acquired trivia | All human cognition is captured |
| Human calibration | Ensure environments are solvable by people | The benchmark is hard for AI but not obscure for humans | Every failure mode is fully diagnosed |
This is one of the paper’s more mature points. Benchmark design is no longer about finding a hard test. It is about designing a test that remains hard after the industry has incentives to game it.
For businesses, the analogy is direct. A vendor demo is the public set. A pilot on your own messy data is the semi-private set. A surprise workflow that was not included in the prompt-engineering theater is the fully private set. Guess which one tells you whether the agent will survive contact with operations.
The harness problem: useful engineering, bad intelligence measurement
The paper is unusually clear about a distinction the AI industry often blurs because blurring it sells better.
Harnesses matter. A harness can manage context, store state, call tools, compress history, run code, search over actions, or inject human-designed strategies. Good harnesses can make models much more useful. The paper even describes pre-launch and community approaches that improved performance on public environments: graph-based exploration, reinforcement-style action prediction, orchestrator-subagent systems, and context-management wrappers.
But the official ARC-AGI-3 leaderboard deliberately does not treat harness-driven performance as evidence of general intelligence.
The reason is not snobbery. It is measurement hygiene.
If a researcher sees public ARC-AGI-3 environments and then handcrafts a strategy around them, the resulting system may perform well for reasons that do not transfer. The paper gives a telling example: in one environment variant, Opus 4.6 with a Duke harness reportedly jumps from 0.0% to 97.1%; in another environment, the same model remains at 0.0% with or without the harness. The interpretation is not that the harness is worthless. It is that targeted scaffolding can unlock a seen pattern without producing general adaptation.
That is exactly the distinction businesses need.
A domain-specific harness may be commercially valuable. If a claims-processing agent becomes reliable because engineers add state tracking, validation rules, document parsers, and escalation policies, excellent. That is automation engineering. It may generate ROI. It may deserve budget.
But do not confuse it with a generally intelligent agent that can enter any new workflow and figure things out. The former is a product architecture. The latter is the measurement target of ARC-AGI-3. Mixing them together creates beautiful slide decks and terrible procurement decisions.
A practical translation looks like this:
| System improvement | What it may mean commercially | What it does not automatically mean |
|---|---|---|
| Better prompt | Cheaper task setup | Better autonomous adaptation |
| Tool-using harness | More reliable domain execution | General intelligence progress |
| Memory/context compression | Longer workflow viability | Correct goal inference in new tasks |
| Synthetic task training | Better benchmark-shaped behavior | Robust out-of-distribution reasoning |
| Private workflow testing | Better deployment evidence | Universal agent capability |
This is where ARC-AGI-3 becomes more than a research benchmark. It gives buyers a language for separating “the model can do this” from “our engineers built enough scaffolding that the system works here.” Both can be useful. Only one should be sold as general agency.
The evidence is a benchmark construction story, not just a leaderboard story
The current Cognaptus article on this topic emphasized the sub-1% frontier model scores. That number deserves attention, but on its own it invites shallow interpretation. Low score, big gap, AI not ready. Fine. Also incomplete.
The stronger evidence is how carefully the benchmark attempts to avoid false difficulty and false progress.
First, the environments are designed around core knowledge priors: objectness, basic geometry and topology, intuitive physics, and agentness. They avoid language, numbers, recognizable symbols, cultural conventions, and familiar game references. This is meant to prevent the task from collapsing into learned trivia.
Second, the environments are tested for accidental solvability. Random regimes run up to 50,000 steps and then 1,000,000 steps to check that non-tutorial levels are not beaten by luck. A broader 1,000,000-step sweep also serves as fuzz testing for crashes and malformed transitions. Graph-based state-space exploration estimates reachability and random win probabilities, with an acceptance threshold that random play should not solve a level more often than 1 in 10,000 times.
Third, human calibration is continuous rather than symbolic. The paper reports 486 unique participants, 414 candidate environments, and 2,893 total environment attempts. Candidate environments must pass an “easy for humans” bar. Each included environment is attempted by 10 people, and only environments fully solved by at least two independent participants are considered for inclusion. Many are solved by six or more. Testing sessions have time limits, compensation, incentives, exclusion of low-effort behavior, and replay review to diagnose where humans get stuck.
This matters because a benchmark can fail in two opposite ways. It can be too easy for models, in which case it saturates. Or it can be arbitrarily hard for everyone, in which case failure tells us little. ARC-AGI-3 is trying to occupy the more informative middle: solvable for humans under first exposure, resistant to random play, and difficult for frontier AI without special scaffolding.
That construction process is not glamorous. It is also the reason the result has teeth.
The official scores are small, but the business lesson is not “stop using agents”
At release, the paper reports official semi-private scores below 1% for several frontier systems: Anthropic Opus 4.6 at 0.50%, Google Gemini 3.1 Pro Preview at 0.40%, OpenAI GPT-5.4 High at 0.20%, and xAI Grok-4.20 Beta at 0.10%.
Those exact model names and scores should be read as a time-stamped benchmark snapshot, not as a universal law of AI. The more durable result is the shape of the failure: models struggle when the environment is novel, the goal is unstated, the interaction history matters, and efficient exploration is required.
That is also the shape of many business workflows after the demo is over.
The practical inference for companies is not “agents are useless.” It is more precise:
| What the paper directly shows | Cognaptus inference for business use | Boundary |
|---|---|---|
| Current frontier models score poorly on official ARC-AGI-3 without task-specific harnesses | First-contact adaptation remains weak; do not assume an agent can discover unfamiliar workflow logic by itself | ARC-AGI-3 is abstract, not a direct office workflow benchmark |
| Harnesses can improve performance on public environments | Domain scaffolding is a major near-term ROI path | Harness success may not transfer to unseen workflows |
| RHAE penalizes action waste sharply | Agent evaluation should measure retries, tool calls, resets, and human interventions, not just final completion | Business value may tolerate slow completion for low-risk batch tasks |
| Private/OOD environments are central to official measurement | Vendor pilots should include hidden test cases and messy edge cases | Private tests require careful design, not arbitrary traps |
| Human calibration anchors task solvability | Internal benchmarks need human baselines for comparison | Human baselines vary by domain expertise |
In other words, ARC-AGI-3 is a warning against lazy evaluation, not a ban on deployment.
A company can still automate structured workflows where the domain knowledge is available, correctness is verifiable, and the cost of mistakes is controlled. Coding agents, document extraction, compliance triage, invoice matching, report generation, market monitoring, and customer support workflows can all benefit from carefully engineered systems. But the evaluation question changes.
Do not ask only: “Can the agent complete the task?”
Ask:
- How many actions, retries, tool calls, and resets did it need?
- How much workflow-specific instruction was embedded by humans?
- Does performance survive on hidden cases not used during setup?
- Can the system identify the goal when the user does not specify it perfectly?
- Where does the agent fail: perception, memory, exploration, goal inference, or planning?
That last question is especially important. A failed agent is not a single category. It may have seen the relevant information but failed to store it. It may have stored it but failed to infer the rule. It may have inferred the rule but pursued the wrong objective. Or it may have planned correctly but executed inefficiently. ARC-AGI-3’s per-level and action-based framing pushes evaluation toward this diagnostic style.
That is the part businesses should copy.
A better enterprise agent test should look more like ARC-AGI-3 than a product demo
For enterprise use, we do not need to clone ARC-AGI-3. Abstract colored-grid environments are useful for measuring general adaptation, but they are not a replacement for domain-specific validation. The better move is to borrow the evaluation philosophy.
A serious enterprise agent test should include four layers.
| Layer | Enterprise equivalent | What it catches |
|---|---|---|
| Public demo | Vendor-guided workflow | Basic interface and tool competence |
| Semi-private test | Company-designed pilot cases | Adaptation to local data and process variation |
| Hidden OOD test | Unseen edge cases and changed assumptions | Overfitting to the pilot script |
| Human baseline | Skilled employee completion path | Whether automation is efficient, not merely possible |
Then add an action-efficiency ledger. Count not only final accuracy but process cost: number of tool calls, failed actions, escalations, clarification requests, repeated observations, user interventions, and recovery attempts. For many workflows, this is closer to ROI than a final success rate.
A model that completes 90% of cases after 35 steps may be worse than a simpler rules-based system that completes 70% in 3 steps and escalates the rest cleanly. Intelligence theater becomes expensive when every “autonomous” success burns compute, latency, and staff attention in the background.
This is where the paper’s distinction between official and community leaderboards becomes commercially useful. Companies should maintain both views internally:
- A general capability view: how well does the base model or standard agent architecture handle new cases with minimal workflow-specific preparation?
- A deployment engineering view: how much performance can we obtain by adding workflow-specific harnesses, tools, memory, validators, and escalation rules?
The first view informs strategic model selection and long-term platform bets. The second view informs near-term automation ROI. Confusing them is how organizations end up buying AGI and receiving a brittle macro with a chatbot attached.
The boundary: ARC-AGI-3 is a strong signal, not a universal business proxy
The paper is ambitious, but its business interpretation needs boundaries.
ARC-AGI-3 environments are abstract, visual, turn-based, and deliberately stripped of language and domain knowledge. That is a strength for measuring fluid adaptation, but it also means the benchmark does not directly predict performance in legal review, sales operations, financial analysis, procurement, or software maintenance. Those domains often reward acquired knowledge, tool familiarity, retrieval quality, and domain-specific verification — exactly the ingredients ARC-AGI-3 tries to minimize.
There is also a scoring boundary. RHAE treats action efficiency as the central resource. In embodied or interactive environments, that makes sense. In business contexts, the relevant resource may be different: elapsed time, compute cost, error cost, audit burden, user interruption, compliance risk, or opportunity cost. Counting actions is still useful, but it should be adapted to the workflow.
Finally, harness engineering should not be dismissed just because it is excluded from the official AGI measurement. The paper itself recognizes that harness research can be economically valuable and may later migrate behind model APIs as first-party capability. In commercial terms, that migration is already one of the most important product dynamics in AI: yesterday’s wrapper becomes tomorrow’s default model behavior, and everyone pretends it was there all along.
So the correct business conclusion is not pessimism. It is measurement discipline.
The real message: agents need less theater and better scorekeeping
ARC-AGI-3 does not show that AI agents are doomed. It shows that the industry has been using the word “agent” to cover too many different phenomena.
A model that follows instructions is not the same as a system that discovers the objective. A harness that solves a public environment is not the same as general intelligence. A workflow that completes after repeated retries is not the same as efficient adaptation. A benchmark that can be trained against is not the same as a benchmark that measures first-contact generalization.
The paper’s contribution is to make these distinctions operational. It gives us an interactive format, an efficiency-based score, private out-of-distribution evaluation, and human-calibrated solvability. More importantly, it gives business readers a useful test of their own AI strategy.
When evaluating an agent, stop asking whether it can perform after the environment has been quietly arranged around it.
Ask what happens when the instructions are incomplete, the goal is implicit, the path is unfamiliar, and every wasted action counts.
That is closer to work. Less glamorous than the demo, certainly. But reality usually has poor stage lighting.
Cognaptus: Automate the Present, Incubate the Future.
-
ARC Prize Foundation Development Team, “ARC-AGI-3: A New Challenge for Frontier Agentic Intelligence,” arXiv:2603.24621v2, 17 Apr. 2026, https://arxiv.org/html/2603.24621. ↩︎