ARC-AGI-3 — When AI Stops Guessing and Starts Thinking

Demo days are generous. A sales engineer opens a prepared workflow, the agent clicks through a familiar sequence, the dashboard turns green, and everyone politely pretends not to notice how much of the intelligence was smuggled into the setup.

ARC-AGI-3 is less polite.

The paper introduces an interactive benchmark for agentic intelligence: not a static puzzle, not a multiple-choice exam, and not a coding task with a unit test waiting like a benevolent parent. An agent enters a novel, abstract, turn-based environment. It receives no explicit objective. It must explore, infer the rules, identify what counts as success, build a working model of the environment, and execute a plan efficiently.¹

That last word matters: efficiently.

The headline result is easy to quote and easy to misunderstand. At release, frontier models score below 1% on the official ARC-AGI-3 leaderboard, while humans can solve the environments under first-exposure conditions. That sounds like another “AI still lacks common sense” story. Tempting, yes. Also too cheap.

The more useful reading is sharper: ARC-AGI-3 separates three things that the market often bundles together under the flattering label of “agentic AI.” First, whether a model can adapt to a genuinely new task without being specially prepared. Second, whether a surrounding harness can make that model useful in a narrow environment. Third, whether a benchmark score is measuring intelligence or the amount of human design hidden in the wrapper.

For business leaders evaluating agents, that separation is the whole point.

ARC-AGI-3 changes the test from answering to discovering

Earlier ARC benchmarks tested abstraction from static examples. ARC-AGI-1 asked systems to infer transformation rules from small grid examples. ARC-AGI-2 increased the reasoning depth of that static format. ARC-AGI-3 moves into interaction.

That shift sounds modest until you notice what disappears.

In static benchmarks, the task frame is already supplied. The system knows that there is a puzzle, knows where the inputs and outputs are, and knows that the answer is some transformation of the visible data. In ARC-AGI-3, the agent sees a 64-by-64 grid with 16 possible colors and must act through a small set of permitted actions: key-like moves, undo, and sometimes cell selection. The interface is intentionally simple. The difficulty is not in dexterity or perception. It is in discovering what the environment is.

The paper frames the target capability through four functions:

Function	What ARC-AGI-3 demands	Why static benchmarks under-test it
Exploration	Probe the environment to reveal mechanics	Static tasks passively expose the relevant information
Modeling	Form a causal model of state transitions	Static tasks usually reduce the world to input-output mapping
Goal inference	Work out what success even means	Most benchmarks state the objective directly
Planning and execution	Reach the goal without wasting actions	Many benchmarks reward final correctness more than path quality

This is why ARC-AGI-3 is not merely “ARC, but game-like.” The benchmark attacks a different failure mode. A model can be impressive at answering questions and still be poor at deciding which questions need to be asked. Anyone who has watched an agent confidently loop through the same browser failure for eight minutes has already seen the commercial version of this problem. The benchmark simply makes the embarrassment measurable.

Completion is not the same as intelligence when the path is wasteful

The paper’s scoring method, RHAE — Relative Human Action Efficiency — is the most business-relevant contribution because it rejects the comforting binary of solved versus unsolved.

For each level completed by an AI system, the score compares the number of AI actions to a human baseline, defined as the upper-median best human action count for that level. The level score uses a squared efficiency term, so inefficient solutions collapse quickly. If a human baseline is 10 actions and the AI takes 100 actions, the score is not a forgiving 10%. It is 1%.

$$ \text{Level efficiency} \approx \left(\frac{\text{human baseline actions}}{\text{AI actions}}\right)^2 $$

The exact benchmark score then adds several design choices: per-level normalization, weighting later levels more heavily, capping unusually high per-level scores at 115% of human baseline, and capping each environment score by the weighted fraction of levels completed. The metric is doing more than grading. It is encoding a theory of intelligence: adaptation that requires huge amounts of trial-and-error is not yet human-like adaptation.

That matters because many business agent demos quietly optimize for eventual completion. Give the system enough retries, enough tools, enough prompt glue, enough human-written heuristics, and it may eventually reach the desired state. For operations teams, eventual completion can still be valuable. A slow agent that clears invoices overnight is not useless. But it is not the same capability as a system that understands a new process after a few interactions.

ARC-AGI-3 makes that distinction explicit:

Evaluation question	Traditional demo answer	ARC-AGI-3-style answer
Did the agent finish?	Yes or no	How many actions did it need, relative to humans?
Did the model reason?	It produced a plausible chain	Did exploration reduce uncertainty efficiently?
Did the system generalize?	It worked in the demo environment	Did it work on unseen, private, out-of-distribution environments?
Was the solution autonomous?	The agent clicked the buttons	How much human strategy was embedded in the harness?

The square in RHAE is a small mathematical insult with a serious purpose. It says: brute force is not just inelegant; it is evidence. If a system needs ten times the human action count, it may be solving, but it is not adapting like a human. Apparently, “eventually got there after exhausting the state space” is not a synonym for intelligence. A harsh standard. Also a useful one.

Public demos are not official evidence, and that distinction is not pedantry

ARC-AGI-3 divides its environments into a public demonstration set, a semi-private set, and a fully private set. The public demo set contains 25 environments. The semi-private and fully private sets each contain 55 environments. This is not a minor administrative detail. It is the benchmark’s immune system.

The public set is intentionally a front door: accessible, engaging, and useful for showing the benchmark format. The private sets are the actual evaluation instruments. They are harder, broader, and intentionally out-of-distribution relative to the public environments. The paper also notes that ARC-AGI-3 inverts the older public-to-private balance: public data becomes demonstration material, while private environments become the basis for meaningful evaluation.

This design responds to a familiar pathology. Once a benchmark becomes famous, it stops being a neutral measuring device and becomes a training target. Static benchmarks are especially vulnerable. Models can be trained on similar examples, synthetic variants, reasoning traces, or benchmark-shaped distributions. Performance rises, but the measurement becomes contaminated.

ARC-AGI-3’s answer is not “make the puzzles secret and hope.” It uses several layers of protection:

Design choice	Likely purpose	What it supports	What it does not prove
Public demo environments	Format familiarization	Researchers can understand the interface	Public scores measure AGI progress
Semi-private environments	API-based frontier testing	Lower-leakage comparison across models	Absolute protection from contamination
Fully private environments	Official competition evaluation	Stronger out-of-distribution scoring	Real-world business ROI
Core-knowledge-only design	Avoid language/cultural dependence	Tests abstract adaptation rather than acquired trivia	All human cognition is captured
Human calibration	Ensure environments are solvable by people	The benchmark is hard for AI but not obscure for humans	Every failure mode is fully diagnosed

This is one of the paper’s more mature points. Benchmark design is no longer about finding a hard test. It is about designing a test that remains hard after the industry has incentives to game it.

For businesses, the analogy is direct. A vendor demo is the public set. A pilot on your own messy data is the semi-private set. A surprise workflow that was not included in the prompt-engineering theater is the fully private set. Guess which one tells you whether the agent will survive contact with operations.

The harness problem: useful engineering, bad intelligence measurement

The paper is unusually clear about a distinction the AI industry often blurs because blurring it sells better.

Harnesses matter. A harness can manage context, store state, call tools, compress history, run code, search over actions, or inject human-designed strategies. Good harnesses can make models much more useful. The paper even describes pre-launch and community approaches that improved performance on public environments: graph-based exploration, reinforcement-style action prediction, orchestrator-subagent systems, and context-management wrappers.

But the official ARC-AGI-3 leaderboard deliberately does not treat harness-driven performance as evidence of general intelligence.

The reason is not snobbery. It is measurement hygiene.

If a researcher sees public ARC-AGI-3 environments and then handcrafts a strategy around them, the resulting system may perform well for reasons that do not transfer. The paper gives a telling example: in one environment variant, Opus 4.6 with a Duke harness reportedly jumps from 0.0% to 97.1%; in another environment, the same model remains at 0.0% with or without the harness. The interpretation is not that the harness is worthless. It is that targeted scaffolding can unlock a seen pattern without producing general adaptation.

That is exactly the distinction businesses need.

A domain-specific harness may be commercially valuable. If a claims-processing agent becomes reliable because engineers add state tracking, validation rules, document parsers, and escalation policies, excellent. That is automation engineering. It may generate ROI. It may deserve budget.

But do not confuse it with a generally intelligent agent that can enter any new workflow and figure things out. The former is a product architecture. The latter is the measurement target of ARC-AGI-3. Mixing them together creates beautiful slide decks and terrible procurement decisions.

A practical translation looks like this:

System improvement	What it may mean commercially	What it does not automatically mean
Better prompt	Cheaper task setup	Better autonomous adaptation
Tool-using harness	More reliable domain execution	General intelligence progress
Memory/context compression	Longer workflow viability	Correct goal inference in new tasks
Synthetic task training	Better benchmark-shaped behavior	Robust out-of-distribution reasoning
Private workflow testing	Better deployment evidence	Universal agent capability

This is where ARC-AGI-3 becomes more than a research benchmark. It gives buyers a language for separating “the model can do this” from “our engineers built enough scaffolding that the system works here.” Both can be useful. Only one should be sold as general agency.

The evidence is a benchmark construction story, not just a leaderboard story

The current Cognaptus article on this topic emphasized the sub-1% frontier model scores. That number deserves attention, but on its own it invites shallow interpretation. Low score, big gap, AI not ready. Fine. Also incomplete.

The stronger evidence is how carefully the benchmark attempts to avoid false difficulty and false progress.

First, the environments are designed around core knowledge priors: objectness, basic geometry and topology, intuitive physics, and agentness. They avoid language, numbers, recognizable symbols, cultural conventions, and familiar game references. This is meant to prevent the task from collapsing into learned trivia.

Second, the environments are tested for accidental solvability. Random regimes run up to 50,000 steps and then 1,000,000 steps to check that non-tutorial levels are not beaten by luck. A broader 1,000,000-step sweep also serves as fuzz testing for crashes and malformed transitions. Graph-based state-space exploration estimates reachability and random win probabilities, with an acceptance threshold that random play should not solve a level more often than 1 in 10,000 times.

Third, human calibration is continuous rather than symbolic. The paper reports 486 unique participants, 414 candidate environments, and 2,893 total environment attempts. Candidate environments must pass an “easy for humans” bar. Each included environment is attempted by 10 people, and only environments fully solved by at least two independent participants are considered for inclusion. Many are solved by six or more. Testing sessions have time limits, compensation, incentives, exclusion of low-effort behavior, and replay review to diagnose where humans get stuck.

This matters because a benchmark can fail in two opposite ways. It can be too easy for models, in which case it saturates. Or it can be arbitrarily hard for everyone, in which case failure tells us little. ARC-AGI-3 is trying to occupy the more informative middle: solvable for humans under first exposure, resistant to random play, and difficult for frontier AI without special scaffolding.

That construction process is not glamorous. It is also the reason the result has teeth.

The official scores are small, but the business lesson is not “stop using agents”

At release, the paper reports official semi-private scores below 1% for several frontier systems: Anthropic Opus 4.6 at 0.50%, Google Gemini 3.1 Pro Preview at 0.40%, OpenAI GPT-5.4 High at 0.20%, and xAI Grok-4.20 Beta at 0.10%.

Those exact model names and scores should be read as a time-stamped benchmark snapshot, not as a universal law of AI. The more durable result is the shape of the failure: models struggle when the environment is novel, the goal is unstated, the interaction history matters, and efficient exploration is required.

That is also the shape of many business workflows after the demo is over.

The practical inference for companies is not “agents are useless.” It is more precise:

What the paper directly shows	Cognaptus inference for business use	Boundary
Current frontier models score poorly on official ARC-AGI-3 without task-specific harnesses	First-contact adaptation remains weak; do not assume an agent can discover unfamiliar workflow logic by itself	ARC-AGI-3 is abstract, not a direct office workflow benchmark
Harnesses can improve performance on public environments	Domain scaffolding is a major near-term ROI path	Harness success may not transfer to unseen workflows
RHAE penalizes action waste sharply	Agent evaluation should measure retries, tool calls, resets, and human interventions, not just final completion	Business value may tolerate slow completion for low-risk batch tasks
Private/OOD environments are central to official measurement	Vendor pilots should include hidden test cases and messy edge cases	Private tests require careful design, not arbitrary traps
Human calibration anchors task solvability	Internal benchmarks need human baselines for comparison	Human baselines vary by domain expertise

In other words, ARC-AGI-3 is a warning against lazy evaluation, not a ban on deployment.

A company can still automate structured workflows where the domain knowledge is available, correctness is verifiable, and the cost of mistakes is controlled. Coding agents, document extraction, compliance triage, invoice matching, report generation, market monitoring, and customer support workflows can all benefit from carefully engineered systems. But the evaluation question changes.

Do not ask only: “Can the agent complete the task?”

Ask:

How many actions, retries, tool calls, and resets did it need?
How much workflow-specific instruction was embedded by humans?
Does performance survive on hidden cases not used during setup?
Can the system identify the goal when the user does not specify it perfectly?
Where does the agent fail: perception, memory, exploration, goal inference, or planning?

That last question is especially important. A failed agent is not a single category. It may have seen the relevant information but failed to store it. It may have stored it but failed to infer the rule. It may have inferred the rule but pursued the wrong objective. Or it may have planned correctly but executed inefficiently. ARC-AGI-3’s per-level and action-based framing pushes evaluation toward this diagnostic style.

That is the part businesses should copy.

A better enterprise agent test should look more like ARC-AGI-3 than a product demo

For enterprise use, we do not need to clone ARC-AGI-3. Abstract colored-grid environments are useful for measuring general adaptation, but they are not a replacement for domain-specific validation. The better move is to borrow the evaluation philosophy.

A serious enterprise agent test should include four layers.

Layer	Enterprise equivalent	What it catches
Public demo	Vendor-guided workflow	Basic interface and tool competence
Semi-private test	Company-designed pilot cases	Adaptation to local data and process variation
Hidden OOD test	Unseen edge cases and changed assumptions	Overfitting to the pilot script
Human baseline	Skilled employee completion path	Whether automation is efficient, not merely possible

Then add an action-efficiency ledger. Count not only final accuracy but process cost: number of tool calls, failed actions, escalations, clarification requests, repeated observations, user interventions, and recovery attempts. For many workflows, this is closer to ROI than a final success rate.

A model that completes 90% of cases after 35 steps may be worse than a simpler rules-based system that completes 70% in 3 steps and escalates the rest cleanly. Intelligence theater becomes expensive when every “autonomous” success burns compute, latency, and staff attention in the background.

This is where the paper’s distinction between official and community leaderboards becomes commercially useful. Companies should maintain both views internally:

A general capability view: how well does the base model or standard agent architecture handle new cases with minimal workflow-specific preparation?
A deployment engineering view: how much performance can we obtain by adding workflow-specific harnesses, tools, memory, validators, and escalation rules?

The first view informs strategic model selection and long-term platform bets. The second view informs near-term automation ROI. Confusing them is how organizations end up buying AGI and receiving a brittle macro with a chatbot attached.

The boundary: ARC-AGI-3 is a strong signal, not a universal business proxy

The paper is ambitious, but its business interpretation needs boundaries.

ARC-AGI-3 environments are abstract, visual, turn-based, and deliberately stripped of language and domain knowledge. That is a strength for measuring fluid adaptation, but it also means the benchmark does not directly predict performance in legal review, sales operations, financial analysis, procurement, or software maintenance. Those domains often reward acquired knowledge, tool familiarity, retrieval quality, and domain-specific verification — exactly the ingredients ARC-AGI-3 tries to minimize.

There is also a scoring boundary. RHAE treats action efficiency as the central resource. In embodied or interactive environments, that makes sense. In business contexts, the relevant resource may be different: elapsed time, compute cost, error cost, audit burden, user interruption, compliance risk, or opportunity cost. Counting actions is still useful, but it should be adapted to the workflow.

Finally, harness engineering should not be dismissed just because it is excluded from the official AGI measurement. The paper itself recognizes that harness research can be economically valuable and may later migrate behind model APIs as first-party capability. In commercial terms, that migration is already one of the most important product dynamics in AI: yesterday’s wrapper becomes tomorrow’s default model behavior, and everyone pretends it was there all along.

So the correct business conclusion is not pessimism. It is measurement discipline.

The real message: agents need less theater and better scorekeeping

ARC-AGI-3 does not show that AI agents are doomed. It shows that the industry has been using the word “agent” to cover too many different phenomena.

A model that follows instructions is not the same as a system that discovers the objective. A harness that solves a public environment is not the same as general intelligence. A workflow that completes after repeated retries is not the same as efficient adaptation. A benchmark that can be trained against is not the same as a benchmark that measures first-contact generalization.

The paper’s contribution is to make these distinctions operational. It gives us an interactive format, an efficiency-based score, private out-of-distribution evaluation, and human-calibrated solvability. More importantly, it gives business readers a useful test of their own AI strategy.

When evaluating an agent, stop asking whether it can perform after the environment has been quietly arranged around it.

Ask what happens when the instructions are incomplete, the goal is implicit, the path is unfamiliar, and every wasted action counts.

That is closer to work. Less glamorous than the demo, certainly. But reality usually has poor stage lighting.

Cognaptus: Automate the Present, Incubate the Future.

ARC Prize Foundation Development Team, “ARC-AGI-3: A New Challenge for Frontier Agentic Intelligence,” arXiv:2603.24621v2, 17 Apr. 2026, https://arxiv.org/html/2603.24621. ↩︎

ARC-AGI-3 changes the test from answering to discovering#

Completion is not the same as intelligence when the path is wasteful#

Public demos are not official evidence, and that distinction is not pedantry#

The harness problem: useful engineering, bad intelligence measurement#

The evidence is a benchmark construction story, not just a leaderboard story#

The official scores are small, but the business lesson is not “stop using agents”#

A better enterprise agent test should look more like ARC-AGI-3 than a product demo#

The boundary: ARC-AGI-3 is a strong signal, not a universal business proxy#

The real message: agents need less theater and better scorekeeping#