AGI Benchmarks

Demo days are generous. A sales engineer opens a prepared workflow, the agent clicks through a familiar sequence, the dashboard turns green, and everyone politely pretends not to notice how much of the intelligence was smuggled into the setup. ARC-AGI-3 is less polite. The paper introduces an interactive benchmark for agentic intelligence: not a static puzzle, not a multiple-choice exam, and not a coding task with a unit test waiting like a benevolent parent. An agent enters a novel, abstract, turn-based environment. It receives no explicit objective. It must explore, infer the rules, identify what counts as success, build a working model of the environment, and execute a plan efficiently.1 ...