When Puzzles Become Process: Benchmarking the Agentic Mind

Opening — Why This Matters Now

For two years, the AI industry has been intoxicated by a single idea: more reasoning tokens equals more intelligence.

Chain-of-thought prompting. Inference-time scaling. “Extended thinking” modes. Adjustable reasoning effort. The narrative is simple: give models more room to think, and they will think better.

But here is the uncomfortable question: how do we know?

Most reasoning benchmarks remain single-shot, final-answer-only, and increasingly contaminated. If a model returns the correct output, we clap. If it fails, we shrug. What happens between the first token and the last remains opaque.

Pencil Puzzle Bench introduces something the industry has quietly needed: step-level, verifiable reasoning with deterministic feedback. Not just “Did you get it right?”—but “Which rule did you break, at which cell, and why?”

For anyone building agentic systems—trading bots, compliance agents, operational copilots—this shift from outcome-only evaluation to process-level verification is not academic. It is existential.

Background — The Benchmark Fatigue Problem

Traditional reasoning benchmarks (GSM8K, MATH, ARC) suffer from three structural limitations:

Single-turn format — no iterative correction.
Final-answer evaluation — no visibility into intermediate reasoning.
Contamination risk — solutions increasingly present in training corpora.

Meanwhile, the industry has pivoted toward:

Inference-time compute scaling
Reinforcement learning for reasoning
Agentic tool use with multi-turn interaction

Yet evaluation has lagged behind capability claims.

Constraint-satisfaction puzzles—many NP-complete—offer an alternative testbed. They are:

Deterministic
Multi-step
Low-contamination
Fully verifiable at each move

This is the conceptual leap of Pencil Puzzle Bench: treat reasoning as a sequence of verifiable state transitions, not a monolithic answer.

The Framework — From Sudoku to System-Level Supervision

Dataset Scale

The framework includes:

Tier	Puzzles	Varieties	Verified Unique Solutions
Full Dataset	62,231	94	100% SAT-verified
Golden Benchmark	300	20	100%
Agentic Baseline	30	4	100%

Every puzzle has:

Step-by-step solution traces
Programmatic constraint checking
Unique solution validation via SAT solver

That last point matters. Ambiguity is eliminated. No partial credit. No fuzzy grading.

Step-Level Verification: The Real Innovation

Each move follows a deterministic pipeline:

Check current board
Apply coordinate-based move
Validate new violations
Check completion

If a rule is broken, the engine reports the exact violated constraint.

Example (Nurikabe-style logic):

“Shaded cells form a 2×2 square”
“Loop crosses itself”
“Duplicate number in row”

This granularity enables:

Dense reward signals
Process reward modeling
Reinforcement learning from verifiable rewards
Curriculum learning based on measurable difficulty

For enterprise AI systems, this mirrors real-world constraints:

Puzzle World	Business World
Constraint violation	Regulatory breach
Illegal move	Invalid transaction
Local rule break	Control failure
Completion check	Audit validation

In short: puzzles become compliance simulators for reasoning engines.

Two Axes of Capability — Depth vs. Iteration

The evaluation tested 51 models across two modes:

Direct Ask (single-shot solution)
Agentic (multi-turn iterative solving with tool feedback)

The results reveal two distinct improvement axes.

Axis 1 — Reasoning Effort Scaling

GPT-5.2 demonstrates an 81× improvement from minimal to maximum reasoning effort.

Effort Level	Direct Ask Success
None	0.33%
Low	2.3%
Medium	9.3%
High	20.7%
XHigh	27.0%

More compute, more depth, better results.

But with a catch.

At maximum effort, 35% of requests fail due to timeouts—revealing a reliability tradeoff.

Capability rises. Stability falls.

That tradeoff is not theoretical. It directly affects production deployments.

Axis 2 — Agentic Iteration

Agentic solving changes the picture dramatically.

Top performance comparison:

Model	Direct Ask	Agentic
GPT-5.2@xhigh	27.0%	56.0%
Claude Opus 4.6 (no extended thinking)	0.3%	30.0%

Two striking patterns emerge:

Models with weak internal reasoning gain massively from iteration.
Even strong reasoners improve with feedback loops.

This “agentic gap” ranges from +4.8pp to +30pp depending on configuration.

The implication: reasoning depth and iterative correction are complementary, not substitutes.

If you are designing AI agents for operational environments, this matters. A mediocre planner with strong feedback loops may outperform a brilliant planner operating blindly.

Cost vs Capability — The Pareto Reality

The benchmark consumed ~$28,246 across 17,032 runs.

Cost per success varies by 66,822× across models.

This is not marginal. It is strategic.

You can choose:

Low-cost mediocre reliability
High-cost high capability
Or optimized mid-frontier efficiency

In enterprise AI deployment, that becomes a portfolio decision.

Reasoning is no longer just a technical metric. It is a capital allocation question.

What Actually Makes a Puzzle Hard?

Here is where the paper quietly changes the difficulty discourse.

Move count predicts solve rate poorly:

$$R^2_{adj} = 0.092$$

Instead, solution compressibility is a far stronger predictor:

$$R^2_{adj} = 0.385$$

Interpretation:

Highly compressible solutions contain repeated patterns.
Low-compression solutions require genuinely novel decisions at each step.

This insight generalizes beyond puzzles.

In real-world reasoning tasks:

Pattern-heavy domains are easier for LLMs.
High-entropy decision spaces expose true limits.

If you are evaluating AI for finance, logistics, or legal reasoning, measuring entropy of solution paths may be more informative than counting steps.

Infrastructure Limits — The Hidden Constraint

High reasoning effort leads to systematic timeouts around 2.5–3 hours.

These are not random failures. They reflect inference-time infrastructure ceilings.

This creates a practical ceiling on “just think longer” strategies.

For enterprise systems, the lesson is blunt:

If your agent needs hours of uninterrupted reasoning to succeed, your architecture is wrong.

Intelligence must operate within bounded latency constraints.

Implications — Beyond Puzzles

Pencil Puzzle Bench is not about Sudoku.

It is about:

Process supervision
Step-level reinforcement learning
Agentic reliability
Infrastructure-aware reasoning

For businesses building AI agents, the benchmark demonstrates:

Feedback loops matter as much as raw reasoning depth.
Verification infrastructure is the key to scalable training.
Capability gains can collapse reliability.
Difficulty is entropy-driven, not length-driven.

In other words, we are entering an era where:

AI evaluation must look more like control systems engineering than trivia quizzes.

And that is a healthy shift.

Conclusion — The Agentic Frontier

Pencil Puzzle Bench reframes reasoning as a controlled, verifiable process rather than a theatrical performance of chain-of-thought.

The strongest models combine two axes:

Deep internal reasoning
Iterative external correction

Even then, 49% of puzzles remain unsolved.

The frontier is advancing—but it is not yet stable, nor universally reliable.

For organizations deploying AI agents, the lesson is clear:

Build systems that can verify, iterate, and recover—not just systems that can think loudly.

Because in production environments, intelligence without constraint-awareness is simply expensive improvisation.

And improvisation does not pass audits.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why This Matters Now#

Background — The Benchmark Fatigue Problem#

The Framework — From Sudoku to System-Level Supervision#

Dataset Scale#

Step-Level Verification: The Real Innovation#

Two Axes of Capability — Depth vs. Iteration#

Axis 1 — Reasoning Effort Scaling#

Axis 2 — Agentic Iteration#

Cost vs Capability — The Pareto Reality#

What Actually Makes a Puzzle Hard?#

Infrastructure Limits — The Hidden Constraint#

Implications — Beyond Puzzles#

Conclusion — The Agentic Frontier#