Opening — Why This Matters Now

For two years, the AI industry has been intoxicated by a single idea: more reasoning tokens equals more intelligence.

Chain-of-thought prompting. Inference-time scaling. “Extended thinking” modes. Adjustable reasoning effort. The narrative is simple: give models more room to think, and they will think better.

But here is the uncomfortable question: how do we know?

Most reasoning benchmarks remain single-shot, final-answer-only, and increasingly contaminated. If a model returns the correct output, we clap. If it fails, we shrug. What happens between the first token and the last remains opaque.

Pencil Puzzle Bench introduces something the industry has quietly needed: step-level, verifiable reasoning with deterministic feedback. Not just “Did you get it right?”—but “Which rule did you break, at which cell, and why?”

For anyone building agentic systems—trading bots, compliance agents, operational copilots—this shift from outcome-only evaluation to process-level verification is not academic. It is existential.


Background — The Benchmark Fatigue Problem

Traditional reasoning benchmarks (GSM8K, MATH, ARC) suffer from three structural limitations:

  1. Single-turn format — no iterative correction.
  2. Final-answer evaluation — no visibility into intermediate reasoning.
  3. Contamination risk — solutions increasingly present in training corpora.

Meanwhile, the industry has pivoted toward:

  • Inference-time compute scaling
  • Reinforcement learning for reasoning
  • Agentic tool use with multi-turn interaction

Yet evaluation has lagged behind capability claims.

Constraint-satisfaction puzzles—many NP-complete—offer an alternative testbed. They are:

  • Deterministic
  • Multi-step
  • Low-contamination
  • Fully verifiable at each move

This is the conceptual leap of Pencil Puzzle Bench: treat reasoning as a sequence of verifiable state transitions, not a monolithic answer.


The Framework — From Sudoku to System-Level Supervision

Dataset Scale

The framework includes:

Tier Puzzles Varieties Verified Unique Solutions
Full Dataset 62,231 94 100% SAT-verified
Golden Benchmark 300 20 100%
Agentic Baseline 30 4 100%

Every puzzle has:

  • Step-by-step solution traces
  • Programmatic constraint checking
  • Unique solution validation via SAT solver

That last point matters. Ambiguity is eliminated. No partial credit. No fuzzy grading.


Step-Level Verification: The Real Innovation

Each move follows a deterministic pipeline:

  1. Check current board
  2. Apply coordinate-based move
  3. Validate new violations
  4. Check completion

If a rule is broken, the engine reports the exact violated constraint.

Example (Nurikabe-style logic):

  • “Shaded cells form a 2×2 square”
  • “Loop crosses itself”
  • “Duplicate number in row”

This granularity enables:

  • Dense reward signals
  • Process reward modeling
  • Reinforcement learning from verifiable rewards
  • Curriculum learning based on measurable difficulty

For enterprise AI systems, this mirrors real-world constraints:

Puzzle World Business World
Constraint violation Regulatory breach
Illegal move Invalid transaction
Local rule break Control failure
Completion check Audit validation

In short: puzzles become compliance simulators for reasoning engines.


Two Axes of Capability — Depth vs. Iteration

The evaluation tested 51 models across two modes:

  • Direct Ask (single-shot solution)
  • Agentic (multi-turn iterative solving with tool feedback)

The results reveal two distinct improvement axes.

Axis 1 — Reasoning Effort Scaling

GPT-5.2 demonstrates an 81× improvement from minimal to maximum reasoning effort.

Effort Level Direct Ask Success
None 0.33%
Low 2.3%
Medium 9.3%
High 20.7%
XHigh 27.0%

More compute, more depth, better results.

But with a catch.

At maximum effort, 35% of requests fail due to timeouts—revealing a reliability tradeoff.

Capability rises. Stability falls.

That tradeoff is not theoretical. It directly affects production deployments.


Axis 2 — Agentic Iteration

Agentic solving changes the picture dramatically.

Top performance comparison:

Model Direct Ask Agentic
GPT-5.2@xhigh 27.0% 56.0%
Claude Opus 4.6 (no extended thinking) 0.3% 30.0%

Two striking patterns emerge:

  1. Models with weak internal reasoning gain massively from iteration.
  2. Even strong reasoners improve with feedback loops.

This “agentic gap” ranges from +4.8pp to +30pp depending on configuration.

The implication: reasoning depth and iterative correction are complementary, not substitutes.

If you are designing AI agents for operational environments, this matters. A mediocre planner with strong feedback loops may outperform a brilliant planner operating blindly.


Cost vs Capability — The Pareto Reality

The benchmark consumed ~$28,246 across 17,032 runs.

Cost per success varies by 66,822× across models.

This is not marginal. It is strategic.

You can choose:

  • Low-cost mediocre reliability
  • High-cost high capability
  • Or optimized mid-frontier efficiency

In enterprise AI deployment, that becomes a portfolio decision.

Reasoning is no longer just a technical metric. It is a capital allocation question.


What Actually Makes a Puzzle Hard?

Here is where the paper quietly changes the difficulty discourse.

Move count predicts solve rate poorly:

$$R^2_{adj} = 0.092$$

Instead, solution compressibility is a far stronger predictor:

$$R^2_{adj} = 0.385$$

Interpretation:

  • Highly compressible solutions contain repeated patterns.
  • Low-compression solutions require genuinely novel decisions at each step.

This insight generalizes beyond puzzles.

In real-world reasoning tasks:

  • Pattern-heavy domains are easier for LLMs.
  • High-entropy decision spaces expose true limits.

If you are evaluating AI for finance, logistics, or legal reasoning, measuring entropy of solution paths may be more informative than counting steps.


Infrastructure Limits — The Hidden Constraint

High reasoning effort leads to systematic timeouts around 2.5–3 hours.

These are not random failures. They reflect inference-time infrastructure ceilings.

This creates a practical ceiling on “just think longer” strategies.

For enterprise systems, the lesson is blunt:

If your agent needs hours of uninterrupted reasoning to succeed, your architecture is wrong.

Intelligence must operate within bounded latency constraints.


Implications — Beyond Puzzles

Pencil Puzzle Bench is not about Sudoku.

It is about:

  • Process supervision
  • Step-level reinforcement learning
  • Agentic reliability
  • Infrastructure-aware reasoning

For businesses building AI agents, the benchmark demonstrates:

  1. Feedback loops matter as much as raw reasoning depth.
  2. Verification infrastructure is the key to scalable training.
  3. Capability gains can collapse reliability.
  4. Difficulty is entropy-driven, not length-driven.

In other words, we are entering an era where:

AI evaluation must look more like control systems engineering than trivia quizzes.

And that is a healthy shift.


Conclusion — The Agentic Frontier

Pencil Puzzle Bench reframes reasoning as a controlled, verifiable process rather than a theatrical performance of chain-of-thought.

The strongest models combine two axes:

  • Deep internal reasoning
  • Iterative external correction

Even then, 49% of puzzles remain unsolved.

The frontier is advancing—but it is not yet stable, nor universally reliable.

For organizations deploying AI agents, the lesson is clear:

Build systems that can verify, iterate, and recover—not just systems that can think loudly.

Because in production environments, intelligence without constraint-awareness is simply expensive improvisation.

And improvisation does not pass audits.

Cognaptus: Automate the Present, Incubate the Future.