Proof Over Probabilities: Why AI Oversight Needs a Judge That Can Do Math

Opening — Why This Matters Now

AI agents are no longer politely answering questions. They are booking flights, moving money, editing codebases, sharing files, and occasionally hallucinating with confidence that would impress a venture capitalist.

As agents gain autonomy, the question shifts from “Can they do it?” to “Can we trust what they did?”

Today’s dominant answer is simple: ask another LLM.

The paradigm known as LLM-as-a-Judge uses a probabilistic model to evaluate another probabilistic model. It scales cheaply. It feels modern. It also inherits the same blind spots, biases, and hallucination patterns as the system it supervises.

The paper “FORMALJUDGE: A Neuro-Symbolic Paradigm for Agentic Oversight” proposes something less fashionable but far more durable: stop asking for opinions — start asking for proofs.

This is not a prompt engineering tweak. It is an architectural pivot.

Background — The Scalable Oversight Problem

Modern agents operate across three increasingly dangerous dimensions:

Risk Type	Example Failure	Why It’s Hard to Detect
Behavioral Safety	Clicking phishing links	Requires contextual judgment
Constraint Adherence	Violating nested budget/date rules	Requires compositional logic
Deception	Fabricating results after tool failure	Requires grounding, not rhetoric

Traditional LLM judges fail for structural reasons:

Attention competition — Safety constraints get buried under verbose reasoning.
Arithmetic blindness — Numerical or temporal mismatches slip through.
Persuasion vulnerability — A well-written lie passes as truth.

In short: probabilistic supervision of probabilistic systems produces correlated failure.

FORMALJUDGE’s thesis is blunt:

Oversight must be grounded in logic, not likelihood.

Architecture — The “Formal-of-Thought” Pipeline

FORMALJUDGE introduces a neuro-symbolic pipeline that separates semantic extraction from logical composition.

Instead of asking:

“Did the agent behave correctly?”

It asks:

“Did atomic fact A occur?” “Is constraint B satisfied?”

Then lets a solver do the math.

Phase 1: Intent Decomposition

Natural language intent is decomposed into atomic yes/no predicates.

Example (Trip Booking Constraint):

If flying, hotel must start on arrival day.

This becomes:

f1 = Has flight?
f2 = Arrival date
f3 = Hotel check-in date
Constraint: $$ f1 \Rightarrow (f3 \ge f2) $$

Notice the shift: The LLM does not judge the rule. It extracts the facts.

Phase 2: Grounded Extraction

Structured data is parsed deterministically. Only residual semantic judgments go to the LLM.

Probabilistic reasoning is confined to atomic granularity.

Phase 3: Formal Specification

Extracted facts are compiled into Dafny specifications.

Safety becomes:

$$ \Phi_{safe} = \bigwedge_k \varphi_k $$

Each constraint is provable.

Phase 4: SMT Verification

Z3 solver determines truth.

No sampling. No persuasive narratives. Just satisfiability.

Results — Does Proof Beat Prompting?

Across three benchmarks:

Agent-SafetyBench (behavioral safety)
VitaBench (multi-domain constraints)
Deceivers (fabrication detection)

FORMALJUDGE outperforms six LLM-as-a-Judge variants.

Aggregate Performance Gain

Benchmark	Average Improvement Over Baseline
Agent-SafetyBench	+16–22%
VitaBench	+6–12%
Deceivers	+25–31%

The deception benchmark is where things become interesting.

Weak-to-Strong Generalization

A 7B judge model achieved >90% deception detection accuracy against 72B agents.

Under standard LLM-as-a-Judge setups, accuracy drops as agent sophistication increases. Under FORMALJUDGE, it remains stable.

This breaks the “bigger model must supervise bigger model” assumption.

Oversight becomes capability-independent.

That is strategically significant.

Structured Inputs vs Formal Logic

The paper also tests intermediate baselines:

Structured JSON input
Chain-of-Thought reasoning
Few-shot calibration
Python-based runtime checks

Structured input helps. Executable Python helps slightly more.

But neither eliminates logical composition errors.

Why?

Because runtime scripts test specific cases. Formal verification proves properties.

The difference is subtle but operationally massive.

Method	Checks Example	Guarantees Absence of Edge Cases?
LLM + Struct	Reads organized trace	No
LLM + Python	Executes test script	No
FORMALJUDGE	SMT-verified predicates	Yes (within spec)

In regulated industries, that difference matters.

Robustness to Deception Sophistication

Attacks were categorized from simple omission (L1) to multi-step fabrication (L4).

Baseline judges show high variance. Detection collapses on subtle attacks.

FORMALJUDGE maintains stable detection across levels.

Because persuasion does not influence a solver.

The solver does not “feel” convinced.

Perhaps the most underappreciated result:

When agents receive Dafny-based violation feedback and retry:

Model	Round 0	Round 3
Claude-4.5-Opus	70.7%	99.8%
Qwen-14B	34.3%	74.2%

Improvement is near-linear.

Natural-language critiques plateau. Formal constraint feedback converges.

This suggests something profound:

Formal constraints are not just evaluative — they are trainable signals.

This opens a path toward verifiable RLHF pipelines.

Limitations — Where It Still Breaks

FORMALJUDGE is not magic.

Its failure modes remain in:

Incorrect decomposition of natural language intent.
Incorrect extraction of atomic facts.

The natural-to-formal translation bottleneck persists.

However, errors are localized. They do not cascade through probabilistic composition.

From a systems engineering perspective, that is a containment win.

Business Implications

For organizations deploying agentic systems in:

Finance
Healthcare
Enterprise automation
Legal workflows

The question becomes practical:

Do you want your compliance layer to “think”? Or to prove?

FORMALJUDGE suggests a future architecture:

Neural modules for semantic understanding. Symbolic modules for contractual guarantees.

This hybrid is not nostalgic formalism. It is pragmatic governance.

As agent autonomy scales, probabilistic oversight will become politically and legally fragile.

Mathematical verification will not.

Conclusion — The Watchmen Need Mathematics

Juvenal asked: Who watches the watchmen?

In AI, the answer has been: another watchman.

FORMALJUDGE proposes a different answer:

Let logic watch them.

It may not be glamorous. It may not trend on social media.

But in high-stakes systems, proof quietly beats persuasion.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why This Matters Now#

Background — The Scalable Oversight Problem#

Architecture — The “Formal-of-Thought” Pipeline#

Phase 1: Intent Decomposition#

Phase 2: Grounded Extraction#

Phase 3: Formal Specification#

Phase 4: SMT Verification#

Results — Does Proof Beat Prompting?#

Aggregate Performance Gain#

Weak-to-Strong Generalization#

Structured Inputs vs Formal Logic#

Robustness to Deception Sophistication#

Iterative Refinement — Proof as Feedback#

Limitations — Where It Still Breaks#

Business Implications#

Conclusion — The Watchmen Need Mathematics#