Typechecked and Still Wrong

TL;DR for operators

The useful lesson from this paper is not “AI can formalize mathematics better.” That is the shiny wrapper. The operational lesson is nastier and more important: an AI-generated formal artifact can pass syntactic checks, be provable, and still fail to represent the original human intent. The type checker is not a mind reader. It is a very disciplined bureaucrat.

The paper introduces Bidirectional Provability Fingerprinting (BPF), a mechanism for checking whether a natural-language mathematical statement and a candidate Lean formalization have the same semantic neighborhood: what follows from them, and what would be enough to imply them.¹ The method does not require a single gold reference formalization. Instead, it generates semantic probes from the natural-language statement, formalizes those probes, tests entailment relations in Lean 4 + mathlib4, and compares the resulting fingerprint against what the source statement is expected to mean.

The supporting components matter. Counterfactual Probe Generation makes probes deliberately target likely drift classes, rather than spraying generic examples at the wall and hoping one sticks. The Equivalence Spectrum turns faithfulness into a continuous score with accept, review, and reject regions. Adaptive Probe Budget Allocation spends entailment-query budget where it is most informative. Faithfulness-Guided Decoding feeds drift evidence back into generation, so verification becomes part of repair rather than a ceremonial stamp after the model has already made a mess.

On the paper’s DRIFTBENCH benchmark, BPF + CPG detects 89.6% of drifted formalizations at a 3.0% false-positive rate, compared with 41.2% for provability and 63.3% for an LLM judge. Adding APBA keeps detection essentially unchanged while cutting relative wall-clock cost from 26× to 8×. FGD reduces drifted outputs from 19.4% to 10.3% at four sampled candidates, and to 8.7% at eight candidates. Those are not decorative improvements. They say the mechanism is doing work.

For business use, the paper is a warning label for every pipeline that converts human language into formal machinery: requirements into tickets, policies into controls, contracts into clauses, spreadsheets into analytics logic, prompts into workflow automation, and product intent into code. Passing a parser, schema validator, unit test, or LLM judge is not the same as preserving intent. Enterprises need semantic certification layers: discriminating tests, bidirectional checks, uncertainty bands, review routing, and repair loops. Naturally, this does not make experts obsolete. It gives experts better alarms, which is less glamorous and more useful.

The failure is not syntax; it is semantic substitution

Most enterprise AI failures are not flamboyant. They do not announce themselves with exploding dashboards or surreal hallucinations about fictional regulations. The dangerous ones look respectable.

A model converts a natural-language instruction into a structured artifact. The artifact validates. It runs. It may even produce a result. Everyone relaxes, because apparently the machine has passed the adult supervision phase. Then, weeks later, someone notices that “all active customers” quietly became “all customers with at least one transaction,” or that “must notify the regulator within 72 hours” became “should notify the compliance team promptly.” The artifact is valid. It is also wrong. Congratulations: you certified grammar.

This paper studies that problem in the highest-stakes toy universe available: formal mathematics. The input is a natural-language mathematical statement. The output is a Lean 4 statement. The conventional gates are typechecking and, sometimes, downstream provability. Both are necessary. Neither is enough.

The authors’ motivating example is the Extreme Value Theorem. A faithful formalization says that every continuous function on a nonempty compact set attains a maximum. A drifted version may omit the nonempty assumption. Another may invert quantifiers so that the result becomes a trivial statement: for every point, there exists a point whose function value is at least as large—take the same point. That statement can be provable. It is also not the theorem.

The obvious misconception is therefore the one the paper dismantles: a formal statement that typechecks, is provable, or satisfies an LLM judge is not automatically the statement the human meant.

This is the faithfulness gap. The paper’s contribution is to turn that gap from a philosophical complaint into an operational test.

BPF treats meaning as a neighborhood, not a string

The central mechanism is elegant because it refuses to chase a canonical translation. In formal mathematics, two competent formalizers may express the same theorem differently. They may use different structures, equivalent lemmas, or library idioms. Demanding one exact reference statement would be too brittle.

So the paper defines faithfulness through an interpretation distribution: the set of formal statements that competent mathematicians would accept as expressing the natural-language source. A candidate is faithful if it is provably equivalent to most acceptable interpretations under that distribution. The distribution itself is never fully materialized. That would be delightful, and impossible.

BPF approximates the relation through probes.

The core idea is inferential: a statement’s meaning is characterized by what it implies and what implies it. If two statements have the same consequence neighborhood inside the formal theory, they are hard to distinguish semantically. If they disagree on carefully selected probes, the disagreement becomes a witness of drift.

A simplified view looks like this:

Natural-language statement N
        +
Candidate formalization F
        |
        v
Counterfactual semantic probes
        |
        v
Formalize and typecheck probes
        |
        v
Lean entailment tests:
  Does F imply probe P?
  Does probe P imply F?
        |
        v
Bidirectional fingerprint
        |
        v
Equivalence score:
  ACCEPT / REVIEW / REJECT

The bidirectional part is not ornamental. A forward-only test asks what follows from the candidate. That can miss strict strengthenings. A backward-only test asks what implies the candidate. That can miss weakenings. The paper’s Proposition 4.2 formalizes the point: under a sufficiently closed probe set, matching both directions corresponds to provable equivalence. In plain language, one-way agreement is how subtle errors buy a suit and walk into production.

The detection theorem gives the mechanism its statistical shape. If probes for a drift class have witnessability rate $\alpha(D)$, then a budget of roughly

$$ k \geq \frac{1}{\alpha(D)} \log \frac{1}{\delta} $$

is enough to flag that drift with probability at least $1-\delta$, assuming the entailment oracle can complete the relevant proofs. The paper estimates witnessability rates of about 0.62 for quantifier inversion, 0.55 for hypothesis omission, 0.48 for conclusion generalization, and 0.31 for type coercion. Type coercion is hardest because it can preserve much of the surface behavior while changing the semantic substrate. Naturally, the most annoying bug is also the least cooperative.

Counterfactual probes are the difference between testing and hoping

A naive probe generator would produce special cases, generalizations, and edge cases from the source statement. That is useful, but it is not targeted. It resembles asking random audit questions instead of asking the questions fraud actually fails.

The paper’s Counterfactual Probe Generation changes the setup. For each drift class, it mechanically creates a drifted twin of the natural-language statement: swap quantifiers, drop a premise, generalize a conclusion, or alter type conventions. Then it asks an LLM to generate probes that discriminate between the original statement and the drifted twin.

This is a subtle design choice. The drifted twin is created mechanically, not by the LLM. That matters because an LLM-generated negative example may drift along several axes at once or become nonsense. Mechanical perturbation gives the contrast a stable label. The LLM is used to propose discriminating probes, not to decide what the failure class means from scratch.

The four canonical drift classes are practical, not merely mathematical taxonomy:

Drift class	What goes wrong	Why ordinary validation misses it
Quantifier inversion	“For all there exists” becomes “there exists for all,” or similar order changes.	The result may remain well-typed and sometimes provable while saying something else.
Hypothesis omission	A premise disappears.	The formal statement can become stronger, false, or misleading without syntax failure.
Conclusion generalization	The candidate claims more than the source intended.	Local plausibility hides global mismatch.
Type coercion	The object is silently retagged, such as changing number domains or total/partial structures.	Surface form can remain close while formal meaning shifts.

The empirical effect is visible in the budget curves. CPG reaches the performance of naive BPF with less than half the probe budget, and the paper reports that it outperforms generic probes at every budget level on three of the four drift classes. This is not just better prompting. It is better test design.

For enterprise readers, this is the first bridge out of formal mathematics. If an AI turns a policy into executable controls, do not merely test generic examples. Generate counterfactual variants of the policy: remove a condition, reverse an exception, broaden a threshold, narrow a role definition, alter a jurisdiction. Then test whether the generated artifact responds differently where it should. That is how intent preservation becomes observable.

The Equivalence Spectrum is a triage instrument, not a trophy score

Binary verdicts are too crude for semantic drift. A candidate with one suspicious fingerprint cell is not the same as one that fails half the probe set. The paper’s Equivalence Spectrum defines a score $E(F,N) \in [0,1]$, weighted by drift class, and calibrates it into three regions on held-out DRIFTBENCH data:

Region	Score range	Operational meaning
ACCEPT	$E \geq 0.93$	Candidate is likely faithful enough to pass the automated gate.
REVIEW	$0.78 \leq E < 0.93$	Candidate needs expert inspection.
REJECT	$E < 0.78$	Candidate shows substantial drift evidence.

This is one of the paper’s more business-relevant design moves. A continuous score enables ranking, review queues, and reward shaping. A binary label mostly enables argument.

The REVIEW band is the interesting part. It acknowledges that semantic certification is not a theological declaration of truth. It is a risk-routing layer. Some cases should be accepted quickly. Some should be rejected quickly. Some deserve human attention because the system has found enough smoke to avoid pretending the room is fine.

That maps cleanly to enterprise AI governance. The purpose of a certification layer is not to eliminate review. It is to spend review budget on the cases where review has the highest expected value. This is what audit teams, compliance teams, and platform teams actually need. Not more dashboards with green checkmarks. Fewer stupid green checkmarks.

APBA cuts the cost by asking the next best question

BPF’s dominant cost is entailment checking. Each probe requires formal reasoning under a tactic budget. The paper uses Lean 4 + mathlib4 with bounded proof-search tactics capped at 30 seconds per query. This makes the method stronger than a pure LLM judge, but also more expensive. Physics remains undefeated.

Adaptive Probe Budget Allocation addresses this by routing probes according to expected information gain. Instead of allocating uniformly across drift classes, APBA estimates which class is currently most informative and spends the next probe there. The paper’s theoretical account links the savings to heterogeneity in witnessability rates. If some drift classes are easier or more informative to test than others, uniform allocation wastes effort.

The empirical result matters. Full BPF + CPG costs 26× relative to typechecking in the paper’s setup. BPF + CPG + APBA preserves essentially the same F1 and detection rate while reducing relative wall-clock cost to 8×. In probe-budget scaling, uniform allocation needs about 28 probes to reach a detection rate near 0.87; APBA reaches the same level at about 9 probes, a 3.1× reduction.

This is the difference between a verification idea and a verification system. Many AI safety methods look excellent until someone asks how many times they must run the expensive component. APBA is the part of the paper that notices the invoice.

FGD turns verification into repair

Post-hoc verification catches drift after the model emits a candidate. Useful, but reactive. The paper’s Faithfulness-Guided Decoding uses the BPF score as a reward during sample-and-rerank generation.

The procedure is simple enough to survive contact with engineering:

Sample multiple candidate formalizations.
Score each candidate with a low-budget BPF pass.
Return the highest-scoring candidate.
If all candidates fall into REVIEW or REJECT territory, feed a witness probe back as a repair hint.

This is a good pattern because the verifier does not merely say “wrong.” It supplies evidence about the direction of wrongness. Hypothesis omission and conclusion generalization benefit most because the witness probe gives the generator something concrete to repair.

On the wild split, the paper reports that the baseline autoformalizer emits drifted statements at a rate of 19.4%. With FGD at four candidates, that falls to 10.3%, a 47% relative reduction. At eight candidates, it falls to 8.7%. Reranking alone saturates around 14%, so the re-prompt loop is doing meaningful work.

For business systems, this suggests a useful architecture: validators should produce repairable evidence, not just rejection codes. A contract clause generator should be told which obligation changed. A requirements parser should be told which condition was dropped. An analytics agent should be told which population boundary drifted. “Invalid output” is an error message. “You omitted the nonempty condition” is a path to repair.

What the experiments actually support

The paper’s evidence is unusually aligned with its mechanism. The experiments are not a separate circus. They test whether each stage of the pipeline earns its keep.

Evidence item	Likely purpose	What it supports	What it does not prove
Controlled DRIFTBENCH split	Main evidence	BPF + CPG detects labeled semantic drift better than typecheck, provability, BLEU-reference, back-translation, and LLM-judge baselines.	Universal performance outside Lean/mathlib or outside the constructed drift taxonomy.
Per-class detection table	Diagnostic breakdown	The method improves across all four drift classes; type coercion remains hardest.	That every future drift class is covered.
Wild split with expert annotation	Robustness / external validity within the paper	BPF + CPG aligns with experts better than LLM judges and approaches inter-annotator agreement.	That expert disagreement disappears or that the method is autonomous ground truth.
Probe budget curves	Sensitivity / cost analysis	CPG and APBA improve efficiency; adaptive routing cuts probe requirements.	That cost is negligible in production environments.
Ablations	Mechanism validation	Bidirectionality, CPG, and targeted probe types materially contribute to performance.	That the exact component choices are optimal.
FGD evaluation	Exploratory downstream extension	Fingerprint scores can reduce drift during generation, not only detect it afterward.	That generation can be made drift-free.
Convention-drift structural test	Boundary / implementation detail	Some invisible convention shifts require structural checks outside the main fingerprint.	That all convention drift is solved.

The main table is decisive for the paper’s central claim. Typecheck-only achieves F1 0.27 and detects 11.4% of drifted cases at a 3% false-positive rate. Provability improves to F1 0.42 and 41.2% detection. Back-translation reaches F1 0.66 and 58.9% detection. LLM-judge reaches F1 0.71 and 63.3% detection. BPF-naive reaches F1 0.85 and 83.2% detection. BPF + CPG reaches F1 0.91 and 89.6% detection. Adding APBA keeps F1 at 0.91 and detection at 89.4%, while reducing cost from 26× to 8×.

That progression is the paper in miniature. Syntactic validity is weak. Downstream proof is better but misses true-but-wrong statements. LLM judgment helps but is still not structured enough. Generic probes help. Counterfactual, bidirectional probes help more. Adaptive budgeting makes the useful version less expensive.

The per-class results sharpen the story. BPF + CPG detects 94% of quantifier inversion, 93% of hypothesis omission, 91% of conclusion generalization, and 66% of type coercion at the same operating point. Type coercion remains the laggard, consistent with its lower witnessability rate. That boundary is important: the method is good, not magical. We remain tragically unable to abolish hard cases by naming them.

The ablation table confirms the mechanism. Removing CPG drops F1 from 0.91 to 0.85. Using forward-only matching drops it to 0.80. Removing hypothesis-drop probes drops it to 0.82. Reducing the probe budget to 8 drops it to 0.83, while using APBA at that low budget brings it back to 0.89. These are not minor cosmetic deltas. They show that bidirectional matching, counterfactual targeting, and adaptive budget allocation are structural contributors.

The wild split is also useful because it moves beyond deterministic perturbations. The paper reports BPF + CPG agreement with expert annotators at $\kappa = 0.77$, compared with inter-annotator agreement of 0.81 and LLM-judge agreement of 0.58. That does not prove the system is an oracle. It shows that the mechanism behaves much more like expert review than like a naked model judgment.

The business analogue is specification drift

The obvious application is formal mathematics. The less obvious one is everything enterprises do with language.

Most organizations are already building autoformalization systems under less dignified names. A model reads a business requirement and produces a Jira ticket. A model reads a policy and produces a control checklist. A model reads a contract and extracts obligations. A model reads an analyst’s question and writes SQL. A model reads a customer support policy and builds an agent workflow. These are all translations from informal intent into formal or semi-formal artifacts.

They all have a faithfulness gap.

A generated SQL query can run and still answer the wrong question. A compliance control can be validly formatted and still omit an exception. A workflow can pass schema validation and still route the wrong customer segment. A contract extractor can populate every field and still reverse the obligation. A code generator can pass unit tests written for the wrong interpretation. The enterprise version of theorem provability is “the system ran.” Useful. Also insufficient.

The paper’s mechanism suggests a practical control pattern:

Paper mechanism	Enterprise analogue	Operational value
Natural-language theorem	Requirement, policy, contract clause, analytic question, workflow instruction	Source intent that must be preserved.
Candidate formalization	SQL, code, control logic, schema object, policy rule, ticket, API workflow	Executable or semi-executable artifact.
Counterfactual probes	Edge-case scenarios, policy variants, exception cases, negative examples	Tests whether the artifact changes behavior when intent changes.
Bidirectional entailment	Does the artifact imply required behavior, and do required cases imply the artifact?	Detects both overreach and underreach.
Equivalence Spectrum	Risk score with accept/review/reject bands	Routes human review budget.
APBA	Adaptive test selection	Reduces validation cost.
FGD	Repair loop with witness evidence	Improves generation rather than merely rejecting it.

The strongest inference for business is not “use Lean for everything,” although some formal-methods teams will be tempted and should be supervised. The inference is architectural: semantic validation should be designed as a discriminating probe system, not as an output-format check.

That means an enterprise AI platform should maintain libraries of counterfactual probes for recurring domains. For finance, probes around eligibility, thresholds, time windows, and aggregation boundaries. For compliance, probes around jurisdiction, obligation type, exception handling, and reporting deadlines. For procurement, probes around approval thresholds, vendor categories, conflict-of-interest flags, and contract renewal logic. For analytics, probes around population definitions, joins, missing values, and time periods.

The probes do not have to be Lean theorems. They need to be executable enough to expose consequence mismatches.

What the paper directly shows, and what Cognaptus infers

The distinction matters. Research papers are not enterprise architecture diagrams with better typography.

Layer	Claim
Directly shown by the paper	In Lean 4 + mathlib4 autoformalization, BPF + CPG detects controlled semantic drift much better than typecheck, provability, BLEU-reference, back-translation, and LLM-judge baselines.
Directly shown by the paper	Adaptive probe allocation preserves detection while cutting relative wall-clock cost substantially in the reported setup.
Directly shown by the paper	BPF scores can guide decoding and reduce drifted formalization outputs in the evaluated wild split.
Cognaptus inference	Similar semantic-certification patterns should help enterprise systems that translate informal intent into structured artifacts.
Cognaptus inference	Review bands are more useful than binary validation because they convert model uncertainty into operational routing.
Still uncertain	Whether the same magnitude of gains transfers to messy enterprise domains with weak oracles, ambiguous policies, incomplete test harnesses, and shifting definitions.

That final uncertainty is not a footnote-sized problem. Enterprises often lack a clean entailment oracle. Mathematics has Lean. SQL has execution against known data, but not always a complete semantic oracle. Contracts have legal interpretation. Policies have organizational context. Requirements have unstated assumptions, which are everyone’s favorite form of technical debt because they are free until they are not.

So the business translation is not to copy BPF literally. It is to copy the discipline: define expected consequence neighborhoods, generate discriminating probes, test both over- and under-coverage, score uncertainty, route review, and feed witness failures back into generation.

Boundaries that change how to deploy it

The paper is precise enough to say where the method weakens.

First, the entailment oracle is incomplete. Lean’s tactic stack may fail to prove a true entailment within the time budget. The paper models this with an oracle failure probability $\eta(\tau)$ and notes empirical $\eta(30s) \approx 0.07$ on DRIFTBENCH probes. Detection guarantees degrade accordingly. In the wild split, 63% of BPF false positives are traced to incomplete entailment-oracle outputs flipping a fingerprint cell, not necessarily to probe-generation failure. That is a useful diagnosis: stronger provers, cached fingerprints, or learned entailment classifiers with selective formal verification could reduce residual error.

Second, convention drift can be invisible to consequence probes. If a candidate uses a different structural convention but preserves all tested consequences, BPF may return faithful. The appendix proposes a complementary structural signature test, comparing predicted type signatures and structural keywords against the candidate. It adds 0.04 F1 on convention-drift cases in the wild split. The important phrase is “complementary,” not “solved.”

Third, probe labels matter. BPF relies on expected labels for probes generated from the natural-language source. If those labels are wrong, the fingerprint comparison inherits the error. The paper calibrates labels against expert annotation and gives PAC-style bounds under assumptions, but production systems would still need label governance.

Fourth, the benchmark is centered on Lean 4 and mathlib4. The authors report preliminary Isabelle/HOL results with F1 0.88 versus 0.91 on Lean 4, which is encouraging. But enterprise domains are not proof assistants with disciplined libraries. Their semantics are messier, their oracles weaker, and their conventions more political. Lovely combination.

Fifth, the cost is real. APBA reduces the wall-clock burden, but BPF + CPG + APBA still costs 8× relative to typechecking in the paper’s setup. That is acceptable for high-value artifacts, review triage, library construction, and safety-critical specifications. It may be excessive for low-risk, disposable outputs. Certification budgets should follow risk, not aesthetic purity.

The ROI is cheaper diagnosis, not just better accuracy

The headline number is detection accuracy. The more durable business value is diagnosis.

A normal validator says: pass or fail. An LLM judge says: probably equivalent, with the usual tone of confident fog. BPF returns witness probes and an equivalence score. That gives teams a reason for review, a direction for repair, and a way to prioritize cases.

This matters because most enterprise AI governance fails at the handoff between detection and action. A risk score appears. Someone asks what to do. The dashboard shrugs elegantly. BPF’s design is better because the evidence is closer to the failure mode. If a hypothesis was omitted, the witness can expose the missing condition. If a conclusion was generalized, the probe can show the overreach. If a type shifted, the system can route to structural checks.

In operational terms, the method supports four workflows:

Gatekeeping: reject high-drift outputs before they enter libraries, repositories, workflows, or knowledge bases.
Triage: send REVIEW-band cases to experts with witness probes attached.
Candidate selection: rank multiple generated artifacts by faithfulness score, not by fluency or superficial plausibility.
Repair: feed witness failures back into generation so the model fixes the semantic boundary it crossed.

That is a stronger governance loop than “ask another model whether the first model seems fine.” AI judging AI may be useful, but without structured probes it is often just a committee of mirrors.

The paper’s deeper message: AI needs intent tests

The long-term value of this paper is not limited to autoformalization. It gives a vocabulary for a broader class of AI systems: those that translate human intent into operational form.

That translation is where many organizations are placing their bets. Natural language is becoming the front end for software, analytics, operations, and compliance. But the front end is not the system. The system is the artifact that gets executed, reused, audited, and embedded downstream.

When the artifact is wrong, the damage compounds. A subtly drifted theorem can contaminate a formal library. A subtly drifted business rule can contaminate pricing, approvals, compliance evidence, or customer treatment. The same pattern appears with different costumes.

BPF is valuable because it asks the right operational question: not “does this output look valid?” but “does this output live in the same consequence neighborhood as the source intent?”

That is a more expensive question. It is also the question that matters.

The practical lesson is therefore simple, though not especially comforting: if an AI system converts language into something executable, the validation layer must test meaning under perturbation. It must look for the nearby ways the model could be wrong. It must distinguish accept, reject, and review. It must produce repair evidence. And it must admit that some drift is outside the current oracle.

The type checker can keep doing its job. It just should not be promoted to philosopher.

Cognaptus: Automate the Present, Incubate the Future.

Noor Islam S. Mohammad and Tamim Sheikh, “The Faithfulness Gap: Certifying Semantic Equivalence Between Natural-Language and Formal Mathematical Statements,” arXiv:2606.16541, 2026, https://arxiv.org/pdf/2606.16541. ↩︎

TL;DR for operators#

The failure is not syntax; it is semantic substitution#

BPF treats meaning as a neighborhood, not a string#

Counterfactual probes are the difference between testing and hoping#

The Equivalence Spectrum is a triage instrument, not a trophy score#

APBA cuts the cost by asking the next best question#

FGD turns verification into repair#

What the experiments actually support#

The business analogue is specification drift#

What the paper directly shows, and what Cognaptus infers#

Boundaries that change how to deploy it#

The ROI is cheaper diagnosis, not just better accuracy#

The paper’s deeper message: AI needs intent tests#