RLVR | Cognaptus

Compile Once, Train Later: Offline RL Moves Code-Model Verification Upstream

Compile Once, Train Later: Offline RL Moves Code-Model Verification Upstream Code assistants have a small accounting problem. Not the glamorous kind involving model capability, agentic workflows, or yet another dashboard with a glowing neural blob. The ordinary kind: every time a model proposes code during reinforcement learning, someone—or something—has to run it, test it, score it, and feed that score back into training. ...

Judge Math-Not by Its Parser

Opening — Why this matters now The AI industry has discovered a wonderfully pedestrian way to misread progress: build models that can solve harder math problems, then grade them with evaluators that panic when 2040 minutes is not written as 34 hours. That is not a joke. It is the central irritation behind “Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity”, an arXiv paper that examines how mathematical reasoning benchmarks can be distorted by rigid symbolic verification.1 ...

When RL Needs a Tour Guide: OGER and the Business of Smarter Exploration

Training a reasoning model is starting to look less like feeding a student more textbooks and more like taking that student into a difficult city with a very opinionated guide. The guide should not carry the student through every street. That creates a tourist, not a navigator. But leaving the student alone with a reward signal that says only “correct” or “wrong” is not exactly enlightened pedagogy either. The student may find one narrow route, repeat it forever, and call that intelligence. We have all seen corporate training programs with roughly this level of imagination. ...

The Data Diet for Reasoning Models: Why Less (But Smarter) Wins

A model-training team has a familiar bad habit: when the model fails, it asks for more. More examples. More domains. More synthetic prompts. More compute. More benchmarks to average over until the unpleasant details become small enough to ignore. This habit is understandable. It is also expensive. And, according to SuperNova, it may be the wrong first instinct. ...

Synthetic Sense or Synthetic Nonsense? When AI Trains on Itself

Charts. Tables. Diagrams. Scanned forms. Product screenshots. Floor plans. Receipts with half-faded numbers and three suspiciously similar line items. This is where enterprise multimodal AI is supposed to become useful. Not in the demo where the model politely describes a golden retriever on a lawn, but in the operationally annoying question: which number, label, relation, or region in this visual object actually matters for the task? ...

From Retry to Recovery: Teaching AI Agents to Learn from Their Own Mistakes

A failed automation run usually tells you more than a successful one. A coding agent compiles the wrong program and receives a concrete error. A web-navigation agent clicks into the wrong product page and sees that the attributes do not match. A task agent tries an invalid action and the environment complains, patiently, like a machine that has seen too much. In each case, the system does not merely say “failed.” It gives clues. ...

Many Roads? Not Quite: Why LLM Alignment May Prefer a Single Moral Lane

Compliance teams like pluralism until the model has to make a decision. That is the quiet tension behind many enterprise AI alignment projects. We say we want models that “consider multiple perspectives,” “respect diverse values,” and “avoid one-size-fits-all answers.” Good. Nobody wants a moral reasoning system that behaves like a bureaucrat with a temperature setting of zero. But when the same system is deployed for policy review, customer escalation, internal audit, medical triage support, or financial compliance, pluralism quickly meets a less poetic requirement: the answer must be consistently defensible. ...

When Failure Pays Dividends: Recycling Reasoning in RLVR with SCOPE

Failure logs are usually where AI teams put the evidence that training was expensive. A reasoning model tries a problem. It gets most of the chain right. Then, near the end, it makes one bad algebraic turn, chooses the wrong case, or quietly invents a rule that mathematics did not approve. Under standard reinforcement learning from verifiable rewards, that rollout receives the same score as nonsense: zero. The model may have climbed nine floors and tripped on the final step; the reward system marks it as indistinguishable from someone who never entered the building. ...

ReSyn & the Rise of the Verifier: When Solving Is Hard but Checking Is Easy

ReSyn & the Rise of the Verifier: When Solving Is Hard but Checking Is Easy Checking is the underrated job in every serious operation. A logistics manager may not instantly know the optimal route for a hundred deliveries, but she can quickly reject a route that violates vehicle capacity, time windows, or geography. A compliance officer may not draft the perfect contract clause, but he can often identify whether a clause violates a rule. A finance team may not generate the ideal capital allocation plan on first attempt, but it can test whether a proposed plan breaks liquidity, exposure, or leverage constraints. ...

Click, Fail, Learn: Why BEPA Might Be the First GUI Agent That Actually Improves

Clicking is easy. Clicking correctly, after the screen has changed, after a pop-up appears, after the previous attempt failed, and after the agent has only fifteen steps before the evaluator gives up — that is where GUI automation stops looking like a demo and starts looking like work. This is the problem behind BEPA, short for Bi-Level Expert-to-Policy Assimilation, introduced in the arXiv paper From Off-Policy to On-Policy: Enhancing GUI Agents via Bi-level Expert-to-Policy Assimilation.1 The paper is about training end-to-end GUI agents, but its practical message is broader: expert workflows are not automatically useful training data. They have to be translated into something the learner can actually perform. ...