LLMs | Cognaptus

Branching Out of the Box: Tree‑OPO Turns MCTS Traces into Better RL for Reasoning

The punchline Tree‑OPO takes something many labs already produce—MCTS rollouts from a stronger teacher—and treats them not just as answers but as a curriculum of prefixes. It then optimizes a student with GRPO-like updates, but with staged, tree-aware advantages instead of a flat group mean. The result in math reasoning (GSM8K) is a modest but consistent bump over standard GRPO while keeping memory/complexity low. Why this matters for practitioners: you can get more out of your expensive searches (or teacher traces) without training a value model or lugging around teacher logits during student training. ...

Plan, Then Rewrite: Why Explicit Intent Wins in Agent Workflows

When assistants coordinate multiple tools or agents, the biggest unforced error is planning off the raw chat log. RECAP (REwriting Conversations for Agent Planning) argues—and empirically shows—that a slim “intent rewriter” sitting between the dialogue and the planner yields better, cleaner plans, especially in the messy realities of ambiguity, intent drift, and mixed goals. The headline: rewriting the conversation into a concise, up‑to‑date intent beats throwing the whole transcript at your planner. ...

Brains Meet Brains: When LLMs Sit on Top of Supply Chain Optimizers

TL;DR Pair a classic mixed‑integer inventory redistribution model with an LLM-driven context layer and you get explainable optimization: the math still finds near‑optimal transfers, while the LLM translates them into role‑aware narratives, KPIs, and visuals. The result is faster buy‑in, fewer “why this plan?” debates, and tighter execution. Why this paper matters for operators Most planners don’t read constraint matrices. They read stockout risks, truck rolls, and WOS. The study demonstrates a working system where: ...

Judge, Jury, and Chain‑of‑Thought: Making Models StepWiser

TL;DR Generative judges that think before they judge—and are trained with online RL using stepwise labels—beat classic discriminative process reward models (PRMs). The StepWiser approach brings three wins: (1) higher accuracy at spotting the first bad step, (2) cleaner, more reliable inference via a “chunk‑reset” search that prunes bad steps while keeping overall length similar, and (3) better data selection for fine‑tuning. Why this matters (for builders and buyers) Most enterprise CoT systems fail not because they can’t produce long reasoning, but because they can’t police their own steps. Traditional PRMs act like a yes/no bouncer at each step—fast, but shallow. StepWiser reframes judging as its own reasoning task: the judge writes an analysis first, then issues a verdict. That small shift has big, practical consequences: ...

Memory With Intent: Why LLMs Need a Cognitive Workspace, Not Just a Bigger Window

TL;DR Today’s long-context and RAG systems scale storage, not thinking. Cognitive Workspace (CW) reframes memory as an active, metacognitive process: curate, plan, reuse, and consolidate. In tests, CW reports ~55–60% memory reuse and 17–18% net efficiency gains despite a 3.3× operation overhead—precisely because it thinks about what to remember and why. The Setup: Context ≠ Cognition Over the past 18 months we’ve cheered >1M-token windows and slicker attention kernels. But piling tokens into a context is like dumping files on a desk; it’s storage without stewardship. In knowledge work, what moves the needle is not how much you can “see” but how well you organize, recall, and reuse—with intent. ...

Forgetting by Design: Turning GDPR into a Systems Problem for LLMs

The “right to be forgotten” (GDPR Art. 17) has always seemed like kryptonite for large language models. Once a trillion-parameter system memorizes personal data, how can it truly be erased without starting training from scratch? Most prior attempts—whether using influence functions or alignment-style fine-tuning—felt like damage control: approximate, unverifiable, and too fragile to withstand regulatory scrutiny. This new paper, Unlearning at Scale, turns the problem on its head. It argues that forgetting is not a mathematical optimization problem, but a systems engineering challenge. If training can be made deterministic and auditable, then unlearning can be handled with the same rigor as database recovery or transaction rollbacks. ...

Paging Dr. Model: When AI Runs the Workup

What if the AI didn’t just answer a question—it ordered the right tests, asked for the right observations, and stopped when it had enough to call the case? A new paper introduces DxDirector-7B, a 7B-parameter medical LLM trained to act as the director of care, not the assistant. Instead of waiting for a physician to assemble clean inputs, the model starts from the patient’s vague chief complaint (e.g., “tummy pain and tired”) and then plans the diagnostic pathway, requesting only those clinician actions that software cannot perform (physical exams, labs, imaging). The goal is twofold: maximize diagnostic accuracy and minimize human workload. ...

$Cover image$

Count Us In: How Dual‑Agent LLMs Turn Math Slips into Teachable Moments

Large language models can talk through a solution like a star pupil—and still get the answer wrong. A new study of four modern LLMs across arithmetic, algebra, and number theory shows where they stumble (mostly procedural slips), when they recover (with a second agent), and how teams should redesign AI tutors and graders to be trustworthy in the real world. TL;DR for builders Single models still flub arithmetic. Even strong general models mis-add partial products or mis-handle carries. Reasoning-tuned models help—but not always. OpenAI o1 was consistently best; DeepSeek‑R1 “overthought” and missed basics. Two agents beat one. Peer‑review style “dual agents” dramatically raised accuracy, especially on Diophantine equations. Most errors are procedural, not conceptual. Think slips and symbolic manipulations—not deep misunderstandings. Step‑labeling works. A simple rubric (Correct / Procedural / Conceptual / Impasse) localizes faults and boosts formative feedback. What the paper really tested (and why that matters) Most benchmarks hide easy leakage and memorized patterns. Here, the authors build three item models—templates that generate many variants—to stress the models beyond memorization: ...

When AI Plays Lawmaker: Lessons from NomicLaw’s Multi-Agent Debates

When AI Plays Lawmaker: Lessons from NomicLaw’s Multi-Agent Debates Large Language Models are increasingly touted as decision-making aides in policy and governance. But what happens when we let them loose together in a legislative sandbox? NomicLaw — an open-source multi-agent simulation inspired by the self-amending game Nomic — offers a glimpse into how AI agents argue, form alliances, and shape collective rules without human scripts. The Experiment NomicLaw pits LLM agents against legally charged vignettes — from self-driving car collisions to algorithmic discrimination — in a propose → justify → vote loop. Each agent crafts a legal rule, defends it, and votes on a peer’s proposal. Scoring is simple: 10 points for a win, 5 for a tie. Two configurations were tested: ...

From GUI Novice to Digital Native: How SEAgent Teaches Itself Software Autonomously

If you’ve ever tried to automate your own software workflows using AI, you’ll know the hard part isn’t reasoning — it’s clicking the right button in a sea of ambiguous icons, drop-downs, and obscure UIs. For agents tasked with navigating GUIs like humans do, the real challenge isn’t logic — it’s context. Enter SEAgent: a self-evolving computer-use agent that doesn’t just learn to operate software — it teaches itself how to learn, using nothing but screenshots, feedback from its own past mistakes, and a clever curriculum. ...