Reinforcement Learning

Skeletons in the Proof Closet: When Lean Provers Need Hints, Not More Compute

Compute is a very convenient alibi. When an AI system fails, the modern reflex is to ask for more of it: more samples, more tokens, more search, more GPUs, more patience from whoever is paying the invoice. This habit is not always wrong. Sometimes the model really does need another attempt. Sometimes the winning answer is hiding in sample number 47. ...

Your Agent Remembers—But Can It Forget?

Memory is usually sold as a virtue. An AI agent with memory sounds safer, smarter, more personal, more autonomous. A warehouse robot remembers where boxes were placed. A navigation agent remembers which corridor led to the exit. A workflow agent remembers what the user asked yesterday and uses that context tomorrow. This is the comforting version of memory: the past as an asset. ...

Deep GraphRAG: Teaching Retrieval to Think in Layers

Retrieval has a management problem. Not the motivational-poster kind of management problem. The operational kind. A company asks its AI system a question about a contract, a customer dispute, a policy exception, or a technical incident. The answer is not sitting in one paragraph. It is distributed across definitions, transactions, policies, exceptions, and historical context. A flat vector search grabs a few semantically similar chunks and hopes the model can stitch them together. A global summarizer reads widely, compresses aggressively, and occasionally smooths away the exact fact that mattered. A local graph search follows nearby entities and may become very confident inside the wrong neighborhood. ...

GUI-Eyes: When Agents Learn Where to Look

Screenshots look simple until they are not. A human opening a dense professional application does not inspect every pixel with equal seriousness. We glance, zoom in mentally, ignore decorative clutter, search for the likely region, then focus. In other words, we do not merely “see” the interface. We decide where to look. ...

MatchTIR: Stop Paying Every Token the Same Salary

Payroll is a useful metaphor for agent training because it makes the absurdity obvious. Imagine a project team where one employee finds the right database, another enters the correct query, a third repeatedly calls the wrong API, and a fourth finally writes the report. If the report is accepted, everyone receives the same bonus. If it fails, everyone receives the same blame. Very democratic. Also very stupid. ...

Seeing Is Thinking: When Multimodal Reasoning Stops Talking and Starts Drawing

Image work has always had a small credibility problem: people can say where they looked, but we do not always know whether they actually looked there. The same problem shows up in multimodal AI. A model can answer a question about a chart, a photograph, a geometry diagram, or a robotic scene, then produce a neat textual chain of thought afterwards. It may sound procedural. It may mention “examining the relevant region.” It may even say “the graph shows…” with the confidence of a consultant holding a laser pointer. ...

When Agents Learn Without Learning: Test-Time Reinforcement Comes of Age

A team meeting usually ends with someone saying, “Let’s remember this for next time.” Human teams sometimes do. Agent teams usually do not. A group of LLM agents can debate, critique, revise, and produce a final answer. Then the whole episode often disappears into the landfill of inference logs: useful comments, bad guesses, decisive objections, elegant checks, all flattened into “the model answered correctly” or “the model failed.” Very modern. Very wasteful. ...

Scaling the Sandbox: When LLM Agents Need Better Worlds

Sandbox is a comforting word. It sounds safe, contained, childlike. Put an AI agent in a sandbox and let it practice. Nothing catches fire. Nobody accidentally cancels a real flight. No production database wakes up with 37 mysterious refund requests and a very confused compliance officer. The problem is that most agent sandboxes are either too fake to teach anything, too manual to scale, or too close to production to be relaxing. The agent has to learn how to navigate persistent state, business rules, incomplete user information, tool failures, and multi-step dependencies. A static API-call dataset does not teach that. A role-playing LLM pretending to be the environment may hallucinate the rules. A hand-built benchmark is useful, but expensive to multiply. ...

Click, Fail, Learn: Why BEPA Might Be the First GUI Agent That Actually Improves

Clicking is easy. Clicking correctly, after the screen has changed, after a pop-up appears, after the previous attempt failed, and after the agent has only fifteen steps before the evaluator gives up — that is where GUI automation stops looking like a demo and starts looking like work. This is the problem behind BEPA, short for Bi-Level Expert-to-Policy Assimilation, introduced in the arXiv paper From Off-Policy to On-Policy: Enhancing GUI Agents via Bi-level Expert-to-Policy Assimilation.1 The paper is about training end-to-end GUI agents, but its practical message is broader: expert workflows are not automatically useful training data. They have to be translated into something the learner can actually perform. ...

STACKPLANNER: When Agents Learn to Forget

Enterprise agents usually fail in an undramatic way. They do not rebel. They do not suddenly become conscious. They do not announce, with cinematic timing, that humanity has been replaced by a spreadsheet. They simply lose the thread. A research agent searches once, finds something half-relevant, and keeps dragging that result through the rest of the task. A report-writing workflow collects too many fragments and then forgets which ones were actually useful. A coordinator delegates to sub-agents, receives noisy outputs, and treats every message as equally important because, apparently, all context is sacred now. By the final step, the system has not become more intelligent. It has become a very expensive meeting transcript. ...