Reinforcement Learning

Signal Over Noise: Why Multimodal RL Needs to Know What to Ignore

Audio. Video. Subtitles. The standard instinct is to send all of them into the model and hope the transformer performs its usual magic trick: turn a messy pile of signals into a useful answer. This instinct is understandable. It is also expensive, noisy, and occasionally a magnificent way to teach the model the wrong lesson. ...

Checklist Capital: Reinforcing Agents Without Verifiable Rewards

Checklist. It is not the most glamorous word in artificial intelligence. It does not sound like a new reasoning architecture, a sovereign model, or a mildly terrifying demo video. It sounds like something an operations manager would use before approving a vendor payment. That is exactly why it matters. Most enterprise agents fail to fit the clean reward structure that reinforcement learning likes. A coding benchmark can verify whether tests pass. A math problem can verify the final answer. A database query can sometimes verify whether a returned value matches the expected record. But business agents live in a less cooperative universe. They ask clarification questions, call internal tools, respect constraints, recover from missing information, and produce replies that are useful without being exactly predictable. ...

Thinking About Thinking: When LLMs Start Writing Their Own Report Cards

Report cards are usually written by teachers, managers, examiners, auditors, or other people with the institutional privilege of saying, “Nice effort, but no.” The paper Reinforcing Chain-of-Thought Reasoning with Self-Evolving Rubrics asks a stranger question: what if the model helps write the report card for its own reasoning process?1 That sounds like the kind of governance idea that would make a compliance officer reach for coffee. A model evaluating itself is not automatically trustworthy. Sometimes it is self-reflection. Sometimes it is theatre with JSON brackets. ...

Code-SHARP: When Agents Start Writing Their Own Ambitions

Automation has a boring failure mode: the moment the world becomes slightly more complicated than the workflow diagram, the system starts asking for a human. That is not because the model lacks vocabulary. It is because the automation system does not know how to grow its own capabilities. Most AI agents are still built around a fixed menu of actions, fixed task definitions, and fixed reward signals. They can optimize, but they rarely expand the set of things they know how to optimize for. Very impressive, in the way a microwave is impressive until you ask it to cook without buttons. ...

Stop Wasting Tokens: ESTAR and the Economics of Early Reasoning Exit

Tokens are tiny invoices. One reasoning model writes a long chain-of-thought, checks itself, circles back, restates the same conclusion in a slightly more spiritual tone, and then finally prints an answer. Another model reaches the same answer halfway through but keeps talking because nobody told it that the meter is still running. This is not philosophy. This is unit economics with better typography. ...

World-Building for Agents: When Synthetic Environments Become Real Advantage

A customer-support agent can sound impressive in a demo and still collapse the first time it has to change an address, cancel a duplicate order, rebook a flight, and explain what happened afterward. That collapse usually does not come from weak prose. The model can write the apology beautifully. The problem is that the world behind the apology has state. Orders exist or do not exist. Inventory changes. Refunds create records. A bad tool call can mutate the wrong row. A follow-up answer must reflect what the agent actually did, not what it vaguely intended to do. ...

Drafts, Then Do Better: Teaching LLMs to Outgrow Their Own Reasoning

Most office work has a draft problem. A junior analyst writes a first version of a financial memo. A lawyer marks up an argument. A consultant turns messy meeting notes into a client-ready recommendation. The first attempt is rarely useless. It is usually half-right, locally clever, and globally flawed. The expensive part is not starting from zero. The expensive part is learning how to improve a decent draft without being hypnotized by it. ...

Agents Need Worlds, Not Prompts: Inside ScaleEnv’s Synthetic Environment Revolution

Workflow automation has a bad habit of looking impressive right up to the moment it touches reality. A demo agent can summarize a refund policy, draft a polite message, and call a refund_order() tool with great confidence. Then the real workflow asks a boring question: does this order exist, is it within the refund window, has it already been refunded, does the customer’s loyalty tier matter, and should the database state change after approval? ...

Learning to Inject: When Prompt Injection Becomes an Optimization Problem

Email is a boring interface. That is exactly why it is dangerous. A user asks an AI agent to summarize a message, update a record, book a trip, or search a workspace. The agent reads some external content, decides which tool to call, fills in the parameters, and continues the user’s task. Somewhere inside that external content sits a hidden instruction saying, in effect: “Before doing the user’s task, do mine.” ...

Quantum Routes, Real Gains: When Transformers Meet CVRP

Routes look simple until someone has to pay for them. A delivery van does not care whether an optimization model sounds elegant. It cares whether the assigned route wastes fuel, crosses another vehicle’s territory, violates capacity, or produces a schedule that looks clever in a paper and stupid on the street. The Capacitated Vehicle Routing Problem, or CVRP, is where that mundane reality becomes mathematically unpleasant: multiple vehicles, limited capacity, customer demand, depot returns, and a search space that grows far faster than managerial patience. ...