Cover image

Bending the Beam, Not the Brain: What RL with Perfect Rewards Still Can’t Teach LLMs

Beams are honest objects. Push them, load them, move their supports, and they obey equilibrium equations without theatrical ambiguity. Language models, unfortunately, are less well-behaved. That is what makes BeamPERL a useful paper. It does not test LLM reasoning on a vague benchmark where “correctness” means pleasing a judge, matching a rubric, or sounding sufficiently graduate-school. It asks a compact reasoning model to solve a classical beam statics task: calculate support reactions for a loaded beam. The answers can be checked by a symbolic solver. The reward can be exact. No vibes, no partial credit, no “the answer feels plausible.”1 ...

March 5, 2026 · 16 min · Zelina
Cover image

Dare to Benchmark: Why Data Science Agents Still Trip Over Their Own Pipelines

Spreadsheet work has a special kind of comedy. A person asks an AI agent to load a dataset, clean a few columns, train a model, generate predictions, and save a prediction.csv file. The agent writes plausible Python. The model architecture is reasonable. The explanation sounds confident. Then the whole thing fails because the agent forgot to pass the filename into the execution tool. ...

March 2, 2026 · 19 min · Zelina
Cover image

When Buffers Bite Back: Teaching AI to Respect Pallets in Flexible Job Shops

Factories rarely fail because a machine cannot work. They fail because the machine, the operator, the part, the fixture, the pallet, and the next free square meter of floor space refuse to arrive in the same universe at the same time. That is why a scheduling paper about pallets is more interesting than it sounds. ...

March 2, 2026 · 16 min · Zelina
Cover image

When Failure Pays Dividends: Recycling Reasoning in RLVR with SCOPE

Failure logs are usually where AI teams put the evidence that training was expensive. A reasoning model tries a problem. It gets most of the chain right. Then, near the end, it makes one bad algebraic turn, chooses the wrong case, or quietly invents a rule that mathematics did not approve. Under standard reinforcement learning from verifiable rewards, that rollout receives the same score as nonsense: zero. The model may have climbed nine floors and tripped on the final step; the reward system marks it as indistinguishable from someone who never entered the building. ...

March 2, 2026 · 15 min · Zelina
Cover image

Mind the Gap: Why Agency Isn’t Intelligence (Yet)

A trading bot keeps executing while the market regime changes. A warehouse robot keeps optimizing its route while a sensor slowly drifts. A customer-service agent keeps sounding fluent while the conversation loses coherence one turn at a time. From the outside, the system still looks agentic. It acts. It responds. It may even keep producing acceptable short-term outcomes. The dashboard, naturally, waits until the mess is obvious. Dashboards are polite like that. ...

February 28, 2026 · 16 min · Zelina
Cover image

Template Thinking: Why Your Next AI Agent Should Steal from Cognitive Science

Architecture is usually where AI enthusiasm goes to become expensive. A team starts with a capable model. Then it adds a planner. Then memory. Then a tool router. Then a critic. Then a second critic because the first critic was apparently too polite. A few weeks later, the “agent” works on the demo path, fails on the second edge case, and nobody can explain whether the problem is the prompt, the retrieval layer, the tool schema, the memory policy, or the small parliament of LLM calls now debating inside the workflow. ...

February 28, 2026 · 22 min · Zelina
Cover image

When Agents Ask for Help: Teaching LLMs the Art of Expert Collaboration

A help desk ticket is rarely solved by the first sentence. Someone says, “The report is wrong.” Then comes the real work: wrong where, compared with what, after which data refresh, under which permission level, and whether “wrong” means mathematically false or merely politically inconvenient. The expert does not just hand over an answer. The expert asks questions, reconstructs context, and turns a vague failure into a useful diagnosis. ...

February 28, 2026 · 15 min · Zelina
Cover image

Divide & Verify: When Decomposition Finally Learns to Behave

A report is only as trustworthy as the sentence nobody checked. That sounds melodramatic until an LLM-generated due diligence note, policy memo, customer support answer, or compliance summary contains three correct facts and one quiet falsehood in the same paragraph. The usual fix is simple in theory: split the answer into smaller claims, retrieve evidence for each claim, let a verifier judge them, and aggregate the results. ...

February 26, 2026 · 17 min · Zelina
Cover image

Reasoning Is Optional. Optimization Is Not: Rethinking VLA Training with NORD

Driving teams do not pay for reasoning tokens because they enjoy watching a model narrate its inner life. They pay for them because, at least in current VLA training culture, reasoning traces are treated as a bridge between perception and action. The bridge is expensive. A typical reasoning-heavy Vision-Language-Action pipeline for autonomous driving collects large driving datasets, generates dense chain-of-thought-style annotations, supervised-fine-tunes the model, and then applies reinforcement learning to improve driving metrics. It is a respectable pipeline. It is also the kind of pipeline that quietly converts every research win into an invoice. ...

February 25, 2026 · 14 min · Zelina
Cover image

Memory in the Mean Field: Teaching Macro Agents to Remember

Simulation has a bad habit: it becomes realistic just when it becomes too expensive to run. A simple market model can treat everyone as the same kind of agent and still say something useful. A richer model lets agents differ by wealth, income, health, location, battery level, portfolio position, or whatever state variable the domain demands. Then someone remembers that real agents do not see the whole system. Investors see prices, not everyone’s balance sheet. Households see wages and interest rates, not the full wealth distribution. Drivers see traffic signals and congestion, not the hidden intention of every other driver. ...

February 24, 2026 · 15 min · Zelina