Cover image

SokoBench: When Reasoning Models Lose the Plot

Opening — Why this matters now The AI industry has grown comfortable with a flattering assumption: if a model can reason, it can plan. Multi-step logic, chain-of-thought traces, and ever-longer context windows have encouraged the belief that we are edging toward systems capable of sustained, goal-directed action. SokoBench quietly dismantles that assumption. By stripping planning down to its bare minimum, the paper reveals an uncomfortable truth: today’s large reasoning models fail not because problems are complex—but because they are long. ...

January 31, 2026 · 3 min · Zelina
Cover image

When ERP Meets Attention: Teaching Transformers to Pack, Schedule, and Save Real Money

Opening — Why this matters now Enterprise Resource Planning (ERP) systems are excellent at recording what has happened. They are far less impressive at deciding what should happen next. When decision-making involves combinatorial explosions—packing furnaces, sequencing machines, allocating scarce inputs—ERP often falls back on brittle heuristics, slow solvers, or human intuition. None scale gracefully. ...

January 31, 2026 · 4 min · Zelina
Cover image

When LLMs Invent Languages: Efficiency, Secrecy, and the Limits of Natural Speech

Opening — Why this matters now Large language models are supposed to speak our language. Yet as they become more capable, something uncomfortable emerges: when pushed to cooperate efficiently, models often abandon natural language altogether. This paper shows that modern vision–language models (VLMs) can spontaneously invent task-specific communication protocols—compressed, opaque, and sometimes deliberately unreadable to outsiders—without any fine-tuning. Just prompts. ...

January 31, 2026 · 3 min · Zelina
Cover image

CAR-bench: When Agents Don’t Know What They Don’t Know

Opening — Why this matters now LLM agents are no longer toys. They book flights, write emails, control vehicles, and increasingly operate in environments where getting it mostly right is not good enough. In real-world deployments, the failure mode that matters most is not ignorance—it is false confidence. Agents act when they should hesitate, fabricate when they should refuse, and choose when they should ask. ...

January 30, 2026 · 4 min · Zelina
Cover image

Optimizing Agentic Workflows: When Agents Learn to Stop Thinking So Much

Opening — Why this matters now Agentic AI is finally escaping the demo phase and entering production. And like most things that grow up too fast, it’s discovering an uncomfortable truth: thinking is expensive. Every planning step, every tool call, every reflective pause inside an LLM agent adds latency, cost, and failure surface. When agents are deployed across customer support, internal ops, finance tooling, or web automation, these inefficiencies stop being academic. They show up directly on the cloud bill—and sometimes in the form of agents confidently doing the wrong thing. ...

January 30, 2026 · 4 min · Zelina
Cover image

Routing the Lottery: When Pruning Learns to Choose

Opening — Why this matters now For years, pruning has promised a neat trick: take a bloated neural network, snip away most of its parameters, and still walk away with comparable performance. The Lottery Ticket Hypothesis made this idea intellectually fashionable by suggesting that large networks secretly contain sparse “winning tickets” capable of learning just as well as their dense parents. ...

January 30, 2026 · 4 min · Zelina
Cover image

Safety by Design, Rewritten: When Data Defines the Boundary

Opening — Why this matters now Safety-critical AI has a credibility problem. Not because it fails spectacularly—though that happens—but because we often cannot say where it is allowed to succeed. Regulators demand clear operational boundaries. Engineers deliver increasingly capable models. Somewhere in between, the Operational Design Domain (ODD) is supposed to translate reality into something certifiable. ...

January 30, 2026 · 5 min · Zelina
Cover image

The Patient Is Not a Moving Document: Why Clinical AI Needs World Models

Opening — Why this matters now Clinical AI has quietly hit a ceiling. Over the past five years, large language models trained on electronic health records (EHRs) have delivered impressive gains: better coding, stronger risk prediction, and even near‑physician exam performance. But beneath those wins lies an uncomfortable truth. Most clinical foundation models still treat patients as documents—static records to be summarized—rather than systems evolving over time. ...

January 30, 2026 · 4 min · Zelina
Cover image

When Rewards Learn to Think: Teaching Agents *How* They’re Wrong

Opening — Why this matters now Agentic AI is having a credibility problem. Not because agents can’t browse, code, or call tools—but because we still train them like they’re taking a final exam with no partial credit. Most agentic reinforcement learning (RL) systems reward outcomes, not process. Either the agent finishes the task correctly, or it doesn’t. For short problems, that’s tolerable. For long-horizon, tool-heavy reasoning tasks, it’s catastrophic. A single late-stage mistake erases an otherwise competent trajectory. ...

January 30, 2026 · 4 min · Zelina
Cover image

World Models Meet the Office From Hell

Opening — Why this matters now Enterprise AI has entered an awkward phase. On paper, frontier LLMs can reason, plan, call tools, and even complete multi-step tasks. In practice, they quietly break things. Not loudly. Not catastrophically. Just enough to violate a policy, invalidate a downstream record, or trigger a workflow no one notices until audit season. ...

January 30, 2026 · 4 min · Zelina