Cover image

GUI-Eyes: When Agents Learn Where to Look

Opening — Why this matters now GUI agents are getting smarter in all the wrong ways. Model sizes grow. Benchmarks inch upward. Training datasets balloon into the tens of millions of annotated clicks. Yet in real interfaces—dense IDEs, CAD tools, enterprise dashboards—agents still miss the obvious. Not because they cannot reason, but because they don’t know where to look. ...

January 17, 2026 · 4 min · Zelina
Cover image

MatchTIR: Stop Paying Every Token the Same Salary

Opening — Why this matters now Tool-using agents are no longer a novelty. They are quietly becoming the default interface between LLMs and the real world: APIs, databases, search engines, execution environments. Yet most reinforcement learning pipelines still behave as if every step in a trajectory deserves the same bonus. That assumption was tolerable when tasks were short. It collapses when agents think, call tools, fail, retry, and recover over ten or more turns. ...

January 17, 2026 · 4 min · Zelina
Cover image

Seeing Is Thinking: When Multimodal Reasoning Stops Talking and Starts Drawing

Opening — Why this matters now Multimodal AI has spent the last two years narrating its thoughts like a philosophy student with a whiteboard it refuses to use. Images go in, text comes out, and the actual visual reasoning—zooming, marking, tracing, predicting—happens offstage, if at all. Omni-R1 arrives with a blunt correction: reasoning that depends on vision should generate vision. ...

January 15, 2026 · 4 min · Zelina
Cover image

When Agents Learn Without Learning: Test-Time Reinforcement Comes of Age

Opening — Why this matters now Multi-agent LLM systems are having a moment. From collaborative coding bots to diagnostic committees and AI tutors, orchestration is increasingly the default answer to hard reasoning problems. But there’s an inconvenient truth hiding behind the demos: training multi-agent systems with reinforcement learning is expensive, unstable, and often counterproductive. ...

January 15, 2026 · 4 min · Zelina
Cover image

Scaling the Sandbox: When LLM Agents Need Better Worlds

Opening — Why this matters now LLM agents are no longer failing because they cannot reason. They fail because they are trained in worlds that are too small, too brittle, or too artificial to matter. As agents are pushed toward real-world tool use—databases, APIs, enterprise workflows—the limiting factor is no longer model size, but environment quality. This paper introduces EnvScaler, a framework arguing that if you want general agentic intelligence, you must first scale the worlds agents inhabit. ...

January 14, 2026 · 3 min · Zelina
Cover image

Click, Fail, Learn: Why BEPA Might Be the First GUI Agent That Actually Improves

Opening — Why this matters now Autonomous agents are very good at talking about tasks. They are far less competent at actually doing them—especially when “doing” involves clicking the right icon, interpreting a cluttered interface, or recovering gracefully from failure. GUI agents, in particular, suffer from a chronic problem: once they fail, they either repeat the same mistake or forget everything they once did right. ...

January 12, 2026 · 3 min · Zelina
Cover image

STACKPLANNER: When Agents Learn to Forget

Opening — Why this matters now Multi-agent systems built on large language models are having a moment. From research copilots to autonomous report generators, the promise is seductive: split a complex task into pieces, let specialized agents work in parallel, and coordinate everything with a central planner. In practice, however, these systems tend to collapse under their own cognitive weight. ...

January 12, 2026 · 4 min · Zelina
Cover image

TowerMind: When Language Models Learn That Towers Have Consequences

Opening — Why this matters now Large Language Models have become fluent planners. Ask them to outline a strategy, decompose a task, or explain why something should work, and they rarely hesitate. Yet when placed inside an environment where actions cost resources, mistakes compound, and time does not politely pause, that fluency often collapses. ...

January 12, 2026 · 4 min · Zelina
Cover image

Stuck on Repeat: When Reinforcement Learning Fails to Notice the Rules Changed

Opening — Why this matters now Reinforcement learning has a credibility problem. Models ace their benchmarks, plots look reassuringly smooth, and yet the moment the environment changes in a subtle but meaningful way, performance falls off a cliff. This is usually dismissed as “out-of-distribution behavior” — a polite euphemism for we don’t actually know what our agent learned. ...

January 11, 2026 · 4 min · Zelina
Cover image

From Tokens to Topology: Teaching LLMs to Think in Simulink

Opening — Why this matters now Large Language Models have become dangerously good at writing text—and conspicuously bad at respecting reality. Nowhere is this mismatch more obvious than in model‑based engineering. Simulink, a cornerstone of safety‑critical industries from automotive to aerospace, is not a playground for eloquence. It is a rigid, graphical, constraint‑heavy environment where hallucinations are not amusing quirks but certification failures. ...

January 9, 2026 · 4 min · Zelina