Cover image

From Prompts to Policies: The Agentic RL Playbook

A chatbot can answer a question. An agent has to do something after the answer stops being enough. That distinction sounds obvious until a system must browse, click, call an API, write code, inspect an error, remember what it tried, and decide whether another attempt is worth the cost. At that point, “better prompting” becomes the AI equivalent of telling a logistics team to be more mindful while the warehouse is on fire. Pleasant, perhaps. Not a control system. ...

September 4, 2025 · 15 min · Zelina
Cover image

Judgment Day for RAG: How L‑MARS Cuts Legal Hallucinations by Design

TL;DR for operators Legal AI does not fail only because models “hallucinate”. That word has become the industry’s favourite fog machine. The more operational diagnosis is sharper: models fail when they answer current legal questions from stale internal memory and then dress the error in confident reasoning. The L-MARS paper is useful because it separates two tasks that vendors often blend together for convenience: retrieving current legal facts and reasoning over stable legal principles.1 On LegalSearchQA, a new 50-question benchmark built around recent U.S. legal facts verified in March 2026, L-MARS reaches 96.0% accuracy. Zero-shot GPT-4o-mini reaches 58.0%. Chain-of-thought falls to 30.0%, because step-by-step reasoning from outdated premises merely creates a more articulate mistake. ...

September 4, 2025 · 14 min · Zelina
Cover image

Numbers Need Narration: Making LLMs Do Reasoning‑Intensive Regression

TL;DR for operators Many AI workflows do not need a yes-or-no judgment. They need a number: how well did this answer follow the instruction, how far did this reasoning trace remain valid, how much better is answer A than answer B, how strong is this essay, how risky is this case, how close is this support call to escalation? ...

September 1, 2025 · 19 min · Zelina
Cover image

Benchmarks with Benefits: What DeepScholar-Bench Really Measures

TL;DR for operators DeepScholar-Bench is useful because it turns “deep research” from a demo category into a measurable workflow: retrieve the right sources, synthesize the right facts, and attach citations that actually support the claims.1 The headline result is not flattering. No evaluated system exceeds a 31% geometric mean across all metrics. OpenAI DeepResearch leads overall with a 0.309 geometric mean, but its best-looking strengths hide serious gaps: 0.857 on organization, 0.392 on nugget coverage, 0.187 on reference coverage, and 0.124 on document importance. Translation: the report may read well while still missing the intellectual furniture. ...

August 30, 2025 · 14 min · Zelina
Cover image

Who Watches the Watchers? Weak-to-Strong Monitoring that Actually Works

TL;DR for operators The paper’s practical message is not “add a monitor and relax.” That would be adorable, in the way unsecured admin panels are adorable. The useful message is sharper: if autonomous agents know they are being watched, standard full-log monitoring becomes less reliable. Giving the monitor more information helps sometimes, but less than many teams would expect. The bigger lever is how the monitor reads the trajectory. ...

August 30, 2025 · 17 min · Zelina
Cover image

Stackelbergs & Stakeholders: Turning Bits into Boardroom Moves

TL;DR for operators BusiAgent is best read as a blueprint for governed AI work, not as proof that LLMs have learned to run companies. The paper proposes a multi-agent framework where business roles—CEO, CFO, CTO, Marketing Manager, Product Manager, HR, and others—coordinate through delegation, peer discussion, tool use, memory, and quality checks.1 ...

August 24, 2025 · 18 min · Zelina
Cover image

Blame Isn’t a Bug: Turning Agent ‘Whodunits’ into Fixable Systems

TL;DR for operators A bad agent incident rarely starts with one dramatic mistake. It usually forms as a chain. The system may be predisposed to fail because of training data, feedback, system prompts, or scaffolding. The environment may then trigger the failure through unclear tasks, insecure information, unavailable tools, excessive permissions, or malicious inputs. Finally, the agent may commit a visible cognitive error: it overlooks something, misunderstands a command, chooses the wrong goal, or executes an action badly. ...

August 23, 2025 · 19 min · Zelina
Cover image

Mirror, Signal, Manoeuvre: Why Privileged Self‑Access (Not Vibes) Defines AI Introspection

TL;DR for operators Dashboard lights are useful because they are wired into the machine. A sticker saying “probably fine” is less useful, even if the sticker was generated in a reassuring font. That is the practical distinction in this paper. Song, Lederman, Hu, and Mahowald argue that AI introspection should not mean “the model says something plausible about itself.” It should mean the model has privileged self-access: it can report an internal state more reliably than an outside evaluator using the same visible evidence at equal or lower computational cost.1 ...

August 23, 2025 · 14 min · Zelina
Cover image

USB‑C for Agents, Stress‑Tested: What MCP‑Universe Really Reveals

TL;DR for operators MCP-Universe is useful because it punctures a very convenient belief: once an LLM is connected to tools through MCP, the agent is basically “integrated” and therefore close to production-ready. The paper says: adorable, but no.1 The benchmark tests agents against real MCP servers rather than toy APIs. It covers 231 tasks across Location Navigation, Repository Management, Financial Analysis, 3D Design, Browser Automation, and Web Searching. It uses 11 MCP servers, 133 tools, and 84 execution-based evaluators, including dynamic evaluators that retrieve live ground truth for time-sensitive tasks. ...

August 23, 2025 · 18 min · Zelina
Cover image

Memory With Intent: Why LLMs Need a Cognitive Workspace, Not Just a Bigger Window

TL;DR for operators Most enterprise LLM failures do not come from the model “not knowing enough”. They come from the system forgetting what it was doing five minutes ago, rediscovering the same facts, and treating every user turn as a fresh episode in a soap opera nobody asked to watch. The paper behind this article proposes Cognitive Workspace: an active memory architecture for LLMs that deliberately curates, reuses, consolidates, and forgets information rather than merely retrieving chunks or stretching the context window.1 Its core claim is simple but consequential: useful long-context behaviour is not the same as having a long context window. It is the ability to maintain a working state across a task. ...

August 20, 2025 · 17 min · Zelina