Cover image

DRIFT-BENCH: When Agents Stop Asking and Start Breaking

Opening — Why this matters now LLM agents are no longer just answering questions. They are executing SQL, calling APIs, modifying system state, and quietly making decisions that stick. Yet most evaluations still assume a fantasy user: precise, unambiguous, and cooperative. In real deployments, users are vague, wrong, impatient, or simply human. This gap is no longer academic. As agents enter finance, operations, and infrastructure, the cost of misunderstanding now rivals the cost of misreasoning. DRIFT‑BENCH arrives precisely at this fault line. ...

February 3, 2026 · 4 min · Zelina
Cover image

Grading the Doctor: How Health-SCORE Scales Judgment in Medical AI

Opening — Why this matters now Healthcare LLMs have a credibility problem. Not because they cannot answer medical questions—many now ace exam-style benchmarks—but because real medicine is not a multiple-choice test. It is open-ended, contextual, uncertain, and unforgiving. In that setting, how a model reasons, hedges, and escalates matters as much as what it says. ...

February 2, 2026 · 4 min · Zelina
Cover image

When Models Start Remembering Too Much

Opening — Why this matters now Large language models are no longer judged solely by what they can generate, but by what they remember. As models scale and datasets balloon, a quiet tension has emerged: memorization boosts fluency and benchmark scores, yet it also raises concerns around data leakage, reproducibility, and governance. The paper examined here steps directly into that tension, asking not whether memorization exists — that debate is settled — but where, how, and why it concentrates. ...

February 2, 2026 · 3 min · Zelina
Cover image

Seeing Too Much: When Multimodal Models Forget Privacy

Opening — Why this matters now Multimodal models have learned to see. Unfortunately, they have also learned to remember—and sometimes to reveal far more than they should. As vision-language models (VLMs) are deployed into search, assistants, surveillance-adjacent tools, and enterprise workflows, the question is no longer whether they can infer personal information from images, but how often they do so—and under what conditions they fail to hold back. ...

January 12, 2026 · 3 min · Zelina
Cover image

Pulling the Thread: Why LLM Reasoning Often Unravels

Opening — Why this matters now Large Language Model (LLM) agents have crossed an uncomfortable threshold. They are no longer just autocomplete engines or polite chat companions; they are being entrusted with financial decisions, scientific hypothesis generation, and multi-step autonomous actions. With that elevation comes a familiar demand: explain yourself. Chain-of-Thought (CoT) reasoning was supposed to be the answer. Let the model “think out loud,” and transparency follows—or so the story goes. The paper behind Project Ariadne argues, with unsettling rigor, that this story is largely fiction. Much of what we see as reasoning is closer to stagecraft: convincing, articulate, and causally irrelevant. ...

January 6, 2026 · 4 min · Zelina
Cover image

Thinking Without Understanding: When AI Learns to Reason Anyway

Opening — Why this matters now For years, debates about large language models (LLMs) have circled the same tired question: Do they really understand what they’re saying? The answer—still no—has been treated as a conversation stopper. But recent “reasoning models” have made that question increasingly irrelevant. A new generation of AI systems can now reason through problems step by step, critique their own intermediate outputs, and iteratively refine solutions. They do this without grounding, common sense, or symbolic understanding—yet they still solve tasks previously reserved for humans. That contradiction is not a bug in our theory of AI. It is a flaw in our theory of reasoning. ...

January 6, 2026 · 4 min · Zelina
Cover image

Crossing the Line: Teaching Pedestrian Models to Reason, Not Memorize

Opening — Why this matters now Pedestrian fatalities are rising, mid-block crossings dominate risk exposure, and yet most models tasked with predicting pedestrian behavior remain stubbornly local. They perform well—until they don’t. Move them to a new street, a wider arterial, or a different land-use mix, and accuracy quietly collapses. This is not a data problem. It’s a reasoning problem. ...

January 5, 2026 · 4 min · Zelina
Cover image

Safety First, Reward Second — But Not Last

Opening — Why this matters now Reinforcement learning has spent the last decade mastering games, simulations, and neatly bounded optimization problems. Reality, inconveniently, is none of those things. In robotics, autonomous vehicles, industrial automation, and any domain where mistakes have real-world consequences, almost safe is simply unsafe. Yet most “safe RL” methods quietly rely on a compromise: allow some violations, average them out, and hope the system behaves. This paper refuses that bargain. It treats safety as a hard constraint, not a tunable preference—and then asks an uncomfortable question: can we still learn anything useful? ...

January 4, 2026 · 4 min · Zelina
Cover image

Alignment Isn’t Free: When Safety Objectives Start Competing

Opening — Why this matters now Alignment used to be a comforting word. It suggested direction, purpose, and—most importantly—control. The paper you just uploaded quietly dismantles that comfort. Its central argument is not that alignment is failing, but that alignment objectives increasingly interfere with each other as models scale and become more autonomous. This matters because the industry has moved from asking “Is the model aligned?” to “Which alignment goal are we willing to sacrifice today?” The paper shows that this trade‑off is no longer theoretical. It is structural. ...

December 28, 2025 · 3 min · Zelina
Cover image

Paths > Outcomes: Measuring Agent Quality Beyond the Final State

When we measure a marathon by who crosses the line, we ignore how they ran it. For LLM agents that operate through tool calls—editing a CRM, moving a robot arm, or filing a compliance report—the “how” is the difference between deployable and dangerous. Today’s paper introduces CORE: Full‑Path Evaluation of LLM Agents Beyond Final State, a framework that scores agents on the entire execution path rather than only the end state. Here’s why this matters for your roadmap. ...

October 2, 2025 · 4 min · Zelina