Cover image

Knowing Is Not Doing: When LLM Agents Pass the Task but Fail the World

Opening — Why this matters now LLM agents are getting disturbingly good at finishing tasks. They click the right buttons, traverse web pages, solve text-based games, and close tickets. Benchmarks applaud. Dashboards glow green. Yet something feels off. Change the environment slightly, rotate the layout, tweak the constraints — and suddenly the same agent behaves like it woke up in a stranger’s apartment. The problem isn’t execution. It’s comprehension. ...

January 15, 2026 · 4 min · Zelina
Cover image

TowerMind: When Language Models Learn That Towers Have Consequences

Opening — Why this matters now Large Language Models have become fluent planners. Ask them to outline a strategy, decompose a task, or explain why something should work, and they rarely hesitate. Yet when placed inside an environment where actions cost resources, mistakes compound, and time does not politely pause, that fluency often collapses. ...

January 12, 2026 · 4 min · Zelina
Cover image

NPCs With Short-Term Memory Loss: Benchmarking Agents That Actually Live in the World

Opening — Why this matters now Agentic AI has entered its Minecraft phase again. Not because blocks are trendy, but because open-world games remain one of the few places where planning, memory, execution, and failure collide in real time. Yet most agent benchmarks still cheat. They rely on synthetic prompts, privileged world access, or oracle-style evaluation that quietly assumes the agent already knows where everything is. The result: impressive demos, fragile agents, and metrics that flatter models more than they inform builders. ...

January 10, 2026 · 4 min · Zelina
Cover image

RxnBench: Reading Chemistry Like a Human (Turns Out That’s Hard)

Opening — Why this matters now Multimodal Large Language Models (MLLMs) have become impressively fluent readers of the world. They can caption images, parse charts, and answer questions about documents that would once have required a human analyst and a strong coffee. Naturally, chemistry was next. But chemistry does not speak in sentences. It speaks in arrows, wedges, dashed bonds, cryptic tables, and reaction schemes buried three pages away from their explanations. If we want autonomous “AI chemists,” the real test is not trivia or SMILES strings — it is whether models can read actual chemical papers. ...

December 31, 2025 · 4 min · Zelina
Cover image

Think Wide, Then Think Hard: Forcing LLMs to Be Creative (On Purpose)

Opening — Why this matters now Large language models are prolific. Unfortunately, they are also boring in a very specific way. Give an LLM a constrained task—generate a programming problem, write a quiz, design an exercise—and it will reliably produce something correct, polite, and eerily similar to everything it has produced before. Change the temperature, swap the model, even rotate personas, and the output still clusters around the same conceptual center. ...

December 30, 2025 · 4 min · Zelina
Cover image

When Models Learn to Forget: Why Memorization Isn’t the Same as Intelligence

Opening — Why this matters now Large language models are getting better at everything—reasoning, coding, writing, even pretending to think. Yet beneath the polished surface lies an old, uncomfortable question: are these models learning, or are they remembering? The distinction used to be academic. It no longer is. As models scale, so does the risk that they silently memorize fragments of their training data—code snippets, proprietary text, personal information—then reproduce them when prompted. Recent research forces us to confront this problem directly, not with hand-waving assurances, but with careful isolation of where memorization lives inside a model. ...

December 26, 2025 · 3 min · Zelina
Cover image

Personas, Panels, and the Illusion of Free A/B Tests

Opening — Why this matters now Everyone wants cheaper A/B tests. Preferably ones that run overnight, don’t require legal approval, and don’t involve persuading an ops team that this experiment definitely won’t break production. LLM-based persona simulation looks like the answer. Replace humans with synthetic evaluators, aggregate their responses, and voilà—instant feedback loops. Faster iteration, lower cost, infinite scale. What could possibly go wrong? ...

December 25, 2025 · 5 min · Zelina
Cover image

When 100% Sensitivity Isn’t Safety: How LLMs Fail in Real Clinical Work

Opening — Why this matters now Healthcare AI has entered its most dangerous phase: the era where models look good enough to trust. Clinician‑level benchmark scores are routinely advertised, pilots are quietly expanding, and decision‑support tools are inching closer to unsupervised use. Yet beneath the reassuring metrics lies an uncomfortable truth — high accuracy does not equal safe reasoning. ...

December 25, 2025 · 5 min · Zelina
Cover image

When Benchmarks Rot: Why Static ‘Gold Labels’ Are a Clinical Liability

Opening — Why this matters now Clinical AI has entered an uncomfortable phase of maturity. Models are no longer failing loudly; they are failing quietly. They produce fluent answers, pass public benchmarks, and even outperform physicians on narrowly defined tasks — until you look closely at what those benchmarks are actually measuring. The paper at hand dissects one such case: MedCalc-Bench, the de‑facto evaluation standard for automated medical risk-score computation. The uncomfortable conclusion is simple: when benchmarks are treated as static truth, they slowly drift away from clinical reality — and when those same labels are reused as reinforcement-learning rewards, that drift actively teaches models the wrong thing. ...

December 23, 2025 · 4 min · Zelina
Cover image

LLMs, Gotta Think ’Em All: When Pokémon Battles Become a Serious AI Benchmark

Opening — Why this matters now For years, game AI has been split between two extremes: brittle rule-based scripts and opaque reinforcement learning behemoths. Both work—until the rules change, the content shifts, or players behave in ways the designers didn’t anticipate. Pokémon battles, deceptively simple on the surface, sit exactly at this fault line. They demand structured reasoning, probabilistic judgment, and tactical foresight, but also creativity when the meta evolves. ...

December 22, 2025 · 4 min · Zelina