Cover image

Triage by Token: When Context Clues Quietly Override Clinical Judgment

How proxy-variable testing exposes a quiet failure mode in LLM-based emergency triage: models can change acuity judgments when non-clinical context enters the prompt.

January 24, 2026 · 13 min · Zelina
Cover image

When LLMs Get a Laptop: Why Sandboxes Might Be the Real AGI Benchmark

A mechanism-first reading of LLM-in-Sandbox, showing why giving models a minimal computer environment may matter more than adding another clever prompt.

January 24, 2026 · 16 min · Zelina
Cover image

When Models Guess the Verb by Looking at the Drawer

A case-first reading of RCORE shows why video models can still confuse actions when object priors overpower temporal evidence.

January 24, 2026 · 17 min · Zelina
Cover image

Affective Inertia: Teaching LLM Agents to Remember Who They Are

A mechanism-first reading of how explicit state dynamics can make LLM agents more temporally coherent, and why too much stability becomes its own failure mode.

January 23, 2026 · 15 min · Zelina
Cover image

Cosmos Policy: When Video Models Stop Watching and Start Acting

A mechanism-first reading of Cosmos Policy, showing how latent frame injection turns a video diffusion model into a robot policy, world model, and planner.

January 23, 2026 · 16 min · Zelina
Cover image

Learning the Fast Lane: When MILP Solvers Start Remembering Where the Answer Is

DeepBound shows how a neural node selector can help branch-and-bound solvers find strong feasible solutions earlier without replacing exact MILP machinery.

January 23, 2026 · 17 min · Zelina
Cover image

Prompt Wars: When Pedagogy Beats Cleverness

A tournament-style prompt evaluation study shows why educational AI teams need evidence, not just elegant prompt wording.

January 23, 2026 · 15 min · Zelina
Cover image

Seeing Is Misleading: When Climate Images Need Receipts

A practical reading of why multimodal climate fact-checking needs evidence orchestration, not just a larger vision-language model with a browser attached.

January 23, 2026 · 15 min · Zelina
Cover image

Skeletons in the Proof Closet: When Lean Provers Need Hints, Not More Compute

A diagnostic study of RL-trained Lean provers shows that more inference samples can repeat the same failed strategy, while tactic-level structural hints recover proofs that random sampling misses.

January 23, 2026 · 16 min · Zelina
Cover image

Auditing the Illusion of Forgetting: When Unlearning Isn’t Enough

A mechanism-first reading of why LLM unlearning can look successful at the output layer while membership traces remain detectable inside model representations.

January 22, 2026 · 17 min · Zelina