Cover image

Grading the Doctor: How Health-SCORE Scales Judgment in Medical AI

Checklist is a boring word. That is why it is useful. In healthcare AI, the glamorous question is whether a model can “reason like a doctor.” The operational question is uglier: did it invent a lab value, miss an emergency referral, overstate certainty, ignore the requested format, recommend unsafe antibiotics, or fail to ask for missing context? ...

February 2, 2026 · 15 min · Zelina
Cover image

MemCtrl: Teaching Small Models What *Not* to Remember

MemCtrl: Teaching Small Models What Not to Remember A robot assistant walks through a room. It sees a chair from the front. Then from the side. Then from a slightly worse angle. Then the same chair again, because the camera moved while the robot hesitated. In theory, all of this is “context.” In practice, it is mostly noise wearing a productivity badge. ...

January 31, 2026 · 14 min · Zelina
Cover image

When Rewards Learn to Think: Teaching Agents *How* They’re Wrong

An agent fails a task. It searched the web twice, opened the wrong page, trusted a noisy snippet, wrote a plausible final answer, and lost the point. Traditional reinforcement learning sees one thing: wrong. That is brutally clean, and also rather unhelpful. The agent may have performed three useful steps before collapsing at the fourth. Or it may have wandered confidently through nonsense from the beginning. Sparse final-answer rewards flatten these cases into the same training signal. The scoreboard says “0.” Very educational, in the same way a fire alarm teaches architecture. ...

January 30, 2026 · 16 min · Zelina
Cover image

Learning to Discover at Test Time: When Search Learns Back

A leaderboard usually treats an AI model like a very fast intern: give it a problem, let it try many times, keep the best answer, and politely ignore the graveyard of failed attempts. That is useful. It is also a little strange. A human engineer does not merely try 25,600 variations of a GPU kernel while keeping the same brain. After the first few failures, she learns which bottlenecks matter. After a lucky partial success, she changes how she thinks about the problem. After enough attempts, the search process is no longer just sampling. It has become learning. ...

January 24, 2026 · 18 min · Zelina
Cover image

When LLMs Get a Laptop: Why Sandboxes Might Be the Real AGI Benchmark

Laptop. That is the deceptively simple object hiding inside this paper. Not a magic planner. Not a thousand-tool agent marketplace. Not a baroque workflow with seventeen orchestration layers and a dashboard that looks like a cockpit designed by consultants. A laptop. Or, more precisely, a minimal virtual computer: a sandbox with terminal access, file editing, code execution, persistent files, and the ability to install or fetch resources. In Computer Environments Elicit General Agentic Intelligence in LLMs, Cheng et al. ask a question that looks almost too obvious to be interesting until one remembers how much of the AI industry is still trying to squeeze “agency” out of longer prompts.1 ...

January 24, 2026 · 16 min · Zelina
Cover image

Skeletons in the Proof Closet: When Lean Provers Need Hints, Not More Compute

Compute is a very convenient alibi. When an AI system fails, the modern reflex is to ask for more of it: more samples, more tokens, more search, more GPUs, more patience from whoever is paying the invoice. This habit is not always wrong. Sometimes the model really does need another attempt. Sometimes the winning answer is hiding in sample number 47. ...

January 23, 2026 · 16 min · Zelina
Cover image

Your Agent Remembers—But Can It Forget?

Memory is usually sold as a virtue. An AI agent with memory sounds safer, smarter, more personal, more autonomous. A warehouse robot remembers where boxes were placed. A navigation agent remembers which corridor led to the exit. A workflow agent remembers what the user asked yesterday and uses that context tomorrow. This is the comforting version of memory: the past as an asset. ...

January 22, 2026 · 16 min · Zelina
Cover image

Deep GraphRAG: Teaching Retrieval to Think in Layers

Retrieval has a management problem. Not the motivational-poster kind of management problem. The operational kind. A company asks its AI system a question about a contract, a customer dispute, a policy exception, or a technical incident. The answer is not sitting in one paragraph. It is distributed across definitions, transactions, policies, exceptions, and historical context. A flat vector search grabs a few semantically similar chunks and hopes the model can stitch them together. A global summarizer reads widely, compresses aggressively, and occasionally smooths away the exact fact that mattered. A local graph search follows nearby entities and may become very confident inside the wrong neighborhood. ...

January 20, 2026 · 14 min · Zelina
Cover image

GUI-Eyes: When Agents Learn Where to Look

Screenshots look simple until they are not. A human opening a dense professional application does not inspect every pixel with equal seriousness. We glance, zoom in mentally, ignore decorative clutter, search for the likely region, then focus. In other words, we do not merely “see” the interface. We decide where to look. ...

January 17, 2026 · 15 min · Zelina
Cover image

MatchTIR: Stop Paying Every Token the Same Salary

Payroll is a useful metaphor for agent training because it makes the absurdity obvious. Imagine a project team where one employee finds the right database, another enters the correct query, a third repeatedly calls the wrong API, and a fourth finally writes the report. If the report is accepted, everyone receives the same bonus. If it fails, everyone receives the same blame. Very democratic. Also very stupid. ...

January 17, 2026 · 16 min · Zelina