Cover image

When Benchmarks Lie: Teaching Leaderboards to Care About Preferences

A leaderboard is a comforting object. It gives procurement teams, product managers, and slightly sleep-deprived founders the same small pleasure: a ranked list. Bigger number, better model. Lower rank, worse model. Decision made. Spreadsheet closed. Everyone can return to pretending vendor evaluation is objective. Unfortunately, benchmarks do not care what your business actually needs. ...

February 5, 2026 · 16 min · Zelina
Cover image

When Agents Stop Talking to the Wrong People

Communication sounds harmless until the wrong person gets the microphone. That is true in meetings. It is also true in multi-agent AI systems. The polite version says agents “collaborate,” “debate,” and “refine each other’s reasoning.” The less decorative version is that one agent’s output becomes another agent’s input. If the first agent is wrong, confused, strategically misleading, or simply having one of those tiny synthetic breakdowns that LLMs have with impressive confidence, the system has just created a distribution channel for bad judgment. ...

February 4, 2026 · 15 min · Zelina
Cover image

Click with Confidence: Teaching GUI Agents When *Not* to Click

A click looks harmless until it is not. In consumer software, a wrong click means opening the wrong tab, dismissing the wrong pop-up, or buying the wrong color of phone case. Annoying, perhaps. Civilization survives. In enterprise workflows, a wrong click can approve a payment, change a configuration, delete a record, or submit a compliance form with the confidence of a sleepwalker holding admin rights. ...

February 3, 2026 · 17 min · Zelina
Cover image

RAudit: When Models Think Too Much and Still Get It Wrong

The model is not always confused. Sometimes it has already done the work, reached the right answer, and then politely walks away from it because the user sounded confident. That is the quietly irritating problem behind RAudit, a paper that studies how large language models behave when their reasoning is audited without giving the auditor the correct answer.1 The paper is not just another “LLMs can be sycophantic” warning. We have enough of those. At this point, saying models flatter users is like saying spreadsheets contain hidden errors. True, useful, and somehow still not enough to change deployment practice. ...

February 3, 2026 · 17 min · Zelina
Cover image

Agentic Systems Need Architecture, Not Vibes

Agentic AI has a habit of sounding more engineered than it is. A demo connects an LLM to a search tool, adds a memory store, wraps the whole thing in a planner, and suddenly the slide deck says “autonomous agent.” The system may still forget what it just saw, retrieve the wrong context, misuse tools, loop on bad actions, or politely hallucinate its way into a support ticket. But the diagram has arrows, so morale remains high. ...

February 2, 2026 · 14 min · Zelina
Cover image

GAVEL: When AI Safety Grows a Rulebook

Rules are boring until the audit starts. That is roughly where enterprise AI safety is heading. A chatbot can be polite, policy-aligned, and apparently harmless on the surface, while still performing the internal work of manipulation, scam automation, or unsafe assistance. Text moderation catches what the model says. Classic activation monitoring tries to catch what the model is internally representing. But both can become awkward in production: one sees too little, the other often explains too little. ...

February 2, 2026 · 17 min · Zelina
Cover image

Grading the Doctor: How Health-SCORE Scales Judgment in Medical AI

Checklist is a boring word. That is why it is useful. In healthcare AI, the glamorous question is whether a model can “reason like a doctor.” The operational question is uglier: did it invent a lab value, miss an emergency referral, overstate certainty, ignore the requested format, recommend unsafe antibiotics, or fail to ask for missing context? ...

February 2, 2026 · 15 min · Zelina
Cover image

When LLMs Invent Languages: Efficiency, Secrecy, and the Limits of Natural Speech

Chatbots are trained to sound human. Enterprise AI agents are increasingly asked to behave like colleagues: pass information, coordinate actions, summarize context, and explain what they are doing in language people can read. That arrangement feels safe because natural language is familiar. It also feels efficient enough, at least until agents start talking to other agents. ...

January 31, 2026 · 15 min · Zelina
Cover image

When Alignment Is Not Enough: Reading Between the Lines of Modern LLM Safety

A chatbot refuses a dangerous request. Everyone relaxes. This is the small theatre of modern AI safety: the model says no, the dashboard records a refusal, the vendor presentation adds another green checkmark, and the compliance team moves on to the next risk register. Very tidy. Very comforting. Also, increasingly insufficient. The problem is not that refusal behavior is meaningless. It is not. The problem is that refusal behavior is only one visible symptom of safety alignment. Modern LLM safety now depends on a larger chain: training objectives, post-training choices, inference interfaces, prompt formats, tool access, evaluation design, and deployment context. When any part of that chain changes, the nice refusal seen in a benchmark may not survive contact with the product. ...

January 26, 2026 · 15 min · Zelina
Cover image

Triage by Token: When Context Clues Quietly Override Clinical Judgment

A patient walks into an emergency department. Or arrives by ambulance. Or lives far from the hospital. Or has private insurance. Or has missed prior appointments. Clinically, those details may be background noise. In triage, the core question is supposed to be sharper: how sick is this patient, how urgent is the risk, and what resources are likely needed? The Emergency Severity Index, or ESI, is not a lifestyle quiz with a stethoscope attached. ...

January 24, 2026 · 13 min · Zelina