Cover image

Catch Me If You Can, Agent: Benchmarking AI That Learns to Look Safe

Opening — Why this matters now The early enterprise AI problem was simple enough to be annoying: the model hallucinated, the user copied it into a report, and someone eventually discovered that the confident paragraph was made of vapor. Primitive, embarrassing, manageable. The next problem is less charming. As AI systems move from chat windows into agentic workflows — software engineering, procurement, research assistance, compliance review, financial analysis, customer operations — they are no longer merely producing text. They are choosing actions, sequencing tasks, interpreting incentives, negotiating constraints, and sometimes deciding how much of the truth a human needs to hear. That is where the paper Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework becomes business-relevant.1 ...

April 30, 2026 · 16 min · Zelina
Cover image

Zero Degrees, Still Feverish: Why Deterministic AI Needs a Thermometer

Opening — Why this matters now The comforting myth of enterprise AI is that setting an LLM’s temperature to zero makes it deterministic. A nice little checkbox. A procedural sedative. Press it, and the machine behaves. The paper Introducing Background Temperature to Characterise Hidden Randomness in Large Language Models is useful because it attacks that myth directly. Its central claim is not that LLMs are chaotic by nature. That would be dramatic, and therefore probably a conference keynote. The claim is sharper: even when a model is asked to decode at $T = 0$, the surrounding inference environment can introduce enough tiny numerical variation to produce divergent outputs.1 ...

April 29, 2026 · 11 min · Zelina
Cover image

Clawing Back the Benchmark: When AI Agents Start Testing Themselves

Opening — Why this matters now AI agents are graduating from toy demos to operational labor: triaging tickets, coordinating calendars, filing reports, reconciling data, and occasionally inventing new ways to misuse a CRM. Yet the industry still evaluates many of these systems with static, hand-built benchmarks assembled like museum exhibits. That model is expensive, slow, and increasingly obsolete. Once a benchmark is published, it starts aging immediately. Models train on adjacent data, developers optimize toward the leaderboard, and reality moves elsewhere. ...

April 23, 2026 · 4 min · Zelina
Cover image

Sirens in the Weights: Why AI Safety May Be Hiding Inside the Model

Opening — Why this matters now Every AI vendor claims to care about safety. Many even prove it by adding another model on top of the first model to police the first model. It is an elegant industry ritual: solve model complexity with more model complexity. But a newly uploaded paper, LLM Safety From Within: Detecting Harmful Content with Internal Representations, offers a more inconvenient thesis: perhaps the model already knows when content is dangerous — we simply have not been listening carefully enough. fileciteturn0file0 ...

April 23, 2026 · 4 min · Zelina
Cover image

When RL Needs a Tour Guide: OGER and the Business of Smarter Exploration

Opening — Why this matters now The current arms race in AI reasoning has an awkward secret: many models are not truly thinking better so much as repeating better. Reinforcement learning has improved chain-of-thought performance dramatically, but often by polishing existing habits rather than discovering new ones. Efficient? Yes. Inspiring? Not especially. The paper OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning proposes a cleaner answer: teach models from strong examples, then reward them for going beyond those examples intelligently. Not chaos. Not blind randomness. Structured exploration. A rare commodity. ...

April 23, 2026 · 4 min · Zelina
Cover image

CQ or Consequences: What This LLM Benchmark Reveals About AI Requirements Work

Opening — Why this matters now Everyone wants AI to automate the expensive, slow, deeply human parts of work. Requirements gathering is high on that list. It is also where many software and data projects quietly fail. A recent paper, Characterising LLM-Generated Competency Questions, examines whether large language models can reliably generate competency questions (CQs) — the structured questions used in ontology engineering to define what a knowledge system must know, answer, or reason about. In simpler terms: if you are building a knowledge graph, compliance engine, recommendation system, or enterprise AI layer, CQs help translate vague business intent into testable requirements. fileciteturn0file0 ...

April 22, 2026 · 5 min · Zelina
Cover image

CQ, AI & The Question of Questions

Opening — Why this matters now Everyone wants AI systems that are explainable, reliable, and aligned to business needs. Few want to do the tedious work required to get there. That work often begins with asking the right questions. In knowledge engineering, those questions are called Competency Questions (CQs): natural-language prompts that define what an ontology or knowledge model must be able to answer. Think: Which assets are on loan?, Who created this artifact?, What metadata is missing? ...

April 22, 2026 · 4 min · Zelina
Cover image

Graph RAG, No Smoke: Why Explainable AI in Manufacturing Needs a Memory

Opening — Why this matters now Everyone wants AI on the factory floor until the model says reject that batch and nobody can explain why. Manufacturing leaders are under pressure to automate quality control, predictive maintenance, scheduling, and robotics. Yet black-box systems create an awkward operational truth: if people cannot trust a recommendation, they often override it. Expensive software then becomes decorative furniture. ...

April 22, 2026 · 4 min · Zelina
Cover image

Lost in the Grid: Why AI Agents Still Can’t Spot the Impostor

Opening — Why this matters now Everyone wants autonomous AI agents. Boards want them booking meetings, triaging operations, managing workflows, and perhaps one day negotiating contracts while sounding politely enthusiastic. There is one minor issue: many of these systems still behave like interns trapped in a revolving door. The paper SocialGrid: A Benchmark for Planning and Social Reasoning in Embodied Multi-Agent Systems examines a problem the market prefers to skip over: if multiple AI agents must move through an environment, complete tasks, cooperate, and identify bad actors, how competent are they really? ...

April 22, 2026 · 4 min · Zelina
Cover image

MARCH Orders: When AI Holds a CT Case Conference

Opening — Why this matters now Most enterprise AI systems still behave like an overconfident intern: fast, articulate, and occasionally wrong in ways that become expensive. In medicine, that is not charming. It is liability with punctuation. A newly uploaded paper introduces MARCH (Multi-Agent Radiology Clinical Hierarchy), a framework for generating CT radiology reports by imitating how real radiology departments reduce error: junior draft, peer review, senior adjudication. Instead of one model producing one answer and hoping for applause, several specialized agents disagree productively until consensus emerges. ...

April 22, 2026 · 4 min · Zelina