Cover image

Judge, Jury, and Calibration: Why AI Evaluation Needs Anchors

TL;DR for operators AI is becoming very good at producing judgement-shaped output. That is not the same thing as judgement. Two recent papers make the same operational point from different sides: one shows how AI can estimate educational item difficulty before response data are available; the other shows how LLM-generated peer reviews can look serious while diverging from human reviewing behaviour.12 ...

June 15, 2026 · 14 min · Zelina
Cover image

Feedback, Not Freefall: Why LLM Writing Tools Need a Teacher in the Loop

Feedback is expensive. Anyone who has managed a classroom, a content team, a training programme, or a junior analyst cohort knows the pattern. The first draft is rarely the problem. The problem is the second draft, because the second draft requires specific feedback, delivered in language the learner can act on, without exhausting the person giving it. Multiply that by thirty students, ten assignments, uneven ability levels, and a calendar that refuses to become more generous. Suddenly “just give everyone personalised feedback” becomes one of those ideas beloved by people who do not have to do it. ...

June 14, 2026 · 17 min · Zelina
Cover image

Synthesize, but Verify: The Data Flywheel Behind Useful AI Automation

Opening — Why this matters now The easiest AI demo in the world is a model producing something plausible. A product description. A support reply. A defect image. A peer-review report. A compliance explanation. A benchmark answer. The output looks competent enough to be shown in a slide deck, which is often where corporate AI strategy goes to enjoy a short but well-lit life. ...

May 6, 2026 · 17 min · Zelina

From Gate Noise to Turnaround Intelligence: AI Agents for Airline Ground Operations

A regional airline or ground-handling team moved from scattered radio, chat, and checklist updates to a human-reviewed AI coordination layer that tracks turnaround state, detects exceptions, drafts delay explanations, and improves passenger communication.

April 30, 2026 · 9 min · Vox
Cover image

When the Judge Needs Judging: LLM Evaluators Under Cross-Examination

The dashboard says the judge is fine. The document disagrees. Judge is an easy word to trust. It suggests robes, procedure, and someone in the room who is supposed to be less confused than everyone else. In AI evaluation, the word has become dangerously comfortable. Product teams now use LLMs to score summaries, rank chatbot answers, approve RAG outputs, compare model releases, and decide whether another model’s response is “good enough.” The attraction is obvious: human review is expensive, slow, and occasionally insists on context. An LLM judge is fast, scalable, and does not ask why the evaluation rubric was written five minutes before the sprint review. ...

April 20, 2026 · 14 min · Zelina

From Scattered Site Logs to Safety Intelligence: AI Mining Site Safety & Reporting Agent

A remote-site mining operator redesigned its safety reporting workflow from manual record chasing into an agent-assisted process that consolidates field evidence, surfaces risks, drafts reports, and preserves human approval for safety-critical decisions.

April 15, 2026 · 9 min · Vox
Cover image

When AI Drives, Who’s in Control? — Reclaiming Determinism in Agentic Systems

A car does not care whether an AI answer is impressive. It cares whether the answer arrives before the intersection. That small timing problem is where a large part of today’s agentic AI discussion becomes unserious. We keep asking whether models are smart enough to act. In cyber-physical systems, the more painful question is whether the system around the model can make action repeatable, bounded, and recoverable when the model is late, vague, or simply wrong. ...

April 14, 2026 · 17 min · Zelina
Cover image

The Ask Gap: Why AI Agents Fail Not Because They Can’t Think — But Because They Don’t Know When to Stop

A ticket lands in the queue. It looks ordinary: update a parser, answer a business question, patch a workflow, produce a SQL query. The agent opens the files, explores the schema, writes code, runs a few checks, and submits something plausible. The output is polished. The reasoning trace is confident. The dashboard marks the task as completed. ...

April 13, 2026 · 16 min · Zelina
Cover image

The Stochastic Gap: Why Your AI Agent Fails Before It Starts

A procurement workflow looks boring until an AI agent touches it. Before that moment, the process is usually wrapped in the comforting machinery of enterprise software: approval rules, validation checks, role permissions, exception paths, and enough audit trails to make everyone feel governed. Then someone inserts an agent and asks it to “handle the workflow.” The agent may know the words. It may call the right tools. It may even produce the next step that looks plausible. ...

March 26, 2026 · 15 min · Zelina
Cover image

From Copilots to Colleagues: The Organizational Leap to Agentic AI

Bookings are not glamorous. They arrive through email, booking platforms, supplier messages, customer updates, and last-minute changes that somehow always appear after the plan has already been “finalized.” Someone reads them. Someone reconciles them. Someone checks activity availability. Someone checks transport capacity. Someone updates the planning sheet. Someone notices that one family needs pickup from a different location. Someone quietly prevents tomorrow morning from becoming a small logistical circus. ...

March 7, 2026 · 18 min · Zelina