AI Governance

MARCH Orders: When AI Holds a CT Case Conference

The useful meeting, unfortunately, exists Meetings are usually where productivity goes to file a complaint. But there is one kind of meeting that high-stakes work still needs: the review session where a first draft is challenged, evidence is checked, and a senior decision-maker signs off. Radiology has long understood this. A resident may draft the report. A fellow may question the interpretation. An attending radiologist resolves the remaining uncertainty. The point is not ceremony. The point is controlled disagreement. ...

Silent Errors, Loud Consequences: ASMR-Bench and the Coming Era of AI Auditors

Code review is supposed to be the sober adult in the room. A researcher writes code. A reviewer checks the code. A suspicious bug gets caught before it becomes a chart, a memo, a product decision, or—if everyone is having a particularly expensive week—a board presentation. That model works reasonably well when the failure is accidental and the reviewer has more patience than the author. It becomes less reassuring when the author is an AI research agent, the codebase is messy, the experiment is expensive to rerun, and the suspicious line looks less like a bug than a perfectly normal design choice. ...

When the Judge Needs Judging: LLM Evaluators Under Cross-Examination

The dashboard says the judge is fine. The document disagrees. Judge is an easy word to trust. It suggests robes, procedure, and someone in the room who is supposed to be less confused than everyone else. In AI evaluation, the word has become dangerously comfortable. Product teams now use LLMs to score summaries, rank chatbot answers, approve RAG outputs, compare model releases, and decide whether another model’s response is “good enough.” The attraction is obvious: human review is expensive, slow, and occasionally insists on context. An LLM judge is fast, scalable, and does not ask why the evaluation rubric was written five minutes before the sprint review. ...

When the Referee Wants to Be Nice: Hidden Bias in AI Judges

Audit. That is the word companies use when they want something to sound objective, disciplined, and preferably immune to politics. A model produces an answer. Another model evaluates it. The evaluator gives a verdict. Everyone gets a dashboard. The dashboard gets shown to management. Management nods, because dashboards have a calming effect on adults in conference rooms. ...

Memory Lane Meets Mainframe: Why Coding Agents Need Better Memories, Not Bigger Egos

Memory is a familiar word. That is exactly why it can mislead us. When people hear that coding agents need “memory,” the first image is often a giant scrapbook: past prompts, previous patches, command logs, successful code snippets, failed attempts, and whatever else the agent has dragged behind it like a very confident intern with a messy backpack. More memory sounds safer. More traces sound more useful. More remembered work sounds like less repeated work. ...

Reviewer, Reviewed: When AI Starts Grading the Graders

Review queue. That is where many serious organizations quietly lose time, quality, and patience. A technical team writes a proposal. A risk team checks a report. A grant committee reads applications. A legal or compliance group inspects a document for missing evidence, weak logic, and embarrassing errors. Everyone agrees that review matters. Everyone also knows the reviewers are tired. ...

Rewarding Bad Physics Habits: What VLMs Learn When You Pay Them to Reason

A factory camera sees a pressure gauge. The AI reads the image, explains the mechanism, applies the formula, and recommends an action. Everyone in the meeting relaxes, because the model has produced a neat chain of reasoning. That is usually the moment to become nervous. The dangerous part is not that a vision-language model can be wrong. We know that. The more interesting problem is that a model can become wrong in a very specific way because we trained it to chase the wrong reward. Pay it for clean formatting, and it learns to look organized. Pay it for final answers, and it may sacrifice the reasoning path. Pay it to stare at the image, and it may do better on spatial problems while forgetting that physics also contains formulas. Apparently, “look harder” is not a complete theory of mechanics. ...

Benchmarking the Benchmarks: When AI Safety Metrics Stop Meaning Anything

Safety used to sound like a simple procurement question. A vendor says its model is safe. The slide deck has benchmark scores. The scores have respectable names: accuracy, F1, safety score, refusal rate, attack success rate. Everyone nods, because familiar metric names create the soothing illusion that someone has already done the hard work. ...

Epistemic Infrastructure: Why Your AI Knows Less Than It Thinks

Documents are rarely wrong in the same way. A project proposal can be relevant but obsolete. A meeting note can be accurate but non-binding. A market-size estimate can be useful but contradicted by later due diligence. A regulatory question can be unanswered and still more important than a polished paragraph that sounds certain. This is the small, boring, expensive problem hiding inside many enterprise AI deployments: the system finds the right files, then treats unlike things as if they had the same authority. ...

The Ask Gap: Why AI Agents Fail Not Because They Can’t Think — But Because They Don’t Know When to Stop

A ticket lands in the queue. It looks ordinary: update a parser, answer a business question, patch a workflow, produce a SQL query. The agent opens the files, explores the schema, writes code, runs a few checks, and submits something plausible. The output is polished. The reasoning trace is confident. The dashboard marks the task as completed. ...