AI Governance

LemmaBench: When AI Finally Meets Real Mathematics

Most AI math benchmarks still feel like exam rooms. The model receives a problem. It produces an answer. We score the answer. Everyone argues about whether the problem was hard enough, whether the model saw something similar during training, and whether the leaderboard means anything outside the leaderboard. Very productive. Almost as peaceful as a faculty meeting. ...

Brains, Bias & Benchmarks: Why Multimodal AI Still Struggles with Tumor Truth

MRI is a useful reality check for multimodal AI. It looks like an image problem, behaves like a reasoning problem, and punishes lazy confidence with the quiet brutality of clinical ambiguity. That is why MM-NeuroOnco is more interesting than another “new benchmark” headline.1 The paper introduces a multimodal instruction dataset and benchmark for MRI-based brain tumor diagnosis, but the dataset size is not the main story. Yes, the authors curate a 73,226-image pool, build 24,726 semantically attributed samples, generate more than 200,000 VQA pairs, and construct a 1,000-image benchmark with more than 3,000 questions. Fine. The spreadsheet is muscular. ...

Mind the Gap: Why Agency Isn’t Intelligence (Yet)

A trading bot keeps executing while the market regime changes. A warehouse robot keeps optimizing its route while a sensor slowly drifts. A customer-service agent keeps sounding fluent while the conversation loses coherence one turn at a time. From the outside, the system still looks agentic. It acts. It responds. It may even keep producing acceptable short-term outcomes. The dashboard, naturally, waits until the mess is obvious. Dashboards are polite like that. ...

Template Thinking: Why Your Next AI Agent Should Steal from Cognitive Science

Architecture is usually where AI enthusiasm goes to become expensive. A team starts with a capable model. Then it adds a planner. Then memory. Then a tool router. Then a critic. Then a second critic because the first critic was apparently too polite. A few weeks later, the “agent” works on the demo path, fails on the second edge case, and nobody can explain whether the problem is the prompt, the retrieval layer, the tool schema, the memory policy, or the small parliament of LLM calls now debating inside the workflow. ...

When Agents Ask for Help: Teaching LLMs the Art of Expert Collaboration

A help desk ticket is rarely solved by the first sentence. Someone says, “The report is wrong.” Then comes the real work: wrong where, compared with what, after which data refresh, under which permission level, and whether “wrong” means mathematically false or merely politically inconvenient. The expert does not just hand over an answer. The expert asks questions, reconstructs context, and turns a vague failure into a useful diagnosis. ...

From Lone LLMs to Living Systems: The Multi-Agent Orchestration Shift

Email is a fine place to see the problem. Ask a large language model to draft a reply, and it usually performs well. Ask it to clear a messy inbox, identify urgent client messages, compare them with your calendar, draft replies, escalate risks, update a CRM, and avoid accidentally sending confidential material to the wrong person, and the cheerful single-assistant fantasy begins to sweat. ...

Update or Revise? Turns Out It’s the Same Argument in a Better Suit

Memory is where many AI systems quietly lose their dignity. A user corrects an agent. A compliance rule changes. A contract clause is clarified. A retrieval system finds a newer document that contradicts an older one. The system must decide what to do with the new information. Should it update because the world has changed, or revise because its earlier belief was wrong? ...

When X-Rays Talk Back: Grounding AI Diagnosis in Evidence, Not Eloquence

Chest X-rays are not mysterious objects. They are images that radiologists interrogate through a disciplined sequence: find the anatomy, measure what matters, compare against criteria, and then make a diagnostic judgment. The modern vision-language model often skips the middle of that sequence. It looks at the image, produces a polished explanation, and hopes the reader will not ask too aggressively where the evidence came from. This is how medical AI becomes impressive in a demo and uncomfortable in a clinic. Fluency is cheap. Verifiability is expensive. ...

Divide & Verify: When Decomposition Finally Learns to Behave

A report is only as trustworthy as the sentence nobody checked. That sounds melodramatic until an LLM-generated due diligence note, policy memo, customer support answer, or compliance summary contains three correct facts and one quiet falsehood in the same paragraph. The usual fix is simple in theory: split the answer into smaller claims, retrieve evidence for each claim, let a verifier judge them, and aggregate the results. ...

From Reactive to Preemptive: Benchmarking the Rise of Proactive Mobile Agents

Phone assistants have one deeply underrated talent: they wait. They wait for the user to unlock the screen. They wait for a command. They wait for a nicely phrased instruction that explains the goal, the app, the constraints, and preferably the user’s hidden motivation. Then, if the demo gods are merciful, they execute. ...