AI Governance

Aligned or Just Agreeable? Why Accuracy Is a Terrible Proxy for AI–Human Alignment

Opening — Why this matters now As large language models quietly migrate from text generators to decision makers, the industry has developed an unhealthy obsession with the wrong question: Did the model choose the same option as a human? Accuracy, F1, and distributional overlap have become the default proxies for alignment. They are also deeply misleading. ...

Greedy, but Not Blind: Teaching Optimization to Listen

Opening — Why this matters now Public-sector AI has a credibility problem. Not because it cannot optimize—but because it optimizes too cleanly. In health system planning, decisions are rarely about pure efficiency. They are negotiated compromises shaped by terrain, politics, institutional memory, and hard-earned intuition. Classic optimization methods politely ignore all that. This paper tackles a question many planners quietly ask but rarely formalize: Can we let algorithms optimize without silencing human judgment—and still keep mathematical guarantees intact? ...

Houston, We Have a Benchmark: When Agentic AI Meets Orbital Reality

Opening — Why this matters now Agentic large language models are increasingly marketed as generalist planners: systems that can reason, act, and adapt across domains without bespoke algorithmic scaffolding. The pitch is seductive—why maintain a zoo of solvers when a single agent can plan everything from code refactors to satellite schedules? AstroReason-Bench arrives as a cold shower. ...

Think-with-Me: When LLMs Learn to Stop Thinking

Opening — Why this matters now The AI industry has developed an unhealthy obsession with thinking longer. More tokens, deeper chains, bigger context windows—surely that must mean better reasoning. Except, increasingly, it doesn’t. Large Reasoning Models (LRMs) often reason past the point of usefulness, slipping into self-validation loops or overwriting correct answers with unnecessary exploration. This paper proposes a heretical idea in the age of scaling: maybe the model doesn’t need to think more—it needs to know when to stop. ...

When LLMs Read the Room: Predictive Process Monitoring Without the Data Buffet

Opening — Why this matters now Predictive Process Monitoring (PPM) has always promised operational foresight: knowing how long a case will take, whether a costly activity will happen, or when things are about to go wrong. The catch has been brutally consistent — you need a lot of data. Thousands of traces. Clean logs. Stable processes. ...

One-Shot Brains, Fewer Mouths: When Multi-Agent Systems Learn to Stop Talking

Opening — Why this matters now Multi-agent LLM systems are having a moment. Software engineering agents argue with each other, math solvers debate proofs, and code reviewers nitpick outputs like caffeinated interns. The results are often impressive—and painfully expensive. Token budgets explode, latency compounds, and the coordination logic starts to look like an over-managed meeting that should have been an email. ...

Redundancy Overload Is Optional: Finding the FDs That Actually Matter

Opening — Why this matters now Functional dependency (FD) discovery has quietly become a victim of its own success. Modern algorithms can enumerate everything—and that is precisely the problem. On realistic schemas, exhaustive FD discovery produces hundreds of thousands of valid dependencies, most of which are technically correct and practically useless. Computationally expensive. Cognitively overwhelming. Operationally irrelevant. ...

$Cover image$

When the Right Answer Is No Answer: Teaching AI to Refuse Messy Math

Opening — Why this matters now Multimodal models have become unnervingly confident readers of documents. Hand them a PDF, a scanned exam paper, or a photographed worksheet, and they will happily extract text, diagrams, and even implied structure. The problem is not what they can read. It is what they refuse to unread. In real classrooms, mathematics exam papers are not pristine artifacts. They are scribbled on, folded, stained, partially photographed, and occasionally vandalized by enthusiastic graders. Yet most document benchmarks still assume a polite world where inputs are complete and legible. This gap matters. An AI system that confidently invents missing math questions is not merely wrong—it is operationally dangerous. ...

Explaining the Explainers: Why Faithful XAI for LLMs Finally Needs a Benchmark

Opening — Why this matters now Explainability for large language models has reached an uncomfortable stage of maturity. We have methods. We have surveys. We even have regulatory pressure. What we do not have—at least until now—is a reliable way to tell whether an explanation actually reflects how a model behaves, rather than how comforting it sounds. ...

MatchTIR: Stop Paying Every Token the Same Salary

Opening — Why this matters now Tool-using agents are no longer a novelty. They are quietly becoming the default interface between LLMs and the real world: APIs, databases, search engines, execution environments. Yet most reinforcement learning pipelines still behave as if every step in a trajectory deserves the same bonus. That assumption was tolerable when tasks were short. It collapses when agents think, call tools, fail, retry, and recover over ten or more turns. ...