AI Governance

When Transformers Learn the Map: Why Geography Still Matters in Traffic AI

Opening — Why this matters now Digital twins for transport are no longer futuristic demos. They are quietly becoming operational systems, expected to anticipate congestion, test control policies, and absorb shocks before drivers ever feel them. But a digital twin that only mirrors the present is reactive by definition. To be useful, it must predict. ...

When VR Shooters Meet Discrete Events: Training Security Policies Without Endless Human Trials

Opening — Why this matters now School security research lives in a permanent bind: the events we most need to understand are precisely the ones we cannot ethically or practically reproduce at scale. Real-world shooter data is sparse, incomplete, and morally costly. Virtual reality (VR) improves matters, but even VR-based human-subject experiments remain slow, expensive, and fundamentally non-iterative. ...

Attention with Doubt: Teaching Transformers When Not to Trust Themselves

Opening — Why this matters now Modern transformers are confident. Too confident. In high-stakes deployments—question answering, medical triage, compliance screening—this confidence routinely outruns correctness. The problem is not accuracy; it is miscalibration. Models say “I’m sure” when they shouldn’t. Most fixes arrive late in the pipeline: temperature scaling, Platt scaling, confidence rescaling after the model has already reasoned itself into a corner. What if uncertainty could intervene earlier—during reasoning rather than after the verdict? ...

Perspective Without Rewards: When AI Develops a Point of View

Opening — Why this matters now As AI systems grow more autonomous, the uncomfortable question keeps resurfacing: what does it even mean for a machine to have a perspective? Not intelligence, not planning, not goal pursuit—but a situated, history-sensitive way the world is given to the system itself. Most modern agent architectures quietly dodge this question. They optimize rewards, compress states, maximize returns—and call whatever internal structure emerges a day. But subjectivity, if it exists at all in machines, is unlikely to be a side effect of reward maximization. It is more plausibly a structural condition: something slow, global, and stubbornly resistant to momentary incentives. ...

When LLMs Lose the Plot: Diagnosing Reasoning Instability at Inference Time

Opening — Why this matters now If you work with large language models long enough, you start noticing a familiar failure mode. The model doesn’t just answer incorrectly—it loses the thread. Halfway through a chain-of-thought, something snaps. The reasoning drifts, doubles back, contradicts itself, and eventually lands somewhere implausible. Traditional evaluation misses this. Accuracy checks only look at the final answer, long after the damage is done. Confidence scores are static and blunt. Multi-sample techniques are expensive and retrospective. What’s missing is a process-level diagnostic—a way to tell, during inference, whether reasoning is stabilizing or quietly unraveling. ...

Conducting the Agents: Why AORCHESTRA Treats Sub-Agents as Recipes, Not Roles

Opening — Why this matters now Agentic systems are quietly hitting a ceiling. As tasks stretch across longer horizons—debugging real codebases, navigating terminals, or stitching together multi-hop web reasoning—the dominant design patterns start to fray. Fixed workflows ossify. Multi-agent chats drown in coordination overhead. Context windows bloat, then rot. AORCHESTRA enters this moment with a subtle but decisive shift: stop treating sub-agents as identities, and start treating them as configurations. ...

Conformal Thinking: Teaching LLMs When to Stop Thinking

Opening — Why this matters now Reasoning models have learned how to think longer. Unfortunately, they have not learned when to stop. Test-time scaling has become the industry’s favorite blunt instrument: allocate more tokens, get better answers—on average. But averages are a luxury in deployment. In production systems, every additional token is a cost, and every premature stop is a risk. The uncomfortable truth is that “adaptive reasoning” merely replaces one opaque knob (token limits) with another (confidence thresholds), without offering a principled way to tune either. ...

More Isn’t Smarter: Why Agent Diversity Beats Agent Count

Opening — Why this matters now Multi-agent LLM systems have quietly become the industry’s favorite way to brute-force intelligence. When one model struggles, the instinct is simple: add more agents. Vote harder. Debate longer. Spend more tokens. And yet, performance curves keep telling the same unflattering story: early gains, fast saturation, wasted compute. This paper asks the uncomfortable question most agent frameworks politely ignore: why does scaling stall so quickly—and what actually moves the needle once it does? The answer, it turns out, has less to do with how many agents you run, and more to do with how different they truly are. ...

When Agents Stop Talking to the Wrong People

Opening — Why this matters now Multi-agent LLM systems are no longer a novelty. They debate, plan, critique, simulate markets, and increasingly make decisions that look uncomfortably close to judgment. Yet as these systems scale, something quietly fragile sits underneath them: who talks to whom, and when. Most multi-agent frameworks still assume that communication is cheap, static, and benign. In practice, it is none of those. Agents drift, hallucinate, fatigue, or—worse—become adversarial while sounding perfectly reasonable. When that happens, fixed communication graphs turn from coordination tools into liability multipliers. ...

When Your Agent Starts Copying Itself: Breaking Conversational Inertia

Opening — Why this matters now Multi-turn agents are supposed to get better with experience. More context, more feedback, more opportunities to adapt. Yet in practice, the opposite often happens. Agents loop. They fixate. They repeat themselves with growing confidence and shrinking effectiveness. This paper puts a name—and a mechanism—on that failure mode: conversational inertia. And more importantly, it shows that the problem is not a lack of information, but too much of the wrong kind. ...