Assurance

When LLMs Stop Talking and Start Driving

Opening — Why this matters now Digital transformation has reached an awkward phase. Enterprises have accumulated oceans of unstructured data, deployed dashboards everywhere, and renamed half their IT departments. Yet when something actually breaks—equipment fails, suppliers vanish, costs spike—the organization still reacts slowly, manually, and often blindly. The uncomfortable truth: most “AI-driven transformation” initiatives stop at analysis. They classify, predict, and visualize—but they rarely decide. This paper confronts that gap directly, asking a sharper question: what does it take for large models to become operational drivers rather than semantic commentators? fileciteturn0file0 ...

Distilling the Thought, Watermarking the Answer: When Reasoning Models Finally Get Traceable

Opening — Why this matters now Large Language Models have learned to reason. Unfortunately, our watermarking techniques have not. As models like DeepSeek-R1 and Qwen3 increasingly rely on explicit or implicit chain-of-thought, traditional text watermarking has started to behave like a bull in a logic shop: detectable, yes — but at the cost of broken reasoning, degraded accuracy, and occasionally, outright nonsense. ...

Model Cannibalism: When LLMs Learn From Their Own Echo

Opening — Why this matters now Synthetic data is no longer a contingency plan; it is the backbone of modern model iteration. As access to clean, human-authored data narrows—due to cost, licensing, or sheer exhaustion—LLMs increasingly learn from text generated by earlier versions of themselves. On paper, this looks efficient. In practice, it creates something more fragile: a closed feedback system where bias, preference, and quality quietly drift over time. ...

Agents Gone Rogue: Why Multi-Agent AI Quietly Falls Apart

Opening — Why this matters now Multi-agent AI systems are having their moment. From enterprise automation pipelines to financial analysis desks, architectures built on agent collaboration promise scale, specialization, and autonomy. They work beautifully—at first. Then something subtle happens. Six months in, accuracy slips. Agents talk more, decide less. Human interventions spike. No code changed. No model was retrained. Yet performance quietly erodes. This paper names that phenomenon with unsettling clarity: agent drift. ...

Argue With Yourself: When AI Learns by Contradiction

Opening — Why this matters now Modern AI systems are fluent, fast, and frequently wrong in subtle ways. Not catastrophically wrong — that would be easier to fix — but confidently misaligned. They generate answers that sound coherent while quietly diverging from genuine understanding. This gap between what a model says and what it actually understands has become one of the most expensive problems in applied AI. ...

Batch of Thought, Not Chain of Thought: Why LLMs Reason Better Together

Opening — Why this matters now Large Language Models have learned to think out loud. Unfortunately, they still think alone. Most modern reasoning techniques—Chain-of-Thought, ReAct, self-reflection, debate—treat each query as a sealed container. The model reasons, critiques itself, revises, and moves on. This is computationally tidy. It is also statistically wasteful. In real decision systems—fraud detection, medical triage, compliance review—we never evaluate one case in isolation. We compare. We look for outliers. We ask why one answer feels less convincing than the rest. ...

MAGMA Gets a Memory: Why Flat Retrieval Is No Longer Enough

Opening — Why this matters now LLM agents are no longer judged by how clever they sound in a single turn. They are judged by whether they remember, whether they reason, and—more awkwardly—whether they can explain why an answer exists at all. As agentic systems move from demos to infrastructure, the limits of flat retrieval become painfully obvious. Semantic similarity alone is fine when the question is what. It collapses when the question is when, why, or who caused what. The MAGMA paper enters precisely at this fault line. ...

Trust Issues at 35,000 Feet: Assuring AI Digital Twins Before They Fly

Opening — Why this matters now Digital twins have quietly become one of aviation’s favorite promises: simulate reality well enough, and you can test tomorrow’s airspace decisions today—safely, cheaply, and repeatedly. Add AI agents into the mix, and the ambition escalates fast. We are no longer just modeling aircraft trajectories; we are training decision-makers. ...

When Pipes Speak in Probabilities: Teaching Graphs to Explain Their Leaks

Opening — Why this matters now Water utilities do not suffer from a lack of algorithms. They suffer from a lack of trustworthy ones. In an industry where dispatching a repair crew costs real money and false positives drain already thin operational budgets, a black‑box model—no matter how accurate—remains a risky proposition. Leak detection in water distribution networks (WDNs) has quietly become an ideal stress test for applied AI. The data are noisy, the events are rare, the topology is non‑Euclidean, and the consequences of wrong decisions are painfully tangible. This paper enters precisely at that fault line: it asks not only where a leak might be, but also how an engineer can understand why the model thinks so. ...

When Prompts Learn Themselves: The Death of Task Cues

Opening — Why this matters now Prompt engineering was supposed to be a temporary inconvenience. A short bridge between pre‑trained language models and real-world deployment. Instead, it became a cottage industry—part folklore, part ritual—where minor phrasing changes mysteriously decide whether your system works or embarrasses you in production. The paper Automatic Prompt Engineering with No Task Cues and No Tuning quietly dismantles much of that ritual. It asks an uncomfortable question: what if prompts don’t need us nearly as much as we think? And then it answers it with a system that is deliberately unglamorous—and therefore interesting. ...