Autonomous Agents

Trust Issues at 35,000 Feet: Assuring AI Digital Twins Before They Fly

Opening — Why this matters now Digital twins have quietly become one of aviation’s favorite promises: simulate reality well enough, and you can test tomorrow’s airspace decisions today—safely, cheaply, and repeatedly. Add AI agents into the mix, and the ambition escalates fast. We are no longer just modeling aircraft trajectories; we are training decision-makers. ...

When Pipes Speak in Probabilities: Teaching Graphs to Explain Their Leaks

Opening — Why this matters now Water utilities do not suffer from a lack of algorithms. They suffer from a lack of trustworthy ones. In an industry where dispatching a repair crew costs real money and false positives drain already thin operational budgets, a black‑box model—no matter how accurate—remains a risky proposition. Leak detection in water distribution networks (WDNs) has quietly become an ideal stress test for applied AI. The data are noisy, the events are rare, the topology is non‑Euclidean, and the consequences of wrong decisions are painfully tangible. This paper enters precisely at that fault line: it asks not only where a leak might be, but also how an engineer can understand why the model thinks so. ...

When Prompts Learn Themselves: The Death of Task Cues

Opening — Why this matters now Prompt engineering was supposed to be a temporary inconvenience. A short bridge between pre‑trained language models and real-world deployment. Instead, it became a cottage industry—part folklore, part ritual—where minor phrasing changes mysteriously decide whether your system works or embarrasses you in production. The paper Automatic Prompt Engineering with No Task Cues and No Tuning quietly dismantles much of that ritual. It asks an uncomfortable question: what if prompts don’t need us nearly as much as we think? And then it answers it with a system that is deliberately unglamorous—and therefore interesting. ...

EverMemOS: When Memory Stops Being a Junk Drawer

Opening — Why this matters now Long-context models were supposed to solve memory. They didn’t. Despite six-figure token windows, modern LLM agents still forget, contradict themselves, and—worse—remember the wrong things at the wrong time. The failure mode is no longer missing information. It is unstructured accumulation. We’ve built agents that can recall fragments indefinitely but cannot reason over them coherently. ...

FormuLLA: When LLMs Stop Talking and Start Formulating

Opening — Why this matters now Pharmaceutical 3D printing has promised personalization for over a decade. In practice, it has mostly delivered spreadsheets, failed filaments, and a great deal of human patience. The bottleneck has never been imagination—it has been formulation. Every new drug–excipient combination still demands expensive trial-and-error, even as printers themselves have matured. ...

Pulling the Thread: Why LLM Reasoning Often Unravels

Opening — Why this matters now Large Language Model (LLM) agents have crossed an uncomfortable threshold. They are no longer just autocomplete engines or polite chat companions; they are being entrusted with financial decisions, scientific hypothesis generation, and multi-step autonomous actions. With that elevation comes a familiar demand: explain yourself. Chain-of-Thought (CoT) reasoning was supposed to be the answer. Let the model “think out loud,” and transparency follows—or so the story goes. The paper behind Project Ariadne argues, with unsettling rigor, that this story is largely fiction. Much of what we see as reasoning is closer to stagecraft: convincing, articulate, and causally irrelevant. ...

Think Before You Sink: Streaming Hallucinations in Long Reasoning

Opening — Why this matters now Large language models have learned to think out loud. Chain-of-thought (CoT) reasoning has become the default solution for math, planning, and multi-step decision tasks. The industry applauded: more transparency, better answers, apparent interpretability. Then reality intervened. Despite elegant reasoning traces, models still reach incorrect conclusions—sometimes confidently, sometimes catastrophically. Worse, the mistakes are no longer obvious. They creep in quietly, spread across steps, and survive superficial self-corrections. What we call “hallucination” has grown up. And our detection methods have not. ...

Causality Remembers: Teaching Social Media Defenses to Learn from the Past

Opening — Why this matters now Social media coordination detection is stuck in an awkward adolescence. Platforms know coordinated inauthentic behavior exists, regulators know it scales faster than moderation teams, and researchers know correlation-heavy detectors are brittle. Yet most deployed systems still behave as if yesterday’s parameters will work tomorrow. This paper introduces Adaptive Causal Coordination Detection (ACCD)—not as another accuracy tweak, but as a structural correction. Instead of freezing assumptions into static thresholds and embeddings, ACCD treats coordination detection as a learning system with memory. And that subtle shift matters more than the headline F1 score. ...

Crossing the Line: Teaching Pedestrian Models to Reason, Not Memorize

Opening — Why this matters now Pedestrian fatalities are rising, mid-block crossings dominate risk exposure, and yet most models tasked with predicting pedestrian behavior remain stubbornly local. They perform well—until they don’t. Move them to a new street, a wider arterial, or a different land-use mix, and accuracy quietly collapses. This is not a data problem. It’s a reasoning problem. ...

When LLMs Stop Guessing and Start Complying: Agentic Neuro-Symbolic Programming

Opening — Why this matters now Large Language Models are excellent improvisers. Unfortunately, software systems—especially those embedding logic, constraints, and guarantees—are not jazz clubs. They are factories. And factories care less about eloquence than about whether the machine does what it is supposed to do. Neuro-symbolic (NeSy) systems promise something enterprises quietly crave: models that reason, obey constraints, and fail predictably. Yet in practice, NeSy frameworks remain the domain of specialists fluent in obscure DSLs and brittle APIs. The result is familiar: powerful theory, low adoption. ...