Autonomous Agents

Agents That Ship, Not Just Think: When LLM Self-Improvement Meets Release Engineering

Opening — Why this matters now LLM agents are no longer party tricks. They browse the web, patch production code, orchestrate APIs, and occasionally—quite creatively—break things that used to work. The industry’s instinctive response has been to make agents smarter by turning them inward: more reflection, more self-critique, more evolutionary prompt tinkering. Performance improves. Confidence does not. ...

ResMAS: When Multi‑Agent Systems Stop Falling Apart

Opening — Why this matters now Multi-agent systems (MAS) built on large language models have developed a bad habit: they work brilliantly—right up until the moment one agent goes off-script. A single failure, miscommunication, or noisy response can quietly poison the entire collaboration. In production environments, this isn’t a hypothetical risk; it’s the default operating condition. ...

When LLMs Stop Talking and Start Driving

Opening — Why this matters now Digital transformation has reached an awkward phase. Enterprises have accumulated oceans of unstructured data, deployed dashboards everywhere, and renamed half their IT departments. Yet when something actually breaks—equipment fails, suppliers vanish, costs spike—the organization still reacts slowly, manually, and often blindly. The uncomfortable truth: most “AI-driven transformation” initiatives stop at analysis. They classify, predict, and visualize—but they rarely decide. This paper confronts that gap directly, asking a sharper question: what does it take for large models to become operational drivers rather than semantic commentators? fileciteturn0file0 ...

When Solvers Guess Smarter: Teaching SMT to Think in Functions

Opening — Why this matters now Quantified SMT solving has always lived in an uncomfortable space between elegance and brute force. As models grew richer—mixing non-linear arithmetic, real-valued domains, and uninterpreted functions—the solvers stayed stubbornly syntactic. They match patterns. They enumerate. They hope. Meanwhile, large language models have quietly absorbed a century’s worth of mathematical intuition. AquaForte asks an obvious but previously taboo question: what if we let SMT solvers borrow that intuition—without surrendering formal guarantees? ...

Distilling the Thought, Watermarking the Answer: When Reasoning Models Finally Get Traceable

Opening — Why this matters now Large Language Models have learned to reason. Unfortunately, our watermarking techniques have not. As models like DeepSeek-R1 and Qwen3 increasingly rely on explicit or implicit chain-of-thought, traditional text watermarking has started to behave like a bull in a logic shop: detectable, yes — but at the cost of broken reasoning, degraded accuracy, and occasionally, outright nonsense. ...

Model Cannibalism: When LLMs Learn From Their Own Echo

Opening — Why this matters now Synthetic data is no longer a contingency plan; it is the backbone of modern model iteration. As access to clean, human-authored data narrows—due to cost, licensing, or sheer exhaustion—LLMs increasingly learn from text generated by earlier versions of themselves. On paper, this looks efficient. In practice, it creates something more fragile: a closed feedback system where bias, preference, and quality quietly drift over time. ...

When Your Agent Knows It’s Lying: Detecting Tool-Calling Hallucinations from the Inside

Opening — Why this matters now LLM-powered agents are no longer a novelty. They calculate loans, file expenses, query databases, orchestrate workflows, and—when things go wrong—quietly fabricate tool calls that look correct but aren’t. Unlike textual hallucinations, tool-calling hallucinations don’t merely misinform users; they bypass security controls, corrupt data, and undermine auditability. In short: once agents touch real systems, hallucinations become operational risk. ...

Agents Gone Rogue: Why Multi-Agent AI Quietly Falls Apart

Opening — Why this matters now Multi-agent AI systems are having their moment. From enterprise automation pipelines to financial analysis desks, architectures built on agent collaboration promise scale, specialization, and autonomy. They work beautifully—at first. Then something subtle happens. Six months in, accuracy slips. Agents talk more, decide less. Human interventions spike. No code changed. No model was retrained. Yet performance quietly erodes. This paper names that phenomenon with unsettling clarity: agent drift. ...

Argue With Yourself: When AI Learns by Contradiction

Opening — Why this matters now Modern AI systems are fluent, fast, and frequently wrong in subtle ways. Not catastrophically wrong — that would be easier to fix — but confidently misaligned. They generate answers that sound coherent while quietly diverging from genuine understanding. This gap between what a model says and what it actually understands has become one of the most expensive problems in applied AI. ...

Graph Before You Leap: How ComfySearch Makes AI Workflows Actually Work

Opening — Why this matters now AI generation has quietly shifted from models to systems. The real productivity gains no longer come from a single prompt hitting a single model, but from orchestrating dozens of components—samplers, encoders, adapters, validators—into reusable pipelines. Platforms like ComfyUI made this modular future visible. They also exposed its fragility. ...