Cognaptus Insights

ResMAS: When Multi‑Agent Systems Stop Falling Apart

Opening — Why this matters now Multi-agent systems (MAS) built on large language models have developed a bad habit: they work brilliantly—right up until the moment one agent goes off-script. A single failure, miscommunication, or noisy response can quietly poison the entire collaboration. In production environments, this isn’t a hypothetical risk; it’s the default operating condition. ...

Stuck on Repeat: When Reinforcement Learning Fails to Notice the Rules Changed

Opening — Why this matters now Reinforcement learning has a credibility problem. Models ace their benchmarks, plots look reassuringly smooth, and yet the moment the environment changes in a subtle but meaningful way, performance falls off a cliff. This is usually dismissed as “out-of-distribution behavior” — a polite euphemism for we don’t actually know what our agent learned. ...

Vibe Coding a Theorem Prover: When LLMs Prove (and Break) Themselves

Opening — Why this matters now LLMs can write code, explain proofs, and occasionally hallucinate both with equal confidence. So the obvious next question—posed almost mischievously in this paper—is whether an LLM can code a theorem prover that itself relies on LLMs. Not as a demo. Not as a toy. But as a fully automatic, kernel-checked prover that runs on a laptop and outperforms Isabelle’s industrial-grade automation in at least some regimes. ...

When LLMs Stop Talking and Start Driving

Opening — Why this matters now Digital transformation has reached an awkward phase. Enterprises have accumulated oceans of unstructured data, deployed dashboards everywhere, and renamed half their IT departments. Yet when something actually breaks—equipment fails, suppliers vanish, costs spike—the organization still reacts slowly, manually, and often blindly. The uncomfortable truth: most “AI-driven transformation” initiatives stop at analysis. They classify, predict, and visualize—but they rarely decide. This paper confronts that gap directly, asking a sharper question: what does it take for large models to become operational drivers rather than semantic commentators? fileciteturn0file0 ...

When Solvers Guess Smarter: Teaching SMT to Think in Functions

Opening — Why this matters now Quantified SMT solving has always lived in an uncomfortable space between elegance and brute force. As models grew richer—mixing non-linear arithmetic, real-valued domains, and uninterpreted functions—the solvers stayed stubbornly syntactic. They match patterns. They enumerate. They hope. Meanwhile, large language models have quietly absorbed a century’s worth of mathematical intuition. AquaForte asks an obvious but previously taboo question: what if we let SMT solvers borrow that intuition—without surrendering formal guarantees? ...

Judging the Judges: When AI Evaluation Becomes a Fingerprint

Opening — Why this matters now LLM-as-judge has quietly become infrastructure. It ranks models, filters outputs, trains reward models, and increasingly decides what ships. The industry treats these judges as interchangeable instruments—different thermometers measuring the same temperature. This paper suggests that assumption is not just wrong, but dangerously so. Across thousands of evaluations, LLM judges show near-zero agreement with each other, yet striking consistency with themselves. They are not noisy sensors of a shared truth. They are stable, opinionated evaluators—each enforcing its own private theory of quality. ...

NPCs With Short-Term Memory Loss: Benchmarking Agents That Actually Live in the World

Opening — Why this matters now Agentic AI has entered its Minecraft phase again. Not because blocks are trendy, but because open-world games remain one of the few places where planning, memory, execution, and failure collide in real time. Yet most agent benchmarks still cheat. They rely on synthetic prompts, privileged world access, or oracle-style evaluation that quietly assumes the agent already knows where everything is. The result: impressive demos, fragile agents, and metrics that flatter models more than they inform builders. ...

Distilling the Thought, Watermarking the Answer: When Reasoning Models Finally Get Traceable

Opening — Why this matters now Large Language Models have learned to reason. Unfortunately, our watermarking techniques have not. As models like DeepSeek-R1 and Qwen3 increasingly rely on explicit or implicit chain-of-thought, traditional text watermarking has started to behave like a bull in a logic shop: detectable, yes — but at the cost of broken reasoning, degraded accuracy, and occasionally, outright nonsense. ...

From Tokens to Topology: Teaching LLMs to Think in Simulink

Opening — Why this matters now Large Language Models have become dangerously good at writing text—and conspicuously bad at respecting reality. Nowhere is this mismatch more obvious than in model‑based engineering. Simulink, a cornerstone of safety‑critical industries from automotive to aerospace, is not a playground for eloquence. It is a rigid, graphical, constraint‑heavy environment where hallucinations are not amusing quirks but certification failures. ...

Model Cannibalism: When LLMs Learn From Their Own Echo

Opening — Why this matters now Synthetic data is no longer a contingency plan; it is the backbone of modern model iteration. As access to clean, human-authored data narrows—due to cost, licensing, or sheer exhaustion—LLMs increasingly learn from text generated by earlier versions of themselves. On paper, this looks efficient. In practice, it creates something more fragile: a closed feedback system where bias, preference, and quality quietly drift over time. ...