Cover image

Batch of Thought, Not Chain of Thought: Why LLMs Reason Better Together

Opening — Why this matters now Large Language Models have learned to think out loud. Unfortunately, they still think alone. Most modern reasoning techniques—Chain-of-Thought, ReAct, self-reflection, debate—treat each query as a sealed container. The model reasons, critiques itself, revises, and moves on. This is computationally tidy. It is also statistically wasteful. In real decision systems—fraud detection, medical triage, compliance review—we never evaluate one case in isolation. We compare. We look for outliers. We ask why one answer feels less convincing than the rest. ...

January 7, 2026 · 4 min · Zelina
Cover image

MAGMA Gets a Memory: Why Flat Retrieval Is No Longer Enough

Opening — Why this matters now LLM agents are no longer judged by how clever they sound in a single turn. They are judged by whether they remember, whether they reason, and—more awkwardly—whether they can explain why an answer exists at all. As agentic systems move from demos to infrastructure, the limits of flat retrieval become painfully obvious. Semantic similarity alone is fine when the question is what. It collapses when the question is when, why, or who caused what. The MAGMA paper enters precisely at this fault line. ...

January 7, 2026 · 4 min · Zelina
Cover image

Trust Issues at 35,000 Feet: Assuring AI Digital Twins Before They Fly

Opening — Why this matters now Digital twins have quietly become one of aviation’s favorite promises: simulate reality well enough, and you can test tomorrow’s airspace decisions today—safely, cheaply, and repeatedly. Add AI agents into the mix, and the ambition escalates fast. We are no longer just modeling aircraft trajectories; we are training decision-makers. ...

January 7, 2026 · 5 min · Zelina
Cover image

When Pipes Speak in Probabilities: Teaching Graphs to Explain Their Leaks

Opening — Why this matters now Water utilities do not suffer from a lack of algorithms. They suffer from a lack of trustworthy ones. In an industry where dispatching a repair crew costs real money and false positives drain already thin operational budgets, a black‑box model—no matter how accurate—remains a risky proposition. Leak detection in water distribution networks (WDNs) has quietly become an ideal stress test for applied AI. The data are noisy, the events are rare, the topology is non‑Euclidean, and the consequences of wrong decisions are painfully tangible. This paper enters precisely at that fault line: it asks not only where a leak might be, but also how an engineer can understand why the model thinks so. ...

January 7, 2026 · 4 min · Zelina
Cover image

When Prompts Learn Themselves: The Death of Task Cues

Opening — Why this matters now Prompt engineering was supposed to be a temporary inconvenience. A short bridge between pre‑trained language models and real-world deployment. Instead, it became a cottage industry—part folklore, part ritual—where minor phrasing changes mysteriously decide whether your system works or embarrasses you in production. The paper Automatic Prompt Engineering with No Task Cues and No Tuning quietly dismantles much of that ritual. It asks an uncomfortable question: what if prompts don’t need us nearly as much as we think? And then it answers it with a system that is deliberately unglamorous—and therefore interesting. ...

January 7, 2026 · 3 min · Zelina
Cover image

EverMemOS: When Memory Stops Being a Junk Drawer

Opening — Why this matters now Long-context models were supposed to solve memory. They didn’t. Despite six-figure token windows, modern LLM agents still forget, contradict themselves, and—worse—remember the wrong things at the wrong time. The failure mode is no longer missing information. It is unstructured accumulation. We’ve built agents that can recall fragments indefinitely but cannot reason over them coherently. ...

January 6, 2026 · 3 min · Zelina
Cover image

FormuLLA: When LLMs Stop Talking and Start Formulating

Opening — Why this matters now Pharmaceutical 3D printing has promised personalization for over a decade. In practice, it has mostly delivered spreadsheets, failed filaments, and a great deal of human patience. The bottleneck has never been imagination—it has been formulation. Every new drug–excipient combination still demands expensive trial-and-error, even as printers themselves have matured. ...

January 6, 2026 · 4 min · Zelina
Cover image

Pulling the Thread: Why LLM Reasoning Often Unravels

Opening — Why this matters now Large Language Model (LLM) agents have crossed an uncomfortable threshold. They are no longer just autocomplete engines or polite chat companions; they are being entrusted with financial decisions, scientific hypothesis generation, and multi-step autonomous actions. With that elevation comes a familiar demand: explain yourself. Chain-of-Thought (CoT) reasoning was supposed to be the answer. Let the model “think out loud,” and transparency follows—or so the story goes. The paper behind Project Ariadne argues, with unsettling rigor, that this story is largely fiction. Much of what we see as reasoning is closer to stagecraft: convincing, articulate, and causally irrelevant. ...

January 6, 2026 · 4 min · Zelina
Cover image

Think Before You Sink: Streaming Hallucinations in Long Reasoning

Opening — Why this matters now Large language models have learned to think out loud. Chain-of-thought (CoT) reasoning has become the default solution for math, planning, and multi-step decision tasks. The industry applauded: more transparency, better answers, apparent interpretability. Then reality intervened. Despite elegant reasoning traces, models still reach incorrect conclusions—sometimes confidently, sometimes catastrophically. Worse, the mistakes are no longer obvious. They creep in quietly, spread across steps, and survive superficial self-corrections. What we call “hallucination” has grown up. And our detection methods have not. ...

January 6, 2026 · 4 min · Zelina
Cover image

Causality Remembers: Teaching Social Media Defenses to Learn from the Past

Opening — Why this matters now Social media coordination detection is stuck in an awkward adolescence. Platforms know coordinated inauthentic behavior exists, regulators know it scales faster than moderation teams, and researchers know correlation-heavy detectors are brittle. Yet most deployed systems still behave as if yesterday’s parameters will work tomorrow. This paper introduces Adaptive Causal Coordination Detection (ACCD)—not as another accuracy tweak, but as a structural correction. Instead of freezing assumptions into static thresholds and embeddings, ACCD treats coordination detection as a learning system with memory. And that subtle shift matters more than the headline F1 score. ...

January 5, 2026 · 4 min · Zelina