Cover image

Rebuttal Agents, Not Rebuttal Text: Why ‘Verify‑Then‑Write’ Is the Only Scalable Future

Opening — Why this matters now Peer review rebuttals are one of the few moments in modern science where precision still beats fluency. Deadlines are tight, stakes are high, and every sentence is implicitly a legal statement about what the paper does—and does not—claim. Yet this is exactly where many researchers now lean on large language models. ...

January 21, 2026 · 3 min · Zelina
Cover image

When Coders Prove Theorems: Agents, Lean, and the Quiet Death of the Specialist Prover

Opening — Why this matters now Formal mathematics has quietly become one of the most revealing stress tests for modern AI. Not because theorems are commercially lucrative, but because they are unforgiving. Proof assistants do not care about vibes, rhetorical fluency, or confident hallucinations. Either the proof compiles, or it does not. Until recently, success in this domain required highly specialized models, intricate pipelines, and months of reinforcement learning. Numina-Lean-Agent proposes something more unsettling: maybe all of that specialization was unnecessary. ...

January 21, 2026 · 3 min · Zelina
Cover image

When Retrieval Learns to Breathe: Teaching LLMs to Go Wide *and* Deep

Opening — Why this matters now Large language models are no longer starved for text. They are starved for structure. As RAG systems mature, the bottleneck has shifted from whether we can retrieve information to how we decide where to look first, how far to go, and when to stop. Most retrieval stacks still force an early commitment: either search broadly and stay shallow, or traverse deeply and hope you picked the right starting point. ...

January 21, 2026 · 4 min · Zelina
Cover image

Deep GraphRAG: Teaching Retrieval to Think in Layers

Opening — Why this matters now Retrieval-Augmented Generation has reached an awkward adolescence. Vector search is fast, scalable, and confidently wrong when questions require structure, multi-hop reasoning, or global context. GraphRAG promised salvation by injecting topology into retrieval — and promptly ran into its own identity crisis: global search is thorough but slow, local search is precise but blind, and most systems oscillate between the two without ever resolving the tension. ...

January 20, 2026 · 4 min · Zelina
Cover image

Aligned or Just Agreeable? Why Accuracy Is a Terrible Proxy for AI–Human Alignment

Opening — Why this matters now As large language models quietly migrate from text generators to decision makers, the industry has developed an unhealthy obsession with the wrong question: Did the model choose the same option as a human? Accuracy, F1, and distributional overlap have become the default proxies for alignment. They are also deeply misleading. ...

January 19, 2026 · 4 min · Zelina
Cover image

Greedy, but Not Blind: Teaching Optimization to Listen

Opening — Why this matters now Public-sector AI has a credibility problem. Not because it cannot optimize—but because it optimizes too cleanly. In health system planning, decisions are rarely about pure efficiency. They are negotiated compromises shaped by terrain, politics, institutional memory, and hard-earned intuition. Classic optimization methods politely ignore all that. This paper tackles a question many planners quietly ask but rarely formalize: Can we let algorithms optimize without silencing human judgment—and still keep mathematical guarantees intact? ...

January 19, 2026 · 4 min · Zelina
Cover image

Houston, We Have a Benchmark: When Agentic AI Meets Orbital Reality

Opening — Why this matters now Agentic large language models are increasingly marketed as generalist planners: systems that can reason, act, and adapt across domains without bespoke algorithmic scaffolding. The pitch is seductive—why maintain a zoo of solvers when a single agent can plan everything from code refactors to satellite schedules? AstroReason-Bench arrives as a cold shower. ...

January 19, 2026 · 4 min · Zelina
Cover image

Think-with-Me: When LLMs Learn to Stop Thinking

Opening — Why this matters now The AI industry has developed an unhealthy obsession with thinking longer. More tokens, deeper chains, bigger context windows—surely that must mean better reasoning. Except, increasingly, it doesn’t. Large Reasoning Models (LRMs) often reason past the point of usefulness, slipping into self-validation loops or overwriting correct answers with unnecessary exploration. This paper proposes a heretical idea in the age of scaling: maybe the model doesn’t need to think more—it needs to know when to stop. ...

January 19, 2026 · 3 min · Zelina
Cover image

When LLMs Read the Room: Predictive Process Monitoring Without the Data Buffet

Opening — Why this matters now Predictive Process Monitoring (PPM) has always promised operational foresight: knowing how long a case will take, whether a costly activity will happen, or when things are about to go wrong. The catch has been brutally consistent — you need a lot of data. Thousands of traces. Clean logs. Stable processes. ...

January 19, 2026 · 5 min · Zelina
Cover image

One-Shot Brains, Fewer Mouths: When Multi-Agent Systems Learn to Stop Talking

Opening — Why this matters now Multi-agent LLM systems are having a moment. Software engineering agents argue with each other, math solvers debate proofs, and code reviewers nitpick outputs like caffeinated interns. The results are often impressive—and painfully expensive. Token budgets explode, latency compounds, and the coordination logic starts to look like an over-managed meeting that should have been an email. ...

January 18, 2026 · 4 min · Zelina