LLM Systems

CompactRAG: When Multi-Hop Reasoning Stops Burning Tokens

Opening — Why this matters now Multi-hop reasoning has quietly become one of the most expensive habits in modern AI systems. Every additional hop—every “and then what?”—typically triggers another retrieval, another prompt expansion, another LLM call. Accuracy improves, yes, but so does the bill. CompactRAG enters this conversation with a refreshingly unfashionable claim: most of this cost is structural, not inevitable. If you stop forcing LLMs to repeatedly reread the same knowledge, multi-hop reasoning does not have to scale linearly in tokens—or in money. ...

Speculate Smarter, Not Harder: Hierarchical Decoding Without Regret

Opening — Why this matters now LLM inference has quietly become the dominant cost center of modern AI systems. Training grabs headlines; inference drains budgets. As models scale into the tens of billions of parameters, every additional forward pass hurts — financially and operationally. Speculative decoding promised relief by letting small models run ahead and big models merely verify. But verification, ironically, became the bottleneck. ...

Vibe Coding a Theorem Prover: When LLMs Prove (and Break) Themselves

Opening — Why this matters now LLMs can write code, explain proofs, and occasionally hallucinate both with equal confidence. So the obvious next question—posed almost mischievously in this paper—is whether an LLM can code a theorem prover that itself relies on LLMs. Not as a demo. Not as a toy. But as a fully automatic, kernel-checked prover that runs on a laptop and outperforms Isabelle’s industrial-grade automation in at least some regimes. ...

SAGA, Not Sci‑Fi: When LLMs Start Doing Science

Opening — Why this matters now For years, we have asked large language models to explain science. The paper behind SAGA asks a more uncomfortable question: what happens when we ask them to do science instead? Scientific discovery has always been bottlenecked not by ideas, but by coordination — between hypothesis generation, experiment design, evaluation, and iteration. SAGA reframes this entire loop as an agentic system problem. Not a chatbot. Not a single model. A laboratory of cooperating AI agents. ...

When Bandits Get Priority: Learning Under Scarce, Tiered Capacity

Opening — Why this matters now Large Language Models, edge computing platforms, and cloud inference systems all share a quiet but inconvenient truth: resources are scarce, and not everyone is equal. Some tasks pay more. Some users matter more. Some workloads jump the queue. Yet much of the bandit literature still assumes a polite world—where arms dispense rewards independently, capacity is either infinite or fixed, and every pull is treated equally. That abstraction collapses the moment you introduce priorities, stochastic capacity, and multiple simultaneous plays. ...

Cheap Thrills, Hard Guarantees: BARGAINing with LLM Cascades

When teams push large text workloads through LLMs (contract triage, lead deduping, safety filtering), they face a brutal choice: pay for the “oracle” model (accurate but pricey) or accept quality drift with a cheaper “proxy”. Model cascades promise both—use the proxy when confident, escalate uncertain items to the oracle—but in practice they’ve been fragile. SUPG and similar heuristics often over‑ or under‑sample, rely on asymptotic CLT assumptions, and miss targets when sample sizes are small. The BARGAIN framework fixes this by combining task‑aware adaptive sampling with tighter finite‑sample tests to certify targets while maximizing utility (cost saved, recall, or precision). The authors report up to 86% more cost reduction vs. SUPG for accuracy‑target (AT) workloads, and similarly large gains for precision‑target (PT) and recall‑target (RT) settings—with rigorous guarantees. ...