LLM Systems

When Retrieval Learns to Breathe: Teaching LLMs to Go Wide and Deep

Retrieval has a breathing problem. Most enterprise RAG systems inhale once, grab the nearest chunks, and then hope the model can make the answer sound less fragile than the evidence actually is. That works tolerably well when the user asks for something sitting neatly inside a document paragraph. It works less well when the answer lives across entities, relations, aliases, product categories, authors, diseases, suppliers, regulations, or customer records. In other words, it works less well in the part of business where knowledge is not a pile of text but a network. ...

One Agent Is a Bottleneck: When Genomics QA Finally Went Multi-Agent

One Agent Is a Bottleneck: When Genomics QA Finally Went Multi-Agent Databases are where elegant AI demos go to develop a limp. A model can sound fluent about biology, medicine, finance, or law. Then someone asks a question that requires the latest record from a specialized database, a second lookup from another source, a formatted API call, a large HTML response, and a final answer that does not forget the original question halfway through. Suddenly the “AI assistant” becomes a very expensive intern copying URLs into the wrong field. ...

Speculate Smarter, Not Harder: Hierarchical Decoding Without Regret

Speed is the polite word. Cost is the less polite one. Every production LLM system eventually meets the same boring villain: the target model must generate tokens one after another, and each forward pass is expensive. Speculative decoding was supposed to soften that problem. Let a cheaper draft model run ahead, ask the expensive model to verify the draft, and accept several tokens per target-model call when the draft is good enough. Simple. Elegant. Almost suspiciously useful. ...

Vibe Coding a Theorem Prover: When LLMs Prove (and Break) Themselves

A theorem prover is a terrible place to let an LLM improvise Code review is forgiving compared with theorem proving. In ordinary software, a language model can produce code that looks clean, passes a few tests, and still hides a slow-burning defect somewhere behind an edge case. Annoying, yes. Catastrophic, sometimes. But the social contract is familiar: tests catch some errors, humans catch others, production catches the rest. Very elegant. Very modern. Very expensive. ...

Infinite Tasks, Finite Minds: Why Agents Keep Forgetting—and How InfiAgent Cheats Time

A report is not finished because the model “understands” the assignment. It is finished because the system still knows, two hundred actions later, which documents were read, which notes were trustworthy, which sections remain unfinished, and which half-baked intermediate answer should not accidentally become the final one. That is the boring part of agentic AI. Naturally, it is also the part most systems quietly fail at. ...

When Bandits Get Priority: Learning Under Scarce, Tiered Capacity

Capacity looks simple until someone pays to jump the queue. That is the quiet problem behind a large amount of modern AI infrastructure. A platform may have many model instances, edge servers, or compute nodes. Tasks arrive with different business value. Enterprise traffic is more important than free-tier traffic. Some jobs have tighter latency targets. Some users, by contract or politics, are simply not equal. Lovely democratic fiction ends at the load balancer. ...

Cheap Thrills, Hard Guarantees: BARGAINing with LLM Cascades

A familiar enterprise AI story goes like this: the expensive model works, the cheap model almost works, and the finance team would very much like “almost” to become a procurement strategy. That is where the trouble starts. For large-scale document processing, classification, filtering, extraction, and review queues, teams rarely want to call the best available LLM on every record. It is too slow, too expensive, and occasionally a lovely way to convert a data pipeline into a billing incident. The obvious compromise is a model cascade: use a cheaper proxy model when it seems confident, and escalate the uncertain cases to a stronger oracle model. ...

Beyond Search: RAG’s Awakening to Enterprise Spreadsheets

TL;DR for operators Most enterprise RAG failures do not begin at the chatbot. They begin earlier, when the retrieval system slices policy manuals into arbitrary chunks, flattens tables into textual porridge, ignores metadata, retrieves semantically similar but operationally wrong passages, and then asks an LLM to look confident. Naturally, the LLM obliges. It has excellent manners. ...

Chunks, Units, Entities: RAG Rewired by CUE-RAG

TL;DR for operators Enterprise RAG teams often treat retrieval quality as a graph-construction problem: extract more entities, more relationships, more summaries, and hope the answer appears somewhere in the resulting machinery. Clue-RAG suggests a more useful diagnosis: the failure is often not that the graph is too small, but that the system has chosen the wrong semantic unit for the job.1 ...

When Streams Cross Wires: Can New AI Models Plug into Old Data Flows?

TL;DR for operators Enterprise AI will not become useful merely because someone bolts a chatbot onto a database and calls the result an “agent”. That is theatre with API keys. The paper behind this article proposes something more sober: a blueprint architecture for compound AI systems in the enterprise, where LLMs are important but not sovereign.1 The core idea is that enterprise AI should be built as a distributed system, not as a heroic model prompt. Streams carry data and control messages. Registries expose existing APIs, models, and datasets as searchable assets. Task planners convert user intent into executable workflows. Data planners work out which databases, documents, models, or transformations are needed. Coordinators execute plans while tracking cost, latency, and quality budgets. ...