LLM Efficiency

MoE Money, MoE Problems: Expert Capacity Finally Gets a Manager

TL;DR for operators Mixture-of-Experts models are supposed to give businesses the best of both worlds: lots of parameters for capability, few active parameters for cost. Lovely on the slide. Messier in the server room. Two recent papers make the same larger point from opposite sides of the MoE machinery. SoftMoE attacks the compute-allocation problem: why should every token, in every layer, use the same fixed number of experts just because the architecture designer had to choose a value for top-$k$?1 Tied Expert Layers attacks the memory problem: why should every layer store its own expert FFNs when many of those expert weights may be redundant across nearby layers?2 ...

The One-Weird-Trick Era of LLM Efficiency Is Over

TL;DR for operators The useful lesson from Unifying Data, Memory, and Compute Efficiency in LLM Training: A Survey is not that one efficiency method is about to save everyone’s GPU bill. That would be charming, in the same way procurement decks are charming. The paper’s real contribution is to show why LLM efficiency has become a coupled operating problem: what data you train on changes the compute you spend; how you fit training into memory changes the optimization path; and when you stop, refresh, or reallocate compute depends on both.1 ...

Think Inside the Blocks: RiM and the Latency Price of Reasoning

Reasoning is expensive mostly because we make the model say it. That sounds almost too simple, which is usually where trouble begins. Chain-of-thought reasoning improved language-model performance by giving the model a written workspace: first solve, then answer. But the same trick also turns internal computation into external communication. Every intermediate step must be decoded, formatted, and passed forward one token at a time. The model is not just thinking; it is producing a small essay it may not need to show anyone. ...

CompactRAG: When Multi-Hop Reasoning Stops Burning Tokens

Ask a normal enterprise RAG system a simple factual question, and it behaves politely enough. Retrieve a few passages. Hand them to the model. Generate an answer. Fine. Ask it a question that requires two or three steps, and the machine starts developing expensive habits. It retrieves, reasons, retrieves again, expands the prompt, reasons again, rewrites a query, retrieves more evidence, and then asks the LLM to stitch the mess together. The architecture looks intellectually serious. The invoice looks even more serious. ...

Optimizing Agentic Workflows: When Agents Learn to Stop Thinking So Much

The most expensive sentence in agentic AI is “Let me think” Every enterprise agent has a little theatre inside it. A user asks for something routine: find a customer record, check a document, submit a form, update a profile, send a message. The agent pauses, reasons, chooses a tool, receives an observation, reasons again, chooses another tool, receives another observation, and continues until the task is finished or the budget is quietly set on fire. ...

Trees That Think Faster: Adaptive Compression for the Long-Context Era

Long context is a lovely product promise until the invoice arrives. Every enterprise AI demo eventually wants the same magic trick: read the whole contract archive, remember every customer interaction, inspect every ticket, keep all meeting notes alive, and answer as if the model has a tidy brain instead of a very expensive attention matrix. The sales slide says “128K context.” The infrastructure team hears “latency, memory, and GPU burn.” Both are correct. One is merely dressed better. ...

Reasoning on a Sliding Scale: Why One Size Doesn't Fit All in CoT

TL;DR for operators Ada-R1 is useful because it attacks the expensive part of reasoning models from the right angle: not “make every answer shorter,” but “decide which problems deserve long reasoning in the first place.”1 The paper’s key evidence is uncomfortable for anyone buying premium reasoning capacity by default. Long Chain-of-Thought helps on harder mathematical problems, but nearly half of the analysed samples show no improvement from Long-CoT, and some perform worse. In other words, paying for the model to brood majestically over simple work is not intelligence. It is ceremony with a token meter attached. ...