Inference Cost

MoE Money, MoE Problems: Expert Capacity Finally Gets a Manager

TL;DR for operators Mixture-of-Experts models are supposed to give businesses the best of both worlds: lots of parameters for capability, few active parameters for cost. Lovely on the slide. Messier in the server room. Two recent papers make the same larger point from opposite sides of the MoE machinery. SoftMoE attacks the compute-allocation problem: why should every token, in every layer, use the same fixed number of experts just because the architecture designer had to choose a value for top-$k$?1 Tied Expert Layers attacks the memory problem: why should every layer store its own expert FFNs when many of those expert weights may be redundant across nearby layers?2 ...

The Yap Trap: Why AI Reasoning Needs a Governor

Long reasoning has become the new luxury trim in AI products. The demo no longer just answers. It pauses, reflects, reconsiders, checks itself, writes a small philosophical memoir, and then hopefully solves the problem. This is not entirely theatrical. Chain-of-thought style reasoning and large reasoning models have improved performance on difficult tasks, especially in mathematics, coding, planning, and multi-step analysis. For business users, that matters. A model that can break down a problem is more useful than one that confidently blurts out the first plausible answer. Nobody wants a legal assistant, financial analyst, or production-support agent whose main cognitive strategy is “vibes, but fast.” ...

Rank and File: MatryoshkaLoRA Turns One Adapter into Many

The adapter budget problem is not just training cost Budget is usually where fine-tuning conversations become less glamorous. A team wants a customized model. The engineer suggests LoRA because full fine-tuning is expensive. Everyone nods. Then the uncomfortable question arrives: which rank? A low rank is cheap but may underfit. A high rank may work better but costs more memory and inference compute. So the team trains several adapters, compares them, chooses one, and pretends the search process was a minor detail. It was not. It was the hidden invoice. ...

Think Less, Align Better: The New Economics of AI Reasoning

Opening — Why this matters now Enterprise AI is entering its mildly awkward teenage phase: everyone wants intelligence, nobody wants the invoice. For the last two years, much of the AI conversation has revolved around more: more context, more reasoning tokens, more chain-of-thought, more human feedback, more evaluators, more synthetic data, more agents, more dashboards to explain why the agents broke the dashboards. The operating assumption was simple enough: if the model thinks more, explains more, or trains on more feedback, it should perform better. ...

Think Twice, Pay Once: The New Economics of Long-Horizon AI Reasoning

Opening — Why this matters now AI reasoning has entered its awkward managerial phase. For the past two years, the dominant story has been simple enough for a conference keynote: make models reason longer, use reinforcement learning, scale inference-time computation, and let the model “think.” The story is not wrong. It is just incomplete in the same way that saying “hire more analysts” is an incomplete operating model for a research department. More thinking can help. It can also become expensive, slow, noisy, and occasionally theatrical. ...

Thinking Isn’t Free: Why Chain-of-Thought Hits a Hard Wall

Reasoning budgets look harmless until they become a line item. A user asks an AI system to reconcile a long contract, inspect a transaction trail, trace dependencies in a knowledge graph, or verify whether one operational event can lead to another. The model “thinks.” The answer improves. The invoice also improves, in the less charming direction. The usual response is to ask for shorter reasoning: compress the chain of thought, use fewer tokens, impose a budget, maybe add a prompt that says “be concise,” because apparently invoices can be negotiated with adjectives. ...

Conformal Thinking: Teaching LLMs When to Stop Thinking

Thinking is not free. That sentence should not need explaining to anyone who has paid an inference bill, waited for a reasoning model to finish its theatrical inner monologue, or watched an AI agent spend half its budget trying to solve a task it was never going to solve. Reasoning models have become better at using more tokens. They have not automatically become better at knowing when more tokens have stopped helping. ...

Batch of Thought, Not Chain of Thought: Why LLMs Reason Better Together

Fraud review is not a solo sport. A risk analyst looking at one suspicious seller can notice a strange product description, a vague company name, or a price range that feels wrong. But the real signal often appears only when several sellers are placed side by side. One shop looks unusual. Ten shops with the same naming pattern, same product mismatch, and same pricing behavior start to look less like noise and more like a system. ...

Tools of Habit: Why LLM Agents Benefit from a Little Inertia

Tools are where many agent demos quietly become invoices. A multi-step LLM agent may look intelligent because it reasons, acts, observes, and repeats. Under the hood, though, it often pays the model to decide every small next move: search here, load that node, look around, check valid actions, fill this argument, try again. Some of those decisions need judgement. Others are basically muscle memory wearing a lab coat. ...

Fast Minds, Cheap Thinking: How Predictive Routing Cuts LLM Reasoning Costs

A support ticket arrives. Then a compliance question. Then a spreadsheet formula request. Then a genuinely nasty piece of mathematical reasoning wearing the innocent expression of a homework problem. In too many AI systems, all four get sent to the same expensive reasoning model, because the architecture has the subtlety of a hotel buffet: everything goes through the same line. ...