AI Efficiency

Mind the Trigger: When AI Should Read the Room

TL;DR for operators The fashionable question is whether an AI can infer what another agent believes, intends, or misunderstands. The more operational question is whether it should bother. Nikolos Gurney’s paper proposes a causal model for deciding when an artificial agent should engage theory-of-mind reasoning in conflict.1 Rather than treating mentalizing as an always-on capability, the model activates it when three conditions create enough pressure: information is unevenly distributed, an analytical solution is inaccessible, or the agent believes there is a meaningful mismatch between its own sophistication and its opponent’s. ...

Don’t Train Harder—Train Smarter: The Hidden Economics of RL for LLMs

The GPU bill is not the strategy The easiest way to make reinforcement learning for reasoning models sound impressive is to say: sample more responses, train longer, scale harder. It is also the easiest way to make the finance team develop a facial twitch. Modern reasoning-focused LLMs increasingly rely on reinforcement learning with verifiable rewards: generate multiple candidate answers, score them with a rule-based signal, and update the model toward better reasoning behavior. In mathematics and coding tasks, this has become one of the most important post-training recipes. But it has a small accounting problem, in the same way a leaking ship has a small moisture problem. ...

Compress, Then Confess: Why Order Beats Method in AI Model Efficiency

A deployment team has a large model, a smaller device, and a familiar problem: the model is too heavy for the place where the business actually wants to use it. So the team reaches for the standard efficiency drawer. Prune some weights. Quantize the remaining values. Maybe add a light adapter to recover accuracy. Push the result to edge hardware, a mobile app, or a cheaper inference server. Then explain to management why the model became faster but also slightly less intelligent. The usual ritual. ...

Tokens, Watts, and Waste: The Hidden Energy Bill of LLM Inference

Tokens are small. That is why they are dangerous. A developer asks an assistant to generate a function, explain a repository, or reason through a failing test. The screen fills with text. Some of it is useful. Some of it is decoration. Some of it is a polite little parade of examples, test cases, alternative implementations, or whitespace that will be thrown away by the next parser in the pipeline. ...

Routing the Lottery: When Pruning Learns to Choose

A model can be small and still be badly organized. That is the quiet problem behind a lot of model compression work. We often ask whether a neural network can be pruned without losing too much accuracy. Fair enough. Budgets are real. Memory is not decorative. But the question hides a stronger assumption: that one sparse structure should serve every input equally well. ...

Think-with-Me: When LLMs Learn to Stop Thinking

A model can be wrong because it did not think enough. That part is easy to understand. The more annoying failure is when the model already had the answer, kept going, second-guessed itself into a ditch, and then presented the ditch with confidence. This is the special comedy of large reasoning models: sometimes the expensive part is not the intelligence, but the hesitation after the intelligence has already done its job. ...

One-Shot Brains, Fewer Mouths: When Multi-Agent Systems Learn to Stop Talking

Meetings are expensive because people talk. Multi-agent AI systems have discovered the same problem, only with tokens instead of coffee. The standard promise sounds attractive: let several LLM agents play different roles, exchange views, debate mistakes, critique each other, and produce a better answer than one lonely model staring into the void. Sometimes this works. It also creates a very modern failure mode: a small committee of agents turns into a transcript factory. Every extra round adds context. Every context window invites more repetition. Every repetition costs money, latency, and occasionally correctness. Artificial intelligence, it turns out, can also suffer from over-management. ...

Decoding Intelligence: When Spikes Meet Hyperdimensions

Edge AI has a habit of turning every efficiency problem into a hardware problem. Buy a better chip. Quantise the model. Move the workload closer to the sensor. Reduce the precision until the accuracy team starts twitching. This paper takes a quieter route. It asks whether part of the energy problem comes not from the sensor, the chip, or even the whole network, but from the way the network is asked to speak. ...

Fast Minds, Cheap Thinking: How Predictive Routing Cuts LLM Reasoning Costs

A support ticket arrives. Then a compliance question. Then a spreadsheet formula request. Then a genuinely nasty piece of mathematical reasoning wearing the innocent expression of a homework problem. In too many AI systems, all four get sent to the same expensive reasoning model, because the architecture has the subtlety of a hotel buffet: everything goes through the same line. ...

Agents on the Clock: How TPS-Bench Exposes the Time Management Problem in AI

A competent assistant can make a list. A useful assistant knows what must happen first. That distinction sounds small until an AI agent is asked to do something ordinary and annoyingly realistic: check a calendar, search the web, compare options, use a map, assemble a recommendation, and perhaps create a document at the end. None of those steps is exotic. The difficulty is that some of them can run in parallel, some must wait for earlier results, and some become nonsense if executed too early. This is less “genius at work” than “junior operations manager with access to too many browser tabs.” Naturally, it is where things get interesting. ...