Model-Efficiency

When Data Decides What Matters: The Quiet Economics of LLM Data Selection

Budgets have a charming way of making AI strategy less philosophical. In the demo room, the question is usually whether a model can reason, code, summarize, plan, and sound pleasantly harmless while doing so. In the finance room, the question becomes simpler: how many tokens, how many GPUs, how many weeks, and why exactly are we paying to teach the model another version of the same web page? ...

Squeezing Time: How Dynamic Tokenization Could Reshape Time‑Series Foundation Models

Forecasting systems have a bad habit: they treat every moment in the past as if it deserves the same amount of attention. A quiet hour in an electricity-load curve. A sudden machine vibration spike. A slowly drifting weather signal. A crypto candle that does nothing for three hours and then ruins someone’s afternoon. To a standard point-wise time-series model, each timestamp is a token. To a fixed-patch model, every group of timestamps is compressed with the same ruler. Both choices are defensible. Both are also slightly lazy. ...

Small Model, Big Eyes: Why Microsoft’s Phi‑4 Vision Model Is a Warning Shot to Giant Multimodal AI

Screen. That is where many ambitious AI agents quietly embarrass themselves. Not in a grand philosophical test of intelligence. Not in a graduate-level theorem. Just on a screen: a small button, a chart label, a checkout field, a misread table cell, a tiny icon in a crowded interface. The model can explain strategy, summarize policy, and generate six polite versions of an apology email, but then it clicks the wrong thing because it did not really see the thing. ...

Beyond the Linear Ceiling: Why Non-Linearity Is the Next Frontier in PEFT

More Rank Is Not Always More Capacity Fine-tuning teams love a simple knob. If the model underperforms, increase rank. If the adapter looks too small, increase rank. If the downstream task is hard, increase rank again and call it strategy. This is comforting because rank is measurable, budgetable, and easy to explain in a meeting. Unfortunately, reality has its usual habit of being less cooperative. ...

Motivation Is Something Your Models Need: When Curiosity Becomes a Training Strategy

Training budgets are where elegant architecture slogans go to be audited. The usual response to a model that needs better accuracy is painfully familiar: make it larger, train it longer, feed it more data, and then pretend the GPU bill is a philosophical problem. The paper Motivation Is Something You Need takes a more interesting route. It asks whether a model needs to be large all the time, or whether extra capacity can be activated only when training signals suggest the model is “getting somewhere.”1 ...

Gated Sparse Attention: Speed Without the Sink

Context is expensive. That sentence is now obvious to anyone building with long-context models. The awkward part is that “long context” sounds like a capability, while the invoice often treats it as a lifestyle choice. Feed a model a 100-page contract, a repository, or a week of customer-support logs, and the theoretical promise is straightforward: the model can inspect more evidence before answering. The operational reality is less romantic. Attention cost grows quickly, prefill becomes painful, memory pressure rises, and training large models over long sequences can become unpleasantly dramatic. ...

When Models Forget on Purpose: Why Data Selection Matters More Than Data Volume

Training data has become the AI industry’s favorite comfort blanket. When performance stalls, add more tokens. When a benchmark looks stubborn, add more tokens. When the model behaves badly, add more tokens and call it a roadmap. This worked well enough to become a reflex. Unfortunately, reflexes are not strategies. The uncomfortable question is no longer whether data matters. Of course it matters. The better question is whether every token deserves the same vote during training. ...

Replay the Losses, Win the Game: When Failed Instructions Become Your Best Training Data

Failure logs are usually treated as evidence for the prosecution. A model is asked to produce a concise compliance summary with three bullet points, mention two risks, avoid prohibited claims, and end with a recommendation. It produces three bullets, correctly identifies the risks, avoids the prohibited claims—and forgets the recommendation. Under a strict binary reward, the response receives a zero. Under a partial-credit reward, it might receive 0.75. The first signal says nothing useful happened. The second says something useful happened, but not precisely what. ...

Attention, But Make It Optional

Cost has a way of making architecture less romantic. In diagrams, a Transformer block looks clean: attention mixes tokens, the MLP transforms features, residual connections keep information flowing. In deployment, the same diagram becomes an invoice. Attention is especially expensive because its cost grows with sequence length. In the paper’s LLaMA-7B timing example, an attention layer has roughly half the parameters of an MLP layer, yet runs nearly twice as long at sequence length around 3,000 and about three times as long around 7,000. Attention is elegant. It is also very good at charging rent. ...

When Reasoning Meets Its Laws: Why More Thinking Isn’t Always Better

The expensive model that thinks less at the wrong moment Tokens are not wisdom. They are rented time. Anyone who has paid for reasoning-model inference already understands the business version of this problem. A model spends hundreds or thousands of tokens circling a simple question, then compresses a genuinely compound task into a suspiciously neat answer. It looks thoughtful. It may even sound disciplined. But the bill arrives in one column and the error arrives in another. ...