Cover image

Place Your Experts, Not Your Bets

Opening — Why this matters now The fashionable version of AI strategy still sounds suspiciously like a gym membership pitch: bigger model, more parameters, more GPUs, more everything. The operational version is less glamorous and much more important: where does the computation happen, which parts of the model are actually used, how predictable is demand, and whether the system can turn those facts into lower latency, lower cost, or better decisions. ...

May 7, 2026 · 13 min · Zelina
Cover image

Queue Who’s Optimizing: Why LLM Serving Needs Math, Not More Vibes

Opening — Why this matters now The first wave of enterprise AI adoption was obsessed with model choice. Which model is smarter? Which model writes better? Which model can reason, code, browse, call tools, summarize contracts, and politely pretend it enjoys quarterly planning? That was the easy part. The less glamorous question is now becoming more expensive: how do we serve all these model calls reliably, cheaply, and at scale? ...

May 6, 2026 · 18 min · Zelina
Cover image

Rank and File: BoostLoRA’s Case for Smarter Fine-Tuning

Opening — Why this matters now Enterprise AI is entering its less glamorous phase: not the demo, not the keynote, not the charming chatbot that answers three curated questions correctly, but the operational grind of making models behave reliably inside messy workflows. That grind usually runs into a familiar triangle. Full fine-tuning is powerful but expensive, operationally heavy, and often risky when the training set is narrow. Parameter-efficient fine-tuning, especially LoRA-style adaptation, is cheaper and easier to deploy, but the smallest adapters can hit a ceiling. Meanwhile, the business user does not care whether the adapter was elegant. They care whether the model stops making the same costly mistakes in invoicing, compliance review, customer support, code generation, or scientific triage. ...

May 4, 2026 · 13 min · Zelina
Cover image

Rank and File: Why LoRA Adapters May Be Bigger Than They Need to Be

Opening — Why this matters now Fine-tuning large models used to sound like a research luxury. Now it is a line item in the infrastructure budget. Enterprises do not want one general-purpose model behaving vaguely usefully for everyone. They want domain-specific behavior: a support adapter for insurance claims, a compliance adapter for legal review, a financial-document adapter for analyst workflows, perhaps a dozen regional variants, and then another dozen because someone discovered “brand tone” during a steering committee meeting. Naturally. ...

May 4, 2026 · 12 min · Zelina
Cover image

Ctrl+Z Is Not a Strategy: When LLM Self-Correction Actually Works

Opening — Why this matters now Agentic AI systems are currently being sold with a suspiciously comforting ritual: generate an answer, ask the same model to reflect, then ask it to improve the answer. Repeat until the dashboard looks busy. In demos, this feels intelligent. In production, it may simply be a very expensive way to turn correct answers into wrong ones. ...

April 30, 2026 · 12 min · Zelina
Cover image

The Cost of Thinking Twice: Why Agentic AI Needs a CFO

Budget. That is the word agentic AI usually discovers after the demo is over. During the demo, the agent searches again. It verifies again. It calls another tool, adds another reasoning step, and produces an answer that feels satisfyingly deliberate. In production, the same behavior becomes less charming. Tokens accumulate, latency stretches, logs become harder to inspect, and nobody is entirely sure whether the last two tool calls were useful or just the machine equivalent of pacing around the room with a clipboard. ...

March 23, 2026 · 17 min · Zelina
Cover image

Beyond the Pareto Frontier: Pricing LLM Mistakes in the Real World

TL;DR for operators Most model-selection dashboards still ask the wrong question. They ask which LLM gives the best accuracy for the lowest inference cost. Zellinger and Thomson’s paper asks a more operationally honest one: how much does a wrong answer, a slow answer, or no answer cost in this specific workflow?1 The paper’s useful move is to convert competing performance metrics into a single expected dollar reward. Inference cost stays in dollars. Latency gets priced in dollars per second or minute. Errors get priced by their business consequence. Abstention gets priced by the cost of failing to answer or escalating to a human. Once everything is in the same unit, the “best model” is no longer the one that looks attractive on a Pareto plot. It is the model with the highest expected reward under the actual economics of the task. ...

July 8, 2025 · 19 min · Zelina