Cover image

Skills to Pay the Agent Bills: Why LLMs Need Better Moves, Not Bigger Models

Runbooks are underrated. Not the glossy strategy kind. The real kind: “check this first, then open that system, then verify the thing that usually breaks, then escalate only if the next signal appears.” Most operational work is not heroic reasoning. It is structured repetition under partial information. This is exactly where many LLM agents still look strangely amateur. They can describe a process beautifully, then fail to follow it. They can hold a long context window, then ignore the one action that would move the task forward. They can retrieve prior examples, then drown themselves in irrelevant steps. Very impressive. Very expensive. Occasionally useful. ...

November 20, 2025 · 18 min · Zelina
Cover image

Promptfolios: When Buffett Becomes a System Prompt

Investment firms love a house style. Conservative value. Quality growth. Distressed credit. Low-volatility income. The style is supposed to mean something more durable than a portfolio manager’s breakfast mood. The uncomfortable part is that many “styles” still live in a fog of analyst judgement, committee memory, spreadsheet folklore, and the occasional sacred quote from an investor whose annual letters have been read with the reverence normally reserved for scripture. Everyone claims discipline. Fewer can show exactly how that discipline becomes position weights. ...

October 9, 2025 · 13 min · Zelina
Cover image

Graph and Circumstance: Maestro Conducts Reliable AI Agents

A broken AI agent often looks deceptively close to working. It answers most questions. It calls the right tool sometimes. It follows the instruction until the conversation gets long, the retrieval query gets vague, or the arithmetic becomes just difficult enough for the model to start doing spreadsheet theatre. The usual repair is prompt editing. Add a stern sentence. Add a role. Add an example. Add “think step by step,” because apparently the machine needed a motivational poster. ...

September 11, 2025 · 15 min · Zelina
Cover image

Forecast: Mostly Context with a Chance of Routing

TL;DR for operators Most forecasting teams already have decent numerical forecasters. Their problem is not that ARIMA, ETS, Lag-Llama, Chronos, or internal demand models suddenly forgot how Tuesdays work. The problem is that many important forecast shocks arrive as text: heat-wave notices, maintenance schedules, holiday effects, price caps, promotions, policy changes, store closures, one-off events, and all the other messy little business facts that refuse to fit politely into a clean covariate table. ...

August 16, 2025 · 17 min · Zelina
Cover image

Longer Yet Dumber: Why LLMs Fail at Catching Their Own Coding Mistakes

TL;DR for operators Code review usually starts after code exists. FPBench argues that this is already too late. The paper behind FPBench tests whether large language models can detect faulty premises in code-generation requests before obediently producing code from them.1 The answer is awkward. Many models can identify the flaw when explicitly told to check the question first, but most do not do so proactively. They behave less like careful engineers and more like very fast interns with a tragic respect for bad tickets. ...

August 6, 2025 · 14 min · Zelina
Cover image

Fraud, Trimmed and Tagged: How Dual-Granularity Prompts Sharpen LLMs for Graph Detection

TL;DR for operators Fraud teams already know the problem: the suspicious review, shop, seller, or account is rarely suspicious in isolation. The useful evidence is scattered across neighbours — same user, same product, same rating pattern, same time window, same commercial ecosystem. The less useful evidence is also scattered there. At scale, that second pile is larger. How inconvenient. ...

July 30, 2025 · 15 min · Zelina
Cover image

Latent Brilliance: Turning LLMs into Creativity Engines

TL;DR for operators Creative AI systems usually fail in a painfully familiar way: ask for ten ideas, and by idea four the model is politely repainting the same wall. Change the temperature, give it a persona, ask a panel of agents to “debate,” and the system may sound busier, but the semantic spread often remains narrow. The paper behind this article argues that this is not merely a prompt-design inconvenience. It is a structural limitation of how LLMs are conditioned. ...

July 21, 2025 · 18 min · Zelina
Cover image

Inner Critics, Better Agents: The Rise of Introspective AI

TL;DR for operators If your agent stack is becoming expensive because every “reflection” step means another model call, this paper is worth reading. Its proposal, Introspection of Thought (INoT), tries to compress an external multi-agent debate loop into one structured prompt. The LLM is not literally running multiple agents. It is being instructed, through a hybrid Python-and-natural-language prompt called PromptCode, to simulate two internal debaters that reason, critique, rebut, revise, and then return an answer.1 ...

July 14, 2025 · 15 min · Zelina
Cover image

From Prompting to Porting: Surviving the LLM Upgrade Cycle

TL;DR for operators A model upgrade is not a software patch. It is closer to changing the interpreter under a production system while hoping every old script still means the same thing. Charming, in the way live wires are charming. The paper behind this article, Prompt Migration: Stabilizing GenAI Applications with Evolving Large Language Models, studies that problem through Tursio, an enterprise search application that converts natural-language questions into structured operator trees for database querying.1 Tursio’s old prompts were fully stable on GPT-4-32k. When the same prompts were run against GPT-4.1, tests passed at 98%. Against GPT-4.5-preview, they passed at 97.3%. That sounds minor until the application is generating SQL-like structures, where “almost correct” is not a governance model. ...

July 9, 2025 · 18 min · Zelina
Cover image

Chains of Causality, Not Just Thought

TL;DR for operators Causal Influence Prompting, or CIP, is a safety method for LLM agents that asks the model to build and consult a causal influence diagram before acting. Instead of telling the agent, “be safe,” it asks the agent to represent the task as a graph: what facts matter, what choices are available, what outcomes are useful, and what outcomes are harmful. This is a better shape for the problem, because agents do not merely answer questions. They click buttons, run code, forward messages, use tools, and occasionally behave as if “sure, why not?” were a compliance framework. ...

July 2, 2025 · 17 min · Zelina