Prompt Engineering

Prompted and Confused: When LLMs Forget the Assignment

A requirements document walks into a model. It says: assign resources, respect capacity, avoid conflicts, minimise waste. The model nods politely, emits a tidy block of MiniZinc, and everyone is briefly tempted to believe the future has arrived. Then someone changes the story from cars to knapsacks, or adds one stray sentence about maximising something, and the same system quietly forgets the assignment. ...

Skills to Pay the Agent Bills: Why LLMs Need Better Moves, Not Bigger Models

Runbooks are underrated. Not the glossy strategy kind. The real kind: “check this first, then open that system, then verify the thing that usually breaks, then escalate only if the next signal appears.” Most operational work is not heroic reasoning. It is structured repetition under partial information. This is exactly where many LLM agents still look strangely amateur. They can describe a process beautifully, then fail to follow it. They can hold a long context window, then ignore the one action that would move the task forward. They can retrieve prior examples, then drown themselves in irrelevant steps. Very impressive. Very expensive. Occasionally useful. ...

Promptfolios: When Buffett Becomes a System Prompt

Investment firms love a house style. Conservative value. Quality growth. Distressed credit. Low-volatility income. The style is supposed to mean something more durable than a portfolio manager’s breakfast mood. The uncomfortable part is that many “styles” still live in a fog of analyst judgement, committee memory, spreadsheet folklore, and the occasional sacred quote from an investor whose annual letters have been read with the reverence normally reserved for scripture. Everyone claims discipline. Fewer can show exactly how that discipline becomes position weights. ...

Graph and Circumstance: Maestro Conducts Reliable AI Agents

A broken AI agent often looks deceptively close to working. It answers most questions. It calls the right tool sometimes. It follows the instruction until the conversation gets long, the retrieval query gets vague, or the arithmetic becomes just difficult enough for the model to start doing spreadsheet theatre. The usual repair is prompt editing. Add a stern sentence. Add a role. Add an example. Add “think step by step,” because apparently the machine needed a motivational poster. ...

Forecast: Mostly Context with a Chance of Routing

TL;DR for operators Most forecasting teams already have decent numerical forecasters. Their problem is not that ARIMA, ETS, Lag-Llama, Chronos, or internal demand models suddenly forgot how Tuesdays work. The problem is that many important forecast shocks arrive as text: heat-wave notices, maintenance schedules, holiday effects, price caps, promotions, policy changes, store closures, one-off events, and all the other messy little business facts that refuse to fit politely into a clean covariate table. ...

Longer Yet Dumber: Why LLMs Fail at Catching Their Own Coding Mistakes

TL;DR for operators Code review usually starts after code exists. FPBench argues that this is already too late. The paper behind FPBench tests whether large language models can detect faulty premises in code-generation requests before obediently producing code from them.1 The answer is awkward. Many models can identify the flaw when explicitly told to check the question first, but most do not do so proactively. They behave less like careful engineers and more like very fast interns with a tragic respect for bad tickets. ...

Fraud, Trimmed and Tagged: How Dual-Granularity Prompts Sharpen LLMs for Graph Detection

TL;DR for operators Fraud teams already know the problem: the suspicious review, shop, seller, or account is rarely suspicious in isolation. The useful evidence is scattered across neighbours — same user, same product, same rating pattern, same time window, same commercial ecosystem. The less useful evidence is also scattered there. At scale, that second pile is larger. How inconvenient. ...

Latent Brilliance: Turning LLMs into Creativity Engines

TL;DR for operators Creative AI systems usually fail in a painfully familiar way: ask for ten ideas, and by idea four the model is politely repainting the same wall. Change the temperature, give it a persona, ask a panel of agents to “debate,” and the system may sound busier, but the semantic spread often remains narrow. The paper behind this article argues that this is not merely a prompt-design inconvenience. It is a structural limitation of how LLMs are conditioned. ...

Inner Critics, Better Agents: The Rise of Introspective AI

TL;DR for operators If your agent stack is becoming expensive because every “reflection” step means another model call, this paper is worth reading. Its proposal, Introspection of Thought (INoT), tries to compress an external multi-agent debate loop into one structured prompt. The LLM is not literally running multiple agents. It is being instructed, through a hybrid Python-and-natural-language prompt called PromptCode, to simulate two internal debaters that reason, critique, rebut, revise, and then return an answer.1 ...

From Prompting to Porting: Surviving the LLM Upgrade Cycle

TL;DR for operators A model upgrade is not a software patch. It is closer to changing the interpreter under a production system while hoping every old script still means the same thing. Charming, in the way live wires are charming. The paper behind this article, Prompt Migration: Stabilizing GenAI Applications with Evolving Large Language Models, studies that problem through Tursio, an enterprise search application that converts natural-language questions into structured operator trees for database querying.1 Tursio’s old prompts were fully stable on GPT-4-32k. When the same prompts were run against GPT-4.1, tests passed at 98%. Against GPT-4.5-preview, they passed at 97.3%. That sounds minor until the application is generating SQL-like structures, where “almost correct” is not a governance model. ...