In-Context Learning

Time to Prefer: Why Binary RLHF Feedback Leaves Reward Models Guessing

Time to Prefer: Why Binary RLHF Feedback Leaves Reward Models Guessing Thumbs-up feedback looks efficient. It is clean, cheap, easy to store, and friendly to dashboards. One output wins, another output loses, and the reward model learns what humans supposedly want. A tidy little morality market, with all the nuance of a vending machine. ...

When the Muse Has a GPU: Teaching a Machine to Write Poetry

Poetry is a useful place to test the limits of AI, partly because the task is so easy to misunderstand. A bad poem can be fluent. A decent poem can be vague. A machine can produce both before breakfast, along with a motivational LinkedIn post and three flavors of executive summary. That is not the interesting part. ...

Rationales Before Results: Teaching Multimodal LLMs to Actually Reason About Time Series

Dashboard work has a familiar little ritual. Someone opens a chart, zooms into the last few points, notices a dip, a rebound, or a suspiciously clean trend line, and then says something that sounds analytical: “Looks like it will continue.” Sometimes that is wisdom. Sometimes it is just a human staring confidently at a squiggle. ...

Anchors Aweigh? Why Small LLMs Refuse to Flip Their Own Semantics

A label looks harmless until you ask it to lie. Tell a model that a glowing movie review should be labeled POS, and few-shot prompting behaves like a useful intern: it studies the examples, picks up the pattern, and usually gets better. Tell the same model that a glowing review should now be labeled NEG, and the intern becomes less useful. It does not smoothly learn your private code. It does not politely invert its semantic universe. It mostly produces a muddle. ...

Skills to Pay the Agent Bills: Why LLMs Need Better Moves, Not Bigger Models

Runbooks are underrated. Not the glossy strategy kind. The real kind: “check this first, then open that system, then verify the thing that usually breaks, then escalate only if the next signal appears.” Most operational work is not heroic reasoning. It is structured repetition under partial information. This is exactly where many LLM agents still look strangely amateur. They can describe a process beautifully, then fail to follow it. They can hold a long context window, then ignore the one action that would move the task forward. They can retrieve prior examples, then drown themselves in irrelevant steps. Very impressive. Very expensive. Occasionally useful. ...

Heads Up: Why Sensitivity Matters in Many‑Shot Multimodal ICL

Long prompts are easy to understand. They are also expensive, slow, and—in multimodal systems—very quickly ridiculous. That is the practical tension behind many-shot multimodal in-context learning. In principle, giving a vision-language model more examples should help it recognise the task. In practice, every image costs tokens, every additional demonstration adds latency, and open-source large multimodal models do not generally enjoy infinite context windows. The business version of the problem is familiar: you want a model to adapt to a specialised workflow, but you do not want to fine-tune it every week, pay for swollen prompts forever, or discover that the “cheap” approach now requires a larger GPU. ...

Privacy by Proximity: How Nearest Neighbors Made In-Context Learning Differentially Private

TL;DR for operators Private examples are not harmless just because they sit inside a prompt rather than inside model weights. In-context learning lets teams adapt a general LLM by adding examples at inference time, which is convenient until those examples are medical notes, legal clauses, customer tickets, invoices, or internal decisions that should not be inferable from the model’s output. ...