Prompt Engineering

Tokens, Watts, and Waste: The Hidden Energy Bill of LLM Inference

Tokens are small. That is why they are dangerous. A developer asks an assistant to generate a function, explain a repository, or reason through a failing test. The screen fills with text. Some of it is useful. Some of it is decoration. Some of it is a polite little parade of examples, test cases, alternative implementations, or whitespace that will be thrown away by the next parser in the pipeline. ...

Prompt Wars: When Pedagogy Beats Cleverness

A prompt review meeting usually sounds more scientific than it is. One person likes the “coach” version. Another prefers the “Socratic” version because it sounds more educational. Someone says the prompt should mention metacognition. Someone else adds “be concise,” because apparently every prompt eventually becomes a corporate email with anxiety issues. Then the team ships the one that feels best. ...

When Models Read Too Much: Context Windows, Capacity, and the Illusion of Infinite Attention

The demo is familiar now. Someone drops a whole contract, a whole policy manual, a whole code repository, or a month of chat history into a model and asks one neat question. The model answers fluently. The room relaxes. The slide says “1M-token context.” Procurement starts smiling. This is where the trouble begins. ...

When Prompts Learn Themselves: The Death of Task Cues

A database column named CURRENT_BAL_AMT is annoying. A column named gbstk is worse. Somewhere inside an enterprise data warehouse, these names are perfectly normal. Somewhere outside the original engineering team, they are tiny locked doors. The usual solution is not glamorous. Someone asks a data engineer. The data engineer asks an older data engineer. A wiki page is found, partly wrong, last updated during an earlier economic cycle. Eventually, “current balance amount” or “overall processing status of sales document” appears in a data catalog, a semantic layer, a search index, or a text-to-SQL system. Humanity advances by one abbreviation. ...

Think Wide, Then Think Hard: Forcing LLMs to Be Creative (On Purpose)

Imagine a brainstorming meeting in which every new idea must immediately pass legal review, fit the quarterly budget, use the existing technology stack, satisfy six executives, and arrive formatted as a PowerPoint slide. The meeting will probably produce something feasible. It will also produce the same three ideas everyone proposed last quarter. ...

When Small Models Learn From Their Mistakes: Arithmetic Reasoning Without Fine-Tuning

Numbers are where language models usually stop sounding impressive. Ask a model to summarize a financial report and it may produce a fluent paragraph with just enough confidence to make everyone in the meeting relax. Ask it to calculate a percentage change from a table, preserve the correct scale, and return a verifiable number, and the poetry ends. Suddenly the model must select the right values, understand the wording, apply the right operation, avoid sign mistakes, avoid scale mistakes, and not hallucinate a formula because the word “change” appeared nearby. ...

When Agents Loop: Geometry, Drift, and the Hidden Physics of LLM Behavior

Agents are rarely dangerous because they answer once. They become interesting, and occasionally annoying, when they loop. A customer-support agent drafts a reply, critiques it, revises it, checks policy, rewrites the tone, and sends the result back into another reasoning step. A research agent summarizes papers, updates its plan, searches again, and revises its own assumptions. A coding agent edits a file, reads the error, patches the patch, and keeps going until either the tests pass or the repository looks like an archaeological site. ...

You Know It When You See It—But Can the Model?

Review queue. Someone has to decide whether an image is “unsafe,” “misleading,” “healthy,” “premium,” “clickbait,” “brand-safe,” or “not really our vibe.” The label sounds simple until the first borderline case appears. A salad with too much cream. A gaming ad that hints at easy money but never quite says it. A before-and-after photo where the “achievement” is visible only if one is feeling generous. ...

Anchors Aweigh? Why Small LLMs Refuse to Flip Their Own Semantics

A label looks harmless until you ask it to lie. Tell a model that a glowing movie review should be labeled POS, and few-shot prompting behaves like a useful intern: it studies the examples, picks up the pattern, and usually gets better. Tell the same model that a glowing review should now be labeled NEG, and the intern becomes less useful. It does not smoothly learn your private code. It does not politely invert its semantic universe. It mostly produces a muddle. ...

Tile by Tile: Why LLMs Still Can't Plan Their Way Out of a 3×3 Box

A board game should not embarrass a frontier model. That is the uncomfortable charm of the 8-puzzle. It has no hidden information, no vague user intent, no messy database schema, no ambiguous policy exception, and no client saying “just make it pop.” It is a 3×3 grid with eight tiles and one blank space. Slide adjacent tiles into the blank. Reach the goal state. Done. ...