Context Engineering

DeltaEvolve: When Evolution Learns Its Own Momentum

Memory is usually where agentic systems go to become expensive. That is not the glamorous failure mode. It is not the cinematic robot rebellion, nor the slightly more realistic spreadsheet full of hallucinated invoices. It is quieter: an LLM agent keeps improving a program, stores previous attempts, retrieves a few “good” ones, and then spends half its context window rereading code scaffolding that no longer explains anything useful. ...

When Your Agent Starts Copying Itself: Breaking Conversational Inertia

A support agent keeps asking the same diagnostic question after the customer has already answered it. A research agent revisits the same failed source path with slightly different wording. A workflow agent tries the same invalid action again because, apparently, the best evidence for what to do next is what it just did badly. ...

Bubble Trouble: Why Top‑K Retrieval Keeps Letting LLMs Down

The problem is not finding documents. It is spending the prompt budget badly. Ask an enterprise RAG system for “scope of work,” and the system may look confident for exactly the wrong reason. The query sounds simple. Somewhere in the document set, there is probably a sheet, paragraph, or clause literally called “Scope of Works.” A flat top-k retriever will happily grab the highest-scoring chunks from that section, stack them into the model context, and call the job done. Very tidy. Very wrong. ...

Browsing Without the Bloat: Teaching Agents to Think Before They Scroll

An analyst opens a promising webpage. It contains the answer somewhere between a navigation menu, several years of archived material, an interactive table, related articles, legal disclaimers, and enough decorative HTML to keep a language model occupied until lunch. A human scans, clicks, ignores, and moves on. A browser agent is more likely to ingest the entire page, append it to an already swollen context window, and then congratulate itself for having “conducted research.” ...

Prompting on Life Support: How Invasive Context Engineering Fights Long-Context Drift

The prompt was clear. Then the conversation kept going. A familiar enterprise AI story starts politely enough. The legal assistant is told to be conservative. The medical triage bot is told not to diagnose. The procurement agent is told never to approve a vendor without documented checks. Everyone nods. The system prompt is immaculate. Compliance is laminated. ...

Memory With a Pulse: Real-Time Feedback Loops for RAG Systems

Ask an enterprise chatbot the wrong question on the wrong day and the problem is rarely that the language model has forgotten how to write English. The problem is that it has been handed the wrong pile of evidence. That is the expensive little defect inside many retrieval-augmented generation systems. The model may be fluent. The corpus may be current. The vector database may be humming along like a well-funded filing cabinet. Yet the answer still disappoints because the system chose the wrong snippets, placed a useful document too low, missed a newly relevant runbook, or treated yesterday’s user intent as if it were carved into basalt. ...

Small Gains, Long Games: Why Tiny Accuracy Bumps Explode into Big Execution Wins

A workflow does not fail because the first step is hard. It fails because the seventeenth step is boring, the twenty-third step depends on a slightly wrong state, and by the thirty-first step the agent is confidently building on its own rubbish. Very enterprise. Very scalable. Very expensive. The paper behind this article, The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs, makes a deceptively simple point: judging LLM progress by short-task accuracy can badly understate the value of reliability gains over long workflows.1 A model that improves only slightly on a single step may become dramatically better at completing long sequences without failure. That is not motivational poster mathematics. It is compounding. ...

Brains Meet Brains: When LLMs Sit on Top of Supply Chain Optimizers

TL;DR for operators The paper is useful because it gets the hierarchy right: the optimizer decides; the LLM explains, configures, contextualizes, and packages the decision for humans.1 That is not a small distinction. It is the difference between a supply chain system that can be audited and a chatbot confidently waving at a warehouse. ...

Back to School for AGI: Memory, Skills, and Self‑Starter Instincts

TL;DR for operators The paper is not really about whether a model can answer exam questions. Given the right context, the frontier models do very well. The hard part is whether an agent can notice what must be preserved, store it in a useful form, retrieve it at the right time, and act without being explicitly prodded. That is the difference between an assistant that sounds competent and an assistant that can actually carry operational state across days, weeks, and dependent workflows. ...