Long-Context

The Fine Print Is the Task: Why Long-Context AI Fails After Finding the Answer

TL;DR for operators When an AI system reads a manual, policy, API specification, case file, or operating procedure, finding the relevant facts is only half the job. It must also discover the local rules that define what a valid answer looks like: required fields, exact labels, ordering constraints, exception handling, validation steps, prohibited actions, and completeness conditions. ...

Cache Me If You Can: Why Enterprise AI Needs Latent Working Memory

A codebase is not a paragraph. Neither is a litigation folder, a clinical case file, a customer-support history, a policy archive, or the slow-motion disaster known as “all meeting notes since March.” Yet many enterprise AI systems still treat long context as a heroic prompt-engineering problem: push more text into the model, pray the key detail survives attention, and call the bill “innovation.” ...

Flash Before the First Token: How FlashPrefill Rewrites the Economics of Long Context

Waiting is the least glamorous part of AI. A user uploads a contract, a codebase, a board pack, or a pile of research notes. The model does not answer immediately. First, it reads. Technically, it prefills: it processes the prompt, builds the internal key-value cache, and prepares the first generated token. In short prompts this feels invisible. In long-context systems, it becomes the awkward pause where the “agent” looks suspiciously like a very expensive loading spinner. ...

Memory Isn’t Personal: Why LLMs Still Forget What You Like

A customer tells your AI assistant that she dislikes crowded tourist attractions. Three weeks later, she asks for a weekend itinerary. A good assistant should not proudly recommend the busiest landmark in the city. A less good assistant will do exactly that, but in a warm tone. This is the quiet failure mode behind many “personal AI” demos. The interface remembers the conversation. The product claims continuity. The model may even have a giant context window large enough to swallow a small novel. Yet when the user asks a new question, the system behaves as if the earlier preference is just decorative text floating somewhere in the attic. ...

The Context Ceiling: When Long Context Stops Thinking

Documents are the easiest way to fool an AI system into looking serious. A procurement team uploads the full contract archive. A compliance team adds policy manuals, audit notes, and emails. A financial analyst stuffs transcripts, filings, and market commentary into one heroic prompt. The interface accepts it. The model answers fluently. Everyone relaxes. ...

When Your Agent Starts Copying Itself: Breaking Conversational Inertia

A support agent keeps asking the same diagnostic question after the customer has already answered it. A research agent revisits the same failed source path with slightly different wording. A workflow agent tries the same invalid action again because, apparently, the best evidence for what to do next is what it just did badly. ...

When LLMs Get a Laptop: Why Sandboxes Might Be the Real AGI Benchmark

Laptop. That is the deceptively simple object hiding inside this paper. Not a magic planner. Not a thousand-tool agent marketplace. Not a baroque workflow with seventeen orchestration layers and a dashboard that looks like a cockpit designed by consultants. A laptop. Or, more precisely, a minimal virtual computer: a sandbox with terminal access, file editing, code execution, persistent files, and the ability to install or fetch resources. In Computer Environments Elicit General Agentic Intelligence in LLMs, Cheng et al. ask a question that looks almost too obvious to be interesting until one remembers how much of the AI industry is still trying to squeeze “agency” out of longer prompts.1 ...

Fish in the Ocean, Not Needles in the Haystack

Documents are where confident AI demos go to become slightly embarrassing. A model reads a long report. It gives the right answer. The room relaxes. Someone says “great, it understood the document,” and everyone pretends the word understood has not just been smuggled into the meeting without a passport. That is the exact mistake SIN-Bench is designed to catch.1 The paper is not merely another benchmark asking whether multimodal large language models can answer questions about scientific literature. It asks a more operationally painful question: can the model show the evidence path that makes the answer legitimate? ...

When Models Read Too Much: Context Windows, Capacity, and the Illusion of Infinite Attention

The demo is familiar now. Someone drops a whole contract, a whole policy manual, a whole code repository, or a month of chat history into a model and asks one neat question. The model answers fluently. The room relaxes. The slide says “1M-token context.” Procurement starts smiling. This is where the trouble begins. ...

Long Thoughts, Short Bills: Distilling Mathematical Reasoning at Scale

The invoice arrives after the benchmark party Math benchmarks are fun until the training bill arrives. A model can be taught to produce longer reasoning traces. It can be shown more olympiad problems. It can be given Python. It can be pushed into 128K-token contexts and told, heroically, to think harder. All of this sounds impressive in a benchmark table. Less impressive is the operational detail that most training samples do not need the full 128K window, yet a naive training setup can still make every step pay for it. ...