Long-Context

Picking Less to Know More: When RAG Stops Ranking and Starts Thinking

Search is not judgment Search is easy to admire because it produces something visible. A ranked list. A bigger context window. A satisfying pile of passages that says, “Look, we retrieved evidence.” Very comforting. Also not the same as knowing what evidence is actually needed. That distinction is the core of Context-Picker: Dynamic Context Selection Using Multi-stage Reinforcement Learning.1 The paper studies a familiar RAG problem: if a system retrieves too little, it misses the answer; if it retrieves too much, it drags in distractors, repeats, weakly related fragments, and the usual long-context swamp where useful evidence politely disappears in the middle. ...

Trees That Think Faster: Adaptive Compression for the Long-Context Era

Long context is a lovely product promise until the invoice arrives. Every enterprise AI demo eventually wants the same magic trick: read the whole contract archive, remember every customer interaction, inspect every ticket, keep all meeting notes alive, and answer as if the model has a tidy brain instead of a very expensive attention matrix. The sales slide says “128K context.” The infrastructure team hears “latency, memory, and GPU burn.” Both are correct. One is merely dressed better. ...

Branching Out of the Middle: How a ‘Tree of Agents’ Fixes Long-Context Blind Spots

Contracts are not polite. They hide the important clause on page 83, define the crucial exception on page 17, and bury the fatal cross-reference in an appendix nobody wanted to read. Annual reports behave similarly. So do medical SOPs, litigation files, policy manuals, technical logs, and most documents produced by institutions that have discovered both Microsoft Word and committees. ...

$Cover image$

Fast & Curious: How ‘Speed-First’ LLM Architectures Change the Build vs. Buy Math

TL;DR for operators Efficient LLMs are not just “smaller Transformers with a haircut.” That is the comfortable misconception, and like many comfortable things in enterprise AI, it becomes expensive once real users arrive. The survey reviewed here maps the major architectural routes for making large language models faster, cheaper, and more deployable: linear sequence models, sparse attention, efficient full attention, sparse mixture-of-experts, hybrid architectures, diffusion LLMs, and multimodal extensions.1 Its practical value is not that it declares a single winner. It does something more useful: it tells operators which bottleneck each family is trying to remove. ...

Remember Like an Elephant: Unlocking AI's Hippocampus for Long Conversations

TL;DR for operators Long-context windows are useful. They are also an expensive way to pretend that memory is just a bigger clipboard. The HEMA paper argues for a more operationally realistic design: keep a compressed summary of the conversation always visible, store detailed past exchanges outside the prompt, and retrieve only the details that matter for the current turn.1 That gives the model two different memory behaviours: continuity from Compact Memory and factual recall from Vector Memory. ...

How Ultra-Large Context Windows Challenge RAG

TL;DR for operators Ultra-large context windows are not a ceremonial funeral for retrieval-augmented generation. They are a price renegotiation. If your task is to analyse a bounded, self-contained document set — a contract bundle, diligence folder, policy manual, code repository, or technical appendix — a long-context model may now be the cleaner first option. The main benefit is not that it “knows more”. It is that it can inspect more of the original evidence without depending on a retriever to guess which passages matter. ...