Cover image

When the Answer Matters More Than the Thinking

Answer. In most business systems, that is the part users actually care about. The approval decision. The risk label. The final invoice category. The recommended next action. The tidy little field that decides whether the workflow moves forward or someone opens a Slack thread titled “Why did the AI say this?” Yet much of modern LLM fine-tuning treats that answer as just another slice of text. Worse, when supervised examples include long chain-of-thought explanations, the final answer may become the shortest and least dominant part of the training objective. The model learns to produce a convincing trail of reasoning, but the tiny destination at the end receives comparatively little optimization pressure. Very elegant. Also slightly absurd. ...

December 26, 2025 · 2 min · Zelina
Cover image

Echoes, Not Amnesia: Teaching GUI Agents to Remember What Worked

Memory is not a folder A useful employee does not fill out the same form from scratch every morning as if yesterday never happened. They remember which menu hides the export button, which warning can be ignored, which field must be filled before the “Next” button wakes up, and which apparently harmless click sends the process into a small bureaucratic swamp. ...

December 23, 2025 · 17 min · Zelina
Cover image

Adversaries, Slices, and the Art of Teaching LLMs to Think

A math tutor does not wait until the end of a two-page solution, circle the final answer, and say “wrong.” At least, not a good one. The useful tutor interrupts earlier. This line follows. That parity condition does not. This factorization is legal, but the conclusion you drew from it is not. The feedback is local, not theatrical. It tells the student where the reasoning began to rot, before the final answer becomes merely the visible corpse. ...

December 19, 2025 · 22 min · Zelina
Cover image

Stepwise Think-Critique: Teaching LLMs to Doubt Themselves (Productively)

The useful part of doubt is timing Doubt is not useful after the invoice is paid, the client report is sent, or the model has already produced a confident wrong answer with twelve decorative paragraphs of reasoning. At that point, “let us verify” becomes less like quality control and more like archaeology. ...

December 18, 2025 · 16 min · Zelina
Cover image

When Tokens Remember: Graphing the Ghosts in LLM Reasoning

Audit is easy when the answer is a single lookup. A customer asks, “What is your refund policy?” The model quotes the policy paragraph. We check whether the quoted paragraph came from the right source. Very civilized. Everyone goes home early. But real enterprise LLM work is rarely that tidy. A compliance assistant reads a contract, extracts obligations, compares them with internal policy, reasons through exceptions, and writes a recommendation. A research assistant reads multiple sources, builds an intermediate summary, then answers a question from that summary. A support agent reads a user history, infers the likely issue, then proposes the next action. In these cases, the final sentence may depend on prompt evidence and on earlier generated text. ...

December 18, 2025 · 16 min · Zelina
Cover image

Picking Less to Know More: When RAG Stops Ranking and Starts Thinking

Search is not judgment Search is easy to admire because it produces something visible. A ranked list. A bigger context window. A satisfying pile of passages that says, “Look, we retrieved evidence.” Very comforting. Also not the same as knowing what evidence is actually needed. That distinction is the core of Context-Picker: Dynamic Context Selection Using Multi-stage Reinforcement Learning.1 The paper studies a familiar RAG problem: if a system retrieves too little, it misses the answer; if it retrieves too much, it drags in distractors, repeats, weakly related fragments, and the usual long-context swamp where useful evidence politely disappears in the middle. ...

December 17, 2025 · 14 min · Zelina
Cover image

When Rewards Learn Back: Evolution, but With Gradients

Rewards are where many agent projects go to become expensive folklore. A team wants an AI agent to complete long workflows: search, reason, call tools, check constraints, recover from mistakes, and produce a useful answer. The model can talk. The tools work. The benchmark demo is acceptable. Then reinforcement learning enters the room, and someone has to decide what “good” means at every step. ...

December 16, 2025 · 17 min · Zelina
Cover image

Replace, Don’t Expand: When RAG Learns to Throw Things Away

The inbox problem hiding inside RAG Inbox. That is the easiest way to understand what goes wrong in many retrieval-augmented generation systems. A query arrives. The system retrieves a few documents. The answer is not obvious. So the system retrieves more. Then more. Then perhaps a web search result. Then a rewritten query. Then another bundle of passages. ...

December 12, 2025 · 20 min · Zelina
Cover image

It Takes a Village (of Models): Why Multi-Agent Intelligence Won't Emerge by Accident

Agents are easy to multiply. That is the attractive part. Give one model a browser. Give another a code editor. Add a planner, a critic, a memory layer, a few tools, a dashboard, and suddenly the product demo looks like a small digital office. Everyone has a job title. Everyone talks. Nobody asks whether the “team” actually knows how to be a team. ...

December 10, 2025 · 14 min · Zelina
Cover image

Trees That Think Faster: Adaptive Compression for the Long-Context Era

Long context is a lovely product promise until the invoice arrives. Every enterprise AI demo eventually wants the same magic trick: read the whole contract archive, remember every customer interaction, inspect every ticket, keep all meeting notes alive, and answer as if the model has a tidy brain instead of a very expensive attention matrix. The sales slide says “128K context.” The infrastructure team hears “latency, memory, and GPU burn.” Both are correct. One is merely dressed better. ...

December 7, 2025 · 17 min · Zelina