Cover image

When Agents Learn Without Learning: Test-Time Reinforcement Comes of Age

A team meeting usually ends with someone saying, “Let’s remember this for next time.” Human teams sometimes do. Agent teams usually do not. A group of LLM agents can debate, critique, revise, and produce a final answer. Then the whole episode often disappears into the landfill of inference logs: useful comments, bad guesses, decisive objections, elegant checks, all flattened into “the model answered correctly” or “the model failed.” Very modern. Very wasteful. ...

January 15, 2026 · 17 min · Zelina
Cover image

Scaling the Sandbox: When LLM Agents Need Better Worlds

Sandbox is a comforting word. It sounds safe, contained, childlike. Put an AI agent in a sandbox and let it practice. Nothing catches fire. Nobody accidentally cancels a real flight. No production database wakes up with 37 mysterious refund requests and a very confused compliance officer. The problem is that most agent sandboxes are either too fake to teach anything, too manual to scale, or too close to production to be relaxing. The agent has to learn how to navigate persistent state, business rules, incomplete user information, tool failures, and multi-step dependencies. A static API-call dataset does not teach that. A role-playing LLM pretending to be the environment may hallucinate the rules. A hand-built benchmark is useful, but expensive to multiply. ...

January 14, 2026 · 17 min · Zelina
Cover image

STACKPLANNER: When Agents Learn to Forget

Enterprise agents usually fail in an undramatic way. They do not rebel. They do not suddenly become conscious. They do not announce, with cinematic timing, that humanity has been replaced by a spreadsheet. They simply lose the thread. A research agent searches once, finds something half-relevant, and keeps dragging that result through the rest of the task. A report-writing workflow collects too many fragments and then forgets which ones were actually useful. A coordinator delegates to sub-agents, receives noisy outputs, and treats every message as equally important because, apparently, all context is sacred now. By the final step, the system has not become more intelligent. It has become a very expensive meeting transcript. ...

January 12, 2026 · 16 min · Zelina
Cover image

When Debate Stops Being a Vote: DynaDebate and the Engineering of Reasoning Diversity

Meeting. Anyone who has sat through a corporate “alignment session” knows the ritual. Three people say nearly the same thing, one person says it more confidently, and the room calls it consensus. The decision looks collaborative. It is often just synchronized hesitation wearing a blazer. Multi-agent debate in AI can fail in a similar way. Add several LLM agents, ask them to debate, and the system may look more robust than a single model. But if all agents begin from nearly the same reasoning path, they may simply repeat the same mistake in different wording. The output becomes a vote over correlated errors. Democracy, but with clones. ...

January 12, 2026 · 15 min · Zelina
Cover image

ResMAS: When Multi‑Agent Systems Stop Falling Apart

Agent teams fail in a very ordinary way. One agent misreads a question. Another repeats the wrong answer with more confidence. A third receives both versions, performs a tiny ceremony of “collaboration,” and returns something that looks more polished than the original error. Management sees five agents instead of one and assumes redundancy has arrived. It has not. Sometimes it is just a committee with better stationery. ...

January 11, 2026 · 15 min · Zelina
Cover image

When LLMs Stop Talking and Start Driving

Factory trouble usually begins in language. Not elegant language. Not the polished language of annual reports and transformation roadmaps. The useful trouble is buried in work orders, technician notes, supplier messages, inspection records, customer complaints, meeting minutes, and logs written by people who had better things to do than produce clean training data. ...

January 11, 2026 · 18 min · Zelina
Cover image

Distilling the Thought, Watermarking the Answer: When Reasoning Models Finally Get Traceable

Traceability sounds simple until a reasoning model enters the room. For ordinary generated text, watermarking usually means nudging token choices so the final output carries a statistical signature. That is already a delicate game. Push too weakly and the detector sees nothing. Push too hard and the writing starts to smell like machine-selected confetti. ...

January 9, 2026 · 15 min · Zelina
Cover image

Infinite Tasks, Finite Minds: Why Agents Keep Forgetting—and How InfiAgent Cheats Time

A report is not finished because the model “understands” the assignment. It is finished because the system still knows, two hundred actions later, which documents were read, which notes were trustworthy, which sections remain unfinished, and which half-baked intermediate answer should not accidentally become the final one. That is the boring part of agentic AI. Naturally, it is also the part most systems quietly fail at. ...

January 7, 2026 · 14 min · Zelina
Cover image

MAGMA Gets a Memory: Why Flat Retrieval Is No Longer Enough

Memory is where many impressive agents quietly become mediocre employees. They can answer the last question. They can summarize the last document. They can sound very confident about a customer, a project, or a workflow they saw three weeks ago. Then someone asks, “Why did we make that decision?”, “When did the requirement change?”, or “Was that the same client who objected last time?” Suddenly the agent rummages through its past like a consultant searching Slack at 1:43 a.m. Technically alive. Not exactly organized. ...

January 7, 2026 · 17 min · Zelina
Cover image

EverMemOS: When Memory Stops Being a Junk Drawer

Memory sounds simple until the assistant has to remember two incompatible things at once. A customer loves craft beer. The same customer is temporarily taking antibiotics. A flat memory system retrieves “likes IPA” and recommends a variety pack, because apparently “memory” means grabbing the loudest sticky note from a drawer and pretending it is wisdom. A more useful assistant retrieves the preference, the medical constraint, the timing, and the relation among them. It recommends a mocktail and quietly avoids turning personalization into negligence. ...

January 6, 2026 · 17 min · Zelina