Cover image

Compile Once, Train Later: Offline RL Moves Code-Model Verification Upstream

Compile Once, Train Later: Offline RL Moves Code-Model Verification Upstream Code assistants have a small accounting problem. Not the glamorous kind involving model capability, agentic workflows, or yet another dashboard with a glowing neural blob. The ordinary kind: every time a model proposes code during reinforcement learning, someone—or something—has to run it, test it, score it, and feed that score back into training. ...

June 3, 2026 · 14 min · Zelina
Cover image

Memory Lane Meets Mainframe: Why Coding Agents Need Better Memories, Not Bigger Egos

Memory is a familiar word. That is exactly why it can mislead us. When people hear that coding agents need “memory,” the first image is often a giant scrapbook: past prompts, previous patches, command logs, successful code snippets, failed attempts, and whatever else the agent has dragged behind it like a very confident intern with a messy backpack. More memory sounds safer. More traces sound more useful. More remembered work sounds like less repeated work. ...

April 16, 2026 · 17 min · Zelina
Cover image

Thinking Fast, Remembering Slow: Why SWE-AGILE Fixes the Memory Crisis of AI Agents

Memory sounds like a storage problem. Give the agent a longer context window, let it keep the full conversation, and the work should become easier. This is the kind of solution that looks obvious until it meets a real software repository, a failing test suite, a long terminal log, and a model that now has to find one important clue buried somewhere in the middle of its own autobiography. ...

April 14, 2026 · 18 min · Zelina
Cover image

Proofs at Scale: When 30,000 Agents Replace the Referee

Mathematics has a management problem. That sounds less romantic than saying it has a reasoning problem, but romance is not usually where bottlenecks hide. A proof can be brilliant, a referee can be diligent, and still the verification system can fail for the boring reason that nobody has enough time to check everything line by line. The paper Automatic Textbook Formalization takes that bottleneck seriously and then does something unusually concrete: it reports a multi-agent system that formalized a 500-plus-page graduate algebraic combinatorics textbook into Lean, with all 340 target definitions and theorems proved, in about one week.1 ...

April 6, 2026 · 18 min · Zelina
Cover image

Double Helix, Double Checks: Why Agentic AI Needs Governance Before It Writes Your Code

Code is where AI confidence goes to become expensive. A chatbot can produce a plausible function in ten seconds. An agent can now plan a refactor, split files, update interfaces, generate documentation, and politely leave behind a system that fails because one event payload forgot a required field. Very efficient. Very modern. Very annoying. ...

March 5, 2026 · 16 min · Zelina
Cover image

Agents That Hire Themselves: Why OpenSage Signals the End of Hand-Crafted AI Workflows

Workflow diagrams age badly. A process that looked clean in January usually becomes a small archaeological site by March: one more exception, one more conditional branch, one more “temporary” manual approval that survives longer than the intern who added it. This is how many AI-agent projects quietly become ordinary software projects with a chatbot sitting on top, smiling politely while humans keep repairing the plumbing. ...

February 21, 2026 · 16 min · Zelina
Cover image

Breaking Things on Purpose: How CLI-Gym Teaches AI to Fix the Real World

Broken environments are where coding agents stop looking magical. A model can write a neat Python function, patch a repository, and explain the bug with courtroom confidence. Then it enters a terminal, meets a missing shared library, a corrupted dependency, a bad environment variable, or a filesystem permission issue, and suddenly the “autonomous engineer” starts behaving like an intern trapped inside conda. Not a bad intern, perhaps. Just one who keeps running the same command and hoping Linux will become more emotionally cooperative. ...

February 13, 2026 · 15 min · Zelina
Cover image

When Agents Believe Their Own Hype: The Hidden Cost of Agentic Overconfidence

Code review has a comforting ritual. A developer submits a patch. A reviewer inspects it. The reviewer says it looks good. Everyone feels slightly better, because at least someone checked. In AI-agent workflows, this ritual becomes even more tempting: let one agent write the patch, let another agent review it, then ask the reviewer how confident it is. ...

February 9, 2026 · 19 min · Zelina
Cover image

Tokens, Watts, and Waste: The Hidden Energy Bill of LLM Inference

Tokens are small. That is why they are dangerous. A developer asks an assistant to generate a function, explain a repository, or reason through a failing test. The screen fills with text. Some of it is useful. Some of it is decoration. Some of it is a polite little parade of examples, test cases, alternative implementations, or whitespace that will be thrown away by the next parser in the pipeline. ...

February 8, 2026 · 14 min · Zelina
Cover image

Vibe Coding a Theorem Prover: When LLMs Prove (and Break) Themselves

A theorem prover is a terrible place to let an LLM improvise Code review is forgiving compared with theorem proving. In ordinary software, a language model can produce code that looks clean, passes a few tests, and still hides a slow-burning defect somewhere behind an edge case. Annoying, yes. Catastrophic, sometimes. But the social contract is familiar: tests catch some errors, humans catch others, production catches the rest. Very elegant. Very modern. Very expensive. ...

January 11, 2026 · 14 min · Zelina