Cover image

When Solvers Guess Smarter: Teaching SMT to Think in Functions

When Solvers Guess Smarter: Teaching SMT to Think in Functions Timeouts are where formal verification quietly loses its glamour. A team writes a specification. A solver receives the formula. Everyone expects the machine to answer a clean question: is this system safe, satisfiable, contradictory, or not? Then the solver thinks. And thinks. And returns nothing useful before the clock runs out. ...

January 11, 2026 · 15 min · Zelina
Cover image

Many Arms, Fewer Bugs: Why Coding Agents Need to Stop Working Alone

Teams are supposed to divide work. Bad teams divide accountability. Anyone who has managed a complicated project has seen the pattern. One specialist produces an impressive-looking analysis. Another quietly repairs its mistakes. The project succeeds, everyone receives credit, and the least useful participant is invited back for the next assignment. Multi-agent AI systems have inherited this problem with admirable efficiency. ...

December 31, 2025 · 19 min · Zelina
Cover image

Guardrails Over Gigabytes: Making LLM Coding Agents Behave

The coding agent did not fail quietly. That was the point. A coding agent writes a patch. The patch looks plausible. The imports are clean enough. The function names sound like they belong in the repository. The explanation is fluent, naturally. Fluency is what these systems do best. Then the build breaks. ...

December 27, 2025 · 16 min · Zelina
Cover image

When Agents Compare Notes: How Shared Memory Quietly Rewires Software Development

When Agents Compare Notes: How Shared Memory Quietly Rewires Software Development Software teams already know the problem. One developer discovers the weird edge case. Another developer repeats the same mistake three weeks later. A third person writes a Slack explanation that disappears into the corporate sedimentary layer, next to the launch checklist from 2019 and that one blessed Docker command nobody can find anymore. ...

November 15, 2025 · 17 min · Zelina
Cover image

The Esperanto of AI Agents: How the Agent Data Protocol Unifies a Fragmented Ecosystem

Every engineering team has met this problem: the useful data exists, but it lives in thirteen different shapes, three different tool conventions, two incompatible logs, and one heroic spreadsheet that nobody dares to open. AI agents have the same disease, only with more acronyms. The paper behind the Agent Data Protocol, or ADP, argues that large-scale supervised fine-tuning of AI agents has been held back less by a lack of data than by a lack of shared representation.1 Agent datasets already exist for coding, software engineering, web browsing, API use, operating-system interaction, and general tool use. The difficulty is that each one tends to encode actions, observations, tool calls, web states, messages, and execution feedback in its own local dialect. Naturally, every dataset is special. How convenient for nobody. ...

November 2, 2025 · 12 min · Zelina
Cover image

When Agents Learn to Test Themselves: TDFlow and the Future of Software Engineering

A bug report is not a specification A bug report says something is wrong. A test says exactly how wrong must fail. That difference is the centre of TDFlow, a test-driven agentic workflow for repository-scale software repair.1 The paper’s central move is not to make the coding agent more charismatic, more autonomous, or more burdened with inspirational tool access. Mercifully. It does almost the opposite: it narrows the agent’s world until the task becomes executable. ...

November 2, 2025 · 15 min · Zelina
Cover image

Pipes by Prompt, DAGs by Design: Why Hybrid Beats Hero Prompts

The demo is easy. The DAG is not. Pipeline automation has a wonderfully deceptive user story. A business analyst writes: “Take this customer file, clean the locations, geocode the addresses, add weather data, then save the enriched output.” An LLM replies with a Python file. The file looks plausible. There are imports. There is an Airflow DAG. There are operators. There are dependencies. A demo audience nods approvingly. ...

October 1, 2025 · 14 min · Zelina
Cover image

Guard Rails > Horsepower: Why Environment Scaffolding Beats Bigger Models

A demo is cheap. Ask an AI agent to build a web app, watch it spin up a cheerful interface, click a few buttons, and everyone briefly pretends software engineering has been solved. Then production begins. The app boots but stores nothing. The database schema exists but the handler quietly forgets foreign keys. The UI looks plausible until the first state transition. The test suite passes because it checked the page title, not the workflow. Somewhere, a dashboard reports “success.” Somewhere else, a user discovers the thing is an elegant cardboard storefront. ...

September 6, 2025 · 14 min · Zelina
Cover image

Mask, Don’t Muse: When Simple Memory Beats Fancy Summaries

TL;DR for operators A coding agent’s memory problem is not philosophical. It is a bill. The paper behind this article compares three ways to manage context in software-engineering agents: keep the full trajectory, summarize old turns with an LLM, or simply mask older environment observations while preserving the agent’s reasoning and actions.1 Across five SWE-agent configurations on SWE-bench Verified, both context-management strategies usually cut cost sharply versus the Raw Agent. The awkward part is that the simple strategy, Observation Masking, is often just as good as LLM-Summary on solve rate and usually cheaper. ...

September 1, 2025 · 17 min · Zelina
Cover image

Wheel Smarts > Wheel Reinvention: What GitTaskBench Really Measures

TL;DR for operators GitTaskBench is useful because it evaluates code agents where enterprise automation usually breaks: not in a clean coding puzzle, but inside an existing repository with dependencies, pretrained weights, fragile instructions, file formats, runtime constraints, and a user asking for a finished output.1 The paper’s headline is not “agents can code”. We have enough confetti for that parade. The sharper finding is that agents are still inconsistent at the whole delivery chain. The best reported combination, OpenHands with Claude 3.7, reaches 72.22% execution completion but only 48.15% task pass rate. In other words, many runs produce something executable, but far fewer produce something good enough. ...

August 27, 2025 · 16 min · Zelina