Software Engineering

Tokens, Watts, and Waste: The Hidden Energy Bill of LLM Inference

Tokens are small. That is why they are dangerous. A developer asks an assistant to generate a function, explain a repository, or reason through a failing test. The screen fills with text. Some of it is useful. Some of it is decoration. Some of it is a polite little parade of examples, test cases, alternative implementations, or whitespace that will be thrown away by the next parser in the pipeline. ...

Vibe Coding a Theorem Prover: When LLMs Prove (and Break) Themselves

A theorem prover is a terrible place to let an LLM improvise Code review is forgiving compared with theorem proving. In ordinary software, a language model can produce code that looks clean, passes a few tests, and still hides a slow-burning defect somewhere behind an edge case. Annoying, yes. Catastrophic, sometimes. But the social contract is familiar: tests catch some errors, humans catch others, production catches the rest. Very elegant. Very modern. Very expensive. ...

When Solvers Guess Smarter: Teaching SMT to Think in Functions

When Solvers Guess Smarter: Teaching SMT to Think in Functions Timeouts are where formal verification quietly loses its glamour. A team writes a specification. A solver receives the formula. Everyone expects the machine to answer a clean question: is this system safe, satisfiable, contradictory, or not? Then the solver thinks. And thinks. And returns nothing useful before the clock runs out. ...

Many Arms, Fewer Bugs: Why Coding Agents Need to Stop Working Alone

Teams are supposed to divide work. Bad teams divide accountability. Anyone who has managed a complicated project has seen the pattern. One specialist produces an impressive-looking analysis. Another quietly repairs its mistakes. The project succeeds, everyone receives credit, and the least useful participant is invited back for the next assignment. Multi-agent AI systems have inherited this problem with admirable efficiency. ...

Guardrails Over Gigabytes: Making LLM Coding Agents Behave

The coding agent did not fail quietly. That was the point. A coding agent writes a patch. The patch looks plausible. The imports are clean enough. The function names sound like they belong in the repository. The explanation is fluent, naturally. Fluency is what these systems do best. Then the build breaks. ...

When Agents Compare Notes: How Shared Memory Quietly Rewires Software Development

When Agents Compare Notes: How Shared Memory Quietly Rewires Software Development Software teams already know the problem. One developer discovers the weird edge case. Another developer repeats the same mistake three weeks later. A third person writes a Slack explanation that disappears into the corporate sedimentary layer, next to the launch checklist from 2019 and that one blessed Docker command nobody can find anymore. ...

The Esperanto of AI Agents: How the Agent Data Protocol Unifies a Fragmented Ecosystem

Every engineering team has met this problem: the useful data exists, but it lives in thirteen different shapes, three different tool conventions, two incompatible logs, and one heroic spreadsheet that nobody dares to open. AI agents have the same disease, only with more acronyms. The paper behind the Agent Data Protocol, or ADP, argues that large-scale supervised fine-tuning of AI agents has been held back less by a lack of data than by a lack of shared representation.1 Agent datasets already exist for coding, software engineering, web browsing, API use, operating-system interaction, and general tool use. The difficulty is that each one tends to encode actions, observations, tool calls, web states, messages, and execution feedback in its own local dialect. Naturally, every dataset is special. How convenient for nobody. ...

When Agents Learn to Test Themselves: TDFlow and the Future of Software Engineering

A bug report is not a specification A bug report says something is wrong. A test says exactly how wrong must fail. That difference is the centre of TDFlow, a test-driven agentic workflow for repository-scale software repair.1 The paper’s central move is not to make the coding agent more charismatic, more autonomous, or more burdened with inspirational tool access. Mercifully. It does almost the opposite: it narrows the agent’s world until the task becomes executable. ...

Pipes by Prompt, DAGs by Design: Why Hybrid Beats Hero Prompts

The demo is easy. The DAG is not. Pipeline automation has a wonderfully deceptive user story. A business analyst writes: “Take this customer file, clean the locations, geocode the addresses, add weather data, then save the enriched output.” An LLM replies with a Python file. The file looks plausible. There are imports. There is an Airflow DAG. There are operators. There are dependencies. A demo audience nods approvingly. ...

Guard Rails > Horsepower: Why Environment Scaffolding Beats Bigger Models

A demo is cheap. Ask an AI agent to build a web app, watch it spin up a cheerful interface, click a few buttons, and everyone briefly pretends software engineering has been solved. Then production begins. The app boots but stores nothing. The database schema exists but the handler quietly forgets foreign keys. The UI looks plausible until the first state transition. The test suite passes because it checked the page title, not the workflow. Somewhere, a dashboard reports “success.” Somewhere else, a user discovers the thing is an elegant cardboard storefront. ...