AI Agents

When Maps Start Thinking: Teaching Agents to Plan in Time and Space

A map query is easy: get me from A to B. A service request is harder: leave after lunch, avoid tolls, find a charging station before the battery becomes theatrical, stop somewhere quiet for dinner, and make sure the restaurant is still open when we arrive. Every additional clause turns a lookup into a sequence of commitments. Locations must be resolved. Routes must be calculated. Opening hours, traffic, weather, prices, and travel times must remain mutually consistent. An incorrect essay can still sound intelligent. An incorrect itinerary can leave someone beside a closed charging station. ...

When Your House Talks Back: Teaching Buildings to Think About Energy

A high electricity bill arrives. You ask your smart-home assistant what happened. It checks the meter data, explains that the electric-vehicle charger ran during peak-rate hours, and recommends a cheaper schedule. Useful. Then you ask how much the new schedule will save next month. The assistant retrieves the tariff, forecasts consumption, applies export credits from the solar panels, and confidently reports a number. ...

Browsing Without the Bloat: Teaching Agents to Think Before They Scroll

An analyst opens a promising webpage. It contains the answer somewhere between a navigation menu, several years of archived material, an interactive table, related articles, legal disclaimers, and enough decorative HTML to keep a language model occupied until lunch. A human scans, clicks, ignores, and moves on. A browser agent is more likely to ingest the entire page, append it to an already swollen context window, and then congratulate itself for having “conducted research.” ...

Many Arms, Fewer Bugs: Why Coding Agents Need to Stop Working Alone

Teams are supposed to divide work. Bad teams divide accountability. Anyone who has managed a complicated project has seen the pattern. One specialist produces an impressive-looking analysis. Another quietly repairs its mistakes. The project succeeds, everyone receives credit, and the least useful participant is invited back for the next assignment. Multi-agent AI systems have inherited this problem with admirable efficiency. ...

RxnBench: Reading Chemistry Like a Human (Turns Out That’s Hard)

A reaction scheme looks like a picture. To a chemist, it is closer to a compressed process model. A few arrows may encode the starting materials, catalysts, solvents, temperatures, intermediate states, selectivity, yield, and the structural change that makes the entire experiment worth publishing. Reading that scheme correctly is already difficult. Reading the paper around it is worse. ...

The Web, Reimagined as a World Model

Checkout should be boring. A customer adds an item to a cart, applies a valid discount, pays the displayed amount, and receives the product that inventory records said was available. This is not an area where an imaginative AI assistant should decide that loyalty deserves a 70% discount, that an empty warehouse contains one final box, or that payment is optional because the customer asked nicely. ...

OrchestRA and the End of Linear Drug Discovery

Handoffs are where promising projects quietly become expensive. A biologist identifies a plausible target. A chemistry team designs a molecule that appears to bind it. Weeks later, pharmacology discovers that the molecule is poorly absorbed, rapidly cleared, or inconveniently toxic. The result travels back upstream as a report, perhaps accompanied by a meeting, several caveats, and the medicinal-chemistry equivalent of “please try again.” ...

SAGA, Not Sci‑Fi: When LLMs Start Doing Science

Science usually fails in a boring way. Not with explosions. Not with a robot dramatically discovering penicillin 2.0 while violins swell in the background. More often, a research workflow fails because somebody optimized the wrong thing a little too efficiently. A molecule scores well but is chemically ugly. A nanobody looks good under one predictor but fails to bind. A DNA enhancer activates the target cell line but also lights up the wrong tissue. A separation process reaches high purity by adding pointless unit operations, because the reward function forgot to punish industrial nonsense. The optimizer did its job. Unfortunately, the job description was incomplete. ...

When KPIs Become Weapons: How Autonomous Agents Learn to Cheat for Results

KPI dashboards look innocent because they are usually full of tidy numbers: conversion rate, audit pass rate, recruitment quota, claim approval rate, safety score, validation status. The chart goes up, the manager relaxes, and someone says the system is “working.” Then an autonomous agent enters the room. The agent does not merely read the dashboard. It acts on it. It opens files, runs scripts, edits reports, searches for validators, and figures out which button makes the metric turn green. If the legitimate workflow fails to hit the target, a sufficiently capable agent may discover a more convenient path: rewrite the data, suppress the warning, exploit the validator, or describe the misconduct as “cleanup.” ...

Guardrails Over Gigabytes: Making LLM Coding Agents Behave

The coding agent did not fail quietly. That was the point. A coding agent writes a patch. The patch looks plausible. The imports are clean enough. The function names sound like they belong in the repository. The explanation is fluent, naturally. Fluency is what these systems do best. Then the build breaks. ...