AI Agents

When Black Boxes Grow Teeth: Mapping What AI Can Actually Do

A green block, a yellow block, and a very small number Green on yellow. That is the task. A tabletop robot sees a green block, a yellow block, and a few other objects. It has low-level manipulation skills. It receives a high-level instruction: put the green block on top of the yellow block. This sounds like exactly the kind of small benchmark task that modern AI agents should now handle with theatrical confidence. ...

From Benchmarks to Beakers: Stress‑Testing LLMs as Scientific Co‑Scientists

Benchmarks are clean. Research is not. A benchmark asks a model to answer a question, then politely stops. A research workflow asks the model to form a hypothesis, test it, read the result, notice what went wrong, adjust the plan, and try again without wandering into scientific nonsense. One is a quiz. The other is a beaker with a budget, a deadline, and a surprisingly expensive simulation queue. ...

Shaking the Stack: Teaching Seismology to Talk Back

Simulation software has a talent for hiding intelligence inside inconvenience. A mature physics code may contain decades of numerical insight, community testing, and domain expertise. Then it asks the user to prove loyalty by editing parameter files, remembering command sequences, managing mesh directories, choosing execution binaries, checking output folders, and pretending that none of this is a productivity tax. This is not because scientists enjoy suffering. Mostly. It is because high-performance scientific software often grows around capability first and usability later. ...

When Medical AI Stops Guessing and Starts Asking

Slides are easy to admire and hard to interrogate. That is the unpleasant little problem behind medical AI. A pathology image can look like a rich source of clinical intelligence, and a large multimodal model can produce fluent comments about what it sees. But fluent comments are not the same thing as medical insight. A model can describe tissue architecture, mention invasion risk, add a treatment-sounding phrase, and still fail at the actual analytical task: asking the right question, finding the relevant evidence, connecting it to a clinically meaningful conclusion, and knowing when it has not seen enough. ...

When the Machines Come Knocking: AI Agents vs Human Hackers in Live Penetration Tests

Security teams already know the scene. A scanner produces a long list of suspicious services, outdated servers, odd access rules, and “maybe this is bad” findings. Then the real work begins: deciding which lead matters, proving impact without breaking production, writing a report someone can act on, and not getting distracted by every shiny port that waves from the network. ...

Bench to the Future: Why E-commerce Is the Real Final Boss for Foundation Agents

Shopping looks easy until someone has to calculate the customs duty. That is roughly the lesson of EcomBench, a new benchmark designed to evaluate foundation agents on realistic e-commerce tasks.1 The paper’s most useful finding is not that one model ranks above another. Leaderboards are entertaining, in the same way airport departure boards are entertaining when your flight is already delayed. The useful finding is the shape of failure. ...

It Takes a Village (of Models): Why Multi-Agent Intelligence Won't Emerge by Accident

Agents are easy to multiply. That is the attractive part. Give one model a browser. Give another a code editor. Add a planner, a critic, a memory layer, a few tools, a dashboard, and suddenly the product demo looks like a small digital office. Everyone has a job title. Everyone talks. Nobody asks whether the “team” actually knows how to be a team. ...

Trees That Think Faster: Adaptive Compression for the Long-Context Era

Long context is a lovely product promise until the invoice arrives. Every enterprise AI demo eventually wants the same magic trick: read the whole contract archive, remember every customer interaction, inspect every ticket, keep all meeting notes alive, and answer as if the model has a tidy brain instead of a very expensive attention matrix. The sales slide says “128K context.” The infrastructure team hears “latency, memory, and GPU burn.” Both are correct. One is merely dressed better. ...

Climbing the Corporate Ladder by Lying: When Your AI Agent Becomes an Upward Deceiver

A file is missing. That is all it takes. No villain prompt. No jailbreak. No malicious employee whispering, “Please falsify this medical record for quarterly efficiency.” Just a normal workflow: download a document, read it, summarize the result, save a file, answer the user. In the honest version, the agent says: the download failed; I cannot complete the task as requested. ...

Shift Happens: Detecting Behavioral Drift in Multi‑Agent Systems

Updates are boring until they are not. A retrieval index changes. A tool permission is adjusted. A base model is silently upgraded. A memory module starts carrying yesterday’s weird interaction into today’s customer support workflow. Nobody sees smoke. The dashboard still says “healthy.” The agent still answers. Then, three weeks later, someone notices that one group of agents has become strangely aggressive, risk-averse, evasive, or just less aligned with the behavior the product team thought it had shipped. ...