AI Agents

Learning to Discover at Test Time: When Search Learns Back

A leaderboard usually treats an AI model like a very fast intern: give it a problem, let it try many times, keep the best answer, and politely ignore the graveyard of failed attempts. That is useful. It is also a little strange. A human engineer does not merely try 25,600 variations of a GPU kernel while keeping the same brain. After the first few failures, she learns which bottlenecks matter. After a lucky partial success, she changes how she thinks about the problem. After enough attempts, the search process is no longer just sampling. It has become learning. ...

Skeletons in the Proof Closet: When Lean Provers Need Hints, Not More Compute

Compute is a very convenient alibi. When an AI system fails, the modern reflex is to ask for more of it: more samples, more tokens, more search, more GPUs, more patience from whoever is paying the invoice. This habit is not always wrong. Sometimes the model really does need another attempt. Sometimes the winning answer is hiding in sample number 47. ...

From Talking to Living: Why AI Needs Human Simulation Computation

The chatbot that cannot check the door A useful AI assistant can write an email, summarize a meeting, explain a regulation, or generate a plan for fixing a server problem. Then something inconvenient happens: the real world disagrees. The meeting transcript missed one speaker. The regulation changed in one jurisdiction. The server error was not caused by the code but by two services fighting over the same port. The customer sounded satisfied in the chat log but cancelled the contract two days later. The model can still talk. Beautifully, even. But it cannot always live inside the situation long enough to notice that its first answer has become stale, incomplete, or simply wrong. ...

Lost Without a Map: Why Intelligence Is Really About Navigation

Lost Without a Map: Why Intelligence Is Really About Navigation Map. That is the word most AI product teams should probably put above their dashboards, agent logs, evaluation suites, and occasionally their office coffee machine. Not because maps are poetic. Because when an AI system fails in a live workflow, the failure often does not look like “the model forgot a fact.” It looks like the system was navigating the wrong space. ...

When Coders Prove Theorems: Agents, Lean, and the Quiet Death of the Specialist Prover

A coder does not trust a program because it sounds plausible. A coder runs it, reads the error message, changes the implementation, tests again, searches the library, asks a colleague, splits the problem, and keeps going until the machine stops complaining. That mundane loop is the interesting part of Numina-Lean-Agent: An Open and General Agentic Reasoning System for Formal Mathematics.1 The headline result is easy to market: with Claude Opus 4.5 as the base model, Numina-Lean-Agent solves all 12 Putnam 2025 problems in Lean, matching the reported perfect score of AxiomProver. Nice. The trophy cabinet sparkles. ...

Houston, We Have a Benchmark: When Agentic AI Meets Orbital Reality

Space is not impressed by fluent reasoning. A satellite does not care that an AI agent has produced a confident plan. A ground station cannot magically see through the Earth because the prompt says “ensure connectivity.” A sensor cannot keep collecting images after its onboard storage is full. Orbital mechanics, power budgets, slew angles, data buffers, and line-of-sight geometry are not stakeholder preferences. They are constraints. Reality, annoyingly, still has root access. ...

Seeing Is Not Thinking: Teaching Multimodal Models Where to Look

A model can see the image and still miss the point Inspection is a wonderfully cruel test for AI. Show a multimodal model a product photo, a medical scan, a factory defect, a form, or a dashboard screenshot, and the answer may sound calm, fluent, and technically plausible. The model may even imitate the reasoning style of a stronger teacher model. It may describe objects, infer relationships, and produce the correct-looking sentence. ...

When AI Stops Pretending: The Rise of Role-Playing Agents

A chatbot can act like a pirate for three turns. That is not the impressive part. A teenager with a Halloween hat can also do that. The harder problem begins when the agent has to remember what happened last week, preserve a recognizable personality across changing situations, make choices consistent with its motives, avoid borrowing another character’s copyrighted voice a little too enthusiastically, and still behave safely when the user pushes it outside the script. At that point, “pretend you are X” stops being a prompt trick and becomes a systems engineering problem. ...

MatchTIR: Stop Paying Every Token the Same Salary

Payroll is a useful metaphor for agent training because it makes the absurdity obvious. Imagine a project team where one employee finds the right database, another enters the correct query, a third repeatedly calls the wrong API, and a fourth finally writes the report. If the report is accepted, everyone receives the same bonus. If it fails, everyone receives the same blame. Very democratic. Also very stupid. ...

One Agent Is a Bottleneck: When Genomics QA Finally Went Multi-Agent

One Agent Is a Bottleneck: When Genomics QA Finally Went Multi-Agent Databases are where elegant AI demos go to develop a limp. A model can sound fluent about biology, medicine, finance, or law. Then someone asks a question that requires the latest record from a specialized database, a second lookup from another source, a formatted API call, a large HTML response, and a final answer that does not forget the original question halfway through. Suddenly the “AI assistant” becomes a very expensive intern copying URLs into the wrong field. ...