Autonomous Agents

Benchmarking the Benchmarks: When AI Safety Metrics Stop Meaning Anything

Opening — Why this matters now The AI industry has quietly entered a dangerous phase: we are measuring everything, and understanding very little. If you ask five vendors whether their model is “safe,” you will likely get five confident “yes” answers—each backed by benchmarks, metrics, and charts. The problem is not the lack of evaluation. It is that the evaluations no longer agree on what they are measuring. ...

Evolve or Die Trying: When LLMs Stop Writing Code and Start Designing Algorithms

Opening — Why this matters now The current generation of LLM-powered systems can write code, suggest optimizations, and even debug their own outputs. Impressive, yes—but fundamentally limited. Most of these systems are still operating at the function level, not the system level. That distinction matters more than people admit. In real-world optimization—logistics, routing, scheduling, portfolio construction—the performance edge rarely comes from a clever function. It comes from how the entire algorithm is structured, decomposed, and coordinated. And until recently, that remained stubbornly human territory. ...

From Words to Workflows: Why AI Still Struggles to Think Like an Operations Research Analyst

Opening — Why this matters now Everyone wants AI that can “just figure it out.” Describe a supply chain problem, a scheduling constraint, or a pricing objective—and expect the system to generate a mathematically sound optimization model. That’s the dream. And increasingly, it’s the pitch behind AI copilots in enterprise decision-making. The paper fileciteturn0file0 quietly dismantles that assumption. ...

Learning on Autopilot? Not Quite — How PAL Turns Passive Videos into Active Intelligence

Opening — Why this matters now For all the noise around “AI-powered education,” most platforms still behave like glorified video players with quizzes stapled on. Personalization, in practice, often means rearranging the same content for everyone—slightly faster for some, slightly slower for others. That model is reaching its limits. As AI systems become more capable in real-time decision-making, the expectation is shifting: learning systems should not just deliver content, but respond to learners as they evolve. Static personalization is no longer sufficient; adaptive intelligence is the new baseline. ...

Routing Without Running Out: How Bilevel Optimization Rewires EV Logistics

Opening — Why this matters now Electric vehicles are no longer a pilot project—they are infrastructure. And infrastructure, unlike PowerPoint, has a habit of exposing weak assumptions. The problem is not just where vehicles go, but whether they make it there without quietly dying mid-route. Routing for EV fleets introduces a constraint traditional logistics never had to respect: energy is no longer an afterthought—it is the system. ...

The Memory Isn’t Broken — It’s Flat: Why LLMs Need to ‘Draw’ to Remember

Opening — Why this matters now AI agents have quietly crossed a threshold: they no longer forget everything between conversations. And yet, they still behave like they do. Despite persistent memory layers—vector databases, RAG pipelines, archival stores—most agents fail at something deceptively simple: answering questions that require time, change, or context. Ask an agent what happened first, what changed, or how multiple events relate, and the system often collapses into guesswork. ...

The Search That Remembers: Training AI Without Answers

Opening — Why this matters now There’s a quiet bottleneck in agentic AI that most demos conveniently ignore: reward design. Search agents—those increasingly fashionable LLM-powered systems that browse, retrieve, and reason—are trained like obedient students. They are rewarded when they produce the correct answer. The catch? Someone needs to define that answer in advance. ...

Epistemic Infrastructure: Why Your AI Knows Less Than It Thinks

Opening — Why this matters now The enterprise AI stack has a favorite illusion: if you retrieve the right documents, you will get the right answer. It’s a comforting belief—engineer better embeddings, expand context windows, sprinkle some graph retrieval, and the system will eventually behave. Except it doesn’t. The paper “Retrieval Is Not Enough: Why Organizational AI Needs Epistemic Infrastructure” fileciteturn0file0 argues something quietly inconvenient: the bottleneck is no longer retrieval fidelity—it’s epistemic fidelity. ...

From Playbooks to Probabilities: When AI Starts Thinking Like a Football Manager

Opening — Why this matters now AI has spent the past decade predicting outcomes. Now it wants to simulate realities. That shift—from prediction to generation—is subtle but consequential. In markets, it means scenario analysis instead of point forecasts. In operations, it means stress-testing decisions rather than merely optimizing them. And, somewhat unexpectedly, one of the clearest demonstrations of this shift comes not from finance or logistics, but from football. ...

Meerkat or Mirage? When AI Safety Fails in Plain Sight (Across Traces)

Opening — Why this matters now If you’re still auditing AI systems one trace at a time, you’re not auditing—you’re sampling. Modern agent systems don’t fail loudly. They fail quietly, collectively, and often strategically. A single interaction may look benign. A hundred interactions may look routine. But somewhere in that haystack sits a coordinated failure—distributed, sparse, and occasionally intentional. ...