Enterprise AI

Sequential Beats Parallel: When Deep Research Agents Learn to Reflect

A research request usually begins with a deceptively harmless sentence: “Can you give me the full picture?” Then comes the usual enterprise ritual. Someone breaks the topic into pieces. One person checks competitors. Another checks regulation. Another reads technical reports. Another searches recent news. Everyone works quickly. Everyone returns with fragments. Then one unlucky analyst stitches the fragments into a report and pretends the seams are a design choice. ...

When LLMs Invent Languages: Efficiency, Secrecy, and the Limits of Natural Speech

Chatbots are trained to sound human. Enterprise AI agents are increasingly asked to behave like colleagues: pass information, coordinate actions, summarize context, and explain what they are doing in language people can read. That arrangement feels safe because natural language is familiar. It also feels efficient enough, at least until agents start talking to other agents. ...

When Rewards Learn to Think: Teaching Agents How They’re Wrong

An agent fails a task. It searched the web twice, opened the wrong page, trusted a noisy snippet, wrote a plausible final answer, and lost the point. Traditional reinforcement learning sees one thing: wrong. That is brutally clean, and also rather unhelpful. The agent may have performed three useful steps before collapsing at the fourth. Or it may have wandered confidently through nonsense from the beginning. Sparse final-answer rewards flatten these cases into the same training signal. The scoreboard says “0.” Very educational, in the same way a fire alarm teaches architecture. ...

World Models Meet the Office From Hell

Office software has a special talent: it says “success” at the exact moment something has gone wrong somewhere else. A ticket is updated. A role is assigned. An asset is transferred. The API returns a cheerful confirmation. The agent, bless its silicon heart, declares victory. Then a background workflow fires. A user’s clearance changes. Another workflow reacts to that clearance change. A different record is silently updated. A constraint is now violated. The agent does not notice, because the agent saw the office equivalent of a green checkmark and mistook it for reality. ...

When Alignment Is Not Enough: Reading Between the Lines of Modern LLM Safety

A chatbot refuses a dangerous request. Everyone relaxes. This is the small theatre of modern AI safety: the model says no, the dashboard records a refusal, the vendor presentation adds another green checkmark, and the compliance team moves on to the next risk register. Very tidy. Very comforting. Also, increasingly insufficient. The problem is not that refusal behavior is meaningless. It is not. The problem is that refusal behavior is only one visible symptom of safety alignment. Modern LLM safety now depends on a larger chain: training objectives, post-training choices, inference interfaces, prompt formats, tool access, evaluation design, and deployment context. When any part of that chain changes, the nice refusal seen in a benchmark may not survive contact with the product. ...

When LLMs Get a Laptop: Why Sandboxes Might Be the Real AGI Benchmark

Laptop. That is the deceptively simple object hiding inside this paper. Not a magic planner. Not a thousand-tool agent marketplace. Not a baroque workflow with seventeen orchestration layers and a dashboard that looks like a cockpit designed by consultants. A laptop. Or, more precisely, a minimal virtual computer: a sandbox with terminal access, file editing, code execution, persistent files, and the ability to install or fetch resources. In Computer Environments Elicit General Agentic Intelligence in LLMs, Cheng et al. ask a question that looks almost too obvious to be interesting until one remembers how much of the AI industry is still trying to squeeze “agency” out of longer prompts.1 ...

When Benchmarks Break: Why Bigger Models Keep Winning (and What That Costs You)

Budget. That is where the benchmark story usually becomes less elegant. A vendor shows a model card with better reasoning scores, stronger multi-task accuracy, and a leaderboard position polished to a mirror finish. Then someone in operations asks the rude question: what does this improvement cost per customer case, per analyst hour, per compliance review, or per failed escalation? ...

When Coders Prove Theorems: Agents, Lean, and the Quiet Death of the Specialist Prover

A coder does not trust a program because it sounds plausible. A coder runs it, reads the error message, changes the implementation, tests again, searches the library, asks a colleague, splits the problem, and keeps going until the machine stops complaining. That mundane loop is the interesting part of Numina-Lean-Agent: An Open and General Agentic Reasoning System for Formal Mathematics.1 The headline result is easy to market: with Claude Opus 4.5 as the base model, Numina-Lean-Agent solves all 12 Putnam 2025 problems in Lean, matching the reported perfect score of AxiomProver. Nice. The trophy cabinet sparkles. ...

When Retrieval Learns to Breathe: Teaching LLMs to Go Wide and Deep

Retrieval has a breathing problem. Most enterprise RAG systems inhale once, grab the nearest chunks, and then hope the model can make the answer sound less fragile than the evidence actually is. That works tolerably well when the user asks for something sitting neatly inside a document paragraph. It works less well when the answer lives across entities, relations, aliases, product categories, authors, diseases, suppliers, regulations, or customer records. In other words, it works less well in the part of business where knowledge is not a pile of text but a network. ...

Deep GraphRAG: Teaching Retrieval to Think in Layers

Retrieval has a management problem. Not the motivational-poster kind of management problem. The operational kind. A company asks its AI system a question about a contract, a customer dispute, a policy exception, or a technical incident. The answer is not sitting in one paragraph. It is distributed across definitions, transactions, policies, exceptions, and historical context. A flat vector search grabs a few semantically similar chunks and hopes the model can stitch them together. A global summarizer reads widely, compresses aggressively, and occasionally smooths away the exact fact that mattered. A local graph search follows nearby entities and may become very confident inside the wrong neighborhood. ...