Enterprise AI

Mind the Middle: Why AI Reliability Lives Between the Data and the Answer

TL;DR for operators AI systems rarely fail only at the final answer. They fail earlier, in the quiet machinery that decides which evidence is seen, which records are aligned, which identity is protected, and which previous model behaviour is worth reusing. Three recent papers make that point from very different technical worlds. One improves few-shot object detection by correcting the imbalance between base-class and novel-class region proposals. One builds anonymous two-party gradient-boosted decision tree training so parties can align records without exposing shared identifiers. One maps the behavioural geometry of LLMs so jailbreak risk and defences can be predicted or transferred across model populations. ...

Graph Work, Not Graph Worship: RAGA Turns RAG Into an Auditable Knowledge Operation

TL;DR for operators RAGA is not another “add a graph and accuracy goes up” paper. That would be too convenient, and therefore suspicious. The useful idea is more operational: treat retrieval-augmented generation as a knowledge management process, not a pile of embeddings with a polite chatbot on top. The paper proposes RAGA, short for Reading-And-Graph-building-Agent, an autonomous system that reads documents, searches existing graph knowledge, verifies whether new entities or relations should be added, and then constructs or updates a knowledge graph with source-linked provenance.1 Its core loop is Read–Search–Verify–Construct, implemented as a ReAct-style tool-calling agent rather than a one-shot extraction pipeline. ...

Logs Are Not Lineage: The Accountability Layer AI Agents Are Missing

TL;DR for operators The paper argues that trustworthy AI agents need more than accurate final answers. Once an agent can retrieve documents, call APIs, write memory, modify databases, send messages, or coordinate with other agents, trust depends on whether the organisation can reconstruct how the output or action happened. The useful mechanism is: ...

The Solver Isn’t the Strategy: FrontierOR’s Reality Check for AI Optimisation Agents

Scheduling a factory, routing a fleet, pricing airline seats, allocating scarce capacity: these are not “write me a Python script” problems with nicer stationery. In real operations research, the useful answer is not merely a correct mathematical model. It is a method that stays feasible, keeps solution quality high, and finishes before the business context has expired. ...

Memory Foam: When AI Stops Storing Everything and Starts Learning From It

Enterprise AI has developed a small obsession with memory. The promise is tidy: give the model more context, attach a vector database, retrieve relevant fragments, and suddenly the system becomes a persistent assistant rather than a forgetful autocomplete machine wearing a blazer. The problem is that storage is not memory. Retrieval is not understanding. And a larger context window is not the same thing as knowing what matters. ...

Judge, Jury, and Benchmark: Why LLM Evaluation Needs Fresh Cases, Not Bigger Leaderboards

The procurement meeting is where public leaderboards go to look useful Benchmark scores are comforting because they compress chaos into a number. One model is 87.3, another is 84.9, and suddenly the procurement meeting has the emotional texture of financial discipline. Very mature. Very measurable. Also, very possibly irrelevant. The problem is simple. A company rarely wants “the best model on average”. It wants the best model for contract review, support triage, clinical note summarisation, SQL repair, claims handling, product search, or whatever unglamorous workflow actually pays the cloud bill. Public benchmarks are often too generic for that decision. Worse, the benchmark items may already be floating inside model training data, turning evaluation into a memory test with better typography. ...

Lie Detectors Are Late: Why AI Oversight Needs Commitment Tracing

Sales agents, investment advisors, negotiators, and procurement bots share one annoying trait: the dangerous moment often arrives before the final sentence. By the time the agent says, “This product is ideal for your risk profile,” or “We have a stronger competing offer,” the operational system has already lost the more interesting battle. The model did not become risky at the punctuation mark. It drifted, selected a path, rationalized a move, and only then produced the polished message that everyone pretends to audit. ...

Raw Is Not Ready: Why Reliable AI Needs Evidence Architecture

Raw Is Not Ready: Why Reliable AI Needs Evidence Architecture Production AI has entered its awkward teenage phase. It can speak fluently, see impressively, forecast usefully, and still fail in ways that make operators quietly reach for the manual override. The problem is not simply that models are too small, not enough tokens have been burned, or someone forgot to add “think step by step” to a prompt. The deeper problem is that many AI systems are being asked to reason directly from raw inputs that have not yet been converted into the right operational form. ...

Mind the Representation Gap: Why Enterprise AI Fails Before It Thinks

Enterprise AI has developed a charming habit: whenever a system fails, someone suggests using a larger model. The chatbot misread a customer complaint? Bigger model. The autonomous system struggled with a new sensor configuration? Bigger model. The video classifier understood the objects but missed the actual message? Bigger model, possibly with a more expensive logo. ...

Same Old Spark: Why AI Creativity Needs Metacognition, Not More Polish

Same Old Spark: Why AI Creativity Needs Metacognition, Not More Polish A marketing team asks twenty people to draft campaign ideas with the same AI assistant. The results arrive quickly. They are fluent, structured, audience-aware, and unusually presentable for first drafts. Then someone reads them side by side. The problem is not that the ideas are bad. That would be easier. The problem is that they are good in the same way. Same rhythm. Same safe positioning. Same “unexpected” angle that everyone, apparently, discovered independently with a little help from the same machine. The team has not automated creativity. It has automated convergence with nicer formatting. ...