Evaluation

When Images Pretend to Be Interfaces: Stress‑Testing Generative Models as GUI Environments

Screenshots are easy to love. They sit still, look polished, and ask very little from the viewer. Interfaces are less polite. Click one wrong icon, place a menu twenty pixels away from where it belongs, blur one label, or forget what happened three screens ago, and the whole interaction becomes decorative theatre. ...

Benchmarks Lie, Rooms Don’t: Why Embodied AI Fails the Moment It Enters Your House

The room is not impressed by your leaderboard A robot that performs well on a public benchmark has not necessarily learned how to operate in your house. It may recognize a chair in a dataset. It may answer a visual question about a tidy image. It may even produce a confident paragraph explaining where the coffee mug should be. Then it enters a real room — with mirrors, partial views, cluttered corners, awkward sightlines, and objects that are not positioned for benchmark convenience — and suddenly the “general intelligence” starts behaving like a tourist holding the map upside down. ...

FIRE-BENCH: Playing Back the Tape of Scientific Discovery

A demo can make an AI research agent look impressive in ten minutes. Give it a task, watch it create files, install packages, run experiments, generate tables, and write something that sounds like a conclusion. Productivity theater, now with terminal logs. The harder question is less cinematic: did it actually discover the right thing? ...

Seeing Is Not Reasoning: Why Mental Imagery Still Breaks Multimodal AI

A model can generate a pretty sequence of images. Good. So can a slide deck. The harder question is whether those images actually help it think. That is the uncomfortable point behind MentisOculi: Revealing the Limits of Reasoning with Mental Imagery, a new benchmark paper that tests whether frontier multimodal models can do something closer to human mental imagery: form a visual state, keep it stable, transform it step by step, and use the transformed state to decide what to do next.1 Not merely “look at an image and answer a question.” Not “draw a plausible intermediate picture.” Actual visual reasoning, with consequences. ...

When LLMs Meet Time: Why Time-Series Reasoning Is Still Hard

Dashboard numbers are seductive because they look obedient. Revenue goes up, traffic dips, latency spikes, inventory turns over, temperature drifts, volatility clusters. Put the sequence into a chart and the pattern seems almost polite. Then someone asks an LLM what happened. The model answers fluently. It may even sound like an analyst who has seen too many quarterly review decks and has developed a protective layer of confidence. But fluency is not temporal understanding. A model can describe a curve, name a trend, and still fail to understand which segment comes next, whether a transformation is correct, or whether a discontinuity is an error or a legitimate feature of the process. ...

When AI Stops Pretending: The Rise of Role-Playing Agents

A chatbot can act like a pirate for three turns. That is not the impressive part. A teenager with a Halloween hat can also do that. The harder problem begins when the agent has to remember what happened last week, preserve a recognizable personality across changing situations, make choices consistent with its motives, avoid borrowing another character’s copyrighted voice a little too enthusiastically, and still behave safely when the user pushes it outside the script. At that point, “pretend you are X” stops being a prompt trick and becomes a systems engineering problem. ...

Think Wide, Then Think Hard: Forcing LLMs to Be Creative (On Purpose)

Imagine a brainstorming meeting in which every new idea must immediately pass legal review, fit the quarterly budget, use the existing technology stack, satisfy six executives, and arrive formatted as a PowerPoint slide. The meeting will probably produce something feasible. It will also produce the same three ideas everyone proposed last quarter. ...

When Benchmarks Rot: Why Static ‘Gold Labels’ Are a Clinical Liability

Clinical AI has a paperwork problem. Not the usual paperwork problem, where doctors drown in documentation and everyone promises that software will save them. The more interesting problem sits one layer below: the paperwork used to judge the software may itself be wrong. That is the uncomfortable center of Scalable Stewardship of an LLM-Assisted Clinical Benchmark with Physician Oversight, a paper that audits MedCalc-Bench, a benchmark for testing whether language models can compute medical risk scores from patient narratives.1 The paper’s target is not a toy dataset. MedCalc-Bench covers 55 medical calculators and includes 10,053 training instances plus 1,047 test instances. Its labels were produced through an LLM-assisted pipeline: GPT-3.5 matched patient contexts to calculator questions, GPT-4 extracted clinical features, and Python scripts aggregated those features into final scores. ...

Memory, Bias, and the Mind of Machines: How Agentic LLMs Mislearn

TL;DR for operators Memory is becoming the fashionable upgrade for AI agents: let the system remember past tasks, extract lessons, and improve without retraining the model. Sensible. Also slightly dangerous, in the same way giving a junior analyst a notebook is useful until they start rewriting the notebook after every meeting. The important result is not that memory sometimes contains bad facts. Everyone who has used software, people, or software made by people already knew that. The sharper point is that useful experience can become faulty during the act of consolidation. When an LLM agent compresses raw trajectories into reusable textual lessons, it may strip away conditions, merge unlike cases, or turn a narrow success into a general rule. The memory then looks cleaner while becoming less true. Very enterprise. ...

When Ambiguity Helps: Rethinking How AI Interprets Our Data Questions

A manager asks the analytics copilot, “Which regions are underperforming this quarter?” This sounds like a normal business question. It is also, technically, a small swamp. Which regions? Sales regions, operating regions, logistics regions, or customer billing regions? Underperforming against what: forecast, last quarter, budget, peers, margin, revenue, retention, or some executive’s private sense of disappointment? And “this quarter” may mean calendar quarter, fiscal quarter, quarter-to-date, or the latest complete quarter if the finance team has not closed the books yet. ...