LLM Evaluation

SokoBench: When Reasoning Models Lose the Plot

A corridor is not supposed to be hard. There is one player. One box. One goal. No maze. No clever trap. No branching strategy tree with a thousand tempting wrong turns. The player stands at one end, the goal sits at the other, and the box is between them. Push the box along the corridor until it reaches the goal. That is the task. ...

CAR-bench: When Agents Don’t Know What They Don’t Know

A car assistant sounds simple until it touches the car. “Turn on the fan.” “Open the sunroof.” “Change my destination to Barcelona.” “Send an email before I arrive.” None of these requests looks philosophically difficult. They are not graduate-level math problems. They do not require poetic reasoning, legal interpretation, or a 128k-token context window stuffed with PDFs. They require the assistant to do something much less glamorous: check the state of the world, follow a few policies, use the right tools, and avoid pretending when something is missing. ...

When Alignment Is Not Enough: Reading Between the Lines of Modern LLM Safety

A chatbot refuses a dangerous request. Everyone relaxes. This is the small theatre of modern AI safety: the model says no, the dashboard records a refusal, the vendor presentation adds another green checkmark, and the compliance team moves on to the next risk register. Very tidy. Very comforting. Also, increasingly insufficient. The problem is not that refusal behavior is meaningless. It is not. The problem is that refusal behavior is only one visible symptom of safety alignment. Modern LLM safety now depends on a larger chain: training objectives, post-training choices, inference interfaces, prompt formats, tool access, evaluation design, and deployment context. When any part of that chain changes, the nice refusal seen in a benchmark may not survive contact with the product. ...

Prompt Wars: When Pedagogy Beats Cleverness

A prompt review meeting usually sounds more scientific than it is. One person likes the “coach” version. Another prefers the “Socratic” version because it sounds more educational. Someone says the prompt should mention metacognition. Someone else adds “be concise,” because apparently every prompt eventually becomes a corporate email with anxiety issues. Then the team ships the one that feels best. ...

Deep GraphRAG: Teaching Retrieval to Think in Layers

Retrieval has a management problem. Not the motivational-poster kind of management problem. The operational kind. A company asks its AI system a question about a contract, a customer dispute, a policy exception, or a technical incident. The answer is not sitting in one paragraph. It is distributed across definitions, transactions, policies, exceptions, and historical context. A flat vector search grabs a few semantically similar chunks and hopes the model can stitch them together. A global summarizer reads widely, compresses aggressively, and occasionally smooths away the exact fact that mattered. A local graph search follows nearby entities and may become very confident inside the wrong neighborhood. ...

Aligned or Just Agreeable? Why Accuracy Is a Terrible Proxy for AI–Human Alignment

Accuracy is comforting because it gives us a number. The model predicted the right label. The chatbot chose the same option as the survey respondent. The simulated customer picked the same product. Everyone claps, someone updates a dashboard, and the alignment problem is declared mostly solved. Unfortunately, decision-making is where accuracy goes to look respectable while quietly doing very little. ...

TowerMind: When Language Models Learn That Towers Have Consequences

Tower placement is a small decision until it is wrong. In a tower-defense game, a bad tower is not merely an inelegant plan. It is money spent, coverage lost, enemies leaked, and time wasted. The game does not care that the explanation sounded strategic. It only asks whether the tower actually touches the road. ...

Judging the Judges: When AI Evaluation Becomes a Fingerprint

The evaluator is not the scale Evaluation looks boring until it changes the winner. A product team compares three candidate responses. A benchmark ranks five model releases. A content workflow asks an LLM judge to score generated SEO packs. The spreadsheet fills itself politely: five rubric dimensions, an overall score, maybe a few quoted receipts. Everyone pretends the judge is just a thermometer. ...

NPCs With Short-Term Memory Loss: Benchmarking Agents That Actually Live in the World

Minecraft is not the point. That may sound rude to the blocks, but it is the cleanest way to read MineNPC-Task: Task Suite for Memory-Aware Minecraft Agents.1 The paper does use Minecraft. It does study an AI companion agent inside a live game world. It does report that a GPT-4o-powered setup failed on 71 out of 216 attempted subtasks, or roughly one-third of the subtask denominator. ...

MobileDreamer: When GUI Agents Stop Guessing and Start Imagining

A phone screen is not difficult because it is visually beautiful. It is difficult because it keeps changing. Tap the wrong button, and a form disappears. Scroll too far, and the useful item vanishes below the fold. Open the wrong menu, and the agent spends the next three steps politely recovering from its own confidence. Anyone who has watched a GUI agent operate a mobile app has seen the pattern: it often looks competent right until the interface asks for a small amount of foresight. ...