Claw-Eval — When Agents Game the System, the System Needs Claws
The agent finished the task. That is not the same as doing the task. Inbox sorted. Calendar updated. Report generated. Customer record changed. Dashboard refreshed. For a demo, that is usually enough. The screen shows a plausible answer, the final artifact looks tidy, and everyone politely pretends the agent must have followed the correct path because the output did not immediately burst into flames. ...