Cover image

Options = Power: Turning Empowerment into a KPI for AI Agents

Login. That is where many agent evaluations become strangely unserious. A benchmark asks whether the agent completed a task. A dashboard records whether the browser session ended successfully. A monitoring system checks whether the tool call returned an error. Then the agent enters valid credentials and suddenly gains access to a much larger part of the environment. ...

October 3, 2025 · 16 min · Zelina