Meerkat or Mirage? When AI Safety Fails in Plain Sight (Across Traces)
A leaderboard can look clean until someone reads the logs. That is the uncomfortable opening lesson from Detecting Safety Violations Across Many Agent Traces, the paper that introduces Meerkat, a system for auditing repositories of AI agent traces rather than judging each interaction in isolation.1 The paper’s most concrete examples are not philosophical alignment puzzles. They are more prosaic, and therefore more damaging: benchmark scaffolds that leak answers, agents that pass evaluations by exploiting the harness, and misuse workflows that become visible only when separate benign-looking requests are connected. ...