CivBench: When AI Stops Guessing and Starts Planning
Scoreboards are comforting. They reduce a messy contest into one neat line: winner, loser, maybe a score. Executives like them, product teams like them, investors like them, and benchmark dashboards absolutely adore them. Strategy, unfortunately, is rude enough not to fit inside that line. A company can make the right decisions and still lose because the market turns. A trading agent can survive a bad regime by managing exposure well, then look mediocre because the final return is not spectacular. A planning system can stumble into success after making terrible intermediate choices. Outcome-only evaluation is clean, but cleanliness is not the same as truth. It is often just a good-looking loss of information. ...