Classical Planning

TL;DR for operators A new benchmark does not say that LLMs are hopeless at planning. That would be too easy, and also false. It says something more useful: frontier models are now strong enough to solve many formal planning tasks, but their competence still weakens when the task stops giving them semantically meaningful labels.1 ...