The Molecule Was Right. The Reasoning Was Not.
TL;DR for operators Chemistry teams should stop treating a correct molecule, reaction product, or ranked option as proof that an AI system reasoned chemically. That is the comfortable interpretation. It is also, inconveniently, the one ChemCoTBench-V2 was built to dismantle. The paper introduces a benchmark that evaluates chemical language models at three separate levels: final-answer correctness, template adherence, and step-wise chemical validity. The important move is not “add more benchmark rows.” The move is to force the model to expose intermediate chemical commitments—rings, scaffolds, fragments, reaction types, edit plans, condition rankings, product constructions—and then check those commitments with deterministic chemistry rules or verified reference traces.1 ...