Judge Math-Not by Its Parser
Opening — Why this matters now The AI industry has discovered a wonderfully pedestrian way to misread progress: build models that can solve harder math problems, then grade them with evaluators that panic when 2040 minutes is not written as 34 hours. That is not a joke. It is the central irritation behind “Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity”, an arXiv paper that examines how mathematical reasoning benchmarks can be distorted by rigid symbolic verification.1 ...