The Solver Was Fine. The Premises Got Lost.
TL;DR for operators SciR is a benchmark for a problem that enterprise AI teams keep trying to flatten into one metric: can a model reason scientifically?1 The more useful question is less flattering and more operational: did the model fail because it could not infer the answer, or because it could not recover the premises from the scientific mess placed in front of it? ...