Despite not yet being a benchmark, the First Proof project is by far the best measure of model usefulness for science and math research available today, and I very much hope that frontier labs continue to take future rounds seriously.
https://www.daniellitt.com/blog/2026/2/20/mathematics-in-the-library-of-babel
