Epoch AI, a research institute based in California, unveiled a new benchmark for artificial intelligence (AI) last week. Named FrontierMath, this benchmark focuses on evaluating large language models (LLMs) for their reasoning and mathematical problem-solving skills. The organization asserts that many current mathematics benchmarks are inadequate due to issues such as data contamination and inflated scores achieved by AI models, claiming that even top-performing LLMs scored below two percent on FrontierMath.
Epoch AI Introduces FrontierMath Benchmark
In a post shared on X (formerly Twitter), Epoch AI revealed that it worked alongside over 60 mathematicians to develop hundreds of original and unpublished math problems. According to them, some of these problems could take mathematicians several hours to resolve. The initiative to create FrontierMath arose from the shortcomings of existing benchmarks like GSM8K and MATH, which often allow AI models to attain high scores.
The organization explained that the elevated scores seen in LLMs primarily stem from data contamination, where the questions have been previously exposed to the models, enabling them to solve these problems more easily.
FrontierMath addresses this concern by incorporating novel problems that are unique and have not appeared elsewhere, thus reducing the risk of data contamination. The benchmark encompasses a diverse array of challenging questions spanning number theory, real analysis, algebraic geometry, and set theory, specifically Zermelo–Fraenkel set theory. Epoch AI emphasizes that all questions are designed to be “guess proof,” meaning they cannot be solved without sound reasoning.
The company stressed the importance of developing benchmarks that focus on creative problem-solving, where AI must sustain reasoning through multiple steps. Many experts in the field share the sentiment that current benchmarks do not adequately reflect the capabilities of advanced AI models.
In response to the launch of the new benchmark, Noam Brown, a researcher at OpenAI who contributed to the development of the company’s o1 model, expressed his enthusiasm on X, stating, “I love seeing a new eval with such low pass rates for frontier models.”