OpenAI's o3 AI Model Scores Just 10% on FrontierMath

OpenAI’s recently launched o3 artificial intelligence (AI) model is facing scrutiny for its performance on a key benchmark. Epoch AI, the developer of the FrontierMath evaluation, revealed that the publicly accessible version of the o3 model achieved a score of just 10 percent, significantly lower than the 25 percent claimed by the company during the model’s introduction. Mark Chen, OpenAI’s chief research officer, had touted the model as setting a new record with that score. It’s important to note, however, that this discrepancy does not imply any dishonesty on OpenAI’s part regarding the metrics presented.

OpenAI’s o3 AI Model Scores 10 Percent on FrontierMath

In December 2024, OpenAI conducted a livestream across various platforms to unveil the o3 AI model. During this event, the organization emphasized the enhanced capabilities of the large language model (LLM), particularly its improved performance in reasoning tasks.

OpenAI illustrated its claims by presenting the benchmark scores from several well-known tests, including FrontierMath, a mathematical evaluation created by Epoch AI. This test, recognized for its difficulty and integrity, was crafted by a team of over 70 mathematicians who designed brand new and unpublished problems. Prior to December, it was noted that no AI model had successfully solved more than nine percent of the questions in any single attempt.

At the launch, Chen asserted that the o3 model had achieved a remarkable score of 25 percent on this benchmark. However, external validation of this achievement was not feasible at the time, as the model had not been publicly released. Following the release of o3 and the o4-mini models last week, Epoch AI posted on X (formerly Twitter), asserting that the o3 model’s actual score was only 10 percent.

Despite being the highest score achieved on FrontierMath, the 10 percent result is less than half of what OpenAI had previously reported. This revelation has sparked discussions among AI enthusiasts regarding the credibility of benchmark scores.

The difference in scoring does not signify that OpenAI misrepresented its model’s capabilities. It’s likely that the unreleased variant of the model utilized more computational power to achieve its claimed score. The publicly released version may have been optimized for energy efficiency, which potentially led to a reduction in its performance metrics.

In a related development, ARC Prize, the organization responsible for the ARC-AGI benchmark that assesses an AI model’s general intelligence, also addressed the discrepancies on X. They confirmed that “the released o3 is a different model from what we tested in December 2024.” It was noted that the compute tiers of the released version are smaller than those of the earlier model tested by ARC. Additionally, ARC Prize clarified that the released o3 was not trained on ARC-AGI data, even during the pre-training phase.

ARC Prize announced plans to conduct a re-evaluation of the o3 AI model and will publish the revised results. The organization will also reassess the o4-mini model and categorize earlier scores as “preview.” The outcome of the re-tests, particularly whether the released o3 model will underperform on the ARC-AGI assessment, remains uncertain.

OpenAI’s o3 AI Model Scores Just 10% on FrontierMath

Comment

OpenAI’s o3 AI Model Scores Just 10% on FrontierMath

Share This Post

or copy the link

OpenAI’s o3 AI Model Scores 10 Percent on FrontierMath

Tamamen Ücretsiz Olarak Bültenimize Abone Olabilirsin

Related News

AI-Powered Phishing Scams Target eBay Execs

Airtel Expands AI Tool to Combat Spam Calls and Texts!

India Approves Google’s $2.4M Settlement in TV Case

Google Unveils AI Glasses and Exciting Gemini Updates

Cohere Unveils Embed 4: Next-Gen AI for Better Search

Write a Reply Cancel