OpenAI's o3 Model Falls Short: Only 10% on Benchmark

The recently launched o3 artificial intelligence (AI) model from OpenAI is facing challenges, particularly on a key benchmark. Epoch AI, the organization responsible for the FrontierMath benchmark, revealed that the publicly available version of the o3 AI model achieved a score of only 10 percent on the assessment. This figure significantly contrasts with the 25 percent score previously claimed by OpenAI during the model’s launch. Nonetheless, this discrepancy does not imply that OpenAI misrepresented its performance metrics.

OpenAI’s o3 AI Model Scores 10 Percent on FrontierMath

In December 2024, OpenAI hosted a livestream on platforms like YouTube, where it introduced the o3 AI model. At this event, the company emphasized the model’s enhanced abilities, particularly its improved capacity for reasoning tasks.

To support its claims, OpenAI shared the model’s benchmark performances across various well-known tests, including FrontierMath, a test crafted by Epoch AI. FrontierMath is recognized for its difficulty and integrity, having been developed by over 70 mathematicians. Moreover, the problems included in the test are all new and unpublished, making it a rigorous challenge. Up until December, no AI model had managed to solve more than nine percent of the questions in a single attempt.

During the launch, OpenAI’s Chief Research Officer, Mark Chen, asserted that o3 had achieved a groundbreaking score of 25 percent on the FrontierMath test. Verification of this performance was not feasible at the time, as the model was not accessible publicly. Following the release of o3 and o4-mini last week, Epoch AI subsequently posted on X (formerly Twitter), disclosing that the o3 model actually scored 10 percent on the test.

Although a score of 10 percent still ranks the AI model highest on the FrontierMath test, it is less than half of the initial claim. This revelation has ignited discussions among AI enthusiasts regarding the credibility of benchmark scores.

The difference in scores does not indicate deception on OpenAI’s part. It is plausible that the model used more computational resources to achieve the higher score, while the commercial version was optimized for power efficiency, which may have impacted its performance.

In a separate context, ARC Prize, the organization behind the ARC-AGI benchmark test—a measure of an AI model’s general intelligence—also addressed the score discrepancy. The organization confirmed in a post on X, “The released o3 is a different model from what we tested in December 2024.” It highlighted that the compute tiers for the released o3 model are smaller than those used during the earlier testing. Furthermore, it stated that the o3 model had not been trained on ARC-AGI data, not even in its pre-training phase.

ARC Prize announced plans to retest the released o3 AI model and publish updated findings. The organization will also reevaluate the o4-mini model, categorizing previous scores as “preview.” There remains uncertainty around whether the public version of o3 will also underperform in this new test.

OpenAI’s o3 Model Falls Short: Only 10% on Benchmark

Comment

OpenAI’s o3 Model Falls Short: Only 10% on Benchmark

Share This Post

or copy the link

OpenAI’s o3 AI Model Scores 10 Percent on FrontierMath

Tamamen Ücretsiz Olarak Bültenimize Abone Olabilirsin

Related News

How to Bypass Google’s AI Overviews Easily

OpenAI Eyes Chrome if Google Faces Forced Sale

Meta Unveils AI Disclosure Labels for Ads

Apple Pulls ‘Available Now’ Claim on AI Features

Zoom Unveils Game-Changing AI Tools for Enterprises

Write a Reply Cancel