1. News
  2. AI
  3. OpenAI’s o3 Model Falls Short: Only 10% on Benchmark

OpenAI’s o3 Model Falls Short: Only 10% on Benchmark

featured
Share

Share This Post

or copy the link

The recently launched o3 artificial intelligence (AI) model from OpenAI is facing challenges, particularly on a key benchmark. Epoch AI, the organization responsible for the FrontierMath benchmark, revealed that the publicly available version of the o3 AI model achieved a score of only 10 percent on the assessment. This figure significantly contrasts with the 25 percent score previously claimed by OpenAI during the model’s launch. Nonetheless, this discrepancy does not imply that OpenAI misrepresented its performance metrics.

OpenAI’s o3 AI Model Scores 10 Percent on FrontierMath

In December 2024, OpenAI hosted a livestream on platforms like YouTube, where it introduced the o3 AI model. At this event, the company emphasized the model’s enhanced abilities, particularly its improved capacity for reasoning tasks.

To support its claims, OpenAI shared the model’s benchmark performances across various well-known tests, including FrontierMath, a test crafted by Epoch AI. FrontierMath is recognized for its difficulty and integrity, having been developed by over 70 mathematicians. Moreover, the problems included in the test are all new and unpublished, making it a rigorous challenge. Up until December, no AI model had managed to solve more than nine percent of the questions in a single attempt.

During the launch, OpenAI’s Chief Research Officer, Mark Chen, asserted that o3 had achieved a groundbreaking score of 25 percent on the FrontierMath test. Verification of this performance was not feasible at the time, as the model was not accessible publicly. Following the release of o3 and o4-mini last week, Epoch AI subsequently posted on X (formerly Twitter), disclosing that the o3 model actually scored 10 percent on the test.

Although a score of 10 percent still ranks the AI model highest on the FrontierMath test, it is less than half of the initial claim. This revelation has ignited discussions among AI enthusiasts regarding the credibility of benchmark scores.

The difference in scores does not indicate deception on OpenAI’s part. It is plausible that the model used more computational resources to achieve the higher score, while the commercial version was optimized for power efficiency, which may have impacted its performance.

In a separate context, ARC Prize, the organization behind the ARC-AGI benchmark test—a measure of an AI model’s general intelligence—also addressed the score discrepancy. The organization confirmed in a post on X, “The released o3 is a different model from what we tested in December 2024.” It highlighted that the compute tiers for the released o3 model are smaller than those used during the earlier testing. Furthermore, it stated that the o3 model had not been trained on ARC-AGI data, not even in its pre-training phase.

ARC Prize announced plans to retest the released o3 AI model and publish updated findings. The organization will also reevaluate the o4-mini model, categorizing previous scores as “preview.” There remains uncertainty around whether the public version of o3 will also underperform in this new test.

OpenAI’s o3 Model Falls Short: Only 10% on Benchmark
Comment

Tamamen Ücretsiz Olarak Bültenimize Abone Olabilirsin

Yeni haberlerden haberdar olmak için fırsatı kaçırma ve ücretsiz e-posta aboneliğini hemen başlat.

Your email address will not be published. Required fields are marked *

Login

To enjoy Technology Newso privileges, log in or create an account now, and it's completely free!