1. News
  2. INTERNET
  3. OpenAI Unveils Enhanced Audio Models for Developers

OpenAI Unveils Enhanced Audio Models for Developers

featured
Share

Share This Post

or copy the link

On Thursday, OpenAI unveiled a set of new audio models within its application programming interface (API), promising enhanced accuracy and reliability. The San Francisco-based artificial intelligence company introduced three new models designed for speech-to-text transcription and text-to-speech (TTS) functions. The firm stated these models will empower developers to create applications with more agentic workflows, and facilitate the automation of customer support operations for businesses. These advancements are grounded in the company’s GPT-4o and GPT-4o mini AI models.

OpenAI Brings New Audio Models in API

A recent blog post provided further insights into the new API-specific models. OpenAI noted its history of launching various AI agents, including Operator, Deep Research, Computer-Using Agents, and the Responses API, all featuring integrated tools. However, the company emphasized that the true capabilities of these agents can be fully realized only when they operate intuitively and across different mediums beyond just text.

The newly introduced audio models include the GPT-4o-transcribe and GPT-4o-mini-transcribe for speech-to-text functions, and the GPT-4o-mini-tts for text-to-speech. OpenAI claims these models surpass the performance of its previous Whisper models released in 2022, though it’s worth noting that the new models are not open-source.

Focusing on the GPT-4o-transcribe model, OpenAI highlighted its enhanced performance in terms of “word error rate” (WER) as assessed by the Few-shot Learning Evaluation of Universal Representations of Speech (FLEURS) benchmark. This benchmark evaluates AI models based on multilingual speech across 100 different languages. The improvements stem from targeted training methodologies, including reinforcement learning (RL) and extensive midtraining with high-quality audio datasets.

The speech-to-text models are designed to accurately capture audio in various challenging conditions, such as heavy accents, noisy settings, and differing speech rates.

Meanwhile, the GPT-4o-mini-tts model boasts significant enhancements as well, allowing for customizable inflections, intonations, and emotional expressiveness. This feature will enable developers to create applications suitable for diverse tasks, such as customer support and creative storytelling. However, it is important to note that the model only provides artificial and preset voice options.

According to OpenAI’s API pricing page, the GPT-4o-based audio model will be priced at $40 (approximately Rs. 3,440) per million input tokens and $80 (approximately Rs. 6,880) per million output tokens. In contrast, the GPT-4o mini-based audio models are set at $10 (about Rs. 860) per million input tokens and $20 (around Rs. 1,720) per million output tokens.

All audio models are now accessible to developers through the API. Additionally, OpenAI is launching an integration with its Agents software development kit (SDK) to assist users in creating voice agents.

OpenAI Unveils Enhanced Audio Models for Developers
Comment

Tamamen Ücretsiz Olarak Bültenimize Abone Olabilirsin

Yeni haberlerden haberdar olmak için fırsatı kaçırma ve ücretsiz e-posta aboneliğini hemen başlat.

Your email address will not be published. Required fields are marked *

Login

To enjoy Technology Newso privileges, log in or create an account now, and it's completely free!