OpenAI has unveiled new audio models within its application programming interface (API), promising enhanced performance in both accuracy and reliability. The San Francisco-based artificial intelligence company announced three innovative AI models focused on speech-to-text transcription and text-to-speech (TTS) functionalities. According to OpenAI, these models will empower developers to create applications that facilitate agent-centric workflows, and they suggested that the API could streamline automated customer support operations. These latest models leverage the capabilities of the company’s GPT-4o and GPT-4o mini AI models.
OpenAI Introduces New Audio Models in API
In a recent blog post, OpenAI elaborated on the new AI models specific to their API. The company noted its history of launching several AI agents such as Operator, Deep Research, Computer-Using Agents, and the Responses API, which all include built-in tools. However, OpenAI emphasized that the full potential of these agents can be realized when they are capable of performing intuitively and interacting across various mediums beyond just text.
The three new models include GPT-4o-transcribe and GPT-4o-mini-transcribe, which serve the speech-to-text function, while the GPT-4o-mini-tts is designed for text-to-speech use. OpenAI asserts that these models surpass the performance of its previously released Whisper models from 2022. Unlike their predecessors, however, these new models are not open-source.
Focusing on the GPT-4o-transcribe, the company highlighted its enhanced “word error rate” (WER) performance as demonstrated in the Few-shot Learning Evaluation of Universal Representations of Speech (FLEURS) benchmark, which assesses AI models on multilingual speech across 100 languages. OpenAI attributes this improvement to a combination of focused training methods, including reinforcement learning (RL) and extensive midtraining with high-quality audio datasets.
The speech-to-text models have been engineered to effectively capture audio even under difficult circumstances, such as strong accents, noisy settings, and varying speeds of speech.
In terms of the GPT-4o-mini-tts model, significant advancements have also been made. OpenAI claims that this model can deliver speech with customizable inflections, intonations, and emotional expressiveness, which will enable developers to create a wide range of applications including those for customer service and storytelling. It is worth noting that the model currently supports only artificial and preset voices.
Pricing details for OpenAI’s API indicate that access to the GPT-4o-based audio model will incur costs of $40 (approximately Rs. 3,440) per million input tokens and $80 (around Rs. 6,880) per million output tokens. Conversely, the GPT-4o mini-based audio models will be priced at $10 (approximately Rs. 860) per million input tokens and $20 (around Rs. 1,720) per million output tokens.
All audio models are now open to developers via the API. OpenAI is also set to release an integration with its Agents software development kit (SDK) to facilitate the development of voice agents.