At the Google I/O 2025, Google unveiled its latest audio generation features utilizing the Gemini 2.5 models. The tech company, headquartered in Mountain View, is now allowing both developers and users to experiment with these innovative functionalities on its platform. The two main additions include native audio dialog and controllable text-to-speech (TTS) with the Gemini 2.5 Flash preview. The native audio dialog feature can create human-like audio responses to user prompts, while the TTS functionality enables the conversion of scripts into conversational speech. However, these features are not yet available through application programming interfaces (APIs) for developers.
Google Showcases Gemini 2.5 Flash’s Audio Output Capabilities
In a recent blog post, Google elaborated on the audio generation modes and their potential applications for developers looking to innovate user experiences. Currently, users can access the native audio dialog feature in the stream tab of Google AI Studio, while the TTS capability is available for testing in the generate media tab.
The native audio dialog with the Gemini 2.5 Flash preview facilitates real-time interactions between users and the AI, allowing for both typed and spoken prompts. Unlike traditional text-to-speech systems, this feature generates audio directly in response to user input, streamlining the communication process.
This method of audio generation has several benefits, including the capability for affective dialogue. Gemini 2.5 Flash can recognize the tone and emotion in a user’s voice—whether they sound frightened, angry, or astonished—and can adjust its responses accordingly.
Additionally, the audio generation feature can display emotions during speech, vary accents and linguistic styles, utilize tools like Google Search, and support over 24 different languages.
Turning to the controllable TTS feature, it offers the generation of multi-speaker dialogues, incorporates emotional expression and accent variations in script narration, allows for adjustments in delivery speed, and emphasizes pronunciation. This feature also supports the same 24 languages and includes options for language mixing.
Throughout the development of these capabilities, Google has taken steps to evaluate potential risks. The company employed internal mechanisms alongside red teaming to identify and address vulnerabilities. Furthermore, it emphasized that all audio outputs from the Gemini 2.5 models incorporate SynthID, its advanced watermarking technology, ensuring authenticity and traceability.