Google launches Gemini 3.1 Flash TTS with 200+ inline audio tags for voice control

Google's DeepMind unit released Gemini 3.1 Flash TTS in public preview on April 15, 2026, introducing a new text-to-speech model that can be steered through natural-language commands embedded in the input text. The model is available via the Gemini API and Google AI Studio for developers, Vertex AI for enterprise customers, and Google Vids for Workspace users. Google described it as its most controllable and expressive speech model to date.

The headline feature is a library of more than 200 bracketed audio tags — including [whispers], [laughs], [excited], [slow] and [long pause] — that developers can drop inline to shift vocal style, pacing and emphasis mid-sentence. Gemini 3.1 Flash TTS supports more than 70 languages, 30 prebuilt baseline voices, and native multi-speaker dialogue, letting a single API call generate conversations without stitching separate voice outputs together. It also exposes fine-grained control over regional accents, including Brixton, RP, Southern, "Valley" and Transatlantic variants of English. Google's onboarding material ships with an unusually elaborate sample prompt that defines a fictional London radio DJ down to scene, posture and vocal disposition.

On the Artificial Analysis TTS leaderboard, which aggregates thousands of blind human preference votes, the model earned an Elo score of 1,211 and ranked second overall. The same analysis placed it in the "most attractive quadrant" for its balance of output quality and cost. The release drops Gemini 3.1 Flash TTS into a market led by ElevenLabs and OpenAI's TTS API; Google has not published direct comparisons against either. It also follows Gemini 3.1 Flash Live, the company's real-time dialogue model announced three weeks earlier, giving Google two specialized voice models aimed at different workloads. Pricing for the new model has not been disclosed.

Every audio clip produced by Gemini 3.1 Flash TTS carries a SynthID watermark, an imperceptible signal Google says will help detect AI-generated audio and mitigate misinformation risks. The published model card lists an 8,192-token input window, a 16,384-token output window, and a January 2025 knowledge cutoff. Independent developer Simon Willison wrote that he used Gemini 3.1 Pro to build a small interface for experimenting with the model shortly after launch. Google is pitching the release at accessibility tools, audiobooks, voice-driven banking systems, game audio and localized customer-experience workflows.

Categories

Language

Google launches Gemini 3.1 Flash TTS with 200+ inline audio tags for voice control

Categories

Language

Google launches Gemini 3.1 Flash TTS with 200+ inline audio tags for voice control

📬 Subscribe to Our Newsletter