
Alibaba Cloud's Qwen team officially unveiled Qwen3.5-Omni on March 30, 2026, marking a significant leap forward in omnimodal artificial intelligence. As the successor to Qwen3-Omni, this new model natively processes text, images, audio, and video inputs while generating real-time streaming speech output. The model ships in three variants — Plus, Flash, and Light — and supports a 256K token context window, enabling it to handle over 10 hours of audio input and approximately 400 seconds of 720P audio-visual video at 1 FPS.
The most striking upgrade lies in multilingual capabilities. Speech recognition support has been expanded from 19 languages to 113 languages and dialects, while speech generation has jumped from 10 to 36 languages. Both the Thinker and Talker components now employ a Hybrid-Attention Mixture-of-Experts architecture, and the model has been natively pretrained on massive volumes of text, visual data, and over 100 million hours of audio-visual content. This foundational training approach ensures that the model delivers exceptional multimodal perception and generation without sacrificing performance in any single modality.
On the interaction front, Qwen3.5-Omni introduces semantic interruption, a feature designed to distinguish genuine user interjections from ambient noise and background chatter. Voice cloning capabilities enable developers to build custom AI assistants with consistent voice identities. Perhaps most intriguing is what Alibaba calls "Audio-Visual Vibe Coding" — the model can watch a screen recording or video of a coding task and write functional code based purely on what it sees and hears, without requiring any text prompt. This represents a glimpse into a future where AI operates directly within human workflows rather than as an external tool.
In terms of benchmarks, Qwen3.5-Omni-Plus achieved 215 state-of-the-art results across audio, audio-video understanding, reasoning, and interaction tasks. On multilingual voice stability benchmarks, the Plus variant outperformed ElevenLabs, GPT-Audio, and MiniMax across 20 languages. With added support for real-time web search, the model can now answer questions about breaking news and live market data. Taken together, these capabilities position Qwen3.5-Omni not merely as a language model, but as a comprehensive AI platform offering near-human interactive experiences across dozens of languages and modalities.