Microsoft VibeVoice: A Model That Turns Text Into 90-Minute, Four-Speaker Audio

Microsoft’s open VibeVoice model turns a written script into up to 90 minutes of natural multi-speaker audio with as many as four voices, generated in a single pass.
artificial-intelligence
Author

Kabui, Charles

Published

2026-06-17

Keywords

vibevoice, text-to-speech, microsoft, next-token-diffusion, multi-speaker-tts, long-form-audio, open-weights, podcast-generation