VibeVoiceNext-Gen Open Source Text-to-Speech Model
90-minute multi-speaker podcast generation, pushing the boundaries of speech synthesis. Developed by Microsoft Research Asia.
What is VibeVoice?
VibeVoice is a cutting-edge speech generation model from Microsoft Research Asia, using next-token diffusion mechanism to achieve up to 90 minutes of high-quality speech generation with up to 4 natural conversational speakers. The generated speech includes natural breathing, appropriate pauses, and even occasional lip sounds, as if someone is really talking in your ear.
90
Minutes Continuous Audio
4
Multi-Speaker Support
7.5
Hz Ultra-Low Frame Rate
Core Capabilities
Three Key Capabilities of VibeVoice
Ultra-Long Duration
- Generate up to 90 minutes of continuous high-quality audio, breaking through traditional TTS time limits. Ultra-low frame rate compression (7.5Hz) significantly reduces computational load.
Multi-Speaker Natural Interaction
- Support up to 4 different speakers, each with unique voice and speaking style, maintaining consistency throughout. Automatic breathing and pause cues during speaker transitions.
Realistic Details & Atmosphere
- Naturally render breathing, pauses, and speech details. Add background music and other audio effects in appropriate scenes to enhance atmosphere.
LatentLM Architecture
- Uses LatentLM next-token diffusion framework, encoding speech into continuous latent vectors via VAE for autoregressive continuous speech generation.
Open Source & Deployable
- MIT/Apache open-source license, free to use and deploy locally, lowering barriers for developers and creators.
Bilingual Support
- Excellent performance in Mandarin Chinese and English, supports mixed-language reading for international content creation.
How It Works
Next-Token Diffusion Mechanism
Uses VAE to encode speech waveforms into continuous latent vector sequences, preserving timbre, tone, and rhythm. Causal Transformer architecture progressively predicts next speech segments.
Low Frame Rate Compression
Compresses to 7.5fps, requiring only ~64K tokens for 90-minute audio, dramatically reducing computation while maintaining audio quality.
Multi-Speaker Coordination
Natural voice switching through speaker labels, automatically adding breathing, pauses, and accent changes to reduce abruptness in speaker transitions.
Deep Context Understanding
VibeVoice's core strength is its deep context understanding, enabling human-like naturalness in tone control and spontaneous speech generation.
Use Cases
Podcast Creation
Simply provide a role-annotated script to generate multi-person podcast audio.
Audiobooks
Convert novels and articles into natural, flowing long-form audio content.
Video Dubbing
Quickly generate high-quality dubbing for video content with multi-speaker support.
Online Education
Create engaging educational audio with multi-language teaching support.
Advertising
Rapidly generate ad audio, saving voiceover costs and time.
Entertainment
Role-playing, drama dialogues, virtual character conversations.
FAQ
- Is VibeVoice free?
- Yes, VibeVoice is open-source under MIT/Apache license, free to use and deploy locally.
- What languages are supported?
- Currently supports Mandarin Chinese and English, with excellent mixed-language performance.
- How long can the audio be?
- Up to approximately 90 minutes of continuous high-quality audio.
- What hardware is required?
- NVIDIA GPU (2080/3060/4060+) is recommended for faster generation, but CPU is also supported (slower).
- How many speakers are supported?
- VibeVoice supports up to 4 different speakers, each maintaining a unique and consistent voice.