VibeVoiceNext-Gen Open Source Text-to-Speech Model

90-minute multi-speaker podcast generation, pushing the boundaries of speech synthesis. Developed by Microsoft Research Asia.

What is VibeVoice?

VibeVoice is a cutting-edge speech generation model from Microsoft Research Asia, using next-token diffusion mechanism to achieve up to 90 minutes of high-quality speech generation with up to 4 natural conversational speakers. The generated speech includes natural breathing, appropriate pauses, and even occasional lip sounds, as if someone is really talking in your ear.

90

Minutes Continuous Audio

4

Multi-Speaker Support

7.5

Hz Ultra-Low Frame Rate

Core Capabilities

Three Key Capabilities of VibeVoice

Ultra-Long Duration

Generate up to 90 minutes of continuous high-quality audio, breaking through traditional TTS time limits. Ultra-low frame rate compression (7.5Hz) significantly reduces computational load.

Multi-Speaker Natural Interaction

Support up to 4 different speakers, each with unique voice and speaking style, maintaining consistency throughout. Automatic breathing and pause cues during speaker transitions.

Realistic Details & Atmosphere

Naturally render breathing, pauses, and speech details. Add background music and other audio effects in appropriate scenes to enhance atmosphere.

LatentLM Architecture

Uses LatentLM next-token diffusion framework, encoding speech into continuous latent vectors via VAE for autoregressive continuous speech generation.

Open Source & Deployable

MIT/Apache open-source license, free to use and deploy locally, lowering barriers for developers and creators.

Bilingual Support

Excellent performance in Mandarin Chinese and English, supports mixed-language reading for international content creation.

How It Works

Next-Token Diffusion Mechanism

Uses VAE to encode speech waveforms into continuous latent vector sequences, preserving timbre, tone, and rhythm. Causal Transformer architecture progressively predicts next speech segments.

Low Frame Rate Compression

Compresses to 7.5fps, requiring only ~64K tokens for 90-minute audio, dramatically reducing computation while maintaining audio quality.

Multi-Speaker Coordination

Natural voice switching through speaker labels, automatically adding breathing, pauses, and accent changes to reduce abruptness in speaker transitions.

Deep Context Understanding

VibeVoice's core strength is its deep context understanding, enabling human-like naturalness in tone control and spontaneous speech generation.

Use Cases

Podcast Creation

Simply provide a role-annotated script to generate multi-person podcast audio.

Audiobooks

Convert novels and articles into natural, flowing long-form audio content.

Video Dubbing

Quickly generate high-quality dubbing for video content with multi-speaker support.

Online Education

Create engaging educational audio with multi-language teaching support.

Advertising

Rapidly generate ad audio, saving voiceover costs and time.

Entertainment

Role-playing, drama dialogues, virtual character conversations.

FAQ

Is VibeVoice free?
Yes, VibeVoice is open-source under MIT/Apache license, free to use and deploy locally.
What languages are supported?
Currently supports Mandarin Chinese and English, with excellent mixed-language performance.
How long can the audio be?
Up to approximately 90 minutes of continuous high-quality audio.
What hardware is required?
NVIDIA GPU (2080/3060/4060+) is recommended for faster generation, but CPU is also supported (slower).
How many speakers are supported?
VibeVoice supports up to 4 different speakers, each maintaining a unique and consistent voice.