VoxCPM2: Tokenizer-Free TTS Generates 48kHz Speech in 30 Languages Without an Audio Codec

Tsinghua’s 2B-parameter VoxCPM2 skips discrete audio tokens entirely, generating studio-quality 48kHz speech directly from text in 30 languages with voice cloning from a short clip.
artificial-intelligence
Author

Kabui, Charles

Published

2026-06-06

Keywords

text-to-speech, tokenizer-free-tts, voice-cloning, voxcpm2, diffusion-autoregressive