Bilingual ASR for dialects, code-switching, and songs
MiMo-V2.5-ASR is an 8B open-source speech recognition model from Xiaomi that transcribes Mandarin, English, eight Chinese dialects, code-switched speech, and song lyrics. Built for ML engineers, researchers, and developers building real-world voice applications.
Whisper changed what people expected from open-source ASR. Three years later, the leaderboard looks very different.
What it is: MiMo-V2.5-ASR is an 8B open-source speech recognition model from Xiaomi MiMo, MIT-licensed and available on HuggingFace, built for bilingual Chinese-English transcription across dialects, noisy audio, code-switched speech, and song lyrics.
The problem: most ASR models are benchmarked on clean studio data and deployed into the real world, where audio is noisy, speakers overlap, and people switch languages mid-sentence. The gap between benchmark accuracy and production accuracy is where voice products quietly fail.
The solution: staged training combining large-scale mid-training, supervised fine-tuning, and a reinforcement learning algorithm specifically targeting the scenarios where conventional models break down. Native punctuation from prosody means transcripts arrive ready to use.
What makes it different: on the Open ASR Leaderboard, MiMo-V2.5-ASR posts 5.73% average WER on English, below Whisper large-v3 at 7.44%. On Wu dialect it scores 19.55% vs FunASR-1.5 at 29.08%. On lyrics, 3.95% on m4singer vs Gemini 2.5 Pro at 4.25%. These are not cherry-picked scenarios — they are the hard ones.
Key features:
Eight Chinese dialects natively supported, including Wu, Cantonese, Hokkien, Sichuanese
Chinese-English code-switching with no language tags
Lyrics transcription under accompaniment and pitch variation
Multi-speaker and noisy environment robustness
Native punctuation, no post-processing needed
MIT license, Python API, Gradio demo, self-hostable
Benefits:
Production-grade accuracy on the audio conditions that actually exist in the field
One model replaces multiple regional or domain-specific ASR solutions
Self-hosting eliminates per-call API costs and keeps data on your infra
Ready-to-use punctuated output cuts one step from every downstream pipeline
Who it's for: ML engineers and voice product teams building bilingual or Chinese-language transcription pipelines who need accuracy that holds up outside the lab.
Open-source ASR has been catching up to closed models for years. MiMo-V2.5-ASR is a data point that the gap is now very small, and in some scenarios gone.
About MiMo-V2.5 Voice on Product Hunt
“Bilingual ASR for dialects, code-switching, and songs”
MiMo-V2.5 Voice launched on Product Hunt on April 25th, 2026 and earned 110 upvotes and 1 comments, placing #6 on the daily leaderboard. MiMo-V2.5-ASR is an 8B open-source speech recognition model from Xiaomi that transcribes Mandarin, English, eight Chinese dialects, code-switched speech, and song lyrics. Built for ML engineers, researchers, and developers building real-world voice applications.
On the analytics side, MiMo-V2.5 Voice competes within API, Open Source, Artificial Intelligence and GitHub — topics that collectively have 674.4k followers on Product Hunt. The dashboard above tracks how MiMo-V2.5 Voice performed against the three products that launched closest to it on the same day.
Who hunted MiMo-V2.5 Voice?
MiMo-V2.5 Voice was hunted by Rohan Chaubey and Kumar Abhishek. A “hunter” on Product Hunt is the community member who submits a product to the platform — uploading the images, the link, and tagging the makers behind it. Hunters typically write the first comment explaining why a product is worth attention, and their followers are notified the moment they post. Around 79% of featured launches on Product Hunt are self-hunted by their makers, but a well-known hunter still acts as a signal of quality to the rest of the community. See the full all-time top hunters leaderboard to discover who is shaping the Product Hunt ecosystem.
Whisper changed what people expected from open-source ASR. Three years later, the leaderboard looks very different.
What it is: MiMo-V2.5-ASR is an 8B open-source speech recognition model from Xiaomi MiMo, MIT-licensed and available on HuggingFace, built for bilingual Chinese-English transcription across dialects, noisy audio, code-switched speech, and song lyrics.
The problem: most ASR models are benchmarked on clean studio data and deployed into the real world, where audio is noisy, speakers overlap, and people switch languages mid-sentence. The gap between benchmark accuracy and production accuracy is where voice products quietly fail.
The solution: staged training combining large-scale mid-training, supervised fine-tuning, and a reinforcement learning algorithm specifically targeting the scenarios where conventional models break down. Native punctuation from prosody means transcripts arrive ready to use.
What makes it different: on the Open ASR Leaderboard, MiMo-V2.5-ASR posts 5.73% average WER on English, below Whisper large-v3 at 7.44%. On Wu dialect it scores 19.55% vs FunASR-1.5 at 29.08%. On lyrics, 3.95% on m4singer vs Gemini 2.5 Pro at 4.25%. These are not cherry-picked scenarios — they are the hard ones.
Key features:
Eight Chinese dialects natively supported, including Wu, Cantonese, Hokkien, Sichuanese
Chinese-English code-switching with no language tags
Lyrics transcription under accompaniment and pitch variation
Multi-speaker and noisy environment robustness
Native punctuation, no post-processing needed
MIT license, Python API, Gradio demo, self-hostable
Benefits:
Production-grade accuracy on the audio conditions that actually exist in the field
One model replaces multiple regional or domain-specific ASR solutions
Self-hosting eliminates per-call API costs and keeps data on your infra
Ready-to-use punctuated output cuts one step from every downstream pipeline
Who it's for: ML engineers and voice product teams building bilingual or Chinese-language transcription pipelines who need accuracy that holds up outside the lab.
Open-source ASR has been catching up to closed models for years. MiMo-V2.5-ASR is a data point that the gap is now very small, and in some scenarios gone.