MiMo-V2.5 Voice
Bilingual ASR for dialects, code-switching, and songs
API
Open Source
Artificial Intelligence
GitHub
Visit Website See on Product Hunt Github ⧉Hugging Face ⧉

Upvotes117

▲ 117View on ProductHunt ⧉

Comments3

3 commentsSee comments on PH ⧉

Featured onApril 25th, 2026

Hunted by

Rohan Chaubey,

Kumar Abhishek

MiMo-V2.5-ASR is an 8B open-source speech recognition model from Xiaomi that transcribes Mandarin, English, eight Chinese dialects, code-switched speech, and song lyrics. Built for ML engineers, researchers, and developers building real-world voice applications.

Top comment

Upvotes117

▲ 117View on ProductHunt ⧉

Comments3

3 commentsSee comments on PH ⧉

Product of the Day7th

Whisper changed what people expected from open-source ASR. Three years later, the leaderboard looks very different.
What it is: MiMo-V2.5-ASR is an 8B open-source speech recognition model from Xiaomi MiMo, MIT-licensed and available on HuggingFace, built for bilingual Chinese-English transcription across dialects, noisy audio, code-switched speech, and song lyrics.
The problem: most ASR models are benchmarked on clean studio data and deployed into the real world, where audio is noisy, speakers overlap, and people switch languages mid-sentence. The gap between benchmark accuracy and production accuracy is where voice products quietly fail.
The solution: staged training combining large-scale mid-training, supervised fine-tuning, and a reinforcement learning algorithm specifically targeting the scenarios where conventional models break down. Native punctuation from prosody means transcripts arrive ready to use.
What makes it different: on the Open ASR Leaderboard, MiMo-V2.5-ASR posts 5.73% average WER on English, below Whisper large-v3 at 7.44%. On Wu dialect it scores 19.55% vs FunASR-1.5 at 29.08%. On lyrics, 3.95% on m4singer vs Gemini 2.5 Pro at 4.25%. These are not cherry-picked scenarios — they are the hard ones.
Key features:
Eight Chinese dialects natively supported, including Wu, Cantonese, Hokkien, Sichuanese
Chinese-English code-switching with no language tags
Lyrics transcription under accompaniment and pitch variation
Multi-speaker and noisy environment robustness
Native punctuation, no post-processing needed
MIT license, Python API, Gradio demo, self-hostable
Benefits:
Production-grade accuracy on the audio conditions that actually exist in the field
One model replaces multiple regional or domain-specific ASR solutions
Self-hosting eliminates per-call API costs and keeps data on your infra
Ready-to-use punctuated output cuts one step from every downstream pipeline
Who it's for: ML engineers and voice product teams building bilingual or Chinese-language transcription pipelines who need accuracy that holds up outside the lab.
Open-source ASR has been catching up to closed models for years. MiMo-V2.5-ASR is a data point that the gap is now very small, and in some scenarios gone.

Comment highlights

Dialect and code-switching support is the piece that usually gets skipped in ASR research because it's hard, but it's exactly where real-world audio breaks down. Anyone building a voice product for users in multilingual environments (SEA, MENA, parts of Africa) runs into this immediately.
One application that jumped to mind reading this: location-based audio guides. I built a travel app called StoryRoute (https://storyroute.netlify.app/) that lets people explore cities through interactive, story-driven walks. Accurate multilingual ASR would open up a lot for that use case — imagine a guide that understands a question asked in Mandarin mixed with English street names, or local dialect terms for landmarks.
The code-switching capability in particular seems underexplored for tourism and cultural content. Is the model trained on domain-specific vocabulary or more general conversational speech?

Code switching and lyrics are exactly where ASR demos usually fall apart. Hitting both, plus Chinese dialect coverage, makes this feel grounded in real audio instead of benchmark Code switching and lyrics are exactly where ASR demos usually fall apart. Hitting both, theater. How much latency does that add in live pipelines?

About MiMo-V2.5 Voice on Product Hunt

“Bilingual ASR for dialects, code-switching, and songs”

MiMo-V2.5 Voice launched on Product Hunt on April 25th, 2026 and earned 117 upvotes and 3 comments, placing #7 on the daily leaderboard. MiMo-V2.5-ASR is an 8B open-source speech recognition model from Xiaomi that transcribes Mandarin, English, eight Chinese dialects, code-switched speech, and song lyrics. Built for ML engineers, researchers, and developers building real-world voice applications.

MiMo-V2.5 Voice was featured in API (98.1k followers), Open Source (68.4k followers), Artificial Intelligence (468.5k followers) and GitHub (41.2k followers) on Product Hunt. Together, these topics include over 135.5k products, making this a competitive space to launch in.

Who hunted MiMo-V2.5 Voice?

MiMo-V2.5 Voice was hunted by Rohan Chaubey and Kumar Abhishek. A “hunter” on Product Hunt is the community member who submits a product to the platform — uploading the images, the link, and tagging the makers behind it. Hunters typically write the first comment explaining why a product is worth attention, and their followers are notified the moment they post. Around 79% of featured launches on Product Hunt are self-hunted by their makers, but a well-known hunter still acts as a signal of quality to the rest of the community. See the full all-time top hunters leaderboard to discover who is shaping the Product Hunt ecosystem.

Reviews

MiMo-V2.5 Voice has received 1 review on Product Hunt with an average rating of 5.00/5. Read all reviews on Product Hunt.

Want to see how MiMo-V2.5 Voice stacked up against nearby launches in real time? Check out the live launch dashboard for upvote speed charts, proximity comparisons, and more analytics.