Realtime TTS 1.5 is #1 on Artificial Analysis, voted best in blind tests by thousands of real users. TTS-2 builds on that with six major upgrades: natural language voice direction for tone, emotion, speed, and pitch. Text-based voice design, where you describe a voice in words and generate it. Cross-lingual synthesis across 100+ languages preserving speaker identity. IPA phonetic control for brand names and rare words. And improved alphanumeric pronunciation. Try it free at inworld.ai/tts.
So I tried it, Speech to Speech. It confuses itself and hallucinates very quickly with just basic questions and conversation, I asked both bots how are you, what are you doing today, and what are you doing for dinner. Both gave me completely different spectrum of answers. They gave alot of filler responses like hey, hmm, huh, which I can understand why those are there. But Jason started telling me how to increase the gain of my television set, and Sarah thought I was going to a party. Also the vocal fidelity is alot to desire, in speech to speech. Just my honest feedback so far. Keep at it.
It's designed to be a frontend of a voice interfaced application of any kind and scale.
Besides naturalness and multilingual quality improvements, in this iteration, this model can't be actually called a "yet another" TTS. Because similarly to speech-to-speech models, Realtime TTS 2.0 was trained tobe explicitly steered to provide the most appropriate response, given the conversation context and agent's goal.
Check it out!
The #1 TTS on Artificial Analysis just got a major capability upgrade. Most voice AI hears what you say. Realtime TTS-2 hears how you say it.
Had a great conversation with Myles. Highly recommend trying it yourself at realtime.ai. It's truly impressive.
I'm most excited about the improvements made in cross-lingual. It's so seamless to have an engaging conversation and switch between multiple languages like English, Hindi, then French and it's the same voice.
It sounds too much like audio book narration. I guess it was trained on that input? Same thing that plagues every single elevenlabs voice. The only voice that sounds human out there is the alloy voice from open ai. and thats an old ai voice. its so strange. this field should be wide open. competative. whats going on ? what an I missing?
training on conversation instead of narration is the right call. every voice agent ive tried sounds like an audiobook reading my support ticket back.
congrats team !!
The voice control seems to be crazy good, you can just describe the tone and it gets really close without all the tweaking. Feels more usable than most TTS tools I’ve tested. I am gonna test it!
Pushing the frontier! Congrats to the team and thank you to all of the partners and customers whose feedback has helped shape TTS-2. Onwards and upwards!
Hey everyone, Andreas from the Inworld team! I've been pumped about this launch for weeks and I'm so excited that we finally get share TTS-2 with you all. If you want to hear what it can do, jump into the playground at inworld.ai/tts and try voice design or steering for yourself or play with our realtime demo at realtime.ai. Would love to hear your reactions!
Hi Product Hunt! We're back! I'm Kylan, CEO and co-founder of @Inworld.
Some of you might remember when we launched Inworld TTS here. It went on to become the #1 ranked voice AI on Artificial Analysis, voted best in blind listening tests by thousands of real users. That meant a lot to us, so we went back and rebuilt the model from the ground up.
Today we're launching Realtime TTS 2.0. Try the live speech-to-speech experience at realtime.ai.
Here's the thing we kept hearing from builders: voice AI was built for audiobooks and voiceovers. It sounds good, but it sounds like a human reading from a script. If you've ever talked to a voice agent and thought "something feels off," that's why. Realtime conversation is a completely different problem, and we decided to solve it.
What can you build with it?
Companion apps that adapt to your user's mood and tone in real time through natural language voice direction
Language tutors that switch languages mid-session with the same voice, no re-recording
Characters that sound exactly how you describe them with text-based voice design
Support agents that get every code, name, and number right with improved alphanumeric handling and International Phonetic Alphabet (IPA) support
So what actually changed?
Natural conversationality. We trained the model on conversational speech instead of narration. You get natural rhythm, breath, micro-pauses, the cadence humans actually use when they talk to each other. Every voice you build on TTS 2.0 sounds like a person in conversation, not a narrator.
Conversational awareness. TTS 2.0 is informed by the full audio context of the multi-turn exchange. Not just the current sentence, the whole conversation. How it speaks adapts to how it was spoken to. A line delivered after a joke lands differently than the same line after bad news. The model knows the difference because it heard what came before.
Full voice direction. You steer the model with natural language the way you'd direct a voice actor. Not preset emotion tags, full descriptions: "act like you just got home from a long day, tired but warm." Combined with inline controls for specific moments ([whispering], [sigh], [excited]), the voice is as controllable as it is expressive.
Text-based voice design. Describe a voice in plain text, generate it. "A posh british man, aged 30-40, speaking deliberately" Iterate on the prompt until it fits, save it, deploy it. No casting calls, no recording booth.
Crosslingual fluency. One voice across 100+ languages with on-the-fly switching inside a single generation. Your voice identity is preserved across every language. No re-recording, no managing separate voices per locale.
Realtime TTS 1.5 is still #1 on the leaderboard. TTS 2.0 takes that quality and adds everything that was missing to uplevel realtime conversation.
Learn more atinworld.ai/tts. Happy to answer any questions in the comments.
– Kylan
About Realtime TTS-2 on Product Hunt
“Voice AI that feels as good as it sounds”
Realtime TTS-2 launched on Product Hunt on May 6th, 2026 and earned 150 upvotes and 15 comments, placing #11 on the daily leaderboard. Realtime TTS 1.5 is #1 on Artificial Analysis, voted best in blind tests by thousands of real users. TTS-2 builds on that with six major upgrades: natural language voice direction for tone, emotion, speed, and pitch. Text-based voice design, where you describe a voice in words and generate it. Cross-lingual synthesis across 100+ languages preserving speaker identity. IPA phonetic control for brand names and rare words. And improved alphanumeric pronunciation. Try it free at inworld.ai/tts.
Realtime TTS-2 was featured in API (98.1k followers), Developer Tools (512k followers) and Artificial Intelligence (467.7k followers) on Product Hunt. Together, these topics include over 168.7k products, making this a competitive space to launch in.
Who hunted Realtime TTS-2?
Realtime TTS-2 was hunted by Chris Messina. A “hunter” on Product Hunt is the community member who submits a product to the platform — uploading the images, the link, and tagging the makers behind it. Hunters typically write the first comment explaining why a product is worth attention, and their followers are notified the moment they post. Around 79% of featured launches on Product Hunt are self-hunted by their makers, but a well-known hunter still acts as a signal of quality to the rest of the community. See the full all-time top hunters leaderboard to discover who is shaping the Product Hunt ecosystem.
Want to see how Realtime TTS-2 stacked up against nearby launches in real time? Check out the live launch dashboard for upvote speed charts, proximity comparisons, and more analytics.
So I tried it, Speech to Speech. It confuses itself and hallucinates very quickly with just basic questions and conversation, I asked both bots how are you, what are you doing today, and what are you doing for dinner. Both gave me completely different spectrum of answers. They gave alot of filler responses like hey, hmm, huh, which I can understand why those are there. But Jason started telling me how to increase the gain of my television set, and Sarah thought I was going to a party. Also the vocal fidelity is alot to desire, in speech to speech. Just my honest feedback so far. Keep at it.