Sharing Sesame's Conversational Speech Model (CSM), and this is a big step beyond typical text-to-speech. The goal is to achieve what Sesame calls "voice presence": making spoken interactions feel real, understood, and valued.
A PH version of this model System Card is :)
😃 Emotional Context: It tries to understand and respond to the emotion in the conversation. ⏱️ Conversational Dynamics: It aims for natural timing, pauses, and intonation. 🧠 Contextual Awareness: It adapts its tone and style to the situation. 👤 Consistent Personality: It maintains coherence. 👂 Multimodal: It understands both text and audio input. 🗣️ End-to-End: It generates speech directly, in a single stage, for greater efficiency. 🔓 Open Source: Models will be released under Apache 2.0 License.
They've built a custom evaluation suite to measure these conversational aspects, because traditional metrics (like Word Error Rate) don't really capture how natural the speech sounds.
The model itself is based on the Llama architecture, but with a clever split-transformer design.
You can try a demo to experience the conversational voice (It's magical, believe me)
Feels almost sentient. I talked to Miles and told it to tweak its sarcasm down to 10%. Let's say, he knows how to play along :)
A friend sent this to me and I just used the demo by putting Sesame in conversation with ChatGPT's voice agent and the difference is huge. It's definitely much more human like, especially with it's intonations and (micro?) expressions. The only hitch—I found it to be extra sensitive to external noises, making it pause in the middle of it's speech. Barring that, it's the most emotionally mature voice model in the market imo.
Current AI voices sound too robotic that "uncanny valley" effect makes voice features feel artificial. Sesame could help make voice interactions more natural and engaging for users.