Marengo 3.0 is TwelveLabs' most significant model to date, delivering human-like video understanding at scale. A multimodal embedding model, Marengo fuses video, audio, and text for holistic video understanding to power precise video search and retrieval.
Congratulations on the new release! We once made a similar service: we recognized text from videos, translated it, and generated videos with the translation. This way, YouTube bloggers could automatically create videos in 70+ languages. YouTube even officially recommended this service later.
@emilykurze Congrats on the launch! The video understanding improvements look solid. Curious about use cases beyond technical content.
You couldn't come up with a better name than twelve labs? Surely you realize it will sound like a knockoff of 11 Labs to anyone who has heard of them.
Wow. I was working couple months ago with open source VLMs to dev database intelligence and was wondering how would they work on videos... your product seems to be smashing it. Can't wait to try it on tennis matches clips
The unified embedding space for video+audio+text is huge. Most tools treat these as separate streams, but real-world content is inherently multimodal.
Question for Emily and team: What's the typical use case where Marengo 3.0 outperforms separate video/audio/text models? Sports analysis? Content moderation?
Also curious about the multilingual capabilities - how many languages are supported? This could be game-changing for global content creators. Congrats on the launch!
Hey Product Hunt! 👋 This is Allie from @TwelveLabs!
If you’ve ever tried to build on top of models that say they understand video but collapse on long content, sports, or anything beyond short clips… M3 is built for you.
🚀 What’s M3?
M3 is a unified multimodal foundation model powering our Search API and Embed API. It understands video, audio, images, and text in a single space — fast, efficient, and built for production.
🔥 Highlights
⚡ Breakaway speed on long-form video processing — practical at massive scale
💾 512-d embeddings → up to 6× more storage-efficient with top-tier accuracy
🎥 True multimodality across video, audio, image, and text
🌍 Native multilingual support (English, Korean, Japanese, and more)
🏀 Elite sports intelligence: fine-grained action recognition, player tracking, and temporal reasoning
🧠 Handles hour-long videos, long queries, and composed queries (image + text)
💡 What you can build
Search platforms, AI agents that watch content, sports analytics tools, compliance systems, media workflows — anything that needs real video understanding.
🛠️ Try Marengo 3.0
Available via:
TwelveLabs SaaS (Search API + Embed API)
AWS Bedrock
I’m so proud of the research-first team behind this release — and excited to see what you build with M3.
Ask me anything below 👇
Congratulations on launch! Curious: when working with long videos, how fast and accurate is the search, can it reliably find moments based on vague queries?
TwelveLabs is impressive in pushing the limits of video AI. It seems powerful and efficient. How does it handle complex scenes to ensure accurate context understanding across different video genres?
Congratulations to the team! This is a huge milestone for AI applied to real-world media.
I like the direction you’re taking with this. What kind of feedback from early users influenced this version?
Congratulations guys!! Could you use TwelveLabs to review a final cut of a video edit before publishing in the context of content creation and YouTube, it could be really interesting for final cut reviews and missed keyframes?
It looks amazing! Does it handle fast-moving sports like action retrieval? I am starting a new product and kinda need something like that.
Only people with a movie historical background will understand the logo :) Love the idea behind it :)
it's insanely fast! do you think i can use this to detect is some animation is broken?
(it's hard to define what broken even means)
check out unfold to see what i mean (we just launched yesterday on PH), and all the best guys - you have #1 vibes!
Why we built Marengo 3.0: Modern multimodal models break down on the things that actually matter in production: long videos, fast-moving sports, mixed-modality queries, noisy real-world audio, and multilingual content. We built Marengo 3.0 to solve those exact pain points. Instead of optimizing for short clips or English-only benchmarks, we focused on understanding the world as it really is—messy, long-form, multilingual, and multimodal.
What’s new and unique: Marengo 3.0 introduces a more efficient unified embedding space that works across video, audio, text, images, and even composed queries (e.g., image + text together). That unlocks new capabilities like action-level sports retrieval, long descriptive queries, accurate speech and non-speech audio retrieval, and native multilingual search across 36 languages. And it does this while being 3–6× more storage-efficient than alternative models.
What we’re most proud of: The biggest milestone: there’s no longer a trade-off between multimodality and performance. Marengo 3.0 hits state-of-the-art results across composed retrieval, sports, OCR, long-form understanding, audio, and multilingual tasks—while staying lightweight and production-friendly. Instead of chasing synthetic benchmarks, we designed a model that excels in real-world use.
Curious to hear what the Product Hunt community thinks! What would you build with access to multimodal video understanding that actually works at production scale?
Congratulations on the new release! We once made a similar service: we recognized text from videos, translated it, and generated videos with the translation. This way, YouTube bloggers could automatically create videos in 70+ languages. YouTube even officially recommended this service later.