Mercury 2 ditches sequential decoding for parallel refinement. As the first reasoning diffusion LLM, it generates tokens simultaneously to hit 1,000+ tokens/sec. This delivers reasoning-grade quality inside tight latency budgets for your agentic loops.
Diffusion models, or dLLMs, are currently one of the most promising paths outside the standard autoregressive route. Everyone is exploring this space right now, from @Seed Diffusion to @Dream 7B and even @Gemini Diffusion. But the standout player is definitely Inception with their Mercury series, and they just pushed their second generation live.
The architectural shift changes everything about latency. Mercury 2 abandons standard left-to-right sequential decoding. Parallel refinement drives the generation instead. Think of the model less like a typewriter printing one token at a time and more like an editor revising a full draft simultaneously.
This parallel approach makes the inference insanely fast. Hitting over 1,000 tokens per second gives you a 5x speedup over leading speed-optimized models. This fundamentally alters the equation for multi-step agentic loops or real-time voice apps where latency compounds across every single step.
The API is strictly OpenAI compatible, so you do not need to rewrite any code. You can apply for early access to the API or just chat with it right now to feel the raw speed of a next-gen diffusion model.
Parallel refinement instead of sequential decoding is a bold technical shift. 1,000+ tokens per second with reasoning grade quality is not a small claim, especially for agentic loops where latency compounds fast.
From a positioning angle, though, the speed story is clear, but the practical transformation could be sharper. Is the real win lower infra cost, smoother agent chains, or enabling use cases that were previously too slow to ship?
You could test framing Mercury 2 around one concrete before and after scenario, like what becomes possible at 1,000 tokens per second that was painful before.
Curious, what is the first production use case where teams feel this speed difference most viscerally?
Congratulations! What do you mean by 'reasoning' for a diffusion llm? Do you have a paper/blog post you could point me to?
About Mercury 2 on Product Hunt
“Fastest reasoning LLM built for instant production AI”
Mercury 2 launched on Product Hunt on February 25th, 2026 and earned 152 upvotes and 5 comments, placing #8 on the daily leaderboard. Mercury 2 ditches sequential decoding for parallel refinement. As the first reasoning diffusion LLM, it generates tokens simultaneously to hit 1,000+ tokens/sec. This delivers reasoning-grade quality inside tight latency budgets for your agentic loops.
Mercury 2 was featured in API (98k followers), Artificial Intelligence (466.2k followers) and Development (5.8k followers) on Product Hunt. Together, these topics include over 99.2k products, making this a competitive space to launch in.
Who hunted Mercury 2?
Mercury 2 was hunted by Zac Zuo. A “hunter” on Product Hunt is the community member who submits a product to the platform — uploading the images, the link, and tagging the makers behind it. Hunters typically write the first comment explaining why a product is worth attention, and their followers are notified the moment they post. Around 79% of featured launches on Product Hunt are self-hunted by their makers, but a well-known hunter still acts as a signal of quality to the rest of the community. See the full all-time top hunters leaderboard to discover who is shaping the Product Hunt ecosystem.
Want to see how Mercury 2 stacked up against nearby launches in real time? Check out the live launch dashboard for upvote speed charts, proximity comparisons, and more analytics.
Hi everyone!
Diffusion models, or dLLMs, are currently one of the most promising paths outside the standard autoregressive route. Everyone is exploring this space right now, from @Seed Diffusion to @Dream 7B and even @Gemini Diffusion. But the standout player is definitely Inception with their Mercury series, and they just pushed their second generation live.
The architectural shift changes everything about latency. Mercury 2 abandons standard left-to-right sequential decoding. Parallel refinement drives the generation instead. Think of the model less like a typewriter printing one token at a time and more like an editor revising a full draft simultaneously.
This parallel approach makes the inference insanely fast. Hitting over 1,000 tokens per second gives you a 5x speedup over leading speed-optimized models. This fundamentally alters the equation for multi-step agentic loops or real-time voice apps where latency compounds across every single step.
The API is strictly OpenAI compatible, so you do not need to rewrite any code. You can apply for early access to the API or just chat with it right now to feel the raw speed of a next-gen diffusion model.