Fastest reasoning LLM built for instant production AI
Mercury 2 ditches sequential decoding for parallel refinement. As the first reasoning diffusion LLM, it generates tokens simultaneously to hit 1,000+ tokens/sec. This delivers reasoning-grade quality inside tight latency budgets for your agentic loops.
Diffusion models, or dLLMs, are currently one of the most promising paths outside the standard autoregressive route. Everyone is exploring this space right now, from @Seed Diffusion to @Dream 7B and even @Gemini Diffusion. But the standout player is definitely Inception with their Mercury series, and they just pushed their second generation live.
The architectural shift changes everything about latency. Mercury 2 abandons standard left-to-right sequential decoding. Parallel refinement drives the generation instead. Think of the model less like a typewriter printing one token at a time and more like an editor revising a full draft simultaneously.
This parallel approach makes the inference insanely fast. Hitting over 1,000 tokens per second gives you a 5x speedup over leading speed-optimized models. This fundamentally alters the equation for multi-step agentic loops or real-time voice apps where latency compounds across every single step.
The API is strictly OpenAI compatible, so you do not need to rewrite any code. You can apply for early access to the API or just chat with it right now to feel the raw speed of a next-gen diffusion model.
Hi everyone!
Diffusion models, or dLLMs, are currently one of the most promising paths outside the standard autoregressive route. Everyone is exploring this space right now, from @Seed Diffusion to @Dream 7B and even @Gemini Diffusion. But the standout player is definitely Inception with their Mercury series, and they just pushed their second generation live.
The architectural shift changes everything about latency. Mercury 2 abandons standard left-to-right sequential decoding. Parallel refinement drives the generation instead. Think of the model less like a typewriter printing one token at a time and more like an editor revising a full draft simultaneously.
This parallel approach makes the inference insanely fast. Hitting over 1,000 tokens per second gives you a 5x speedup over leading speed-optimized models. This fundamentally alters the equation for multi-step agentic loops or real-time voice apps where latency compounds across every single step.
The API is strictly OpenAI compatible, so you do not need to rewrite any code. You can apply for early access to the API or just chat with it right now to feel the raw speed of a next-gen diffusion model.