Diffusion Models Are Coming for Your AI Assistant (That's a Good Thing)
Three seconds doesn't sound like a long time until you're staring at a loading spinner while your AI assistant figures out whether you have a conflict at 2pm. You asked a simple question. The model is thinking. You wait, and wait, and the little dots keep pulsing, and by the time the answer appears you've already opened your calendar and checked yourself.
That delay isn't a bug. It's how the underlying technology works. And a new class of model is about to change it.
Why current models are slow by design
Every major language model you've used — ChatGPT, Claude, Gemini — generates text the same way: one token at a time, left to right, each word depending on the one before it. The technical term is autoregressive generation. The practical term is "standing in line at a very fast deli counter where only one person can order at a time."
This sequential bottleneck means that no matter how powerful the hardware gets, there's a ceiling on how fast these models can respond. You can make each step faster, but you can't skip the queue. A 500-word response means roughly 500 sequential steps. For complex tasks where an AI assistant needs to reason through multiple steps, check your calendar, draft a message, and confirm the details, those steps multiply. The wait adds up.
It's tolerable for a chatbot you're poking at casually. It's a real problem for an assistant that's supposed to handle things in the background while you get on with your day.
Diffusion models: the alternative
You've probably seen diffusion models at work already, even if you didn't know the name. They're the technology behind AI image generators like Stable Diffusion and DALL-E. The process starts with pure noise and gradually refines it into a coherent image, all at once rather than pixel by pixel.
Now the same idea is being applied to text. Instead of generating one word, then the next, then the next, a diffusion language model works on the entire output simultaneously. It starts with a rough draft of noise and iterates toward clarity in parallel. Think of it like the difference between writing a sentence word by word versus sketching the whole paragraph and then sharpening it.
The result is dramatically faster generation. Not 20% faster. More like 10x faster.
Mercury 2 makes this real
In February 2026, Inception Labs launched Mercury 2, a diffusion-based reasoning LLM that generates around 1,000 tokens per second on H100 hardware. For comparison, Claude 4.5 Haiku runs at roughly 89 tokens per second. GPT-5 Mini hits about 71.
Inception Labs was founded by Stefano Ermon, a Stanford professor who co-invented many of the diffusion methods used in image and video generation. The company raised $50M from Menlo Ventures with angel backing from Andrew Ng and Andrej Karpathy. Mercury 2 is already available on AWS Bedrock and Azure Foundry.
These aren't lab benchmarks that evaporate in production. This is a commercially available model running at speeds that make real-time AI interaction feel different.
What 10x speed means for assistants
Speed isn't just a nice-to-have for AI assistants. It changes what's possible.
When a model responds in milliseconds instead of seconds, agent workflows stop stalling. An assistant that needs to check your email, cross-reference your calendar, draft a reply, and confirm with you can do that entire loop before you've finished your sip of coffee. Today, each step in that chain has a noticeable pause. At 1,000 tokens per second, the chain collapses into something that feels instant.
Voice interfaces benefit even more. The awkward gap between saying something and hearing a response is what makes most voice AI feel robotic. Cut that latency by 10x and the conversation starts to flow like a real one.
Then there's cost. Faster inference on the same hardware means serving more users per GPU. That cost reduction trickles down to the products built on top of these models. Always-on assistants that monitor your apps throughout the day become economically viable at a scale that doesn't work with current token-per-second economics.
The tradeoffs (for now)
Diffusion LLMs aren't a free upgrade. The technology is still maturing. Current models can produce repetition or truncated outputs. They work with fixed output lengths, which adds constraints that autoregressive models don't have. The ecosystem of optimization techniques that makes traditional models practical (speculative decoding, prefix caching) doesn't have equivalents in diffusion land yet.
And the best output quality still comes when you slow diffusion models down to roughly one token per step, which erases the speed advantage. The sweet spot is somewhere in between: fast enough to feel instant, careful enough to be reliable.
These are engineering problems, not dead ends. The trajectory is clear.
Why we're paying attention
At clawww.ai, we build AI assistants that connect to your real tools and act on your behalf. Every millisecond of model latency shows up in your experience as a pause, a stutter, a moment where you wonder if it's working. Faster models mean fewer of those moments.
We're watching diffusion LLMs closely because the jump from "responds in a few seconds" to "responds before you notice" is the jump from a tool you check on to a tool that just works in the background. That's the experience we're building toward.
The models are getting faster. The interesting question isn't whether AI assistants will feel instant. It's what becomes possible when they do.