Xiaomi's MiMo Hits Over 1,000 Tokens per Second on Commodity GPUs, Outpacing GPT-5.5 and Claude
Xiaomi released MiMo-V2.5-Pro-UltraSpeed, a serving mode for its trillion-parameter MiMo model that processes over 1,000 tokens per second, peaking near 1,200 in demonstrations. The benchmark was run on a single 8-GPU commodity node using standard hardware and no custom chips. According to Artificial Analysis figures cited in coverage, GPT-5.5—the model most ChatGPT users interact with—processes 68 tokens per second, while Claude Opus 4.6 lands around 71 tokens per second and the smaller Haiku touches 98. Gemini Flash reaches 192 tokens per second. By comparison, MiMo-V2.5-Pro-UltraSpeed delivers 1,000 tokens per second on a model that matches Opus on coding benchmarks.
The performance places Xiaomi ahead of dedicated inference hardware providers. Cerebras, which designed a wafer-scale chip the size of a dinner plate packing 44GB of on-chip memory, hit 969 tokens per second on Meta's Llama 3.1 405B—a 405-billion-parameter model less than half the size of MiMo-V2.5-Pro. Groq's custom Language Processing Unit architecture tops out between 300 and 750 tokens per second depending on the model. Neither system runs on hardware available through standard cloud rentals, while Xiaomi's result was achieved on commodity GPUs through software alone.
Two techniques drive the speed gain. The first, FP4 quantization, compresses the expert layers—which make up most of the 1 trillion parameters—down to 4-bit precision, reducing memory footprint and bandwidth pressure. Xiaomi applies the compression only to expert layers, keeping the rest of the model at full precision to limit quality degradation, which the company describes as near-zero. The second technique, DFlash speculative decoding, replaces the sequential drafting step of standard speculative decoding by filling a whole block of masked positions in a single forward pass. In coding tasks, the main model accepts an average of 6.3 out of 8 proposed tokens per verification round.
A purpose-built inference engine called TileRT ties the system together by keeping the entire compute pipeline continuously resident inside the GPU, eliminating per-operator launch overhead and execution gaps. The combination of FP4 quantization, DFlash, and TileRT allowed Xiaomi to achieve its reported throughput without specialized silicon.
Share Article
Quick Info
Disclaimer: This content is for information and entertainment purposes only. It does not constitute financial, investment, legal, or tax advice. Always do your own research and consult with qualified professionals before making any financial decisions.
See our Terms of Service, Privacy Policy, and Editorial Policy.