StorageReview.com

NVIDIA Sets MLPerf Inference v6.0 Records with Blackwell Ultra Platform

AI  ◇  Enterprise

NVIDIA has published results for MLPerf Inference v6.0, highlighting system-level gains driven by tight co-design across hardware, software, and models. The company positions inference throughput and token economics as the primary metrics for AI factory performance, moving beyond peak accelerator specifications to measured output under real workloads.

In this round, systems built on NVIDIA Blackwell Ultra GPUs delivered the highest throughput across all submitted models and scenarios. The ecosystem around the platform also expanded, with 14 partners submitting results, including major OEMs, cloud providers, and integrators such as ASUS, Cisco, CoreWeave, Dell Technologies, GigaComputing, Google Cloud, HPE, Lenovo, Nebius, Netweb Technology, QCT, Red Hat, Supermicro, and Lambda.

Expanded Benchmark Coverage Reflects Emerging Workloads

MLPerf Inference v6.0 introduces several new benchmarks to represent current AI deployments better. NVIDIA was the only vendor to submit across all new tests, spanning large language models, multimodal systems, generative video, and recommendation engines.

Key additions include DeepSeek-R1 Interactive, which evaluates higher interactivity with faster token delivery and reduced time to first token compared to prior server scenarios. The suite also adds Qwen3-VL-235B-A22B, marking the first multimodal vision-language model in MLPerf Inference, and GPT-OSS-120B, a mixture-of-experts reasoning model tested across offline, server, and interactive scenarios.

Scenario DeepSeek-R1 GPT-OSS-120B Qwen3-VL Wan 2.2 DLRMv3
Offline 2,494,310 tokens/sec* 1,046,150 tokens/sec 79 samples/sec 0.059 samples/sec 104,637 samples/sec
Server 1,555,110 tokens/sec* 1,096,770 tokens/sec 68 queries/sec 21 secs**
(Single Stream)
99,997 queries/sec
Interactive 250,634 tokens/sec 677,199 tokens/sec *** *** ***

* Not a new scenario in MLPerf Inference v6.0
** Wan 2.2 features a single stream scenario, which measures end-to-end request latency, instead of a server scenario. Lower is better.
*** Not tested in MLPerf Inference v6.0

Generative media and recommendation workloads are now included. The Wan 2.2 text-to-video model features both latency-sensitive and throughput-focused tests, while DLRMv3 replaces previous recommendation benchmarks with a transformer-based architecture that boosts compute intensity and model complexity.

Software Optimization Drives Measurable Gains

A notable aspect of this submission is the performance uplift achieved on existing hardware through software updates. NVIDIA reports up to 2.7x higher token throughput on the GB300 NVL72 platform for DeepSeek-R1 server scenarios compared to results from six months prior. This improvement translates to materially lower cost per token and higher utilization of deployed infrastructure.

NVIDIA MLPerf v6 graphic

These gains are attributed to updates in the TensorRT-LLM stack and associated frameworks. Kernel-level optimizations and fusion techniques reduce execution overhead, while improved attention data parallelism more effectively balances workloads across GPUs. Additional enhancements in the Dynamo distributed inference framework enable disaggregated serving, allowing independent optimization of prefill and decode phases.

For mixture-of-experts models, techniques like Wide Expert Parallel distribute expert weights across GPUs to reduce memory bottlenecks. Multi-token prediction boosts compute efficiency in low-batch, latency-sensitive scenarios by generating and validating multiple tokens at once. KV-aware routing further enhances scheduling by directing inference requests based on estimated compute costs.

Benchmark GB300 NVL72
v5.1
GB300 NVL72
v6.0
Speedup
DeepSeek-R1
(Server)
2,907 tokens/sec/gpu 8,064 tokens/sec/gpu 2.77x
DeepSeek-R1
(Offline)
5,842 tokens/sec/gpu 9,821 tokens/sec/gpu 1.68x
Llama 3.1 405B
(Server)
170 tokens/sec/gpu 259 tokens/sec/gpu 1.52x
Llama 3.1 405B
(Offline)
224 tokens/sec/gpu 271 tokens/sec/gpu 1.21x

NVIDIA also demonstrated continued scaling on established models. On Llama 3.1 405B, the GB300 NVL72 platform achieved a 1.5x performance increase in server scenarios, indicating ongoing optimization for dense LLMs alongside newer architectures.

Open Ecosystem and Framework Integration

Submissions across new workloads leveraged a mix of NVIDIA and open-source frameworks. The Qwen3-VL benchmark used the vLLM framework, reflecting the rapid development in multimodal inference optimization. The Wan 2.2 text-to-video results were powered by TensorRT-LLM VisualGen, targeting diffusion-based pipelines on GPUs.

For DLRMv3, NVIDIA combined its recsys-example framework with GPU-accelerated embedding lookup technologies to handle the increased demands of transformer-based recommendation models. These integrations underscore the role of the broader software ecosystem in extracting performance from the underlying hardware.

Scale-Out Performance with InfiniBand

NVIDIA also showcased large-scale inference performance using four GB300 NVL72 systems connected via Quantum-X800 InfiniBand. This setup, with a total of 288 Blackwell Ultra GPUs, marks the largest MLPerf Inference submission to date and achieved system-level throughput of millions of tokens per second on DeepSeek-R1.

DeepSeek-R1 | 4x GB300 NVL72 Tokens/Second
Offline 2,494,310
Server 1,555,110

The results highlight the importance of high-performance interconnects in scaling inference workloads, particularly for distributed LLM serving and high-throughput batch processing.

Toward Service-Level Benchmarking

Looking ahead, NVIDIA is helping develop the MLPerf Endpoints within the MLCommons consortium. This upcoming benchmark aims to measure deployed inference services using real API traffic, giving insight into latency, throughput, and efficiency at the service level rather than just at the component level.

As AI workloads develop into agentic systems with longer context windows, benchmarks that measure end-to-end service performance are expected to become more important for both cloud providers and enterprise deployments.

Engage with StorageReview

Newsletter | YouTube | Podcast iTunes/Spotify | Instagram | Twitter | TikTok | RSS Feed

Harold Fritts

I have been in the tech industry since IBM created Selectric. My background, though, is writing. So I decided to get out of the pre-sales biz and return to my roots, doing a bit of writing but still being involved in technology.