AI Infra Summit highlights MLPerf Inference results from AMD and NVIDIA, as well as NVIDIA’s 2026 Vera Rubin roadmap, specifically Rubin CPX.
At the AI Infra Summit 2025, NVIDIA showcased momentum on two fronts: impressive new MLPerf Inference results from its Blackwell Ultra systems and, more significantly, a detailed roadmap for the 2026 Vera Rubin generation, including Rubin CPX, a new class of GPU purpose-built for massive-context inference.
Blackwell Ultra Sets New Performance Baselines
NVIDIA’s GB300 NVL72 rack-scale systems have already achieved remarkable performance in MLPerf Inference v5.1, showcasing the architectural maturity of the Blackwell Ultra platform even as software continues to unlock its full potential. This power is clearly exhibited on the Llama 2 70B benchmark, where the platform delivered an impressive 12,934 tokens per second per GPU in offline scenarios. The performance in the online serving test was nearly identical at 12,701 tokens per second, speaking towards the architecture’s exceptional efficiency across different workloads.
The platform’s readiness for real-world applications was further demonstrated in the newly introduced interactive category, which imposes substantially stricter latency constraints, including sub-500ms time-to-first-token requirements and a 33 tokens-per-second-per-user threshold. Even under these aggressive quality-of-service demands, Blackwell Ultra maintained high throughput, delivering 7,856 tokens per second per GPU. On the DeepSeek-R1 reasoning benchmark, the platform established another definitive baseline at 5,842 tokens per second per GPU.
Ultimately, these results indicate that the hardware’s capabilities surpass those of the current software stack. There is significant performance headroom yet to be unlocked as frameworks like TensorRT-LLM and NVIDIA Dynamo evolve to fully exploit Blackwell Ultra’s architectural advantages, such as its enhanced NVFP4 compute paths and the massive 288GB HBM3e capacity per GPU.
Accelerating Innovation Cadence: The Vera Rubin Platform
NVIDIA has adopted an annual architecture refresh cycle as a strategic response to the exponential growth in AI computational demands. Adhering to this aggressive timeline, NVIDIA revealed that the Vera Rubin generation is already taped out and slated for enterprise deployment in the second half of 2026.
The Vera Rubin architecture introduces a comprehensive platform refresh centered on the integration of new Vera CPUs and Rubin GPUs. The Vera CPU represents a significant evolution from the Grace CPUs used in the last three generations of NVIDIA systems. The Vera CPUs feature 88 ARM cores supporting 176 threads. These processors also double the chip-to-chip (C2C) link bandwidth to 1,800 GB/s, enabling a faster link between the CPU, GPU, and their shared memory resources.
At the interconnect layer, the sixth-generation NVLink delivers 3,600 GB/s of bidirectional bandwidth, doubling the bandwidth of the current 5th-generation NVLink switches. This enhanced connectivity becomes particularly critical as models continue to scale beyond the memory capacity of individual devices, requiring sophisticated parallel execution strategies that demand minimal communication latency and maximum throughput between nodes.
Addressing the Context Processing Challenge

Rubin CPX: Purpose-Built Architecture for Context Processing

Flexible Deployment Architectures and Configuration Options


Gigawatt-Scale Infrastructure Blueprint

Conclusion
NVIDIA continues to lead the AI infrastructure space with a forward-thinking, developer-centric approach that directly addresses the pain points faced by organizations and AI labs. The transition from general-purpose acceleration to workload-specific architectures, exemplified by Rubin CPX’s targeted approach to context processing, indicates that future AI systems will increasingly comprise heterogeneous computational resources optimized for every phase of AI workflows. This architectural evolution demands that organizations planning multi-year AI infrastructure investments consider not just raw computational throughput but also the alignment between hardware capabilities and evolving model architectures.
Articles Referenced: Nvidia GTC25 News
All Slides and Images are sourced from Nvidia
Engage with StorageReview
Newsletter | YouTube | Podcast iTunes/Spotify | Instagram | Twitter | TikTok | RSS Feed