AI Infra Summit highlights MLPerf Inference results from AMD and NVIDIA, as well as NVIDIA’s 2026 Vera Rubin roadmap, specifically Rubin CPX.
At the AI Infra Summit 2025, NVIDIA showcased momentum on two fronts: impressive new MLPerf Inference results from its Blackwell Ultra systems and, more significantly, a detailed roadmap for the 2026 Vera Rubin generation, including Rubin CPX, a new class of GPU purpose-built for massive-context inference.
Blackwell Ultra Sets New Performance Baselines
NVIDIA’s GB300 NVL72 rack-scale systems have already achieved remarkable performance in MLPerf Inference v5.1, showcasing the architectural maturity of the Blackwell Ultra platform even as software continues to unlock its full potential. This power is clearly exhibited on the Llama 2 70B benchmark, where the platform delivered an impressive 12,934 tokens per second per GPU in offline scenarios. The performance in the online serving test was nearly identical at 12,701 tokens per second, speaking towards the architecture’s exceptional efficiency across different workloads.
The platform’s readiness for real-world applications was further demonstrated in the newly introduced interactive category, which imposes substantially stricter latency constraints, including sub-500ms time-to-first-token requirements and a 33 tokens-per-second-per-user threshold. Even under these aggressive quality-of-service demands, Blackwell Ultra maintained high throughput, delivering 7,856 tokens per second per GPU. On the DeepSeek-R1 reasoning benchmark, the platform established another definitive baseline at 5,842 tokens per second per GPU.
Ultimately, these results indicate that the hardware’s capabilities surpass those of the current software stack. There is significant performance headroom yet to be unlocked as frameworks like TensorRT-LLM and NVIDIA Dynamo evolve to fully exploit Blackwell Ultra’s architectural advantages, such as its enhanced NVFP4 compute paths and the massive 288GB HBM3e capacity per GPU.
Accelerating Innovation Cadence: The Vera Rubin Platform
NVIDIA has adopted an annual architecture refresh cycle as a strategic response to the exponential growth in AI computational demands. Adhering to this aggressive timeline, NVIDIA revealed that the Vera Rubin generation is already taped out and slated for enterprise deployment in the second half of 2026.
The Vera Rubin architecture introduces a comprehensive platform refresh centered on the integration of new Vera CPUs and Rubin GPUs. The Vera CPU represents a significant evolution from the Grace “>last three generations of NVIDIA systems. The Vera CPUs feature 88 ARM cores supporting 176 threads. These processors also double the chip-to-chip (C2C) link bandwidth to 1,800 GB/s, enabling a faster link between the CPU, GPU, and their shared memory resources.
At the interconnect layer, the sixth-generation NVLink delivers 3,600 GB/s of bidirectional bandwidth, doubling the bandwidth of the current 5th-generation NVLink switches. This enhanced connectivity becomes particularly critical as models continue to scale beyond the memory capacity of individual devices, requiring sophisticated parallel execution strategies that demand minimal communication latency and maximum throughput between nodes.
Complementing the NVLink advancement, the Spectrum-6 switch, incorporating co-packaged optics (CPO) technology, achieves 102 TB/s of switching capacity. The integration of optical components directly into the switch package eliminates traditional electrical-to-optical conversion bottlenecks, reducing latency while dramatically improving power efficiency—critical considerations as AI factories scale toward gigawatt power consumption levels.
The VR NVL144 systems will still utilize the proven Oberon rack platform that currently underpins Grace Hopper, Grace Blackwell, and Grace Blackwell Ultra deployments.
Architectural Nomenclature Evolution: From Packages to Dies
NVIDIA is shifting its naming convention from a package-based to a die-based count. While this change might be controversial, it is a forward-looking move that will provide greater clarity, especially with the anticipated launch of Rubin Ultra GPUs in late 2026, which are expected to feature four reticle-sized dies.
With the Rubin generation, NVIDIA is adopting a die-count nomenclature that directly reflects the available computational resources. The NVL144 designation explicitly references 144 GPU dies while maintaining the 72-package physical configuration and provides a more precise measure of computational capacity. This is similar to the current generation GB200 and GB300 NVL72 systems, which contain 72 GPU packages, each housing two GPU dies, for a total of 144 computational dies.
Addressing the Context Processing Challenge
The announcement of Rubin CPX, planned for availability in late 2026, is NVIDIA’s architectural response to one of the most pressing challenges in LLM inference: the fundamental mismatch between computational patterns during different phases of token generation. To understand this innovation, we need to look into the distinctive characteristics of LLM inference workloads and the limitations of current homogeneous GPU architectures in addressing these varied computational demands.

Rubin CPX: Purpose-Built Architecture for Context Processing

Flexible Deployment Architectures and Configuration Options


Gigawatt-Scale Infrastructure Blueprint
Beyond individual system innovations, NVIDIA also unveiled reference architectures for gigawatt-scale AI factories. Developed in collaboration with infrastructure partners including Jacobs, Schneider Electric, Siemens Energy, and Vertiv, these blueprints address the complete infrastructure stack, from power generation to computational delivery. The reference designs acknowledge that next-generation AI deployments require holistic optimization that extends far beyond the computational components themselves.
These architectural blueprints utilize NVIDIA Omniverse digital twins to facilitate comprehensive facility simulation before physical deployment. Organizations can model power distribution, cooling systems, and computational workloads in unified simulations, identifying and addressing bottlenecks before committing to physical infrastructure.
Conclusion
NVIDIA continues to lead the AI infrastructure space with a forward-thinking, developer-centric approach that directly addresses the pain points faced by organizations and AI labs. The transition from general-purpose acceleration to workload-specific architectures, exemplified by Rubin CPX’s targeted approach to context processing, indicates that future AI systems will increasingly comprise heterogeneous computational resources optimized for every phase of AI workflows. This architectural evolution demands that organizations planning multi-year AI infrastructure investments consider not just raw computational throughput but also the alignment between hardware capabilities and evolving model architectures.
The accelerated innovation cadence, from Blackwell Ultra through Vera Rubin to Rubin CPX within a compressed timeline, is truly impressive. Such a rapid pace requires organizations to design systems capable of integrating new architectural paradigms as they emerge, avoiding the lock-in that characterized previous generations of data center infrastructure. To address this challenge, NVIDIA’s AI Factory reference designs and Omniverse digital twins provide the essential blueprints and simulation tools for future-proofing these critical investments. As AI models continue their trajectory toward trillion-parameter scales and million-token contexts, the architectural innovations unveiled at the AI Infrastructure Summit provide the essential foundation for this future of computation. They establish the frameworks and technologies that will define enterprise AI capabilities throughout the decade.
Articles Referenced: Nvidia GTC25 News
All Slides and Images are sourced from Nvidia
Engage with StorageReview
Newsletter | YouTube | Podcast iTunes/Spotify | Instagram | Twitter | TikTok | RSS Feed