At the 2025 OCP Global Summit, Intel emphasized AI inference by unveiling two key advancements: a new data center GPU called “Crescent Island” and a rack-scale reference design for Gaudi 3. Both developments align with the growing transition from model training to real-time, widespread inference, where factors such as latency, memory bandwidth, efficiency, and operational simplicity are now crucial.
Intel’s CTO, Sachin Katti, summarized the transition by noting that as agentic AI becomes more widespread, inference turns continuous, context-rich, and increasingly system-intensive. To manage the growing token volumes, complex modality combinations, and strict SLA requirements, a heterogeneous infrastructure combining optimal silicon with an open, developer-centric software stack becomes necessary. In this environment, Intel’s Xe architecture data center GPUs offer the capacity and reliability needed as sequence lengths and token rates increase, while Gaudi 3 supports an open, scalable inference ecosystem with predictable total cost of ownership (TCO).
The key takeaway is not simply “faster chips.” Inference success hinges on systems integration: topology-aware memory, flexible interconnects, power and cooling that match utilization profiles, and orchestration that treats models, tokens, and streams as first-class citizens. Enterprises don’t want lock-in at precisely the moment they need to iterate across models, frameworks, and serving stacks.
Intel argues it’s uniquely positioned to deliver end-to-end breadth, with AI PCs, industrial edge, and data center racks grounded in Xeon 6 CPUs, Gaudi 3 accelerators, and Intel GPUs. The common themes include:
Partnering with OCP supports the company’s preference for open, referenceable designs, which are easier to procure, validate, and scale in operations.
Crescent Island is Intel’s upcoming data center GPU designed for air-cooled enterprise servers that need high token throughput without requiring advanced power or cooling solutions. Its focus is practical: prioritize cost efficiency and energy performance while delivering the memory capacity and bandwidth essential for modern inference tasks.
Highlights include:
Crescent Island is about inference economics at scale. It’s air-cooled, capacity-forward, perf/W focused. Suppose your workloads are dominated by token-heavy serving with strict latency SLOs. In that case, the onboard 160 GB and efficient data types should translate into more concurrent sessions per watt and fewer cluster surprises as sequence lengths creep up.
Gaudi 3 extends from PCIe deployments to complete rack-scale configurations, offering a path to scale without forcing an all-at-once architectural bet. The new rack-scale reference design targets enterprises standardizing on large-model inference and latency-critical real-time systems.
Key elements:
In production inference, factors like memory topology and consistent thermal management surpass peak theoretical performance. The Gaudi 3 rack, which features 8.2 TB of HBM across 64 accelerators, addresses practical needs such as loading multiple large variants, managing long prompts, and handling bursts without causing thrashing. Liquid cooling isn’t just optional at these densities; it’s essential for maintaining QPS and latency under heat stress.
HPE has announced general availability of the HPE Compute Scale-up Server 3250, a scale-up platform engineered for in-memory databases and…
Dell Technologies has announced two major updates to its Dell AI Platform with AMD, targeting organizations scaling from pilot AI…
NVIDIA and IREN Limited have announced a strategic partnership to accelerate the deployment of next-generation AI infrastructure, with plans to…
Anthropic’s new compute agreement with SpaceX gives the AI company access to all compute capacity at SpaceX’s Colossus 1 data…
AMD has announced the Instinct MI350P, a PCIe accelerator aimed at enterprises that want on-premises AI inference without rebuilding their…
Broadcom announced VMware Cloud Foundation 9.1, positioning the platform as a private cloud foundation optimized for production AI workloads. The…