Home EnterprisePodcast #141: Graid Delivers Enterprise RAID Functionality to AI Workloads

Podcast #141: Graid Delivers Enterprise RAID Functionality to AI Workloads

by Harold Fritts

Gain insight into improving RAID performance for AI workloads with Graid SupremeRAID AE.

Storagereview has worked with Graid on numerous projects, including our most recent review, Storage at GPU Speed, and they continue to impress. In this podcast, Brian chats with his friend, Kelley Osburn, Senior Director of OEM and Channel Business Development at Graid Technology.

Kelley is responsible for developing and executing the OEM and channel sales strategy, managing key partnerships and relationships, and expanding the company’s market presence and revenue. He leverages his expertise in flash memory, data storage, virtualization, cloud, DevOps, and containers/Kubernetes to deliver value-added solutions to our customers and partners.

Graid Technology delivers enterprise-grade RAID functionality for AI workloads with minimal infrastructure changes. Instead of a dedicated hardware RAID card or custom appliance, AE is offered as a software license and uses only a small share of an existing NVIDIA GPU. This enables organizations to achieve enterprise-grade storage performance and reliability without consuming additional PCIe slots, requiring infrastructure changes, or significantly impacting GPU performance for training and inference workloads.

Brian and Kelley dive into the functionality of the Graid solution, why it’s critical in today’s AI landscape, and what the future holds.

If you are interested in improving RAID performance for AI workloads without adding more GPUs to your existing racks, this podcast delivers.

If you don’t have time to listen to the entire podcast, we have broken it down into five-minute segments so you can easily hop around to get the information you need.

Listener’s Guide

Why Graid? Problem Solved! [00–05]

  • Traditional RAID/MD-RAID hit a wall with NVMe and PCIe Gen4/5/6—hard x16-lane bottlenecks.
  • Unsafe stopgaps: RAID 0 + snapshots/checkpoints increase risk and admin overhead.
  • CPU-bound math in MD-RAID diverts cycles from apps; GPUs excel at parallel parity/EC math.
  • Graid offloads to the GPU and uses peer‑to‑peer paths, so data bypasses the card (no x16 choke).
  • Anecdote: 16–44 drive servers magnify the physical lane mismatch of legacy HW RAID.

Outrunning “x16 limits” [05-10]

  • Affordable GPUs (A2000/A400) act as a traffic cop, not a data ferry—peer‑to‑peer NVMe routing.
  • Real builds: ~200 GB/s with ~24 drives; new paths push ~300 GB/s/server.
  • Design tradeoffs: ~24 x4 drives for peak perf; 40+ drives at x2 favors capacity/IOPS.
  • Small form factor: Low‑cost edition supports up to ~12 drives—ideal for AI desktops.
  • Banter: “How can it be faster than x16?”—because the data doesn’t traverse the GPU.

AI Edition: Use existing compute GPUs [12–15]

  • Run Graid on H100/Blackwell, you already own; no extra slot for a RAID GPU.
  • Tiny footprint: ~6/144 SMs on H100; “drop out” when idle to free resources.
  • GPU Direct Storage moves data from NVMe to GPU memory directly (lower latency/hops).
  • Keep mid‑riser slots for 400/800G NICs and other accelerators.
  • Anecdote: “Feed the Beast.” Keep GPUs from starving on IO.

Lab impact and edge patterns [15–20]

  • Modest tokens/sec impact during inference; storage and AI seldom peak at once.
  • Edge deployments: Single 2U with 2 GPUs for capture, light train, and inference at scale.
  • Preserve IO slots by sharing existing GPUs with Graid.
  • Examples: YOLO-style wildlife, retail, and other ready-to-use inference models.
  • Tip: Size storage to avoid starvation under bursty pipelines.

Integrity, namespaces, and scale [20–25]

  • RAID is more relevant at the edge: Resilience keeps GPUs productive during faults.
  • Transient errors: Detect bad reads, rebuild from on-the-fly parity, and relocate data.
  • Big namespaces: 61/122/256 TB-class drives make large protected pools valuable.
  • Scale via NVMe‑oF JBOFs (RoCE/RDMA); 2496+ raw devices still require software RAID.
  • Roadmap: Higher drive counts + erasure coding for flexible protection domains.

Beyond AI: Databases, streaming, analytics [26–30]

  • Databases/HFT: Redis, Aerospike, and Oracle benefit from fast, consistent writes.
  • Splunk/log ingest: Sustained throughput prevents drops and backpressure.
  • Media: Multi-8K stream servers can reduce the total server count for a workload.
  • Density/power constraints favor higher perf per RU in colocations.
  • Rebuilds: Parity flows through GPU; slight latency increase, rarely app‑visible.

Roadmap, PCIe Gen6, how to engage [30-35]

  • Software-first: moving Gen4 to Gen5 boosts perf since Graid isn’t in the data path.
  • Gen6 flash (~28 GB/s x4) magnifies the need to remove chokepoints.
  • Peer-to-peer + GPU-offloaded math unlocks new PCIe generations without HW churn.
  • How to buy: Dell catalog, Supermicro OEM, channels (CDW), and Amazon.
  • Meet Graid: An NVIDIA Inception partner with booths at GTC Washington & San Jose.

Engage with StorageReview

Newsletter | YouTube | Podcast iTunes/Spotify | Instagram | Twitter | TikTok | RSS Feed