The Impact of Storage and GPU on AI Workloads

by Brian Beeler on June 26, 2020

Enterprise ◇ Software

Hardly a week goes by that we don’t hear from an IT vendor about the impact their solutions have on orgs that are involved in artificial intelligence, deep learning, machine learning or edge intelligence. The problem however is material insights on how these solutions impact the performance of each of these tasks is lacking. Recently we decided to see if we could do something about that by partnering with byteLAKE, an AI and HPC solutions builder based in Poland. The main objective being to evaluate the impact of storage and GPU on AI workloads.

Impact of Storage on AI

Initially, we wanted to explore a popular notion that local storage impacts the performance of AI models. We took one of the Dell EMC PowerEdge R740xd servers in our lab, configuring with two Intel Xeon Gold 6130 CPUs with 256GB of DRAM. We ran the byteLAKE AI test using three different local storage alternatives. For the test we used a legacy KIOXIA PX04S SSD along with the much faster, Samsung 983 ZET and Intel Optane 900P.

During the benchmark, we analyzed performance of the AI learning process. In the tests, we run the learning process for a real-world scenario. In this case the tests were a part of training procedure in one of the byteLAKE products: EWA Guard. It is based on the latest YOLO (You Only Look Once) which is a state-of-the-art real-time detection model. The model consists of a single input layer, 22 convolution layers, 5 pooling layers, 2 router layers, a single reorg layer and a single detection layer.

As a basic metric of performance, we have used the execution time of training for 5000 epochs. The benchmarks were repeated three times for each storage configuration, and the average values are presented below.

Results:

KIOXIA 98h 24m
Samsung 98h 44
Intel 98h 42

As is clear in the data, local storage had no impact in performance. Testing ranged from a SATA SSD to the latest and greatest Optane, with no impact whatsoever. That said, storage may play a more important role when it comes to data ingress and egress, but computationally for AI, in this case there was no impact.

Impact of GPU and Storage on AI

With the storage data in-hand, we added a single NVIDIA T4 to the PowerEdge to gauge impact of a GPU on the AI. For this testing, we ran the same three storage configurations as well.

Results:

KIOXIA 4h 30
Samsung 4h 28m
Intel 4h 27m

As expected, the GPU made an impact, an exponential impact in fact, driving a 22x improvement. With the GPU accelerating the overall performance of the AI, there was some thought that the faster storage may make an impact. That however was not the case as the SATA drive was right in line with the high-speed NVMe.

Conclusions

In this testing, we found the use of faster storage devices to not improve the performance of learning. The main reason here is a complex structure of the AI model. The time of learning is longer than the time of data reading. Said another way, the time of learning using the current batch of images is longer than the time needed for reading the next one. Consequently, the storage operations are hidden behind the AI computations.

When adding in the NVIDIA T4, there was some thought that faster processing by the AI would induce the storage to make an impact in performance. This was not the case in this test, as even with the T4, the AI model still had a heavier learning component and didn’t require storage to be particularly speedy.

While more work needs to be done to further test the impact of specific components and systems on AI, we believe this initial data to be useful and a good starting point for the conversation. We need application data to be able to get to a better understanding of where the right levers are from an IT standpoint and where budgetary spend can yield the most impactful results. This of course depends in large part too on where this activity takes place, be it in the data center or edge. For now we welcome the engagement by byteLAKE and others at the tip of the AI spear to help provide useful data to help answer these pressing questions.

This is our first AI test but not the last. Mariusz Kolanko, co-founder of byteLAKE, indicated they have been working on a product named CFD Suite (AI for Computational Fluid Dynamics “CFD” to accelerate solvers) where the deep learning process needs a lot of data for every epoch of training. This model may in fact place a higher load on storage to train models in the Big Data area and might have an impact on the performance of the deep learning processes itself. Ultimately, as with any application, it’s critical to understand the application needs to assign the proper data center resources. AI is clearly not a one size fits all application.

Learn more about byteLAKE

Discuss on Reddit