Rethinking Storage for AI Infrastructure Efficiency

Rethinking your storage architecture might just be the key to efficiently unlocking the full potential of your accelerated computing.

How do I reduce the number of watts consumed for a given productive output? How do I make these very large, very critical production AI systems robust? Answering these questions is our mission at DDN and I think about them a lot.

At the end of the day, an AI system comprises compute, network, middleware (containers, schedulers, monitoring), AI frameworks, and storage. These components are all part of the answer regarding system efficiency.

Overcoming AI Storage Bottlenecks for Enhanced Productivity

This is why it is frustrating that shared storage within AI Infrastructure remains an often overlooked element of the stack. We come across this a lot, especially with customers starting from a smaller scale. But the important points that people often miss are:

The right storage simplifies everything and makes a system more robust
Storage can vastly improve the efficiency of the entire system

Ultimately, purchasing optimized storage for AI workloads might be a bit more costly, but if you get it right, you gain 2X that back in higher productivity from the whole infrastructure.

Streamlining Data Movement in AI Infrastructure

How does that work? Well, moving data is, of course, an inherent part of AI model training (and inference for that matter). This includes moving datasets into GPUs, moving models into GPU memory, and saving hyperparameters to storage (checkpoints).

All of this data movement is overhead. It can be compared to waiting at the checkout at a supermarket. Standing there waiting adds no value and is just a step that needs to be completed before moving forward. If we could reduce our checkout wait time to zero, supermarkets would be more efficient, shoppers would be happier, and nobody would lose.

Just like a long checkout line can delay dinner, waiting for data can seriously stall AI projects. A high percentage of AI infrastructure is wasted as it burns up capital value along with electricity, cooling, and precious time.

Organizations are trying myriad approaches to optimize the training of foundation models, often by reducing the size of the computed problem without impacting the efficacy. Popular techniques include reducing the parameter search space for training, moving to asynchronous data movements, reducing model size, reducing effective size of training data, etc.

Racing Towards Efficiency: The Need for Balanced Infrastructure

While we’ve seen these kinds of optimizations before in HPC, they are never able to completely negate the need for a balanced hardware and software architecture. To maximize efficiency, the system’s GPU performance, CPU performance and memory bandwidth, PCI architecture, network bandwidth and latency, filesystem performance and storage capabilities all need to be balanced against the workload. And that efficiency is measured in reduced power, greater capital returns, less operational spending and better use of people’s time.

I’ll also note that things are not getting easier. With the huge race now for building the next generation of foundation models, and even moving towards AGI, the problem sizes are growing faster than the optimizations can reduce them. On top of that, GPU power, memory footprint and memory bandwidth are rapidly increasing, with NVIDIA doing an astounding job of maintaining momentum in hardware and software advances. The company is not showing any signs of slowing down.

From Jensen’s recent GTC keynote, I noted the following as key for us at DDN, being a contributing part of part of the NVIDIA-driven AI Infrastructure:

“And that’s our goal—to continuously drive down the cost and energy associated with computing so that we can continue to expand and scale up the computation that we have to do to train the next generation of models.”

It was about efficiency five years ago when NVIDIA bought this first DDN solution to drive the data for its Selene supercomputer. It was the same one year ago when, again, NVIDIA turned to a DDN solution to drive the data for their Eos system, and it is going to be the same for future systems when driving down cost and energy while driving up productive output is the target.

From Hyperscalers to End Users: Prioritizing Efficiency

And it’s not just the hyperscalers who are seriously thinking about the efficiency question. The end-users are too. An interesting post from a couple of weeks ago highlighted this in the light of these new Cloud implementations that are helping organizations gain access to large scale AI infrastructure quickly.

Yi Tay, Co-founder & Chief Scientist, Reka (Past: Senior Research Scientist, Google Brain) wrote a blog that I thought very important for the new wave of GPU cloud providers out there. Tay’s post highlights the challenges and variability faced when acquiring compute resources from different providers for training LLMs.

Tay notes that even clusters with stable nodes are not immune to efficiency problems:

“Meanwhile, even though some other clusters could have significantly more stable nodes, they might suffer from poor I/O and file system that even saving checkpoints could result in timeouts or ungodly amounts of time chipping away at cluster utilisation.”

“Did I mention you’ll also get a different Model Flop Utilisation (MFU) for different clusters!? This was a non-negligible amount of compute wasted if one is unlucky enough to find a provider with badly cabled nodes or some other issues. Systems with very sub-optimal file systems would have the MFU of training runs tank the moment a teammate starts transferring large amounts of data across clusters.”

(Source)

Tay’s insights show that end users are definitely noticing the overall efficiency and output corresponding to their time and spend on cloud resources. For their GPU-Hour dollars, they notice the useful productive output, and they are comparing between the good, bad, and ugly out there.

Balanced architectures and integrated solutions are important to efficiency. NVIDIA has talked about accelerated computing being more than the sum of its parts, but rather emergent from cross-stack integration as well as the speed of the individual components.

DDN’s Integral Role in Streamlining AI Architecture

At DDN, our storage is not just storage. It is an integral piece of your AI architecture that simplifies and accelerates while keeping your data safe. Our A³I storage solutions are purpose-built for AI workloads, delivering unmatched performance, comprehensive enterprise features, and limitless scaling.

These offerings are just the beginning. DDN provides a comprehensive suite of products and solutions to streamline your entire AI data lifecycle. From our EXAScaler parallel file system for high-performance workloads to our DataFlow software for seamless data movement and management across diverse environments, our solutions enable you to harness the full potential of your data.

Explore our online TCO tool to see the tangible impact of DDN solutions on your bottom line and discover how our intelligent data solutions can transform your accelerated computing architecture.