DDN & NVIDIA Collaborate To Leverage NVIDIA’s DGX SuperPOD™ Reference Architectures for Internal Development

NVIDIA has created the data center for the age of AI and provides solutions that deliver breakthrough performance for workloads at any scale, driving business decisions in real-time and resulting in faster time to value.

A pioneer in accelerated computing and the world leader in artificial intelligence (AI) computing, NVIDIA is setting new standards for computing innovation, deep learning, and data analytics. NVIDIA is engineering the most advanced chips, systems, and software for AI factories of the future by building new AI services that help companies create their own AI factories.

Built from the ground up for enterprise AI, the NVIDIA DGX™ platform combines the best of NVIDIA software, infrastructure, and expertise. By consolidating the power of an entire data center into a single platform, NVIDIA has revolutionized how complex machine learning workflows and AI models are developed and deployed in an enterprise.

The Challenge

With the explosive growth of AI applications, an entirely novel approach to the data center was necessary. NVIDIA required a high-capacity, reliable, and easy-to-integrate AI data storage and management solution to not only deliver supercomputing services to meet complex demands from its internal developers but also create the blueprint for deploying turnkey supercomputers for their new breed of AI customers.

Beginning with the initial supercomputer collaboration, Selene, NVIDIA wanted to build a system powerful enough to train the AI models their colleagues were building for autonomous vehicles and general purpose enough to serve the needs of any deep- learning researcher. As the size and complexity of AI models continued to grow, NVIDIA incorporated new technologies into subsequent systems to fulfill the ongoing goal of creating best-in-class infrastructure for all AI workloads, and they needed storage solutions that could keep up.

NVIDIA required a reliable data storage platform and provider partner that could handle large computational problems distributed across hundreds of systems operating in parallel using a standard set of scalable storage building blocks. To reduce complexity, these storage building blocks needed to supply excellent performance for both reads and writes and scale out without needing to re-architect to accommodate future growth.

“What’s needed is data center-scale computing, so AI models and datasets can be processed across many systems in parallel, enabling applications to train in hours instead of weeks,” said Tony Paikeday, senior director of Product Marketing at NVIDIA.

The Solution

Since 2018, DDN and NVIDIA have run extensive validation testing and collaborative development projects to create an optimal infrastructure architecture for AI workloads and applications. This has resulted in DDN storage being used for NVIDIA’s Selene, Cambridge-1 and Eos AI supercomputers, as well as the creation of reliable and repeatable reference architectures that scale with ease for enterprise AI customers.

Historically, most supercomputers were custom-built one-off designs, but the new breed of enterprise AI customers does not have the experience, expertise or time to build one this way. With the experience building Selene leveraging DDN’s A³I appliances, accomplished in 2020 over just three weeks, NVIDIA was able to create the blueprint for AI supercomputing that came to be known as NVIDIA DGX SuperPOD™. The DGX SuperPOD RAs deliver reduced time-to-solution while minimizing the complexity of increasingly diverse AI models, including conversational AI, recommender systems, computer vision workloads, autonomous vehicles and DDN’s A³I was certified as the first storage solution for this world- class, turnkey AI infrastructure.

“When we developed Selene, we had a design in mind, to grow from a smaller unit into the full-size supercomputer,” said Prethvi Kashinkunti, senior data center systems engineer, NVIDIA. “We wanted to be able to take on that effort of going through the pain of putting this together and figuring out where the gaps were so that joint customers of ours could go out and take the same architecture for whatever scale that they need. [We are giving them] the confidence of knowing that somebody has done this and that it works, and that expectations can be met.”

Over time, the increase in the size and complexity of AI models has driven NVIDIA and DDN to collaborate on additional systems to achieve unprecedented performance and predictable uptime, dramatically boosting utilization and productivity and increasing the ROI of NVIDIA’s internal systems and customer AI initiatives alike. Most recently, NVIDIA unveiled its Eos system, which is comprised of 576 NVIDIA DGX H100 systems and NVIDIA Quantum-2 InfiniBand networking, where NVIDIA uses DDN’s AI400X2 for their storage.

“There are many important considerations when designing the world’s most powerful AI systems. Storage is one that is often overlooked. As the data models get bigger and bigger, and the computation becomes bigger and bigger, more and more data is needed,” explained Marc Hamilton, VP Solutions Architecture and Engineering, NVIDIA. “It’s not just about moving that data; it’s about moving the data at the same time.”

By utilizing DDN’s A³I appliances, NVIDIA received a data platform well matched to its DGX systems, with high- performance networking, ample I/O capabilities, and a design that scaled well with its growing data needs and customers’ growing demands.

The Benefits

“DDN’s performance and scalability are essential to reducing total time to solution, which is king,” said Michael Houston, chief architect, AI Systems at NVIDIA.

DDN is proud to be integrated with many NVIDIA DGX SuperPODs sold around the world today for shared clouds, generative AI, sovereign AI, and other applications. The flexible and performance optimized solution has allowed customers to get faster ROI with more effective generative AI and LLM training across autonomous vehicles, genomics and biosciences, financial services, robotics, manufacturing, and countless other industries.

DDN’s solutions have kept up with NVIDIA’s advances in GPU technology. As GPUs get more powerful, they need to stay busy and DDN has increased the performance of its appliances in successive generations by 50% in the same power and rack space requirement.

“Having a partner who stands shoulder-to-shoulder with our engineers to solve the big challenges is where the true value comes from,” said Houston. “We’re definitely pushing the boundaries of what’s possible today while exploring new frontiers for the future.”

With a good balance of read and write performance, DDN maximizes GPU utilization by minimizing the time it takes to run I/O intensive operations like data load, model load, and checkpoints. Checkpoints, a critical recurrent step in training workloads where models are saved to persistent storage for a variety of reasons, can be a significant bottleneck. Because of DDN’s efficient write performance, these checkpoints are significantly faster than alternative storage solutions, reducing wait time and making the entire system more productive.

“Having the storage technology that can provide the appropriate amount of bandwidth both for reads and writes is critical to ensure we maintain that level of efficiency,” explained Kashinkunti. “The DDN technology was the right fit for this type of application.”

DDN has delivered complete campus-wide, departmental and cloud storage solutions to hundreds of universities around the world, combining sophisticated technology with an in-depth understanding of the diverse requirements in academic research.

Looking Ahead

By consolidating the power of an entire data center into a single platform, NVIDIA is revolutionizing how complex machine learning workflows and AI models are developed and deployed in an enterprise. With the addition of DDN storage to the advanced AI infrastructure provided by NVIDIA, they are providing world-class AI solutions for enterprise customers.

“I would say to anybody who is thinking about using DDN, that they would be getting an engineering partner and a team that knows how to support customers that have such a large scale like we do,” said Kashinkunti. “They have the ability to continue to innovate and provide new solutions for improving performance of future AI applications.”

NVIDIA is also making access to accelerated computing as easy, fast, and flexible as possible. Whether they are deploying in their own data center, as a hosted private solution, or in a public cloud, customers can be confident that providers following the standardized reference architectures will supply an efficient and well- proven solution.

“What I love about DDN is that they’re not new to high-performance. They’re the de facto name in high- performance computing storage. And now, by working with us on our DGX SuperPOD, they’re the de facto name for AI storage in high-performance environments,” Hamilton adds.

Related Resources

Whitepapers
Evaluating Infrastructure Options for Enterprise Development
Although the promise and attraction of AI is well appreciated by IT and business decision-makers alike, it’s far less clear how organizations should begin their journey to enterprise-class AI.
View Resource
Whitepapers
Accelerate Artificial Intelligence Initiatives with DDN and NVIDIA at Any Scale
Data now plays such a prominent role in determining business success that 98% of organizations surveyed by ESG are in some phase of data-driven digital transformation.
View Resource
Research
A Guide to Solving 5 of Common AI Infrastructure Challenges
Maybe your applications seem to be running fine, but since when has “fine” been good enough? Instead of celebrating the new business value of AI, you find yourself resetting expectations.
View Resource