Nvidia turns up the AI heat with 1,200W Blackwell GPUs

Five times the performance of the H100, but you'll need liquid cooling to tame the beast

For all the saber-rattling from AMD and Intel, Nvidia remains, without question, the dominant provider of AI infrastructure. With today's debut of the Blackwell GPU architecture during CEO Jensen Huang's GTC keynote, it aims to extend its technical lead – in both performance and power consumption.

Given Nvidia's rapid rise in the wake of the generative AI boom, the stakes couldn't be higher. But at least on paper, Blackwell – the successor to Nvidia's venerable Hopper generation – doesn't disappoint. In terms of raw FLOPS, the GPU giant's top-specced Blackwell chips are roughly 5x faster.

Of course, the devil is in the details, and getting this performance will depend heavily on a number of factors. While Nvidia claims the new chip will do 20 petaFLOPS, that's only when using its new 4-bit floating point data type and opting for liquid-cooled servers. Looking at gen-on-gen FP8 performance, the chip is only about 2.5x faster than the H100.

At the time of writing, Blackwell encompasses three parts: the B100, B200, and Grace-Blackwell Superchip (GB200). Presumably there will be other Blackwell GPUs at some point – like the previously teased B40, which use a different die, or rather dies – but for now the three chips share the same silicon.

The Nvidia Blackwell GPU powering the B100, B200, and GB200 accelerators features a pair of reticle limited compute dies which communicate with each other via a 10TB/sec NVLink-HBI interconnect

The Nvidia Blackwell GPU powering the B100, B200, and GB200 accelerators features a pair of reticle limited compute dies which communicate with each other via a 10TB/sec NVLink-HBI interconnect – Click to enlarge

And it's this silicon which is at least partially responsible for Blackwell's performance gains this generation. Each GPU is actually two reticle-limited compute dies, tied together via a 10TB/sec NVLink-HBI (high-bandwidth interface) fabric, which allows them to function as a single accelerator. The two compute dies are flanked by a total of eight HBM3e memory stacks, with up to 192GB of capacity and 8TB/sec of bandwidth. And unlike H100 and H200, we're told the B100 and B200 have the same memory and GPU bandwidth.

Nvidia is hardly the first to take the chipset – or in its preferred parlance "multi-die" – route. AMD's MI300-series accelerators – which we looked at in December – are objectively more complex and rely on both 2.5D and 3D packaging tech to stitch together as many as 13 chiplets into a single part. Then there's Intel's GPU Max parts, which use even more chiplets.

AI's power and thermal demands hit home

Even before Blackwell's debut, datacenter operators were already feeling the heat associated with supporting massive clusters of Nvidia's 700W H100.

With twice the silicon filling out Nvidia's latest GPU, it should come as some surprise that it runs only a little hotter — or at least it can, given the ideal operating conditions.

With the B100, B200, and GB200, the key differentiator comes down to power and performance rather than memory configuration. According to Nvidia, the silicon can actually operate between 700W and 1,200W, depending on the SKU and type of cooling used.

Within each of these regimes, the silicon understandably performs differently. According to Nvidia, air-cooled HGX B100 systems are able to squeeze 14 petaFLOPS of FP4 per GPU, while consuming the same 700W power target as the H100. This means if your datacenter can already handle Nvidia's DGX H100 systems, you shouldn't run into trouble adding a couple of B100 nodes to your cluster.

Where things get interesting is with the B200. In an air-cooled HGX or DGX configuration, each GPU can push 18 petaFLOPS of FP4 while sucking down a kilowatt. According to Nvidia its DGX B200 chassis with eight B200 GPUs will consume roughly 14.3kW – something that's going to require roughly 60kW of rack power and thermal headroom to handle.

For newer datacenters built with AI clusters in mind, this shouldn't be an issue – but for existing facilities it may not be so easy.

Speaking of AI datacenters, reaching Blackwell's full potential will require switching over to liquid cooling. In a liquid-cooled configuration, Nvidia says the chip can output 1,200W of thermal energy when pumping out the full 20 petaFLOPS of FP4.

All of this is to say that liquid cooling isn't a must this generation but – if you want to get the most out of Nvidia's flagship silicon – you're going to need it.

Nvidia doubles up on GPU compute with second-gen Superchips

Nvidia's Grace-Blackwell Superchip, or GB200 for short, combines a 72 Arm core CPU with a pair of 1,200W GPUs

Nvidia's Grace-Blackwell Superchip, or GB200 for short, combines a 72 Arm core CPU with a pair of 1,200W GPUs – Click to enlarge

Nvidia's most powerful GPUs can be found in its GB200. Similar to Grace-Hopper, the Grace-Blackwell Superchip meshes together its existing 72-core Grace CPU with its Blackwell GPUs, using the NVLink-C2C interconnect.

But rather than a single H100 GPU, the GB200 packs a pair of Blackwell accelerators – boosting performance to 40 petaFLOPS of FP4 performance and 384GB of HBM3e memory.

We asked Nvidia for clarification on the GB200's total power draw and were told "no further details provided at this time." However, the older GH200 was rated for 1,000W – between a 700W GPU and the 300W Arm CPU. So a back of the envelop calculation suggests that, under peak load, the Grace-Blackwell part, – with its twin GPUs drawing 1,200W each and the same Arm CPU – is capable of sucking down somewhere in the neighborhood of 2,700W. So it's not surprising that Nvidia would skip straight to liquid cooling for this beast.

Ditching the bulky heat spreaders for a couple of cold plates allowed Nvidia to cram two of these accelerators into a slim 1U chassis capable of pushing 80 petaFLOPS of FP4, or 40 petaFLOPS at FP8.

Compared to the previous generation, this dual GB200 system is capable of churning out more FLOPS than its 8U 10.2kW DGX H100 systems – 40 petaFLOPS vs 32 petaFLOPS – while consuming an eighth the space.

Nvidia's rackscale systems get an NVLink boost

These dual GB200 systems form the backbone of Nvidia's NVL72 rackscale AI systems, which are designed to support large-scale training and inference deployments on models scaling to trillions of parameters.

Each rack comes equipped with 18 nodes, for a total of 32 Grace GPUs and 72 Blackwell accelerators. These nodes are then interconnected via a bank of nine NVLink switches, enabling them to behave like a single GPU node with 13.5TB of HBM3e.

The GB200 NVL72 is a rackscale system that uses NVLink switch appliances to stitch together 36 Grace-Blackwell Superchips into a single system

The GB200 NVL72 is a rackscale system that uses NVLink switch appliances to stitch together 36 Grace-Blackwell Superchips into a single system – Click to enlarge

This is actually the same technology employed in Nvidia's past DGX systems to make eight GPUs behave as one. The difference is that, using dedicated NVLink appliances, Nvidia is able to support many more GPUs.

According to Nvidia, the approach allows a single NVL72 rack system to support model sizes up to 27 trillion parameters – presumably when using FP4. In training, Nvidia claims the system is good for 720 petaFLOPS at sparse FP8. For inferencing workloads, meanwhile, we're told the system will do 1.44 exaFLOPS at FP4. If that's not enough horsepower, eight NVL72 racks can be networked to form Nvidia's DGX BG200 Superpod.

Eight DGX NVL72 racks can be strung together to form Nvidia's liquid cooled DGX GB200 Superpod

Eight DGX NVL72 racks can be strung together to form Nvidia's liquid-cooled DGX GB200 Superpod – Click to enlarge

If any of this sounds familiar, that's because this isn't the first time we've seen this rackscale architecture from Nvidia. Last fall, Nvidia showed off a nearly identical architecture called the NVL32, which it was deploying for AWS. That system used 16 dual-GH200 Superchip systems connected via NVLink switch appliances for a total of 128 petaFLOPS of sparse FP8 performance.

However, the NVL72 design doesn't just pack more GPUs – both the NVLink and network switches used to stitch the whole thing together have had an upgrade too.

The NVLink switches now feature a pair of Nvidia's fifth-gen 7.2Tbit/sec NVLink ASICs, each of which is capable of 1.8TB/sec of all-to-all bidirectional bandwidth – twice that of the last generation. 

Nvidia has also doubled the per-port bandwidth of its networking gear to 800Gbit/sec this time around – though reaching these speeds will likely require using either Nvidia's Connect-X8 or BlueField-3 SuperNICs in conjunction with its Quantum 3 or Spectrum 4 switches.

Nvidia's rackscale architecture appears to be a hit with the major cloud providers too – Amazon, Microsoft, Google, and Oracle have all signed up to deploy instances based on the design. We've also learned that AWS's Project Ceiba has been upgraded to use 20,000 accelerators.

How Blackwell stacks up so far

While Nvidia may dominate the AI infrastructure market, it's hardly the only name out there. Heavy hitters like Intel and AMD are rolling out Gaudi and Instinct accelerators, cloud providers are pushing custom silicon, and AI startups like Cerebras and Samba Nova are vying for a slice of the action.

And with demand for AI accelerators expected to far outstrip supply throughout 2024, winning share doesn't always mean having faster chips – just ones available to ship.

While we don't know much about Intel's upcoming Guadi 3 chips just yet, we can make some comparisons to AMD's MI300X GPUs, launched back in December.

As we mentioned earlier, the MI300X is something of a silicon sandwich which uses advanced packaging to stack eight CDNA 3 compute units vertically onto four I/O dies, which provide high speed communications between the GPUs and 192GB of HBM3 memory.

  MI300X B100 B200 GB200 (Per GPU) GB200 (2x GPU)
FP4 NA 14 PFLOPS* 18 PFLOPS* 20 PFLOPS* 40 PFLOPS*
FP8 5.2 PFLOPS* 7 PFLOPS* 9 PFLOPS* 10 PFLOPS* 20 PFLOPS*
INT8 5,200 TOPS* 7,000 TOPS* 9,000 TOPS* 10,000 TOPS* 20,000 TOPS*
FP16 2.6 PFLOPS* 3.5 PFLOPS* 4.5 PFLOPS* 5 PFLOPS* 10 PFLOPS*
TF32 1.3 PFLOPS* 1.8 PFLOPS* 2.2 PFLOPS* 2.5 PFLOPS* 5 PFLOPS*
FP64 (Matrix) 163 TFLOPS 30 TFLOPS 40 TFLOPS 45 TFLOPS 90 TFLOPS
FP64 (Vector) 81.7 TFLOPS ? ? ? ?
Memory 192GB HBM3 192GB HBM3e 192GB HBM3e 192GB HBM3e 384GB HBM3e
HBM Bandwidth 5.3TB/s 8TB/s 8TB/s 8TB/s 16TB/s
Power 750W 700W 1,000W 1,200W 2,700W (With Grace CPU)
* With sparcity

In terms of performance, the MI300X promised a 30 percent performance advantage in FP8 floating point calculations and a nearly 2.5x lead in HPC-centric double precision workloads compared to Nvidia's H100.

Comparing the 750W MI300X against the 700W B100, Nvidia's chip is 2.67x faster in sparse performance. And while both chips now pack 192GB of high bandwidth memory, the Blackwell part's memory is 2.8TB/sec faster.

Memory bandwidth has already proven to be a major indicator of AI performance, particularly when it comes to inferencing. Nvidia's H200 is essentially a bandwidth boosted H100. Yet, despite pushing the same FLOPS as the H100, Nvidia claims it's twice as fast in models like Meta's Llama 2 70B.

While Nvidia has a clear lead at lower precision, it may have come at the expense of double precision performance – an area where AMD has excelled in recent years, winning multiple high-profile supercomputer awards.

According to Nvidia, the Blackwell GPU is capable of delivering 45 teraFLOPS of FP64 tensor core performance. That's a bit of a step down from the 67 teraFLOPS of FP64 Matrix performance delivered by the H100, and puts it at a disadvantage against AMD's MI300X at either 81.7 teraFLOPS FP64 vector or 163 teraFLOPS FP64 matrix.

There's also Cerebras, which recently showed off its third-gen Waferscale AI accelerators. The monster 900,000-core processor is the size of a dinner plate and designed specifically for AI training.

Cerebras claims each of these chips can squeeze 125 petaFLOPS of highly sparse FP16 performance from 23kW of power. Compared to the H100, Cerebras claims the chip is about 62x faster at half precision.

However, pitting the WSE-3 against Nvidia's flagship Blackwell parts, that lead shrinks considerably. From what we understand, Nvidia's top specced chip should deliver about 5 petaFLOPS of sparse FP16 performance. That cuts Cerebras's lead down to 25x. But, as we pointed out at the time, all of this depends on your model being able to take advantage of sparsity.

Don't cancel your H200 orders just yet

Before you get too excited about Nvidia's Blackwell parts, it's going to be a while before you can get your hands on them.

Nvidia told The Register the B100, B200, and GB200 will all ship in the second half of the year, but it's not clear exactly when or in what volume. It wouldn't surprise us if the B200 and GB200 didn't start ramping until sometime in early 2025.

The reason is simple: Nvidia hasn't shipped its HBM3e-equipped H200 chips yet. Those parts are due out in the second quarter of this year. ®

Need more analysis? Don't forget to check out Timothy Prickett Morgan's commentary on Nvidia's news right here on The Next Platform.

More about

TIP US OFF

Send us news


Other stories you might like