NVIDIA Hopper GPU Architecture and H100 Accelerator Announced: Working Smarter and Harderby Ryan Smith on March 22, 2022 11:45 AM EST
Depending on your point of view, the last two years have either gone by very slowly, or very quickly. While the COVID pandemic never seemed to end – and technically still hasn’t – the last two years have whizzed by for the tech industry, and especially for NVIIDA. The company launched its Ampere GPU architecture just two years ago at GTC 2020, and after selling more of their chips than ever before, now in 2022 it’s already time to introduce the next architecture. So without further ado, let’s talk about the Hopper architecture, which will underpin the next generation of NVIDIA server GPUs.
As has become a ritual now for NVIDIA, the company is using its Spring GTC event to launch its next generation GPU architecture. Introduced just two years ago, Ampere has been NVIDIA’s most successful server GPU architecture to date, with over $10B in data center sales in just the last year. And yet NVIDIA has little time to rest on their laurels, as the the growth and profitability of the server accelerator market means that there are more competitors than ever before aiming take a piece of NVIDIA’s market for themselves. To that end, NVIDIA is ready (and eager) to use their biggest show of the year to talk about their next generation architecture, as well as the first products that will implement it.
Taking NVIDIA into the next generation of server GPUs is the Hopper architecture. Named after computer science pioneer Grace Hopper, the Hopper architecture is a very significant, but also very NVIDIA update to the company’s ongoing family of GPU architectures. With the company’s efforts now solidly bifurcated into server and consumer GPU configurations, Hopper is NVIDIA doubling down on everything the company does well, and then building it even bigger than ever before.
Hyperbole aside, over the last several years NVIDIA has developed a very solid playbook for how to tackle the server GPU industry. On the hardware side of matters that essentially boils down to correctly identifying current and future trends as well as customer needs in high performance accelerators, investing in the hardware needed to handle those workloads at great speeds, and then optimizing the heck out of all of it. And for NVIDIA, the last step may very well be the most important bit: NVIDIA puts a lot of work into getting out of doing work.
That mentality, in turn, is front and center for NVIDIA’s Hopper architecture. While NVIDIA has made investments across the board to improve performance, from memory bandwidth and I/O to machine learning and confidential computing, the biggest performance uplifts with Hopper are in the areas where NVIDIA has figured out how to do less work, making their processors all the faster.
Kicking things off for the Hopper generation is H100, NVIDIA’s flagship server accelerator. Based on the GH100 GPU, GH100 is a traditional NVIDIA server-first launch, with the company starting at the high end to develop accelerator cards for their largest and deepest pocketed server and enterprise customers.
|NVIDIA Accelerator Specification Comparison|
|FP32 CUDA Cores||16896||6912||5120|
|Memory Clock||4.8Gbps HBM3||3.2Gbps HBM2e||1.75Gbps HBM2|
|Memory Bus Width||5120-bit||5120-bit||4096-bit|
|FP32 Vector||60 TFLOPS||19.5 TFLOPS||15.7 TFLOPS|
|FP64 Vector||30 TFLOPS||9.7 TFLOPS
(1/2 FP32 rate)
(1/2 FP32 rate)
|INT8 Tensor||2000 TOPS||624 TOPS||N/A|
|FP16 Tensor||1000 TFLOPS||312 TFLOPS||125 TFLOPS|
|TF32 Tensor||500 TFLOPS||156 TFLOPS||N/A|
|FP64 Tensor||60 TFLOPS||19.5 TFLOPS||N/A|
18 Links (900GB/sec)
12 Links (600GB/sec)
6 Links (300GB/sec)
|Manufacturing Process||TSMC 4N||TSMC 7N||TSMC 12nm FFN|
Ahead of today’s keynote presentation – which as this article goes up, is still going on – NVIDIA offered a press pre-briefing on Hopper. In traditional NVIDIA fashion, the company has been very selective about the details released thus far (least it gets leaked ahead of Jensen Huang’s keynote). So we can’t make a fully apples-to-apples comparison to A100 quite yet, as we don’t have the full specifications. But based on this pre-briefing, we can certainly tease out some interesting highlights about NVIDIA’s architecture.
First and foremost, NVIDIA is once again building big for their flagship GPU. The GH100 GPU is comprised of 80 billion transistors and is being built on what NVIDIA is calling a “custom” version of TSMC’s 4N process node, an updated version of TSMC’s N5 technology that offers better power/performance characteristics and a very modest improvement in density. So even at just two years after Ampere, NVIDIA is making a full node jump and then some for GH100. At this point NVIDIA is not disclosing die sizes, so we don’t have exact figures to share. But given the known density improvements of TSMC’s process nodes, GH100 should be close in size to the 826mm2 GA100. And indeed, it is, at 814mm2.
Like NVIDIA’s previous sever accelerators, the H100 card isn’t shipping with a fully-enabled GPU. So the figures NVIDIA is providing are based on H100 as implemented, with however many functional units (and memory stacks) are enabled.
In regards to performance, NVIDIA isn’t quoting any figures for standard vector performance in advance. They are however quoting tensor performance, which depending on the format is either 3x or 6x faster than the A100 accelerator. We’ll see how this breaks down between clockspeed increases and either larger or additional tensor cores, but clearly NVIDIA is once again throwing even more hardware at tensor performance, a strategy that has worked out well for them so far.
Officially, NVIDIA likes to quote figures with sparsity enabled, but for the purposes of our spec sheet I’m using the non-sparse numbers for a more apples-to-apples comparison with previous NVIDIA hardware, as well as competing hardware. With sparsity enabled, TF32 performance and on down can be doubled.
Memory bandwidth is also improving significantly over the previous generation, with H100 offering 3TB/second of bandwidth there. The increase in bandwidth this time around comes thanks to the use of HBM3, with NVIDIA becoming the first accelerator vendor to use the latest-generation version of the high bandwidth memory. H100 will come with 6 16GB stacks of the memory, with 1 stack disabled. The net result is 80GB of HBM3 running at a data rate of 4.8Gbps/pin, and attached to a 5120-bit memory bus.
NVIDIA will be offering H100 in their usual two form factors: SXM mezzanine for high performance servers, and a PCIe card for more mainstream servers. The power requirements for both form factors have gone up significantly over the previous generation. NVIDIA is quoting an eye-popping 700 Watt TDP for the SXM version of the card, 75% higher than the official 400W TDP of the A100. For better or worse, NVIDIA is holding nothing back here, though the ongoing decline in transistor power scaling is not doing NVIDIA any favors, either.
Cooling such a hot GPU will be an interesting task, though not beyond current technology. At these power levels we’re almost certainly looking at liquid cooling, something the SXM form factor is well-suited for. Still, it’s worth noting that the rival OAM form factor – essentially the Open Compute Project’s take on SXM for use in accelerators – is designed to top out at 700W. So NVIDIA is seemingly approaching the upper limits of what even a mezzanine style card can handle, assuming that sever vendors don’t resort to exotic cooling methods.
Meanwhile the H100 PCie card will see its TDP raised to 350W, from 300W today. Given that 300W is the traditional limit for PCIe cards, it will be interesting to see how NVIDIA and their partners keep those cards cool. Otherwise, with just half the TDP of the SXM card, we’re expecting the PCIe version to be clocked/configured noticeably slower in order to temper the card’s power consumption.
Hopper Tensor Cores: Now With Transformer Engines
Moving on to the big-ticket architectural features of the Hopper architecture, we’re start with NVIDIA’s Transformer Engines. Living up to their name, the transformer engines are a new, highly specialized type of tensor core, that are designed to further accelerate transformer ML models.
In keeping with NVIDIA’s focus on machine learning, for the Hopper architecture the company has taken a fresh look at the makeup of the ML market, and what workloads are popular and/or the most demanding on existing hardware. The winner, in this regard, has been transformers, a type of deep learning model that have risen in popularity rather quickly due to their utility in natural language processing and computer vision. Recent advancements in transformer technology, such as the GPT-3 model, along with demand from service operators for better natural language processing, have made transformers the latest big breakthrough in ML.
But at the same time, the processing requirements for transformers are also hampering the development of even better models. In short, better models require an ever-larger number of parameters, and at over 175 billion parameters for GPT-3 alone, training times for transformers are becoming unwieldy, even on large GPU clusters.
To that end, NVIDIA has developed a variant of the tensor core specifically for speeding up transformer training and inference, which they have dubbed the Transformer Engine. NVIDIA has optimized this new unit by stripping it down to just processing the lower precision data formats used by most transformers (FP16), and then scaling things down even more with the introduction of an FP8 format as well. The goal with the new units, in brief, is to use the minimum precision necessary at every step to train transformers without losing accuracy. In other words, to avoid doing more work than is necessary.
With that said, unlike more traditional neural network models which are trained at a fixed precision throughout, NVIDIA’s latest hack for transformers is to vary the precision, since FP8 cannot be used throughout a model. As a result, Hopper’s transformer engines can swap between FP16 and FP8 training on a layer by layer basis, utilizing NVIDIA-provided heuristics that work to select the lowest precision needed. The net benefit is that every layer that can be processed at FP8 can be processed twice as fast, as the transformer engines can pack and process FP8 data twice as quickly as FP16.
Combined with the additional memory on H100 and the faster NVLink 4 I/O, and NVIDIA claims that a large cluster of GPUs can train a transformer up to 9x faster, which would bring down training times on today’s largest models down to a more reasonable period of time, and make even larger models more practical to tackle.
Meanwhile, on the inference side of matters, Hopper can also immediately consume its own FP8 trained models for inference use. This is an important distinction for Hopper, as it allows customers to otherwise skip converting and optimizing a trained transformer model down to INT8. NVIDIA isn’t claiming any specific performance benefits from sticking with FP8 over INT8, but it means developers can enjoy the same performance and memory usage benefits of running inference on an INT8 model without the previously-required conversion step.
Finally, NVIDIA is claiming anywhere between a 16x and 30x increase in transformer inference performance on H100 versus A100. Like their training claims, this is an H100 cluster versus an A100 cluster, so memory and I/O improvements are also playing a part here, but it none the less underscores that H100’s transformer engines aren’t just for speeding up training.
DPX Instructions: Dynamic Programming for GPUs
NVIDIA’s other big smart-and-lazy improvement for the Hopper architecture comes courtesy of the field of dynamic programming. For their latest generation of technology, NVIDIA is adding support for the programming model by adding a new set of instructions just for dynamic programming. The company is calling these DPX Instructions.
Dynamic programming, in a nutshell, is a way of breaking down complex problems into smaller, simpler problems in a recursive manner, and then solving those smaller problems first. The key feature of dynamic programming is that if some of these sub-problems are identical, then those redundancies can be identified and eliminated – meaning a sub-problem can be solved once, and its results saved for future use within the larger problem.
All of which is to say that, like Sparsity and Transformer Engines, NVIDIA is implementing dynamic programming to allow their GPUs to get out of doing more work. By eliminating the redundant parts of workloads that can be broken up per the rules of dynamic programming, it’s that much less work NVIDIA’s GPUs need to do, and that much faster they can produce results.
Though unlike Transformer Engines, adding dynamic programming support via the DPX Instructions is not so much about speeding up existing workloads on GPUs as it is enabling new workloads on GPUs. Hopper is the first NVIDIA architecture to support dynamic programming, so workloads that can be resolved with dynamic programming are normally run on CPUs and FPGAs. In that respect, this is NVIDIA finding one more workload they can steal from CPUs and run on a GPU instead.
Overall, NVIDIA is claiming a 7x improvement in dynamic programming algorithm performance on a single H100 versus naïve execution on an A100.
As for the real-world implications of DPX Instructions, NVIDIA is citing route planning, data science, robotics, and biology as all being potential beneficiaries of the new technology. These fields already use several well-known dynamic programming algorithms, such as Smith-Waterman and Flyod-Warshall, which score genetic sequence aligning and find the shortest distances between pairs of destinations respectively.
Overall, dynamic programming is one of the more niche fields among high performance workloads. But it’s one that NVIDIA believes can be a good fit for GPUs once the right hardware support is available.
Confidential Computing: Protecting GPU Data From Prying Eyes
Shifting away from performance-focused features, NVIDIA’s other big push with the Hopper architecture is on the security front. With the expansion of GPU usage in cloud computing environments – and especially shared VM environments – the company is taking a new focus on the security concerns that entails, and how to secure shared systems.
The end result of those efforts is that Hopper is introducing hardware support for trusted execution environments. Specifically, Hopper supports the creation of what NVIDIA is terming a confidential virtual machine, where all of the data within the VM environment is secure, and all of the data entering (and leaving) the environment is encrypted.
NVIDIA didn’t go over too many of the technical details underpinning their new security features in our-pre-briefing, but according to the company it’s a product of a mix of new hardware and software features. Of particular note, data encryption/decryption when moving to and from the GPU is fast enough to be done at the PCIe line rate (64GB/sec), meaning there’s no slowdown in terms of practical host-to-GPU bandwidth when using this security feature.
This trusted execution environment, in turn, is designed to resist all forms of tampering. The memory contents within the GPU itself are secured by what NVIDIA is terming a “hardware firewall”, which prevents outside processes from touching them, and this same protection is extended to data in-flight in the SMs as well. The trusted environment is also said to be secured against the OS or the hypervisor accessing the contents of the GPU from above, restricting access to just the owner of the VM. Which is to say that, even with physical access to the GPU, it shouldn’t be possible to access the data within a secure VM on hopper.
Ultimately, NVIDIA’s aim here appears to be making/keeping their customers comfortable using GPUs to process sensitive data by making them much hardware to break into when they’re working in a secured mode. This, in turn, is not only to protect traditionally sensitive data, such as medical data, but also to protect the kind of high-value AI models that some of NVIDIA’s customers are now creating. Given all of the work that can go into creating and training a model, customers don’t want their models getting copied, be it in a shared cloud environment or being pulled out of a physically insecure edge device.
Multi-Instance GPU v2: Now With Isolation
As an extension of NVIDIA’s security efforts with confidential computing, the company has also extended these protections to their Multi-Instance GPU (MIG) environment. MIG instances can now be fully isolated, with I/O between the instance and the host fully virtualized and secured as well, essentially granting MIG instances the same security features as H100 overall. Overall, this moves MIG closer to CPU virtualization environments, where the various VMs assume not to trust each other and are kept isolated.
NVLink 4: Extending Chip I/O Bandwidth to 900GB/sec
With the Hopper architecture also comes a new rendition of NVIDIA’s NVLink high-bandwidth interconnect for wiring up GPUs (and soon, CPUs) together for better performance in workloads that can scale out over multiple GPUs. NVIDIA has iterated on NVLink with every generation of their flagship GPU, and this time is no different, with the introduction of NVLink 4.
While we’re awaiting a full disclosure of technical specifications from NVIDIA, the company has confirmed that NVLink bandwidth on a per-chip basis has been increased from 600GB/second on A100 to 900GB/second for H100. Note that this is the sum total of all upstream and downstream bandwidth across all of the individual links that NVLink supports, so cut these figures in half to get specific transmit/receive rates.
|NVLink Specification Comparison|
|NVLink 4||NVLink 3||NVLink 2|
|Signaling Rate||100 Gbps||50 Gbps||25 Gbps|
|Bandwidth/Direction/Link||25 GB/sec||25 GB/sec||25 GB/sec|
|Total Bandwidth/Link||50 GB/sec||50 GB/sec||50 GB/sec|
|Bandwidth/Chip||900 GB/sec||600 GB/sec||300 GB/sec|
900GB/sec represents a 50% increase in I/O bandwidth for H100. Which is not as great an increase as H100’s total processing throughput, but a realistic improvement given the ever-escalating complexities in implementing faster networking rates.
Given that NVLink 3 was already running at a 50 Gbit/sec signaling rate, it’s not clear if the additional bandwidth is courtesy of an even faster signaling rate, or if NVIDIA has once again adjusted the number of links coming from the GPU. NVIDIA previously altered the NVLink lane configuration for A100, when they halved the lane width and doubled the number of lanes, all while doubling the signaling rate. Adding lanes on top of that means not having to figure out how to improve the signaling rate by even more, but it also means a 50% increase in the number of pins needed for NVLink I/O.
Along those lines, it’s also worth noting that NVIDIA is adding PCIe 5.0 support with Hopper. As PCIe is still used for host-to-GPU communications (until Grace is ready, at least), this means NVIDIA has doubled their CPU-GPU bandwidth, letting them keep H100 that much better fed. Though putting PCIe 5.0 to good use is going to require a host CPU with PCIe 5.0 support, which isn’t something AMD or Intel are providing quite yet. Presumably, someone will have hardware ready and shipping by the time NVIDIA ships H100 in Q3, especially since NVIDIA is fond of homogenization for their DGX pre-built servers.
Finally, with the launch of H100/NVLink 4, NVIDIA is also using this time to announce a new, external NVLink switch. This external switch extends beyond NVIDIA’s current on-board NVSwitch functionality, which is used to help build more complex GPU topologies within a single node, and allows H100 GPUs to directly communicate with each other across multiple nodes. In essence, it’s a replacement for having NVIDIA GPUs go through Infiniband networks in order to communicate cross-node.
The external NVLInk Switch allows for up to 256 GPUs to be connected together within a single domain, which works out to 32 8-way GPU nodes. According to NVIDIA, a single, 1U NVLink Switch offers 128 lanes of NVLink via 32 Octal SFP (OSFP) transceivers. The full Switch, in turn, offers a total bisection bandwidth of 70.4TB/second.
It’s worth noting, however, that the NVLink Switch is not a wholesale replacement for Infiniband – which of course, NVIDIA also sells through its networking hardware division. Infiniband connections between nodes are still needed for other types of communications (e.g. CPU to CPU), so external NVLink networks are a supplement to Infiniband, allowing H100 GPUs to directly chat amongst themselves.
NVIDIA HGX Rides Again: HGX For H100
Last, but not least, NVIDIA has confirmed that they’re updating their HGX baseboard ecosystem for H100 as well. A staple of NVIDIA’s multi-GPU designs since they first began using the SXM form factor for GPUs, HGX baseboards are NVIDIA-produced GPU baseboards for system builders to use in designing complete multi-GPU systems. The HGX boards provide the full connection and mounting environment for NVIDIA’s SXM form factor GPUs, and then sever vendors can route power and PCIe data (among other things) from their motherboards to the HGX baseboard. For the current A100 generation, NVIDIA has been selling 4-way, 8-way, and 16-way designs.
Relative to the GPUs themselves, HGX is rather unexciting. But it’s an important part of NVIDIA’s ecosystem. Server partners can pickup an HGX board and GPUs, and then quickly integrate that into a server design, rather than having to design their own server from scratch. Which in the case of H100, means that status quo will (largely) reign, and that NVIDIA’s server partners will be able to assemble systems in the same manner as before.
Hopper H100 Accelerators: Shipping In Q3 2022
Wrapping things up, NVIDIA is planning on having H100-equipped systems available in Q3 of this year. This will include NVIDIA’s full suite of self-built systems, including DGX and DGX SuperPod servers, as well as servers from OEM partners using HGX baseboards and PCIe cards. Though in typical fashion, NVIDIA is not announcing individual H100 pricing, citing the fact that they sell this hardware through server partners. We’ll have a bit more insight once NVIDIA announces the prices of their own DGX systems, but suffice it to say, don’t expect H100 cards to come cheap.
Post Your CommentPlease log in or sign up to comment.
View All Comments
Kevin G - Tuesday, March 22, 2022 - link"As PCIe is still used for host-to-GPU communications (until Grace is ready, at least)"
It is worth pointing out that GV100 did support native NVLink to POWER8 processors from IBM. nVidia has partnered with other vendors on the CPU front if high performance and bandwidth are necessary. Dunno why the IBM and nVidia relationship fell apart but IBM is going a significantly different direction in terms of system design with their flexible memory/IO topologies.
"The external NVLInk Switch allows for up to 256 GPUs to be connected together within a single domain, which works out to 32 8-way GPU nodes. "
How many switches are necessary to reach that 256 GPU figure? The current sixteen A100 topologies generall use eight switches, though half of them are mainly to propagate the signaling through the backplane.
A 300% increase in performance for 75% more power is a still an improvement in performance/watt. This does make me wonder how configurable clocks and power consumption will be on the SMX5 modules as that'll be a huge increase in power consumption per system frame. Being able to run full racks of these systems at full load is going to require new infrastructure in many instances. At that density, liquid cooling is also going to become a requirement. I do see some demands in the HPC/AI sector for more 'drop in' replacements in terms of power consumption even if they're not as performant as the 700W versions.
I'm also surprised that there isn't a version with the entire 6144 bit wide memory bus version. Even for A100 I'm perplexed as to why this didn't happen for memory bandwidth and memory capacity reasons. Are packaging yields really that bad?
Ryan Smith - Tuesday, March 22, 2022 - link"How many switches are necessary to reach that 256 GPU figure? The current sixteen A100 topologies generall use eight switches, though half of them are mainly to propagate the signaling through the backplane."
NVIDIA's own documentation is less than clear on this point. The Hopper whitepaper says: "a total of 128 NVLink ports in a single 1 RU, 32-cage NVLink Switch"
But looking at their diagrams (which are admittedly mock-ups), it looks like a full 32 node configuration uses 18 NVLink Switches.
A single 1U NVLink Switch offers 32 ports.
Which is not to be confused with the NVSwitch chips for on-board switching. NVIDIA's suggested topology there is 4 NVSwitch chips for an 8-way GPU configuration.
spikebike - Wednesday, March 23, 2022 - linkMakes sense. Each node has 8 GPUs, so 32 nodes have 256 GPUs. If each h100 has 18 links there's a switch (or bus) inside the node and then you run 1 cable to each of 18 switches. Such networks are pretty common with Infiniband networks, which Mellanox does and was purchased by Nvidia. The common mellanox standard a few years ago was 200Gbit/HDR, which is 25GB/sec per link or 50GB/sec per link bidirectionally, the same numbers NVLink has.
Mike Bruzzone - Monday, March 28, 2022 - linkswitching, got it, thank you. mb
mode_13h - Thursday, March 24, 2022 - link> Dunno why the IBM and nVidia relationship fell apart
I don't think that was the issue. I think the reason POWER 8 supported NVLink was due to a couple big ticket supercomputer contracts. With subsequent machines opting instead to use AMD and Intel CPUs, there was no longer the impetus for IBM to add NVLink support in their newer CPUs.
If IBM had wanted to add NVLink to newer CPUs, I'm pretty sure Nviida would've let them. They already paid a high price to port their entire CUDA stack to POWER.
> I'm also surprised that there isn't a version with the entire 6144 bit wide memory bus version.
Perhaps Nvidia saves those for really "special" customers with extra-deep pockets. Maybe there just aren't enough of those golden units to be worth publicly announcing.
Cooe - Tuesday, March 22, 2022 - linkOutside of Nvidia's traditional tensor op & AI/ML stronghold this looks absolutely PATHETIC compared to AMD's MI-250X.... 700W for just 30TFLOPS standard FP64??? (One THIRD of the almost 100TFLOPS from the MI-250X, and w/ higher power draw to boot!). Are you KIDDING ME??? People doing serious machine learning will buy Nvidia like they always have, but basically ANYONE in the HPC market is going to take one look at Hopper and say "Yeah.... That's freaking stupid." It's like Nvidia never wants to be in a major supercomputer ever again...
Cooe - Tuesday, March 22, 2022 - link*That near 100TFLOPs figure for MI-250X is actually matrix FP64 so it's really 60TFLOPS vs 96TFLOPS, but that's still an absolutely MASSIVE gap for the Nvidia part pulling +200W more power!!! Basically half the performance for like +1/3rd more power...
mode_13h - Thursday, March 24, 2022 - link> Basically half the performance for like +1/3rd more power...
That's awfully fuzzy math, for someone banging on about HPC. MI250X offers about 59.7% more fp64 vector performance and 59.5% more fp64 matrix performance. Not *double*.
As for power, the H100's 700 W is 25% more than the MI250X's 560 W.
Now, if you want to talk efficiency, then we get 99.6% and 99.4% more perf/W at fp64 vector and matrix, respectively. However, that presumes customers will run either at their max rated speeds, which power-sensitive customers are unlikely to do.
cake_lover - Thursday, March 24, 2022 - linkAMD's numbers are a fairytale. Their quoted 95 TFLOPS in fp64 cannot be maintained at 560 watts, and the GPU ramps down its clocks when you try. If you look at the small print on AMD's marketing you can see this for yourself: HPL efficiency is only at ~45% of peak.
Moreover, quoted max TDP does not equal max power. The only way to compare what the power efficiency of two processors is to look at the consumed power when running a given workload.
A good vetted source of HPC fp64 processor efficiency is the Green500. If AMD's power efficiency is what they claim it to be in their slideware, then it will show up there.
mode_13h - Friday, March 25, 2022 - link> AMD's numbers are a fairytale. Their quoted 95 TFLOPS in fp64 cannot
> be maintained at 560 watts, and the GPU ramps down its clocks when you try.
I think both AMD and Nvidia are guilty of pushing specs based on boost clocks.
> the small print on AMD's marketing ...: HPL efficiency is only at ~45% of peak.
That's yet again different. The numbers on *both* AMD and Nvidia's spec sheets are theoretical. For actual benchmark results, there's always a gap with such theoretical numbers.
> Moreover, quoted max TDP does not equal max power.
If you're concerned about *sustained* performance, then TDP should be your number.
> The only way to compare what the power efficiency of two processors is
> to look at the consumed power when running a given workload.
Well, yes. We'd like to actually benchmark these things, but most of us cannot. Especially when they haven't even started shipping, yet.