Floating point peak performance of Kaveri and other recent AMD and Intel chips
by Rahul Garg on January 22, 2014 8:30 AM EST
With the launch of Kaveri, some people have been wondering if the platform is suitable for HPC applications. Floating point peak performance of the CPU and GPU on both fp32 and fp64 datatypes is one of the considerations. At launch time, we were not clear on the fp64 performance of Kaveri's GPU but now we have official confirmation from AMD that it is 1/16th the rate of fp32 (similar to most GCN based GPUs except the flagships) and we have verified this on our 7850K by running FlopsCL.
I am taking this opportunity to summarize the info about Kaveri, Trinity, Llano and Intel's competing platforms Haswell and Ivy Bridge on both the CPU and GPU side. We provide a per-cycle estimate for the chips as well as peak calculated in gflops. The estimates are chip-wide, i.e. already take into account the number of cores or modules. Due to turbo boost, it was difficult to decide what frequency to use for peak calculations. For CPUs, we are using the base frequency and for GPUs we are using the boost frequency because in multithreaded and/or heterogeneous scenarios the CPU is less likely to turbo. In any case, we believe our readers are smart enough to calculate peaks at any frequency they want, given that we already supply per-cycle peaks :)
The peak CPU performance will depend on the SIMD ISA that your code was written and compiled for. We consider three cases: SSE, AVX (without FMA) and AVX with FMA (either FMA3 or FMA4).
Platform | Kaveri | Trinity | Llano | Haswell | Ivy Bridge |
---|---|---|---|---|---|
Chip | 7850K | 5800K | 3870K | 4770K | 3770K |
CPU frequency | 3.7 GHz | 3.8 GHz | 3.0GHz | 3.5GHz | 3.5GHz |
SSE fp32 (/cycle) | 16 | 16 | 32 | 32 | 32 |
SSE fp64 (/cycle) | 8 | 8 | 16 | 16 | 16 |
AVX fp32 (/cycle) | 16 | 16 | - | 64 | 64 |
AVX fp64 (/cycle) | 8 | 8 | - | 32 | 32 |
AVX FMA fp32 (/cycle) | 32 | 32 | - | 128 | - |
AVX FMA fp64 (/cycle) | 16 | 16 | - | 64 | - |
SSE fp32 (gflops) | 59.2 | 60.8 | 96 | 112 | 112 |
SSE fp64 (gflops) | 29.6 | 30.4 | 48 | 56 | 56 |
AVX fp32 (gflops) | 59.2 | 60.8 | - | 224 | 224 |
AVX fp64 (gflops) | 29.6 | 30.4 | - | 112 | 112 |
AVX FMA fp32 (gflops) | 118.4 | 121.6 | - | 448 | - |
AVX FMA fp64 (gflops) | 59.2 | 60.8 | - | 224 | - |
It is no secret that AMD's Bulldozer family cores (Steamroller in Kaveri and Piledriver in Trinity) are no match for recent Intel cores in FP performance due to the shared FP unit in each module. As a comparison point, one core in Haswell has the same floating point performance per cycle as two modules (or four cores) in Steamroller.
Now onto GPU peaks. Here, for Haswell, we chose to include both GT2 and GT3e variants.
Platform | Kaveri | Trinity | Llano | Haswell GT3e | Haswell GT2 | Ivy Bridge |
---|---|---|---|---|---|---|
Chip | 7850K | 5800K | 3870K | 4770R | 4770K | 3770K |
GPU frequency | 720 MHz | 800 MHz | 600 MHz | 1.3 GHz | 1.25 GHz | 1.15 GHz |
fp32/cycle | 1024 | 768 | 800 | 640 | 320 | 256 |
fp64/cycle (OpenCL) |
64 | 48** | 0 | 0 | 0 | 0 |
fp64/cycle (Direct3D) |
64 | 0? | 0 | 160 | 80 | 64 |
fp32 gflops | 737.3 | 614 | 480 | 832 | 400 | 294.4 |
fp64 gflops (OpenCL) |
46.1 | 38.4** | 0 | 0 | 0 | 0 |
fp64 gflops (Direct3D) |
46.1 | 0? | 0 | 208 | 100 | 73.6 |
The fp64 support situation is a bit of a mess because some GPUs only support fp64 under some APIs. The fp64 rate of Intel's GPUs does not appear to be published but David Kanter provides an estimate of 1/4 speed compared to fp32. However Intel only enables fp64 under DirectCompute but does not enable fp64 under OpenCL for any of its GPUs.
Situation on AMD's Trinity/Richland is even more complicated. fp64 support under OpenCL is not standards-compliant and depends upon using a proprietary extension (cl_amd_fp64). Trinity/Richland do not appear to support fp64 under DirectCompute (and MS C++ AMP implementation) from what I can tell. From an API standapoint, Kaveri's GCN GPUs should work fine on for fp64 under all APIs.
Some of you might be wondering whether Kaveri is good for HPC applications. Compared to discrete GPUs, applications that are already ported and work well on discrete GPUs will continue to be best run on discrete GPUs. However, Kaveri and HSA will enable many more applications to be GPU accelerated.
Now we compare Kaveri against Haswell. In applications depending upon fp64 performance, conditions are not generally favorable to Kaveri. Kaveri's fp64 peak including both the CPU and GPU is only about 110 gflops. You will generally be better off first optimizing your code for AVX and FMA instructions and running on Haswell's CPU cores. If you are using Windows 8, you might also want to explore using Iris Pro through C++ AMP in conjunction with the CPU. Overall I doubt we will see Kaveri being used for fp64 workloads.
For heterogeneous fp32 applications, Kaveri should outperform Haswell GT2 and Ivy Bridge. Haswell GT3e will again be a strong contender on Windows given the extremely capable Haswell CPU cores and Iris Pro graphics. Intel's GPUs do not currently support OpenCL under Linux, but a driver is being worked on. Thus, on Linux, Kaveri will simply win out on fp32 heterogeneous applications. However, even on Windows Haswell GT3e will get strong competiton from Kaveri. While AMD has advantages such as excellent GCN architecture and HSA software stack (when ready) enabling many more applications to take advantage of GPU, Iris Pro will have the eDRAM to potentially provide much improved bandwidth and the backing of strong CPU cores.
I hope I have provided a fair overview of the FP capabilities of each platform. Application performance will of course depend on many more factors. Your questions and comments are welcome.
101 Comments
View All Comments
TheinsanegamerN - Wednesday, January 22, 2014 - link
ive noticed the same thing. gaming wise, the desktop a10 trinity creamed ivy bridge. on mobile, though, the performance difference was only 18% higher in favor of amd. with haswell, intel hits the same performance as mobile richland a10s in games, and ets better battery life to boot.on the other hand, the performance of the 45 watt a8-7600 makes me hopefull that amd will give us another 45 watt mobile fusion apu that would be as fast as the desktop version.
YuLeven - Thursday, January 23, 2014 - link
I'm dreaming on that too. It would be a shame if mobile Kaveri took the same huge performance hit that it's older brothers saw when moving from desktop to mobile.If history repeats itself, I think Broadwell will hit Kaveri-M very hard, relegating it to the same shady spot on poor budget designs that llano, trinnity and richland where. I would love to see an AMD APU performing strong on a good laptop. If Kaveri-M ever threats Broadwell, at least for gaming-focused folk, it would cause the healthy impact that competition causes on Intel. Lower prices, better parts.
toyotabedzrock - Friday, January 24, 2014 - link
If you want to know what is in store for Broadwell you have to watch the Linux kernel mailing lists or read a certain site that watches the video driver commits like a hawk.Intel has already been adding support for the broadwell gpu for some time. For Linux 3.14 they started adding the framework for a new Cpu feature in skylake and broadwell audio support.
http://www.phoronix.com/scan.php?page=home
Bob Todd - Wednesday, January 22, 2014 - link
Hopefully, but that requires design wins which they have been sorely lacking compared to Intel. And AMD seems practically non-existent in the SFF space. Where is their NUC? Hell, where are their mITX boards? Newegg shows a whopping 3 FM2+ mITX boards and 2 FM2 boards. Intel has 24 just for Haswell, and another 19 for Sandy/Ivy.MrSpadge - Wednesday, January 22, 2014 - link
You know with DP at 1/16th SP they're not even trying. They could easily go up to 1/4th, though.wumpus - Saturday, February 8, 2014 - link
Maybe, maybe not. I suspect they aren't trying, but I wouldn't write any code that expected strict IEE754 rounding in single (crypto, perhaps). Strict rounding needs close to 4 times the multiplies that you would need for an unrounded multiply, so they could be wimping out there.Personally, I'd rather have more floats that are off by a bit than strict 754 rounding on my floats, but can't see doing it as long as there are claims of "IEEE754" compatibility. Violating 754 has a *long* history (there have been plenty of -754strict compiler flags that kill performance), and there are plenty of ways to weasel a datasheet, but violating a spec is something an engineer *does* *not* *do*. When a careful engineer sees something like this, he won't go near the edge conditions (and rounding and the other 754 nastiness is about as edge condition as you can get).
KenLuskin - Saturday, January 25, 2014 - link
Kaveri was NOT created to be high priced chip.Kaveri was NOT even created to be a desktop chip.
Kaveri was designed for laptops, but with enough GPU to run AAA games.
Kaveri is designed to be AFFORDABLE for very low priced laptops.
Most people do NOT need any more speed out of their CPU.
But, they would like the ability to run AAA games.
Kaveri 45 Watt for $120 blows away an i3 at $130 in grahics!
And that is without MANTLE!
sanaris - Saturday, March 1, 2014 - link
Noone of laptop users needs direct access from shader units to caches.Any of real laptop users will buy Intel with NVidia descrete card, because they are supported with Linux.
toyotabedzrock - Wednesday, January 22, 2014 - link
I noticed both the companies gpu's are slower than the cpu for fp64.tipoo - Thursday, January 23, 2014 - link
Probably because DP floating point calculations are crippled, so those who need it buy the full FirePro or Quadro parts.