The Bulldozer Aftermath: Delving Even Deeperby Johan De Gelas on May 30, 2012 1:15 AM EST
It has been months since AMD's Bulldozer architecture surprised the hardware enthusiast community with performance all over the place. The opinions vary wildly from “server benchmarks are here, and they're a catastrophe” to “Best Server Processor of 2011”. The least you can say is that the idiosyncrasies of AMD's latest CPU architecture have stirred up a lot of dust.
Now that the dust has settled, the Bulldozer chips now account for more than half of Opteron shipments and revenues. Since AMD's Financial Analyst Day (February 2, 2012), we have new code names: the improved Bulldozer architecture "Piledriver" will power the "Abu Dhabi" chip, a replacement for the current top server chip "Interlagos". AMD is clearly committed to the new "Bulldozer" direction: fitting as many cores as possible into a certain power envelope to improve thread throughput, while trying to "hold the line" on single-threaded performance.
In theory, the new 16-core Interlagos should have offered somewhere around a 33% boost in most highly-threaded applications. The reality is unfortunately not that rosy: in many highly-threaded server applications such as OLAP databases and virtualization, the new Opteron 6200 fails to impress and is only a few percent faster than it's older brother the 12-core Magny-Cours. There are even times where the older Opteron is faster.
Some, including sources inside AMD, have blamed Global Foundries for not delivering higher clocked SKUs. Sure, the clock speed targets for Interlagos were probably closer to 3GHz instead of 2.3GHz. But that does not explain why the extra integer cores do not deliver. We were promised up to 50% higher performance thanks to the 33% extra cores, but we got 20% at the most.
The combination of low single-threaded performance, the failure to really outperform the previous generation in highly-threaded applications, the relatively high power consumption at full load, and the fact that the CPU is designed for high clock speeds gives a lot of people a certain sense of Déjà vu: is this AMD's version of the Pentum 4 ?
One of our readers, "Iketh", spoke up and voiced the opinion of many of our readers:
" Unfortunately, the thought still in the back of my mind while reading was why did AMD reinvent the Pentium 4? I just don't get it."
Another reader nicknamed "Clagmaster" commented:
"A core this complex in my opinion has not been optimized to its fullest potential. Expect better performance when AMD introduces later steppings of this core with regard to power consumption and higher clock frequencies."
Although there have already been quite a few attempts to understand what Bulldozer is all about, we cannot help but not feel that many questions are still unanswered. Since this architecture is the foundation of AMD's server, workstation, and notebook future (Trinity is based on the improved Bulldozer core with the codename "Piledriver"), it is interesting enough to dig a little deeper. Did AMD take a wrong turn with this architecture? And if not, can the first implementation "Bulldozer" be fixed relatively easily?
We decided to delve deeper into the SAP and SPEC CPU2006 results, as well as profiling our own benchmarks. Using the profiling data and correlating it with what we know about AMD's Bulldozer and Intel's Sandy Bridge, we attempt to solve the puzzle.
Post Your CommentPlease log in or sign up to comment.
View All Comments
Spunjji - Wednesday, June 6, 2012 - linkAgreed. That will be nice!
haukionkannel - Wednesday, May 30, 2012 - linkVery nice article! Can we get more thorough explanation about µop cache? It seems to be important part of Sandy bridge and you predict that it would help bulldoser...
How complex it is to do and how heavily it has been lisensed?
JohanAnandtech - Thursday, May 31, 2012 - linkDon't think there is a license involved. AMD has their own "macro ops" so they can do a macro ops cache. Unfortunately I can not answer your question of the top of head on how easy it is to do, I would have to some research first.
name99 - Thursday, May 31, 2012 - linkOh for fsck's sake.
The stupid spam filter won't let me post a URL.
Do a google search for
sandy bridge Real World Technologies
and look at the main article that comes up.
SocketF - Friday, June 1, 2012 - linkIt is already planned, AMD has a patent for sth like that, google for "Redirect Recovery Cache". Dresdenboy found it already back in 2009:
The BIG Question is:
Why did AMD not implement it yet?
My guess is that they were already very busy with the whole CMT approach. Maybe Streamroller will bring it, there are some credible rumors in that direction.
yuri69 - Wednesday, May 30, 2012 - linkHowdy,
FOA thanks for the effort to investigate the shortcomings of this march :)
Quoting M. Butler (BD's chief architect): 'The pipeline within our latest "Bulldozer" microarchitecture is approximately 25 percent deeper than that of the previous generation architectures. ' This gives us 12 stages on K8/K10 => 12 * 1.25 = 15.
Btw all the major and significant architectural improvements & features for the upcoming BD successor line were set in stone long time ago. Remember, it takes 4-5 years for a general purpose CPU from the initial draft to mass availability. The stage when you can move and bend stuff seems to be around half of this period.
BenchPress - Wednesday, May 30, 2012 - link"This means that Bulldozer should be better at extracting ILP (Instruction Level Parallelism) out of code that has low IPC (Instructions Per Clock)."
This should be reversed. ILP is inherent to the code, and it's the hardware's job to extract it and achieve a high IPC.
Arnulf - Wednesday, May 30, 2012 - linkUgh, so much crap in a single article ... this should never have been posted on AT.
You weren't promised anything. You came across a website put up by some "fanboy" dumbass and you're actually using it as a reference. Why not quote some actual references (such as transcripts of the conference where T. Seifert clearly stated that gains are expected to be in line with core number increase, i.e. ~33%) instead of rehashing this Fruehe nonsense ?
erikvanvelzen - Wednesday, May 30, 2012 - linkYes AMD totally set out to make a completely new architecture with a massive increase in transistors per core but 0 gains in IPC.
Don't fool yourself.
Homeles - Wednesday, May 30, 2012 - linkIt's a more intelligent analysis than your sorry ass could ever produce. Getting hung up on one quote... really?