The Architecture

We'll start, logically, at the front end of a Bulldozer module. The fetch and decode logic in each module is shared by both integer cores. The role this logic plays is to fetch the next instruction in the thread being executed, decode the x86 instruction into AMD's own internal format, and pass the decoded instruction onto the scheduling hardware for execution.

AMD widened the K8 front end with Bulldozer. Each module is now able to fetch and decode up to four x86 instructions from a single thread in parallel. Each of the four decoders are equally capable. Remembering that each Bulldozer module appears as two cores, the front end can only pick 4 instructions to fetch and decode from a single thread at a time. A single Bulldozer module can switch between threads as often as every clock.

Decode hardware isn't very expensive on its own, but duplicating it four times across multiple cores quickly adds up. Although decode width has increased for a single core, multi-core Bulldozer configurations can actually be at a disadvantage compared to previous AMD architectures. Let's look at the table below to understand why:

Front End Comparison
  AMD Phenom II AMD FX Intel Core i7
Instruction Decode Width 3-wide 4-wide 4-wide
Single Core Peak Decode Rate 3 instructions 4 instructions 4 instructions
Dual Core Peak Decode Rate 6 instructions 4 instructions 8 instructions
Quad Core Peak Decode Rate 12 instructions 8 instructions 16 instructions
Six/Eight Core Peak Decode Rate 18 instructions (6C) 16 instructions 24 instructions (6C)

For a single instruction thread, Bulldozer offers more front end bandwidth than its predecessor. The front end is wider and just as capable so this makes sense. But note what happens when we scale up core count.

Since fetch and decode hardware is shared per module, and AMD counts each module as two cores, given an equivalent number of cores the old Phenom II actually offers a higher peak instruction fetch/decode rate than the FX. The theory is obviously that the situations where you're fetch/decode bound are infrequent enough to justify the sharing of hardware. AMD is correct for the most part. Many instructions can take multiple cycles to decode, and by switching between threads each cycle the pipelined front end hardware can be more efficiently utilized. It's only in unusually bursty situations where the front end can become a limit.

Compared to Intel's Core architecture however, AMD is at a disadvantage here. In the high-end offerings where Intel enables Hyper Threading, AMD has zero advantage as Intel can weave in instructions from two threads every clock. It's compared to the non-HT enabled Core CPUs that the advantage isn't so clear. Intel maintains a higher instantaneous decode bandwidth per clock, however overall decoder utilization could go down as a result of only being able to fill each fetch queue from a single thread.

After the decoders AMD enables certain operations to be fused together and treated as a single operation down the rest of the pipeline. This is similar to what Intel calls micro-ops fusion, a technology first introduced in its Banias CPU in 2003. Compare + branch, test + branch and some other operations can be fused together after decode in Bulldozer—effectively widening the execution back end of the CPU. This wasn't previously possible in Phenom II and obviously helps increase IPC.

A Decoupled Branch Predictor

AMD didn't disclose too much about the configuration of the branch predictor hardware in Bulldozer, but it is quick to point out one significant improvement: the branch predictor is now significantly decoupled from the processor's front end.

The role of the branch predictor is to intercept branch instructions and predict their target address, rather than allowing for tons of cycles to go by until the branch target is known for sure. Branches are predicted based on historical data. The more data you have, and the better your branch predictors are tuned to your workload, the more accurate your predictions can be. Accurate branch prediction is particularly important in architectures with deep pipelines as a mispredict causes more instructions to be flushed out of the pipe. Bulldozer introduces a significantly deeper pipeline than its predecessor (more on this later), and thus branch prediction improvements are necessary.

In both Phenom II and Bulldozer, branches are predicted in the front end of the pipe alongside the fetch hardware. In Phenom II however, any stall in the fetch pipeline (e.g. fetching an instruction that wasn't in cache) would stop the whole pipeline including future branch predictions. Bulldozer decouples the branch prediction hardware from the fetch pipeline by way of a prediction queue. If there's a stall in the fetch pipeline, Bulldozer's branch prediction hardware is allowed to run ahead and continue making future predictions until the prediction queue is full.

We'll get to the effectiveness of this approach shortly.

Scheduling and Execution Improvements

As with Sandy Bridge, AMD migrated to a physical register file architecture with Bulldozer. Data is now only stored in one location (the physical register file) and is tracked via pointers back to the PRF as operations make their way through the execution engine. This is a move to save power as copying data around a chip is hardly power efficient.

The buffers and queues that feed into the execution engines of the chip are all larger on Bulldozer than they were on Phenom II. Larger data structures allows for better instruction level parallelism when trying to execute operations out of order. In other words, the issue hardware in Bulldozer is beefier than its predecessor.

Unfortunately where AMD took one step forward in issue hardware, it does a bit of a shuffle when it comes to execution resources themselves. Let's start with the positive: Bulldozer's integer execution cores.

Integer Execution

Each Bulldozer module features two fully independent integer cores. Each core has its own integer scheduler, register file and 16KB L1 data cache. The integer schedulers are both larger than their counterparts in the Phenom II.

The biggest change here is each integer core now has two ports instead of three. A single integer core features two AGU/ALU ports, compared to three in the previous design. AMD claims the third ALU/AGU pair went mostly unused in Phenom II, and as a result it's been removed from Bulldozer.

With larger structures feeding into the integer cores, AMD should be able to have an easier time of making use of the integer units than in previous designs. AMD could, in theory, execute more integer operations per core in Phenom II however AMD claims the architecture was typically bound elsewhere.

The Shared FP Core

A single Bulldozer module has a single shared FP core for use by up to two threads. If there's only a single FP thread available, it is given full access to the FP execution hardware, otherwise the resources are shared between the two threads.

Compared to a quad-core Phenom II, AMD's eight-core (quad-module) FX sees no drop in floating point execution resources. AMD's architecture has always had independent scheduling for integer and floating point instructions, and we see the same number of execution ports between Phenom II cores and FX modules. Just as is the case with the integer cores, the shared FP core in a Bulldozer module has larger scheduling hardware in front of it than the FPU in Phenom II.

The problem is AMD had to increase the functionality of its FPU with the move to Bulldozer. The Phenom II architecture lacks SSE4 and AVX support, both of which were added in Bulldozer. Furthermore, AMD chose Bulldozer as the architecture to include support for fused multiply-add instructions (FMA). Enabling FMA support also increases the relative die area of the FPU. So while the throughput of Bulldozer's FPU hasn't increased over K8, its capabilities have. Unfortunately this means that peak FP throughput running x87/SSE2/3 workloads remains unchanged compared to the previous generation. Bulldozer will only be faster if newer SSE, AVX or FMA instructions are used, or if its clock speed is significantly higher than Phenom II.

Looking at our Cinebench 11.5 multithreaded workload we see the perfect example of this performance shuffle:

Cinebench 11.5—Multi-Threaded

Despite a 9% higher base clock speed (more if you include turbo core), a 3.6GHz 8-core Bulldozer is only able to outperform a 3.3GHz 6-core Phenom II by less than 2%. Heavily threaded floating point workloads may not see huge gains on Bulldozer compared to their 6-core predecessors.

There's another issue. Bulldozer, at least at launch, won't have to simply outperform its quad-core predecessor. It will need to do better than a six-core Phenom II. In this comparison unfortunately, the Phenom II has the definite throughput advantage. The Phenom II X6 can execute 50% more SSE2/3 and x87 FP instructions than a Bulldozer based FX.

Since the release of the Phenom II X6, AMD's major advantage has been in heavily threaded workloads—particularly floating point workloads thanks to the sheer number of resources available per chip. Bulldozer actually takes a step back in this regard and as a result, you will see some of those same workloads perform worse, if not the same as the outgoing Phenom II X6.

Compared to Sandy Bridge, Bulldozer only has two advantages in FP performance: FMA support and higher 128-bit AVX throughput. There's very little code available today that uses AMD's FMA instruction, while the 128-bit AVX advantage is tangible.

Cache Hierarchy and Memory Subsystem

Each integer core features its own dedicated L1 data cache. The shared FP core sends loads/stores through either of the integer cores, similar to how it works in Phenom II although there are two integer cores to deal with now instead of just one. Bulldozer enables fully out-of-order loads and stores, an improvement over Phenom II putting it on parity with current Intel architectures. The L1 instruction cache is shared by the entire bulldozer module, as is the L2 cache.

The instruction cache is a large 64KB 2-way set associative cache, similar in size to the Phenom II's L1 cache but obviously shared by more "cores". A four-core Phenom II would have 256KB of total L1 I-Cache, while a four core Bulldozer will have half of that. The L1 data caches are also significantly smaller than Bulldozer's predecessor. While Phenom II offered a 64KB L1 D-Cache per core, Bulldozer only offers 16KB per integer core.

The L2 cache is much larger than what we saw in multi-core Phenom II designs however. Each Bulldozer module has a private 2MB L2 cache.

There's a single 8MB L3 cache that's shared among all Bulldozer modules on a chip. In its first incarnation, AMD has no plans to offer a desktop part without an L3 cache. However AMD indicated that the L3 cache was only really useful in server workloads and we might expect future Bulldozer derivatives (ahem, Trinity?) to forgo the L3 cache entirely as a result.

Cache accesses require more clocks in Bulldozer, due to a combination of size and AMD's desire to make Bulldozer a very high clock speed part...

Introduction The Pursuit of Clock Speed
Comments Locked

430 Comments

View All Comments

  • THizzle7XU - Wednesday, October 12, 2011 - link

    Well, why would you target the variable PC segment when you can program for a well established, large user-base platform with a single configuration and make a ton more money with probably far less QA work since there's only one set (two for multi-platform PS3 games) of hardware to test?

    And it's not like 360/PS3 games suddenly look like crap 5-6 years into their cycles. Think about how good PS2 games looked 7 years into that system's life cycle (God of War 2). Devs are just now getting the most of of the hardware. It's a great time to be playing games on 360/PS3 (and PC!).
  • GatorLord - Wednesday, October 12, 2011 - link

    Consider what AMD is and what AMD isn't and where computing is headed and this chip is really beginning to make sense. While these benches seem frustrating to those of us on a desktop today I think a slightly deeper dive shows that there is a whole world of hope here...with these chips, not something later.

    I dug into the deal with Cray and Oak Ridge, and Cray is selling ORNL massively powerful computers (think petaflops) using Bulldozer CPUs controlling Nvidia Tesla GPUs which perform the bulk of the processing. The GPUs do vastly more and faster FPU calculations and the CPU is vastly better at dishing out the grunt work and processing the results for use by humans or software or other hardware. This is the future of High Performance Computing, today, but on a government scale. OK, so what? I'm a client user.

    Here's what: AMD is actually best at making GPUs...no question. They have been in the GPGPU space as long as Nvidia...except the AMD engineers can collaborate on both CPU and GPU projects simultaneously without a bunch of awkward NDAs and antitrust BS getting in the way. That means that while they obviously can turn humble server chips into supercomputers by harnessing the many cores on a graphics card, how much more than we've seen is possible on our lowly desktops when this rebranded server chip enslaves the Ferraris on the PCI bus next door...the GPUs.

    I get it...it makes perfect sense now. Don't waste real estate on FPU dies when the one's next door are hundreds or thousands of times better and faster too. This is not the beginning of the end of AMD, but the end of the beginning (to shamlessely quote Churchill). Now all that cryptic talk about a supercomputer in your tablet makes sense...think Llano with a so-so CPU and a big GPU on the same die with some code tweaks to schedule the GPU as a massive FPU and the picture starts taking shape.

    Now imagine a full blown server chip (BD) harnessing full blown GPUs...Radeon 6XXX or 7XXX and we are talking about performance improvements in the orders of magnitude, not percentage points. Is AMD crazy? I'm thinking crazy like a fox.

    Oh..as a disclaimer, while I'm long AMD...I'm just an enthusiast like the rest of you and not a shill...I want both companies to make fast chips that I can use to do Monte Carlos and linear regressions...it just looks like AMD seems to have figured out how to play the hand they're holding for change...here's to the future for us all.
  • Menoetios - Wednesday, October 12, 2011 - link

    I think you bring up a very good point here. This chip looks like it's designed to be very closely paired with a highly programmable GPU, which is where the GPU roadmaps are leading over the next year and a half. While the apples-to-apples nature of this review draw a disappointing picture, I'm very curious how AMD's "Fusion" products next year will look, as the various compute elements of the CPU and GPU become more tightly integrated. Bulldozer appears to fit perfectly in an ecosystem that we don't quite have yet.
  • GatorLord - Wednesday, October 12, 2011 - link

    Exactly. Ecosystem...I like it. This is what it must feel like to pick up a flashlight at the entrance to the tunnel when all you're used to is clubs and torches. Until you find the switch, it just seems worse at either...then viola!
  • actionjksn - Wednesday, October 12, 2011 - link

    Wow I hope that made you feel better about the crappy chip also known a "Man With A Shovel"
    I was just hoping AMD would quit forcing Intel to have to keep on crippling their chips, just to keep them from putting AMD out of business. AMD better fix this abortion quick, this is getting old.
  • GatorLord - Thursday, October 13, 2011 - link

    Feeling fine. Not as good in the short run, but feeling better about the long run. Unfortunately, due to constraints, it takes AMD too long to get stuff dialed in and by the time they do, Intel has already made an end run and beat them to the punch.

    Intel can do that, they're 40x as big as AMD. Actually, and this may sound crazy until you digest it, the smartest thing Intel could do is spin off a couple of really good dev labs as competitors. Relying on AMD to drive your competition is risky in that AMD may not be able to innovate fast enough to push Intel where it could be if they had more and better sharks in the water nipping at their tails.

    You really need eight or more highly capable, highly aggressive competitors to create a fully functioning market free of monopolistic and oligopolistic sluggishness and BS hand signalling between them. This space is too capital intensive for that at the time being with the current chip making technology what it is.
  • yankeeDDL - Wednesday, October 12, 2011 - link

    Just to be the devil's advocate ...
    The launch event in London sported 2 PC, side by side, running Cinebench.
    One had the core i5-2500k, the other the FX8150.
    Of course, these systems are prepared by AMD, so the results from Anand are clearly more reliable (at least all the conditions are documented).
    Nevertheless, it is clear that in the demo from AMD, the FX runs faster. Not by a lot, but it is clearly faster than the i5.
    Video: http://www.viddler.com/explore/engadget/videos/335...

    Even so, assuming that this was a valid datapoint, things won't change too much: the i5-2500k is cheaper and (would be) slightly slower than the FX8150 in the most heavily threaded benchmark. But it would be slightly better than Anand's results show.
  • KamikaZeeFu - Wednesday, October 12, 2011 - link

    "Nevertheless, it is clear that in the demo from AMD, the FX runs faster. Not by a lot, but it is clearly faster than the i5."

    Check the review, cinebench r11.5 multithreaded chart.
    Anand's numbers mirror the ones by AMD. Multithreaded workloads are the only case where the 8150 will outperform an i5 2500k because it can process twice the amount of threads.

    Really disappointed in AMD here, but I expected subpar performance because it was eerily quiet about the FX line as far as performance went.

    Desktop BD is a full failure, they were aiming for high clock speeds and made sacrifices, but still failed their objective. By the time their process is mature and 4 GHz dozers hit the channel, Ivy bridge will be out.

    As far as server performance goes, not even sure they will succeed there.
    As seen in the review, clock for clock performance isn't up compared to the prvious generation, and in some cases it's actually slower. Considering that servers run at lower clocks in the first place, I don't see BD being any threat to intels server lineup.

    4 years to develop this chip, and their motto seemed to be "we'll do netburst but in not-fail"
  • medi01 - Wednesday, October 12, 2011 - link

    So CPU is a bottleneck in your games eh?
  • TekDemon - Wednesday, October 12, 2011 - link

    It's not but people don't buy CPUs for today's games, generally you want your system to be future proof so the more extra headroom there is in these CPU benchmarks the better it holds up over the long term. Look back at CPU benchmarks from 3-4 years ago and you'll see that the CPUs that barely passed muster back then easily bottleneck you whereas CPUs that had extra headroom are still usable for gaming. For example the Core 2 Duo E8400 or E8500 is still a very capable gaming CPU, especially when given a mild overclock and frankly in games that only use a few threads (like Starcraft 2) it gives Bulldozer a run for the money.
    I'm not a fanboy either way since I own that E8400 as well as a Phenom II (unlocked to X4, OC'ed to 3.9Ghz) and a i5 2500K but if I was building a new system I sure as heck would want extra headroom for future-proofing.
    That said? Of course these chips will be more than enough power for general use. They're just not going to be good for high end systems. But in a general use situation the problem is that the power consumption is just crappy compared to the intel solutions, even if you can argue that it's more than enough power for most people why would you want to use more electricity?

Log in

Don't have an account? Sign up now