The AMD FX (Bulldozer) Scheduling Hotfixes Testedby Anand Lal Shimpi on January 27, 2012 12:47 PM EST
The basic building block of Bulldozer is the dual-core module, pictured below. AMD wanted better performance than simple SMT (ala Hyper Threading) would allow but without resorting to full duplication of resources we get in a traditional dual core CPU. The result is a duplication of integer execution resources and L1 caches, but a sharing of the front end and FPU. AMD still refers to this module as being dual-core, although it's a departure from the more traditional definition of the word. In the early days of multi-core x86 processors, dual-core designs were simply two single core processors stuck on the same package. Today we still see simple duplication of identical cores in a single processor, but moving forward it's likely that we'll see more heterogenous multi-core systems. AMD's Bulldozer architecture may be unusual, but it challenges the conventional definition of a core in a way that we're probably going to face one way or another in the not too distant future.
A four-module, eight-core Bulldozer
The bigger issue with Bulldozer isn't one of core semantics, but rather how threads get scheduled on those cores. Ideally, threads with shared data sets would get scheduled on the same module, while threads that share no data would be scheduled on separate modules. The former allows more efficient use of a module's L2 cache, while the latter guarantees each thread has access to all of a module's resources when there's no tangible benefit to sharing.
This ideal scenario isn't how threads are scheduled on Bulldozer today. Instead of intelligent core/module scheduling based on the memory addresses touched by a thread, Windows 7 currently just schedules threads on Bulldozer in order. Starting from core 0 and going up to core 7 in an eight-core FX-8150, Windows 7 will schedule two threads on the first module, then move to the next module, etc... If the threads happen to be working on the same data, then Windows 7's scheduling approach makes sense. If the threads scheduled are working on different data sets however, Windows 7's current treatment of Bulldozer is suboptimal.
AMD and Microsoft have been working on a patch to Windows 7 that improves scheduling behavior on Bulldozer. The result are two hotfixes that should both be installed on Bulldozer systems. Both hotfixes require Windows 7 SP1, they will refuse to install on a pre-SP1 installation.
The first update simply tells Windows 7 to schedule all threads on empty modules first, then on shared cores. The second hotfix increases Windows 7's core parking latency if there are threads that need scheduling. There's a performance penalty you pay to sleep/wake a module, so if there are threads waiting to be scheduled they'll have a better chance to be scheduled on an unused module after this update.
Note that neither hotfix enables the most optimal scheduling on Bulldozer. Rather than being thread aware and scheduling dependent threads on the same module and independent threads across separate modules, the updates simply move to a better default cause of scheduling on modules first. This should improve performance in most cases but there's a chance that some workloads will see a performance reduction. AMD tells me that it's still working with OS vendors (read: Microsoft) to better optimize for Bulldozer. If I had to guess I'd say that we may see the next big step forward with Windows 8.
AMD was pretty honest when it described the performance gains FX owners can expect to see from this update. In its own blog post on the topic AMD tells users to expect a 1 - 2% gain on average across most applications. Without any big promises I wasn't expecting the Bulldozer vs. Sandy Bridge standings to change post-update, but I wanted to run some tests just to be sure.
|Motherboard:||ASUS P8Z68-V Pro (Intel Z68)
ASUS Crosshair V Formula (AMD 990FX)
|Hard Disk:||Intel X25-M SSD (80GB)
Crucial RealSSD C300
|Memory:||2 x 4GB G.Skill Ripjaws X DDR3-1600 9-9-9-20
|Video Card:||ATI Radeon HD 5870 (Windows 7)
|Video Drivers:||AMD Catalyst 11.10 Beta (Windows 7)
|Desktop Resolution:||1920 x 1200|
|OS:||Windows 7 x64 SP1 w/ BD Hotfixes|
Post Your CommentPlease log in or sign up to comment.
View All Comments
wumpus - Friday, January 27, 2012 - linkI'd have to believe that any CPU with SMT enabled will benefit. That is, unless they already have this feature. Of course, Intel has been shipping SMT processors since P4. I'd like to believe that microsoft simply flipped whatever switch to treat bulldozer cores as SMT cores, but I don't have enough faith in microsoft's scheduling to believe they ever got it right.
hansmuff - Friday, January 27, 2012 - linkAt least Windows 7 (haven't tested anything else) schedules threads properly on Sandy Bridge. HT only comes into play once all 4 cores are loaded.
tipoo - Friday, January 27, 2012 - linkWindows already has intelligent behaviour for Hyperthreading. I don't think this will change anything on the Intel side.
silet1911 - Wednesday, February 1, 2012 - linkYes, a website called Jagatreview have review a 2500+patch and there is a small performance increase
tk11 - Friday, January 27, 2012 - linkEven if if a scheduler did take the time to figure out when threads shared a significant number of recent memory accesses would that be enough information to determine that the thread would perform optimally on the same module as a related thread rather than an unused module?
Also... Wouldn't running code that performed "intelligent core/module scheduling based on the memory addresses touched by a thread" negatively impact performance far more than any gains realized by scheduling threads on cores that are merely suspected to be more optimally suited to running each particular thread?
eastyy123 - Friday, January 27, 2012 - linkcould some explain the whole module/core thing to me please
i always assumed a core was basically like a whole processor shrunk onto a die is that basically right ?
and how does the amd modules differ ?
KonradK - Friday, January 27, 2012 - linkLong sory short:
Bulldozer's module consist 2 integer cores and 1 floating point (FPU) core.
KonradK - Friday, January 27, 2012 - link"Story" no "sory"
Ammaross - Friday, January 27, 2012 - link"Bulldozer's module consist 2 integer cores and 1 floating point (FPU) core."
However, the 1 FPU core can be used as two single floating point cores or a single double double floating point core, so it depends on the floating point data running through it.
KonradK - Friday, January 27, 2012 - linkNot sure what you are supposing.
Precision is the same, regardless of fact whether one or two threads are executed by FPU core. There are single or double precision FPU instructions, but aech thread can use any of them.
However if you mean single or double performance:
If two FPU threads will run on the same module each of them will have half of performance in comparision tothe same two FPU threads running on separate modules.
Just in first case one FPU is shared by two threads.
And it is whole point in the hotfixes - avoiding such situation as long this is possible.