Intel's Pentium 4 E: Prescott Arrives with Luggage

Name: Intel's Pentium 4 E: Prescott Arrives with Luggage
Item: Intel's Pentium 4 E: Prescott Arrives with Luggage
Author: Anand Lal Shimpi & Derek Wilson

by Anand Lal Shimpi & Derek Wilson on February 1, 2004 3:06 PM EST

Posted in
CPUs

104 Comments | Add A Comment

104 Comments

An Impatient Prescott: Scheduler Improvements

Prescott can’t keep any more operations in flight (active in the pipeline) than Northwood, but because of the longer pipeline Prescott must work even harder to make sure that it is filled.

We just finished discussing branch predictors and their importance in determining how deep of a pipeline you can have, but another contributor to the equation is a CPU’s scheduling windows.

Let’s say you’ve got a program that is 3 operations long:

1. D = B + 1
2. A = 3 + D
3. C = A + B

You’ll notice that the 2nd operation can’t execute until the first one is complete, as it depends on the outcome (D) of the first operation. The same is true for the 3rd operation, it can’t execute until it knows what the value of A is. Now let’s say our CPU has 3 ALUs, and in theory could execute three adds simultaneously, if we just had this stream of operations going through the pipeline, we would only be using 1/3 of our total execution power - not the best situation. If we just upgraded from a CPU with 1 ALU, we would be getting the same throughput as our older CPU – and no one wants to hear that.

Luckily, no program is 3 operations long (even print “Hello World” is on the order of 100 operations) so our 3 ALUs should be able to stay busy, right? There is a unit in all modern day CPUs whose job it is to keep execution units, like ALUs, as busy as possible – as much of the time as possible. This is the job of the scheduler.

The scheduler looks at a number of operations being sent to the CPU’s execution cores and attempts to extract the maximum amount of parallelism possible from the operations. It does so by placing pending operations as soon as they make it to the scheduling stage(s) of the pipeline into a buffer or scheduling window. The size of the window determines the amount of parallelism that can be extracted, for example if our CPU’s scheduling window were only 3 operations large then using the above code example we would still only use 1/3 of our ALUs. If we could look at more operations, we could potentially find code that didn’t depend on the values of A, B or D and execute that in parallel while we’re waiting for other operations to complete. Make sense?

Because Intel increased the size of the pipeline on Prescott by such a large amount, the scheduling windows had to be increased a bit. Unfortunately, present microarchitecture design techniques do not allow for very large scheduling windows to be used on high clock speed CPUs – so the improvements here were minimal.

Intel increased the size of the scheduling windows used to buffer operations going to the FP units to coincide with the increase in pipeline.

There is also parallelism that can be extracted out of load and store operations (getting data out of and into memory). Let’s say that you have the following:

A = 1 + 3
Store A at memory location X
…
…
Load A from memory location X

The store operation actually happens as two operations (further pipelining by splitting up a store into two operations): a store address operation (where the data is going) and a store data operation (what the data actually is). The problem here is that the scheduler may try to parallelize the store operations and the load operation without realizing that the two are dependent on one another. Once this is discovered, the load will not execute and a performance penalty will be paid because the CPU’s scheduler just wasted time getting a load ready to execute and then having to get rid of it. The load will eventually execute after the store operations have completed, but at a significant performance penalty.

If a situation like the one mentioned above does crop up, long pipeline designs will suffer greatly – meaning that Prescott wants this to happen even less than Northwood. In Prescott, Intel included a small, very accurate, predictor to predict whether a load operation is likely to require data from a soon-to-be-executed store and hold that load until the store has executed. Although the predictor isn’t perfect, it will reduce bubbles of no-execution in the pipeline – a killer to Prescott and all long pipelined architectures.

Don’t look at these enhancements to improve performance, but to help balance the lengthened pipeline. A lot of the improvements we’ll talk about may sound wonderful but you must keep in mind that at this point, Prescott needs these technologies in order to equal the performance of Northwood so don’t get too excited. It’s an uphill battle that must be fought.

Prescott's Crystal Ball (continued) Execution Core Improvements

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

104 Comments

View All Comments

mattsaccount - Sunday, February 1, 2004 - link
From the HardOCP review: "Certainly moving to watercooling helped us out a great deal. In fact it is hard for us to recommend buying a Prescott and cooling it any other way."
eBauer - Sunday, February 1, 2004 - link
I am curious as to why the UT2k3 botmatch scores dropped on all CPU's... Different map?
Pumpkinierre - Sunday, February 1, 2004 - link
Sorry errata on #20 that was 3.0 Northood result is out of kilter with other cpus in dtata analysis sysmark 2004.
Pumpkinierre - Sunday, February 1, 2004 - link
JFK,Vietnam,Nixon,Monica,Bush/Gore,Iraq and now this! - what is going on with the leader of the free world.I hope it overclocks well- that's all that's going for it. Maybe Intel should rethink their multiplier locked policy. AMD must get in there and profit. I still dont understand why the caches are running at half the latency as Northood if they are the same speed and structure? Is it as a result of a doubling in size for the same associativity?

Good article- needs re-rereading after digestion. Last chart in Sysmark2004 (data analysis) has 3.0 Prescott totally outperformed by 2.8 Prescott and all other cpus. Look like a benchmark/typing glitch.
yak8998 - Sunday, February 1, 2004 - link
first the error:
pg 9 -
The LDDQU instruction is one Intel is particularly proud of as it helps accelerate video encoding and it is implemented in the DivX 5.1.1 codec. More information on how it is used can be found in Intel’s developer documentation here.

No link?

===
"What's the power consumption like on these new bad boys?

Is anything less than a quality 450watt PSU gonna be generally *NOT* recommended?? "

I'm going to guess a clean running ~350W or so should suffice for a regular system, but I'm not positive with these monster gfx cards out rite now...

"Any of you know what the cache size on the EE's will be?"

If your talking about the Northwood (the p4c's are still considered northwoods, no?), its 1mb I believe.
(still finishing the article. man i love these in-depth technical articles)
Tiorapatea - Sunday, February 1, 2004 - link
I agree, some info on power consumption please.

Thanks for the article, by the way.

I guess we'll have to wait and see how Prescott ramps in speed versus 90nm A64.
AgaBooga - Sunday, February 1, 2004 - link
Much better than the P4's origional launch...

All I want to know now is what AMD is going to do soon... They'll probably counteract Prescott with high clock speeds but when and by how much is what matters.

Any of you know what the cache size on the EE's will be?

Also, the final CPU's based on Northwood are kind of like a car with the ratio curves or whatever they're called, but basically after a point of revving, going any higher doesn't give you as much of an increase in speed as it would at a lower rpm increasing the same amount.
Cygni - Sunday, February 1, 2004 - link
AMD's roadmap shows a 4000+ Athlon64 by the end of the year... which is the same as Intel's. They are aware, im sure.
Stlr22 - Sunday, February 1, 2004 - link
What's the power consumption like on these new bad boys?

Is anything less than a quality 450watt PSU gonna be generally *NOT* recommended??
HammerFan - Sunday, February 1, 2004 - link
Things are gonna get hairy in '04 and '05!!! My take is that AMD nees to get their marketing up-to-spec or the high-clocked prescotts are gonna run the show.

I have a question for Derek and Anand: What kind of temps does the prescott run at? what type of cooler does it have? (there's nothing there to support or refute claims that the prescott is one hot potato)

Intel's Pentium 4 E: Prescott Arrives with Luggage

An Impatient Prescott: Scheduler Improvements

Post Your Comment

104 Comments

View All Comments

mattsaccount - Sunday, February 1, 2004 - link

eBauer - Sunday, February 1, 2004 - link

Pumpkinierre - Sunday, February 1, 2004 - link

Pumpkinierre - Sunday, February 1, 2004 - link

yak8998 - Sunday, February 1, 2004 - link

Tiorapatea - Sunday, February 1, 2004 - link

AgaBooga - Sunday, February 1, 2004 - link

Cygni - Sunday, February 1, 2004 - link

Stlr22 - Sunday, February 1, 2004 - link

HammerFan - Sunday, February 1, 2004 - link

Log in

Don't have an account? Sign up now