Intel Unveils Lunar Lake Architecture: New P and E cores, Xe2-LPG Graphics, New NPU 4 Brings More AI Performance
by Gavin Bonshor on June 3, 2024 11:00 PM ESTIntel Lunar Lake: New P-Core, Enter Lion Cove
Diving straight into the Performance, or P-Core commonly referred to, has had major architectural updates to increase power efficiency and performance. Bigger of these updates, Intel needed to comprehensively update its classic P-core cache hierarchy.
Key among these improvements is a significant overhaul of Intel's traditional P-core cache hierarchy. The fresh design for Lion Cove uses a multi-tier data cache containing a 48KB L0D cache with 4-cycle load-to-use latency, a 192KB L1D cache with 9-cycle latency, and an extended L2 cache that gets up to 3MB with 17-cycle latency. In total, this puts 240KB of cache within 9 cycles' latency of the CPU cores, whereas Redwood Cove before it could only reach 48KB of cache in the same period of time.
The data translation lookaside buffer (DTLB) has also been revised, increasing its depth from 96 to 128 pages to improve its hit rate.
Intel has also added a third Address Generation Unit (AGU)/Store Unit pair to further boost the performance of data write operations. Intel has also thrown more cache at the problem, and as CPU complexity grows, so does the reliance on the cache subsystems to keep them fed. Intel has also reworked the core-level cache subsystem by adding an intermediate data cache (IDC) between the 48 KB L1 and the L2 level. The original L1D cache is now called the L0 D-cache internally and retires to a 192 KB L1 D-cache.
The latest Lion Cove P-core design also includes a new front-end for handling instructions. The prediction block is 8x larger, fetch is wider, decode bandwidth is higher than on Raptor Cove, and there has been an enormous increase in Uops cache capacity and read bandwidth. The change in Uop queue capacity is designed to enhance the overall performance throughput.
The out-of-order engine in Lion Cove is partitioned in the footprint for Integer (INT) and Vector (VEC) domains Execution Domain with Independent renaming and scheduling. This type of partitioning allows for expandability in the future, independent growth of each domain, and benefits toward reduced power consumption for a domain-specific workload. The out-of-order engine is also improved, going from 6 to 8-wide allocation/rename and 8 to 12-wide retirement, with the deep instruction window increased from 512 to 576 entries and from 12 to 18 execution ports.
Lion Cove's integer execution units have also been improved over Raptor Cove, with execution resources grown from 5 to 6 integer ALUs, 2 to 3 jump units, and 2 to 3 shift units. Scaling from 1 to 3 units, these multiply 64x64 units to 64, which takes 3 units and gives even more compute power for the harder part of computation. Another significant development is transforming the P-core database from a 'sea of fubs' to a 'sea of cells.' This process of migrating the sub-organization of the P-cores structure from fubs to more organized cells essentially increases the density.
Intel has removed Hyper-Threading (HT) from their Lunar Lake SoC, with one potential reason being to enhance power efficiency and single-thread performance. By eliminating HT, Intel reduces power consumption and simplifies thermal management, which should extend battery life in ultra-thin notebooks. Intel does make a couple of claims regarding the Lion Cove P-cores, which are set to offer approximately 15% better performance-to-power and performance-to-area ratios than cores with HT. Intel's hybrid architecture, which effectively utilizes E-cores for multi-threaded tasks, reduces the need for HT, allowing workloads to be distributed more efficiently by the Intel Thread Director.
Power management has also been refined by including AI self-tuning controllers to replace the static thermal guard bands. This lets the system respond dynamically to real-time operating conditions in an adaptive way to achieve higher sustained performance. Intel also implements Lion Cove P-Core clock speeds at tighter 16.67MHz intervals rather than the traditional 100MHz. This means more accurate power management and finer tuning to squeeze as much from the power budget as possible.
Intel's Lion Cove P-Core microarchitecture looks like a nice upgrade over Golden Cove. Lion Cove incorporates improved memory and cache subsystems and better power management while not relying solely on opting for faster P-core frequencies to boost the IPC performance.
91 Comments
View All Comments
Silver5urfer - Tuesday, June 4, 2024 - link
Disaster for Intel. Finally they folded. Intel fabs are now not even used for their high volume BGA junk processors. Instead using TSMC.Second thing is as everyone pointed out they are comparing LP-E to E cores lol to inflate the graphs. Also the IPC is meager at best, Raptor Cove is faster than Meteor one and they are using that figure.
ARL will lack HT on top of this reduced clockrate, interesting times ahead for Desktop battle.
Drumsticks - Tuesday, June 4, 2024 - link
They aren’t comparing LP E-Cores to E-Cores. LNL E-cores are separated from the LLC, same as MTL island cores. It’s an apt comparison.On the flip side, the comparison to Raptor cove is with E-cores connected to the LLC and ring bus, just as Raptor cove would be. It’s also an apt comparison. You’ll see island E-cores only on LNL (because of the power advantages) and ring bus connected E-cores on Arrow Lake (because of the performance advantages).
Kangal - Wednesday, June 5, 2024 - link
I don't know, but I am pretty underwhelmed.Intel is the least trusted tech giant, even Nvidia look better when it comes to honesty.
Here it seems like Intel took two steps forward, and three steps back. They are probably at a loss in either pricing, efficiency, or performance. Or more likely all three. That's why they use smoke and mirrors and try to trick the viewers/shareholders with the technicalities.
It's not like AMD didn't do the same, but they stand behind their technology, and actually showcased real products. And they also gave benchmarks. That's how you know they are confident.
It seems the CPU and GPU space is going to be a bloodbath for Intel. And we need all the competition we can get. But it is a little amusing to seeing Intel squirm. Ironically Intel is going the way of Bulldozer (shared cores) whilst AMD is sticking with Hyperthreading (extra bits per core) design. It's only amusing because Intel did unethical and illegal business practices that led to AMDs bankruptcy more than a decade ago. Microsoft is also complicit in that.
Terry_Craig - Wednesday, June 5, 2024 - link
Sounds like an intel employee. People care about performance, not excuses, the problem with the comparison is that the LP-E cores are much inferior to the already deficient E-Cores.https://chipsandcheese.com/2024/05/20/comparing-cr...
Drumsticks - Tuesday, June 11, 2024 - link
Not sure if this was a reply to me because of page breaks, but if it was, what about what I said is untrue or biased?From the (excellent, by the way) Chips article: "I wonder if Intel could give low power Crestmont a larger L2 cache, or even drop some blocks on Meteor Lake’s SoC tile to make room for a system level cache." - this is exactly what was done in Lunar Lake. The LNL E-Cores don't access the same L3 as the P-Cores, but there's an 8MB System level cache that they can access (that the rest of the chip also can I think, P-Cores, GPU, and NPU included). That probably is a big part of the giant 40-70% performance gain they show.
And E-Cores connected to the ring bus ARE much better, by Intel's own admission and by, again, the Chips article. Skymont E-Cores coming to ARL are (presumably) on the ring bus, and should punch much better than LNL E-Cores because of it.
None of this means that Intel's design is the best, or that it's not going to fall flat. That devil is still in the details, which Intel still needs to give to us. But I'm not sure how we can argue that the explicit details of the implementation are somehow biased or an excuse. That IS how Intel designed the chip; whether or not it is a good design remains to be seen. IMO, it seems like a pretty decent concept, but we'll have to see how much power the new P-Cores are really saving. With a 4P+4e design, they will need to be pretty efficient to match what Zen 5 will be up to, even in low power setups. (I assume 15W and above will get an arrow lake design that has more p cores and/or E cores on the ring bus).
Drumsticks - Tuesday, June 11, 2024 - link
One other thought - based on the Chips and Cheese article, LP E-Cores seem to be anywhere from 10-30% slower without access to an L3 cache. That Intel is calling out a 40-70% gain in Skymont LPE core performance over Crestmont LP-E is pretty noteworthy if nothing else. Even at their 10% (which is nuts) margin of error, the LPE core Skymont cores (albeit at least with access to a system cache) are as fast as Crestmont cores with a full blown 24MB L3 cache.Again, benchmarks are king, but assuming Skymont LP-E is bad because Crestmont LP-E was bad seems like a poor assumption given the underlying conditions are completely different.
GeoffreyA - Tuesday, June 4, 2024 - link
On the P side, most interesting is Lion Cove's moving to a split-scheduler design, saying good-bye to their classic unified approach there since the P6. AMD, always thinking ahead, has been using the split scheduler since the Athlon.Blastdoor - Tuesday, June 4, 2024 - link
This really looks like a SOC made for a MacBook Air.lmcd - Wednesday, June 12, 2024 - link
Or intended to beat out Snapdragon Elite if its date didn't slip.NextGen_Gamer - Tuesday, June 4, 2024 - link
With confirmation that the entire compute tile is made on TSMC's N3B process, I guess we can take that to mean Intel was not super confident in mass yields on its own 20A process. Intel's 20A will be used in Arrow Lake, the desktop equivalent to Lunar Lake. Desktop shipments are a small fraction of laptop chips nowadays, so that makes sense. This does create a really interesting opportunity that I hope Anandtech will explore, where you could take a desktop Arrow Lake processor, disable enough P-cores and E-cores to make it equal to Lunar Lake, and see how they compare. Same architectures, but one on TSMC N3B versus Intel 20A.