Intel Unveils Lunar Lake Architecture: New P and E cores, Xe2-LPG Graphics, New NPU 4 Brings More AI Performance
by Gavin Bonshor on June 3, 2024 11:00 PM ESTIntel Lunar Lake: New P-Core, Enter Lion Cove
Diving straight into the Performance, or P-Core commonly referred to, has had major architectural updates to increase power efficiency and performance. Bigger of these updates, Intel needed to comprehensively update its classic P-core cache hierarchy.
Key among these improvements is a significant overhaul of Intel's traditional P-core cache hierarchy. The fresh design for Lion Cove uses a multi-tier data cache containing a 48KB L0D cache with 4-cycle load-to-use latency, a 192KB L1D cache with 9-cycle latency, and an extended L2 cache that gets up to 3MB with 17-cycle latency. In total, this puts 240KB of cache within 9 cycles' latency of the CPU cores, whereas Redwood Cove before it could only reach 48KB of cache in the same period of time.
The data translation lookaside buffer (DTLB) has also been revised, increasing its depth from 96 to 128 pages to improve its hit rate.
Intel has also added a third Address Generation Unit (AGU)/Store Unit pair to further boost the performance of data write operations. Intel has also thrown more cache at the problem, and as CPU complexity grows, so does the reliance on the cache subsystems to keep them fed. Intel has also reworked the core-level cache subsystem by adding an intermediate data cache (IDC) between the 48 KB L1 and the L2 level. The original L1D cache is now called the L0 D-cache internally and retires to a 192 KB L1 D-cache.
The latest Lion Cove P-core design also includes a new front-end for handling instructions. The prediction block is 8x larger, fetch is wider, decode bandwidth is higher than on Raptor Cove, and there has been an enormous increase in Uops cache capacity and read bandwidth. The change in Uop queue capacity is designed to enhance the overall performance throughput.
The out-of-order engine in Lion Cove is partitioned in the footprint for Integer (INT) and Vector (VEC) domains Execution Domain with Independent renaming and scheduling. This type of partitioning allows for expandability in the future, independent growth of each domain, and benefits toward reduced power consumption for a domain-specific workload. The out-of-order engine is also improved, going from 6 to 8-wide allocation/rename and 8 to 12-wide retirement, with the deep instruction window increased from 512 to 576 entries and from 12 to 18 execution ports.
Lion Cove's integer execution units have also been improved over Raptor Cove, with execution resources grown from 5 to 6 integer ALUs, 2 to 3 jump units, and 2 to 3 shift units. Scaling from 1 to 3 units, these multiply 64x64 units to 64, which takes 3 units and gives even more compute power for the harder part of computation. Another significant development is transforming the P-core database from a 'sea of fubs' to a 'sea of cells.' This process of migrating the sub-organization of the P-cores structure from fubs to more organized cells essentially increases the density.
Intel has removed Hyper-Threading (HT) from their Lunar Lake SoC, with one potential reason being to enhance power efficiency and single-thread performance. By eliminating HT, Intel reduces power consumption and simplifies thermal management, which should extend battery life in ultra-thin notebooks. Intel does make a couple of claims regarding the Lion Cove P-cores, which are set to offer approximately 15% better performance-to-power and performance-to-area ratios than cores with HT. Intel's hybrid architecture, which effectively utilizes E-cores for multi-threaded tasks, reduces the need for HT, allowing workloads to be distributed more efficiently by the Intel Thread Director.
Power management has also been refined by including AI self-tuning controllers to replace the static thermal guard bands. This lets the system respond dynamically to real-time operating conditions in an adaptive way to achieve higher sustained performance. Intel also implements Lion Cove P-Core clock speeds at tighter 16.67MHz intervals rather than the traditional 100MHz. This means more accurate power management and finer tuning to squeeze as much from the power budget as possible.
Intel's Lion Cove P-Core microarchitecture looks like a nice upgrade over Golden Cove. Lion Cove incorporates improved memory and cache subsystems and better power management while not relying solely on opting for faster P-core frequencies to boost the IPC performance.
91 Comments
View All Comments
BushLin - Wednesday, June 5, 2024 - link
Seconded, also..."Gavin Bonshor - Tuesday, May 21, 2024 - link
Hey, thank you for saying that. They are coming as soon as I can get the data updated. I had to fly out to the USA last Monday evening, and the testing wasn't finished in time. I also don't typically work weekends, but I made an exception in this case. I'm catching up, but don't worry, it will be updated ASAP."
TheinsanegamerN - Monday, June 10, 2024 - link
Nope. We never got that Macbook review or the return of the GPU benchmarks.jaj18 - Tuesday, June 4, 2024 - link
What's the improvement from on package memory🤔?rgreen1983 - Tuesday, June 4, 2024 - link
Power savings. Trading upgradeability for unplugged battery life because publications put way too much emphasis on it for years trying to make arm seem better than x86.The Hardcard - Wednesday, June 5, 2024 - link
There’s not way to emphasis on it. Battery life is a far more mainstream issue than upgradability.rgreen1983 - Wednesday, June 5, 2024 - link
I disagree. Battery life beyond a certain point is silly in a laptop, they aren't phones or tablets, which are much better suited for unplugged use for media consumption. Who the heck is spending 20 hours unplugged browsing the web? And at power performance than they would have if they were plugged in. I have supported thousands of laptops, lots of them macs, and any that do real work are plugged in.Battery life used to be measured in minutes and was a big deal but now that we are measuring near days it's getting silly.
The Hardcard - Thursday, June 6, 2024 - link
People work plugged in because they have to, not because they want to. as more powerful workload, capable all day and multi day, devices become available, they will be the choices for huge numbers of people who can afford the price.Once all the players jump in, and there is more competition in price, extended battery life devices that can be worked on will dominate.
TheinsanegamerN - Monday, June 10, 2024 - link
Maybe you want a battery that can still do a 8 hour workday 5 years after you bought it? Battery degradation is a thing you know.I could throw your question right back at you. Why does anyone need upgradeability on a modern laptop? CPUs last a LONG time, by the time the CPU is no longer fast enough, the whole generation will be unsupported anyway, and the device a relic of the past. Just buy enough memory to do what you need and use the machine.
See how easy that is?
shabby - Tuesday, June 4, 2024 - link
Thanks tsmc for saving Intel's butt, they couldn't do it themselves with 10nm+++++++Nate_on_HW - Tuesday, June 4, 2024 - link
I found it interesting that they also talked about the INT8 OPS throughput of the GPU and CPUWould find it interesting to get those numbers on AMDs &Qualcomms chip and maybe plot each module of the SoCs as "TOPS/Watt" (for comparison)
I wonder if the new windows11 "on-device ML-models" would use the whole chip for computing or only the NPU.