CPU ST Performance: Faster & More Efficient

Starting off with this year’s review of the A15, in order to have a deeper look at the CPU single-threaded performance and power efficiency, we’re migrating over to SPEC CPU 2017. While 2006 has served us well over the years and is still important and valid, 2017 is now better understood in terms of its microarchitectural aspects in its components, and becoming more relevant as we moved our desktop side coverage to the new suite some time ago.

One continuing issue with SPEC CPU 2017 is the Fortran subtests; due to a lacking compiler infrastructure both on iOS and Android, we’re skipping these components entirely for mobile devices. What this means also, is that the total aggregate scores presented here are not comparable to the full suite scores on other platforms, denoted by the (C/C++) subscript in the score descriptions.

As always, because we’re running completely custom harnesses and not submitting the scores officially to SPEC, we have to denote the results as “estimates”, although we have high confidence in the accuracy.

In terms of compiler settings, we’re continuing to employ simple -Ofast flags without further changes, to be able to get the best cross-platform comparisons possible. On the iOS side of things, we’re running on the newest XCode 13 build tools, while on Android we’re running the NDKr23 build tools.

In terms of performance and efficiency details, we’re swapping the graphs around a bit from now on – on the left axis we have the performance scores of the tests – larger bars here mean better performance. On the right-side axis, growing from right to left are the energy consumption figures for the platforms, the smaller the figure, the more energy efficient (less energy consumed) a workload was completed. Alongside the energy figure in Joules, we’re also showcasing the average power figure in Watts.

Starting off with the performance figures of the A15, we’re seeing increases across the board, with absolute performance going up from a low of 2.5% to a peak of +37%.

The lowest performance increases were found in 505.mcf_r, a more memory latency sensitive workload; given the increased L2 latency as well as slightly higher DRAM latency, it doesn’t come too unexpected to see a more minor performance increase. However, when looking at the power and efficiency metrics of the same workload, we see the A15 use up almost 900mW less than the A14, with energy efficiency improving by +22%. 520.omnetpp_r saw the biggest individual increase at +37% performance – power here went up a bit, but energy efficiency is also up 24%.

The smallest performance gains of the A15 are found in the most back-end execution bound workloads, 525.x264_r and 538.imagick_r improve by only 8.7%, resulting in an IPC increase of 0.6% - essentially within the realm of measurement noise. Still, even here in this worst performance case, Apple still managed to improve energy efficiency by +13%, as the new chip is using less absolute power even though clock frequencies have gone up.

The most power demanding workload, 519.lbm_r, is extremely bandwidth hungry and stresses the DRAM the most in the suite, with the A15 chip here eating a whopping 6.9W of power. Still, energy efficiency is generationally slightly improved as performance goes up by 17.9% - based on first teardown reports, the A15 is still only powered by LPDDR4X-class memory, so these improvements must be due to the chip’s new memory subsystem and new SLC.

Shifting things over to the efficiency cores, I wanted to make comparisons not only to the A14’s E-cores, but also put the Apple chips in context to the competition, a Snapdragon 888 in this context, comparing against a 2.41GHz Cortex-A78 mid-core, as well as a 1.8GHz Cortex-A55 little core.

The A15’s E-cores are extremely impressive when it comes to performance. The minimum improvement varies from +8.4 in the 531.deepsjeng_r, essentially flat up with clocks, to up to again +46% in 520.omnetpp_r, putting more evidence into some sort of large effective sparse memory access parallelism improvement for the chip.  The core has a median performance improvement of +23%, resulting in a median IPC increase of +11.6%. The cores here don’t showcase the same energy efficiency improvement as the new A15’s performance cores, as energy consumption is mostly flat due to performance increases coming at a cost of power increases, which are still very much low.

Compared to the Snapdragon 888, there’s quite a stark juxtaposition. First of all, Apple’s E-cores, although not quite as powerful as a middle core on Android SoCs, is still quite respectable and does somewhat come close to at least view them in a similar performance class. The comparison against the little Cortex-A55 cores is more absurd though, as the A15’s E-core is 3.5x faster on average, yet only consuming 32% more power, so energy efficiency is 60% better. Even for the middle cores, if we possibly were to down-clock them to match the A15’s E-core’s performance, the energy efficiency is multiple factors off what Apple is achieving.


In the overview graph, I’m also changing things a bit, and moving to bubble charts to better spatially represent the performance to energy efficiency positioning, as well as the performance to power positioning. In the energy axis graphs, which I personally find to be more representative of the comparative efficiency and resulting battery life experiences of the SoCs, we see the various SoCs at their peak CPU performance states versus the total energy consumed to complete the workloads. On the power axis graphs, we see the same data, only plotted against average power. Generally, I find distinction of efficiency here to be quite harder between the various data-points, however some readers have requested this view. The bubble size corresponds to the average power of the various CPUs, we’re measuring system active power, meaning total device workload power minus idle power, to compensate components such as the display.

Apple A15 performance cores are extremely impressive here – usually increases in performance always come with some sort of deficit in efficiency, or at least flat efficiency. Apple here instead has managed to reduce power whilst increasing performance, meaning energy efficiency is improved by 17% on the peak performance states versus the A14. If we had been able to measure both SoCs at the same performance level, this efficiency advantage of the A15 would grow even larger. In our initial coverage of Apple’s announcement, we theorised that the company might possibly invested into energy efficiency rather than performance increases this year, and I’m glad to see that seemingly this is exactly what has happened, explaining some of the more conservative (at least for Apple) performance improvements.

On an adjacent note, with a score of 7.28 in the integer suite, Apple’s A15 P-core is on equal footing with AMD’s Zen3-based Ryzen 5950X with a score of 7.29, and ahead of M1 with a score of 6.66.

The A15’s efficiency cores are also massively impressive – at peak performance, efficiency is flat, but they’re also +28% faster. Again, if we would be able to compare both SoCs at the same performance level, the efficiency advantage of the A15’s E-cores would be very obvious. The much better performance of the E-cores also massively helps avoiding the P-cores, further improving energy efficiency of the SoC.

Compared to the competition, the A15 isn’t +50 faster as Apple claims, but rather +62% faster. While Apple’s larger cores are more power hungry, they’re still a lot more energy efficient. Granted, we are seeing a process node disparity in favour of Apple. The performance and efficiency of the A15 E-cores also put to shame the rest of the pack. The extremely competent performance of the 4 efficiency cores alongside the leading performance of the 2 big cores explain the significantly better multi-threaded performance than the 1+3+4 setups of the competition.

Overall, the new A15 CPUs are substantial improvements, even though that’s not immediately noticeable to some. The efficiency gains are likely key to the new vastly longer battery longevity of the iPhone 13 series phones – more on that in a dedicated piece in a few days, and in our full device review.

The Apple A15 SoC: Focus on Efficiency GPU Performance - Great GPU, So-So Thermals Designs
Comments Locked

204 Comments

View All Comments

  • michael2k - Monday, October 4, 2021 - link

    There's been work to document and improve on out of order vs in order energy efficiency (roughly a 150% energy consumption with a CG-OoO, and 270% with normal OoO) :
    https://dl.acm.org/doi/pdf/10.1145/3151034

    So there really is an energy efficiency benefit to 'in order'; out of order gives you a 73% performance boost but a 270% increase in energy consumption:

    Performance:
    https://zilles.cs.illinois.edu/papers/mcfarlin_asp...

    Power:
    https://stacks.stanford.edu/file/druid:bp863pb8596...

    In other words, if Apple had an in-order design it would use even less power, but as I understand it they have never had an in-order design. Make lemonade out of lemons as it were.
  • eastcoast_pete - Monday, October 4, 2021 - link

    I will take a look at the links you posted, but this is the first time I read about out-of-order execution being referred to as "lemon" vs. in-order. The out-of-order efficiency cores that Apple's SoC have had for a while now are generally a lot better on perf/W than in-order designs like the A55. And yes, a (much slower) in-order small core might consume less energy in absolute terms, but the performance per Watt is still significantly worse. Lastly, why would ARM design its big performance cores (A76/77/78, X2) as out-of-order, if in-order would make them so much more efficient?
  • jospoortvliet - Tuesday, October 5, 2021 - link

    Because they would be slower in absolute terms. In theory, all other things being equal, an in-order core should be more efficient than an out of order core. In practice, not everything is ever equal so just because apples small cores are so extremely efficient doesn't mean the theory is wrong.
  • michael2k - Wednesday, October 6, 2021 - link

    ARM can't hit the same performance using in-order, so if you need the performance you need to use an out-of-order design. In theory you could clock the in-order design faster, but the bottleneck isn't CPU performance but stalls when waiting on memory; with the out-of-order design the CPU can start working on a different chunk of instructions while still waiting on memory for the first chunk.

    It's essentially like having a left turn, forward, and right turn lane available, so that drivers can join different queues, vs all drivers forced to use a single lane for left, forward, and right. If the car turning left cannot move because of oncoming traffic, cars moving forward or right are blocked.

    As for your question regarding perf/W, you can see four different A55s all have the same perf but different W:
    https://www.anandtech.com/show/16983/the-apple-a15...

    This tells us that there is more to CPU energy use than the CPU, if you also have to include memory, memory controllers, storage controllers, and storage into the equation since the CPU needs to access/power all those things just to operate. The D1200 has better p/W across all it's cores despite being otherwise similar to the E2100 at the low and mid end (both have A55 and A78 but the E2100 uses far more power for those two cores)
  • Ppietra - Tuesday, October 5, 2021 - link

    the thing is Apple’s efficiency cores are clearly far more efficient than ARM’s efficiency cores in most workloads...
    Just because someone is not able to make an OoO design more efficient is not proof that any OoO will inevitably be less efficient
  • Andrei Frumusanu - Tuesday, October 5, 2021 - link

    Discussions and papers like these about energy just on a core basis are basically irrelevant because you don't have a core in isolation, you have it within a SoC with power delivery and DRAM. The efficiency of the core here can be mostly overshadowed by the rest of the system; the Apple cores results here are extremely efficient not only because of the cores but because the whole system is incredibly efficient.

    Look at how Samsung massacres their A55 results, that's not because their implementation of the A55 is bad, it's because everything surrounding it is bad, and they're driving memory controllers and what not uselessly at higher power even though the active cores can't make use of any of it. It creates a gigantic overhead that massively overshadows the A55 cores.
  • name99 - Thursday, October 7, 2021 - link

    You are correct but as always the devil is in the details.
    (a) What EXACTLY is the goal? In-order works if the goal is minimal energy at some MINIMAL level of performance. But if you require performance above a certain level, in-order simple can't deliver (or, more precisely, the contortions required plus the frequency needed make this a silly exercise).
    For microcontroller levels of performance, in-order is fine. You can boost it to two-wide and still do well; you can augment it with some degree of speculation and still do OK, but that's about it. ARM imagined small cores as doing a certain minimal level of work; Apple imagined them doing a lot more, and Apple appears to (once again) have skated closer to where the puck was heading, not where it was.

    (b) Now microcontrollers are not nothing. Apple has their own controller core, Chinook, that's used all over the place (their are dozens of them on an M1) controlling things like GPU, NPU, ISP, ... Chinook is AArch64, at least v8.3 (maybe updated). It may be an in-order core, two or even one-wide, no-one knows much about [what we do know is mainly what we can get from looking at the binary blobs of the firmware that runs on it].

    Would it make sense for Apple to have a programmer visible core that was between Chinook and Blizzard? For example give Apple Watch two small cores (like today) and two "tiny" cores to handle everything but the UI? Maybe? Maybe not if the core power is simply not very much compared to everything else (always-on stuff, radios, display, sensors, ...)?
    Or maybe give something like Airpods a tiny core?

    (c) Take numbers for energy and performance for IO vs OoO with a massive grain of salt. There have been various traditional ways of doing things for years; but Apple has up-ended so much of that tradition, figuring out ways to get the performance value of OoO without paying nearly as much of the energy cost as people assumed.
    It's not that Apple invented all this stuff from scratch, more that there was a pool of, say, 200 good ideas out there, but every paper assumed "baseline+my one good idea", only Apple saw the tremendous *synergy* available in combining them all in a single design.
    We can see this payoff in the way that Apple's small cores get so much performance at such low energy cost compared to ARM's in-order orthodoxy cores. Apple just isn't paying much of an energy price for all its smarts.
    And yes, the cost of all this is more transistors and more area. To which the only logical response is "so what? area and transistors are cheap!"
  • Nicon0s - Tuesday, October 5, 2021 - link

    >I don't see how they can get even close to the efficiency cores in Apple's SoC.

    Very simple actually. By optimized/modifying a Cortex A76.
    The latest A510 is very very small and this limit's the potential of such a core.
  • techconc - Monday, October 18, 2021 - link

    >Very simple actually. By optimized/modifying a Cortex A76.

    LOL... That would take a hell of a lot of "optimization" and a bit of magic maybe. The A15 efficiency cores match the A76 in performance at about 1/4 the power.
  • Raqia - Monday, October 4, 2021 - link

    Interesting that the extra core on the 13 Pro GPU doesn't seem to do add much performance much over the 13's 4 cores even when unthrottled, certainly not 20%. Perhaps the bottleneck has to do with memory bandwidth.

Log in

Don't have an account? Sign up now