Cortex A76 µarch - Frontend

Starting off with a rough overview of the Cortex A76 microarchitectural diagram we see the larger functional blocks. The A76 doesn’t look too different than other Arm processors in this regard and the differences come only with details that Arm is willing to divulge. To overly simplify it, this is a superscalar out-of-order core with a 4-wide decode front-end with 8 execution ports in the backend with a total implementation pipeline depth of 13 stages with the execution latencies of a 11 stage core.

In the front-end, Arm has created a new predict/fetch unit that it calls a “predict-directed fetch”, meaning the branch prediction unit feeds into the instruction fetch unit. This is a divergence from past Arm µarches and it allows for both higher performance and lower power consumption.

The branch prediction unit is what Arm calls a first in the industry in adopting a hybrid indirect predictor. The predictor is decoupled from the fetch unit and its supporting large structures operate separate from the rest of the machine – likely what this means is that it will be easier to clock-gate during operation to save on power. The branch predictor is supported by 3-level branch target caches; a 16-entry nanoBTB, a 64-entry microBTB and a 6000 entry main BTB. Arm claimed back in the A73 and A75 generations of branch predictors were able to nearly predict all taken branches so this new unit in the A76 seems to be one level above that in capability.

The branch unit operates at double the bandwidth of the fetch unit – it operates on 32B/cycle meaning up to 8 32b instructions per cycle. This feeds a fetch queue in front of the instruction fetch consisting of 12 “blocks”. The fetch unit operates at 16B/cycle meaning 4 32b instructions. The branch unit operating at double the throughput makes it possible to get ahead of the fetch unit. What this serves is that in the case of a mispredict it can hide branch bubbles in the pipeline and avoid stalling the fetch unit and the rest of the core. The core is said to able to cope with up to 8 misses on the I-side.

I mentioned at the beginning that the A76 is a 13-stage implementation with the latency of an 11-stage core. What happens is that in latency-critical paths the stages can be overlapped. One such cycle happens between the second cycle of the branch predict path and the first cycle of the fetch path. So effectively while there’s 4 (2+2) pipeline stages on the branch and fetch, the core has latencies of down to 3 cycles.

On the decode and rename stages we see a throughput of 4 instructions per cycle. The A73 and A75 were respectively 2 and 3-wide in their decode stages so the A76 is 33% wider than the last generation in this aspect. It was curious to see the A73 go down in decode width from the 3-wide A72, but this was done to optimise for power efficiency and “leanness” of the pipeline with goals of improving the utilisation of the front-end units. With the A76 going 4-wide, this is also Arms to date widest microarchitecture – although it’s still extremely lean when putting it into juxtaposition with competing µarches from Samsung or Apple.

The fetch unit feeds a decode queue of up to 16 32b instructions. The pipeline stages here consist of 2 cycles of instruction align and decode. It looks here Arm decided to go back to a 2-cycle decode as opposed to the 1-cycle unit found on the A73 and A75. As a reminder the Sophia cores still required a secondary cycle on the decode stage when handling instructions utilising the ASIMD/FP pipelines so Arm may have found other optimisation methods with the A76 µarch that warranted this design decision.

The decode stage takes in 4 instructions per cycle and outputs macro-ops at an average ratio of 1.06Mops per instruction. Entering the register rename stage we see heavy power optimisation as the rename units are separated and clock gated for integer/ASIMD/flag operations. The rename and dispatch are a 1 cycle stage which is a reduction from the 2-cycle rename/dispatch from the A73 and A75. Macro-ops are expanded into micro-ops at a ratio of 1.2µop per instruction and we see up to 8µops dispatched per cycle, which is an increase from the advertised 6µops/cycle on the A75 and 4µops/cycle on the A73.

The out-of-order window commit size of the A76 is 128 and the buffer is separated into two structures responsible for instruction management and register reclaim, called a hybrid commit system.  Arm here made it clear that it wasn’t focusing on increasing this aspect of the design as it found it as a terrible return on investment when it comes to performance. It is said that the performance scaling is 1/7th – meaning a 7% increase of the reorder buffer only results in a 1% increase in performance.  This comes at great juxtaposition compared to for example Samsung's M3 cores with a very large 224 ROB.

As a last note on the front-end, Arm said to have tried to optimised the front-end for lowest possible latency for hypervisor activity and system calls, but didn’t go into more details.

The Arm Cortex A76 - Introduction Cortex A76 µarch - Backend
Comments Locked

123 Comments

View All Comments

  • techconc - Tuesday, June 5, 2018 - link

    " A11 is no match in speed and performance versus SD845 and Exynos powered Android phones today"

    Huh? Benchmarks do not support your claim.
  • name99 - Friday, June 1, 2018 - link

    This assumption makes two mistakes.

    The first is to assume that ONE metric (in this case 4-wide front end) is the PRIMARY determinant of performance. Even Apple's (A11) IPC (over a wide range of code) is about maybe 2.7. This means on average less than 3 of those 6 execution units are being used per cycle. IF other parts of the core uncache could be PERFECTED on a 4 wide design so that EVERY cycle 4 instructions executed, it would clearly surpass the A11 in IPC.
    The problem, of course, is just how hard it is to prevent cycles where NOTHING executes and cycles where only a few (one or two) instructions execute. Reducing these are where most of the magic is --- and you won't see details of that it in an article like this; rather it's in that painstaking rooting out hundreds of small inefficiencies that the article talked about.

    To give just one example -- no-one is talking about the clustered page tables. This is a very cool idea which relies on the fact that most of the time the OS page allocator allocates a number of pages contiguously in virtual AND physical space, and with the same permissions. If that is so, the same page entry can correspond to multiple contiguous pages (in the academic literature, usually up to 8). This gives you a substantial increase in TLB reach at a very minor increase in TLB bits.
    (I can find no info as to whether Intel does this. I SUSPECT Apple used to do this in their earlier [and probably even A11] cores. There are very recent even better ideas in the academic literature that might, perhaps, have made it to the A12 core.)

    Second mistake you make is to ignore frequency. A9 ran at 1.85 GHz, A10 at 2.35 GHz. The A76 will likely run at 3GHz.
  • tipoo - Tuesday, September 4, 2018 - link

    And yet, here we are with single core results at around 60% that of the A11. Taking their own numbers at face value, a 56% increase over the A73 in GB4 results in 2800.

    Yes, I used a very simplistic one dimensional comparison, and there's a whole lot more to it. However, core complexity does go up almost exponentially with width, and so it does point to what ballpark they were aiming at. A76 was never going to beat the A11 per core because it was never aimed at it.
  • colinisation - Thursday, May 31, 2018 - link

    Hi Andrei,

    Is this core the one referred to as Ares on roadmaps?

    Been waiting years for this one if it is.
  • Andrei Frumusanu - Thursday, May 31, 2018 - link

    Yes in practical terms - no in actual terms. You'll likely hear more about this in the future.
  • tuxRoller - Friday, June 1, 2018 - link

    Ok, now you've incepted the idea that ARM is going to announce a dedicated server-class chip (maybe even a tease of SVE.....)
  • name99 - Friday, June 1, 2018 - link

    I agree with your point (ARM will release a server chip, essentially based on this core).
    Remember that GB4 is scaled to 4000 represents an i7-6600U (Skylake, 3.4GHz, 4MiB L3, 15W).
    So A76 is essentially at that level (slightly worse FP, but many server tasks will not care).
    To the extent that that Skylake at 3.4GHz is an acceptable Server class core, ARM could dump some large number of A76 on a die and be in the same sort of space as dearly-departed Centriq and ThunderX2.

    They likely would have to beef up their NoC one way or another, and tweak the caching and memory systems, the usual server additions.
    But I assume they didn't put all that effort into "lowest possible latency for hypervisor activity" on the theory that hypervisor performance on smartphones is THE next big thing...
  • joe_85 - Thursday, May 31, 2018 - link

    I am sure people will disagree with me because people love to argue. Andrei, I like your writing and overall thoroughness but a few critiques here. The charts you make are extremely unpleasant to look at and do not lend themselves to a quick assessment of the data.

    First of all the color coded stripes in the legend for the A76 projections is not even decipherable, and the actual bars are on the chart are difficult to see. Secondly, why are you color coding them at all? Just put processor names to the left of the bars and the benchmark name above the bars.

    Additionally are the bars in any particular order? If they are I certainly can't tell, they should be relative to the performance OR the efficiency.

    Other constructive criticism would be that adding some additional subheadings within your articles would make it feel like a more solid piece.

    Keep up the good work.
  • jospoortvliet - Wednesday, June 6, 2018 - link

    Loving to argue or not, the performance vs efficiency graphs are rather unique and very clever, I have not seen any design that so clearly shows how different SOC's compare at both. Yes it takes a few mins before you can read them I am sorry that the world is so complicated. But they work very well if you just use that gray matter a bit.
  • syxbit - Thursday, May 31, 2018 - link

    As an Android user, I continue to be disappointed with QCOMM, Arm (and Nvidia for dropping out) at how far ahead Apple in single threaded perf.

Log in

Don't have an account? Sign up now