Several years ago, at a local event detailing a new Arm microarchitecture core, I recall a conversation I had with a number of executives at the time: the goal was to get Arm into 25% of servers by 2020. A lofty goal, which hasn’t quite been reached, however after the initial run of Arm-based server designs the market is starting to hit its stride with Arm’s N1 core for data-centers getting its first outings. Out of those announcing an N1-based SoC, Ampere is leading the pack with a new 80-core design aimed at cloud providers and hyperscalers. The new Altra family of products is aiming to offer competitive performance, performance per watt, and scalability up to 210 W and tons of IO, for any enterprise positioning.

The Arm Server Market: 2010-2019 (Abridged)

We’ve seen companies such as Broadcom/Cavium/Marvell, Calxeda, Huawei, Fujitsu, Phytium, Annapurna/Amazon, AppliedMicro/Ampere, and even AMD put Arm-based IP into silicon and subsequently into the server market. Up until recently, most designs have been fairly lackluster – with companies either developing their own core on an Arm architecture license and not getting a performance lift, or using the standard Arm cores and not finding the right mix of performance, power, and software uptake needed to drive home the design. As a result, we’ve seen multiple companies fall by the wayside, be acquired, or limit their activities to specific customers and keep very hush-hush.


First Generation Ampere eMAG, Built on Applied Micro designs

A big example of the ‘be acquired’ type of company was Annapurna, whom Amazon acquired and eventually released its Graviton2 processor in recent months. This chip has 64 cores based on Arm’s N1 design, which is the leading microarchitecture layout for Arm server chips at this point. To that end, Ampere (who originally purchased Applied Micro) is now set to release its second generation product, with 80 of the N1-based cores, and it now has a name: Altra.

Ampere Altra

Ampere has already given a number of details away about Altra in an announcement late last year, however this time around we have concrete details and the company has performance projections. On the back of its first generation eMAG product, Ampere is looking to offer better-than-Graviton2 performance to any cloud provider or hyperscaler who isn’t called Amazon, given that Graviton2 is built by Amazon and only available to Amazon. In that regard, Ampere has taking Arm’s full recommendations for its N1 design, building a chip with the most number of cores that N1 is designed to support.

As with other N1-based products, Altra will be single threaded, ensuring that each thread has its own core, its own resources, and removing any potential core-sharing thread security issues that have occurred recently. The Altra SoC is built with containers in mind, ensuring high-levels of quality of service with multiple customers on the same chip, and additional RAS features to ensure consistent performance.

The N1 core is by design what we’ve covered when Arm detailed the microarchitecture design last year. There is a 4-cycle 64 KB L1I/L1D caches per core, along with a 9-11 cycle 1 MB of private L2 per core. This is partnered with 32 MB of system wide LLC distributed through the SoC mesh, and all these caches are ECC with SECDED operation. It’s worth noting that 32 MB across 80 cores is less per core than Amazon’s Graviton2, which has 32 MB for 64 cores. 32 MB is actually half of what Arm recommends, as in Arm’s presentation it stated that it would expect a 64-core design to have 64 MB.

On top of the 80 cores, the SoC will also have eight DDR4-3200 memory channels with ECC support, up to 4 TB per socket. There are also 128 PCIe 4.0 lanes, with which the CPU can use 32 of them to hook up to another CPU for dual socket operation. The dual socket system can then have a total of 192 PCIe 4.0 lanes between it, as well as support for up to 8 TB of memory. We are told that it’s actually the CCIX protocol that runs over these PCIe lanes, which means 25 GB/s per x16 linkup. That’s good for 50 GB/s in each direction.

Each of the PCIe lanes can be bifurcated down to x8/x4/x2, and every different variant of the Altra SoC will only be segmented on core count and frequency: all CPUs will have 4 TB support and 128 lanes of PCIe 4.0. Each CPU can also support up to four CCIX-based accelerators.

Altra is built on TSMC’s 7nm, and while is technically an Arm v8.2 design, it does borrow a couple of features from 8.3 and 8.5, namely hardware based mitigations for side channel attacks and a couple of other small micro-architectural features.

Each of the 80 cores is designed to run at 3.0 GHz all-core, and Ampere was consistent in its messaging in that the top SKU is designed to run at 3.0 GHz at all times, even when both 128-bit SIMD units per core are being used (thus an unlimited turbo at 3.0 GHz). The CPU range will vary from 45W to 210W, and vary in core count - we suspect these SKUs will be derived from the single silicon design, and it will depend on demand as well as binning as to what comes out of the fabs. Exact SKUs are going to be announced later this year.

Also on security, Ampere was keen to point out that its new SoC will have two control processors: an SM Pro and a PM Pro. These allow for server manageability, up to SBSA Level 4, as well as Secure Boot, RAS error reporting, and advanced power management/temperature control.

Ampere will be launching with two reference designs for Altra, one in single socket called Mt. Snow, and one in dual socket called Mt. Jade. Each design will be available in 1U and 2U form factors, with PCIe 4.0 and CCIX attach, and up to 16 memory modules per socket. We know that the partner for the single socket is the GIGABYTE Server team, however the dual socket partner has not be announced yet. We have been told that the CPUs are socketed, which makes mass scale production and testing (at least on our side) a little easier.

Projected Performance

Ampere has some performance numbers, which as always we take with a grain of salt. These include 2.23x the performance on SPEC2017_int rate over a single 28-core Intel Xeon Platinum 8280, and 1.04x over a single 64-core AMD EPYC 7742. This is obviously extended into a number of claims about improved TCO. Ampere didn’t provide similar numbers for SPEC2017_fp, because the company states that the SoC has been developed with INT workloads in mind. Exact power/performance numbers were not given, but based purely on TDP, which is somewhat of an unreliable metric at times. We’ll wait to run our own numbers in due course.

Developing a Roadmap: 2021, 2022

One of the key questions going into our briefings with Ampere is how closely they are working with Arm on the next generation enterprise server core designs for upcoming SoCs. They weren’t keen to position themselves as Arm’s key partner in this venture (which might be Amazon, given they were first), but did state that there is a lot of collaboration and feedback that goes into the future designs. As a result, Ampere is able to formally declare a long-term roadmap for its product portfolio.

In this instance, Ampere is stating that today it has the 80-core Altra design on 7nm. In 2021, it will launch its Mystique product, which is currently in development (and when asked, Ampere told us will share the same socket as Altra). In 2022, Ampere will launch Siryn, and at this time the product has been defined and requires development.

Having a sustained product cadence has been critical to a number of processor designs in the last couple of decades – it tells potential ODM partners and customers that the company is in for the long haul, and committed to future developments with targets to meet. Obviously with Ampere tying itself to Arm’s roadmap helps in those product definition stages. It’s a feature that has crippled previous Arm designs from coming to market – without a clear roadmap, customers are unwilling to invest in a one-generation wonder and provide long term support for it. There’s always the issue as to whether any investment funding might run out, so Ampere’s goal here with Altra is to be the obvious answer to Graviton2 for the other hyperscalers. With that large market on offer, the goal is to be profitable and self-sustaining as quickly as possible, which then in turn gives potential customers even more confidence.

Next Stage for Altra

At this point, Ampere has stated to us that Altra is currently sampling with its key customers who are looking to deploy the hardware. From previous experience, the key customers who are involved early tend to get priority for deployment, and in that respect Ampere has stated that an official SKU list will come to market mid-year, along with pricing, and with official SPEC submissions. Hopefully at that time we will also get instance pricing from the companies intending to deploy the new chip.

We’re currently in talks with Ampere in order to obtain Altra for in-house testing when they feel it is ready. We have a version of Ampere’s previous generation eMAG workstation that just arrived in the labs, which should help us provide a good base-line from the previous design to the new one. Stay tuned for our coverage of eMAG and Altra!

Related Reading

Gallery: Ampere Altra

POST A COMMENT

66 Comments

View All Comments

  • deltaFx2 - Thursday, March 5, 2020 - link

    So wall power draw includes things like PSU, voltage regulator, DRAM, etc. Stuff outside of the CPU, that is. Ampere will have the same things so TDP is a reasonable comparison point for both until we get real numbers off a real system. Remember that once you go past a certain point, you need to raise voltage a lot to get little freq increase. ARM designs for a lower fmax which makes them efficient at lower frequencies but inefficient at high frequency: AMD/Intel likely need lower voltage to hit 3.3 GHz because fmax is much higher but is inefficient at lower frequencies. Point being, I doubt arm can be power efficient past 3 GHz. Indeed it should not playing that game as it gives up its biggest strength.

    As for derate, measurements are easier. Both systems are easily available. It's no excuse... where did that derating factor come from? Is there an industry standard conversation factor? Of course not.

    At least for AMD, single threaded IPC for a system running only one process may be less than N1 in some cases as there is only 16mb L3 pet ccx so N1 has more cache. Nobody runs a server like that though. Intel has a large unified cache so perhaps it may come ahead. We need real silicon to know.

    This startup was AMCC not that long ago and they have simply cobbled together someone else's IP. It doesn't take that much technical chops, certainly not like doing your own. AWS is already doing the same thing. If MS wants, it can do the same. So what's the big deal here?
    Reply
  • Wilco1 - Thursday, March 5, 2020 - link

    Well clearly it no longer takes billions and 5-10 years to design a competitive server. Today one can license a server core from Arm, "cobble" something together on a budget and in 1-2 years beat the fastest servers on the market.

    "Cobbled" or not, AMD and Intel just lost the hyperscaler market. A big deal or not?
    Reply
  • deltaFx2 - Thursday, March 5, 2020 - link

    'beat' is a very loose way of putting it. It's hardly a beat when your overclocked part ( 10% over your advertised single core turbo) barely keeps up with a crippled competitors score. Moreover, this is a paper launch. Intel and AMD system are available today. The fact that there needed to be so much dishonesty in metrics suggests to me that x86 market is safe for this neoverse generation.

    You are being naive if you believe that just being able to tape something out = success. Servers need volume. Volume aids binning, brings costs way down so as to recover nre costs, etc. AMDs chiplet design serves exactly that purpose: yields, yes but mainly volume into the desktop market allows them to harvest the best server chiplet. Which in turn means they can always undercut on price. List price for Intel/amd is meaningless because nobody actually pays that. Graviton at least has a captive market. An oem like Ampere has to make money.

    X86 has not yet lost the hyperscalar market. Not even close. Arm is now at a point where it doesn't suck anymore. Not the same as saying it offers a superior solution because it didn't just yet.
    Reply
  • name99 - Tuesday, March 3, 2020 - link

    WHY?
    This SoC is designed for a particular job: massive throughput of mostly integer tasks. Benchmark it for that. It’s irrelevant how well or badly it handles single threaded or FP.

    If you really want, it’s mostly an A76, so look up the Kirin 980 numbers. Bottom line, it’s about 50% to 65% of an A13 single threaded, depending on the exact task.
    Reply
  • npz - Wednesday, March 4, 2020 - link

    you can't ignore FP completely because most things actually use FP even implicitly. For example awk at for its traditional unix implemention, the very common scripting language used everywhere in shell programming, uses FP for all calculations even integer ones. Furthermore, single threaded does matter since not everything can be parralellized and you could have parallel tasks blocked waiting for a single threaded task for example. Reply
  • CiccioB - Wednesday, March 4, 2020 - link

    Floating point are a fraction of the instruction used in code that is not meant to make calculations with them. In normal use you can even have not a HW FPU but use floating point emulations that takes a lot of cycles to make calculations but does not require an otherwise useless big die.

    However, having FP HW support or not is different than having a core that is studies to make fast and efficient FP operations. This requires a lot of die space (and power budget). These cores are simply not thought to make scientific simulations but you may be sure they can happily run all your WM with PHP or Python scripts with no performance issues due to the FP instructions used there.
    Reply
  • Wilco1 - Wednesday, March 4, 2020 - link

    Neoverse N1 has 2 128-bit pipes doing FADD in 2 cycles, FMUL in 3 and FMAC in 4 cycles (with 2-cycle back-to-back accumulation). No other microarchitecture can match these latencies.

    So FP hasn't been ignored, it's just not a key selling point for hyperscalers.
    Reply
  • Foeketijn - Tuesday, March 3, 2020 - link

    Thanks for making this kind of content! Reply
  • Scipio Africanus - Tuesday, March 3, 2020 - link

    STH has a pretty good breakdown of the announcement. https://www.servethehome.com/ampere-altra-80-arm-c... Reply
  • eastcoast_pete - Tuesday, March 3, 2020 - link

    Thanks Ian! Any word on which companies will deploy these? With it's own in-house design, I'd guess Amazon is out. Would be nice to know who the launch customers/partners are. Reply

Log in

Don't have an account? Sign up now