Original Link: https://www.anandtech.com/show/15733/ampere-emag-system-a-32core-arm64-workstation



Arm desktop systems are quite a rarity. In fact, it’s quite an issue for the general Arm software ecosystem in terms of having appropriate hardware for developers to actually start working in earnest on more optimised Arm software.

To date, the solution to this has mostly been using cloud instances of various Arm server hardware – it can be a legitimate option and new powerful cloud instances such as Amazon’s Graviton2 certainly offer the flexibility and performance you’d need to get things rolling.

However, if you actually wanted a private local and physical system, you’d mostly be relegated to small low-performing single-board computers which most of the time had patchy software support. It’s only been in the last year or two where Arm-based laptops with Qualcomm Snapdragon chips have suddenly become a viable developer platform thanks to WSL on Windows.

For somebody who wants a bit more power and in particular is looking to make use of peripherals – actively using large amounts of storage or PCIe connectivity, then there’s options such as Avantek’s eMag Workstation system.

The system is an interesting mish-mash of desktop and server hardware, and at the centre of it all enabling is Ampere’s “Raptor” motherboard containing the eMAG 8180 32-core chip. This is a server development board that doesn’t really adhere to any standard form-factor standard, but Avantek was able to make it fit into BeQuiet tower chassis with some modifications.

Ian had published a more in-depth visual inspection of the machine a few weeks ago, so I recommend reading that in terms of the analysis of what’s physically present in the machine and its quirks.

Read: Arm Development For The Office: Unboxing an Ampere eMag Workstation

The notable characteristics of the system is that in fact it’s a setup that was designed by a vendor that’s usually server oriented – this is Avantek’s first foray into a desktop system.

As noted, because the motherboard isn’t adhering to an ATX standard, the biggest incompatibility lies on the part of the PCIe slots which don’t match up with the slots of the chassis. Avantek here had to resort to using a riser card and a custom backplate in order to fit the graphics card.

The graphics card provided in our sample was a Radeon Pro WX5100 – a lower-end unit meant for workstations.

The biggest advantage of the system which we’ll address in more detail in a bit, is the fact that this is an SBSA (Server Base System Architecture) compliant system, which means it’ll be compatible with “most” PCIe hardware out there. For example, I had no issues replacing the graphics card with an older Radeon HD 7950 I had lying around and the system booted with display output without any issues. This might sound extremely boring, and it is – but for the Arm ecosystem it’s been a decade long journey to reach this point.

In terms of general form-factor, Avantek’s choice here to go with a desktop chassis works well. It’s a big motherboard so it does require a bigger case, allowing it for plenty of additional hardware inside.

I think one negative on the system from a practical hardware perspective is Avantek’s server pedigree. The CPU cooler in particular is the type you’d find in a server system, and the fan choice isn’t something you’d see in any traditional desktop system as it's a more robust 90mm fan. Although the company has said that it tried to minimise the noise of the system by adjusting the fan curves as well as opting for a low acoustics chassis – it’s still subjectively loud for a desktop system. I measured around 42dBA at idle which is still a bit much - but that also depends on your typical expectations of a silent system. I hope Avantek would change in the future is employ a more consumer grade CPU cooler system and reduce the acoustics of the machine.



An Arm SBSA System

As noted, the one thing that sets the eMag Workstation apart from most other Arm-based embedded systems in the market, is the fact that it’s an SBSA (Server Base System Architecture) compliant system.

What SBSA mandates as a standard, is for a vendor to design the hardware in a certain way such that the CPU, the system timers, interrupts and PCIe handling operates in such a way, that any SBSA compliant operating system image would be able to boot on it.

That’s in stark contrast with most other Arm embedded systems in the market – take Nvidia’s Jetson Arm development kits: while these system do have images for popular OSes such as Ubuntu, these are provided and maintained by Nvidia, with customised kernels and baked-in drivers. You’re reliant on the vendor to actually update the OS images, lest you actually go ahead and compile your kernels and OS yourself – if possible at all.

An SBSA system on the other hand will be able to boot generic OS images – essentially the same way it would work on any x86 desktop or server system on the market.

In essence, that’s the one main advantage of the eMAG Workstation over any other Arm embedded system, and it’s a strong advantage from a software standpoint.

The system has a comprehensive BIOS with tons of configuration options. The options here match and exceed what you’d find in a typical x86 system – overclocking options aside of course.

One important aspect of the system is that this is a server motherboard with a BMC. The BMC is an ASPEED AST2500 chip that allows independent management of the system through the dedicated BMC Ethernet port on the back of the system, allowing for two BIOS images to be set and configured.

It also serves as the 2D driver allowing for (non-accelerated) VGA display output via the D-Sub connector on the back of the system. There’s also a serial connection for terminal access.

One big disadvantage of the BMC nature of the system and the SBSA architecture, is that boot times are horrible. Booting and rebooting the system is a matter of test of patience with each cycle taking between 4 to 5 minutes, as demonstrated in the above video capture. The BMC bootup itself is around 1:20 minutes before we get to the BIOS screen, and then another painful 3 minutes for Linux to actually boot up to the login screen. The ecosystem still has a lot of work ahead to optimise this aspect of Arm systems.



The eMAG 8180: AppliedMicro's Legacy Skylark Core

While you’re reading this in 2020, and the eMAG Workstation had been released in 2019 – the CPU powering the system is actually quite ancient, tracing back its roots in the 2017 defunct AppliedMicro. Originally meant to be called the X-Gene3, the chip had originally been planned for the second half of 2017 before the AppliedMicro had went through several changes of ownership before the IP and designs ended up with Ampere Computing.

In that sense, the eMAG 8180 is more of a legacy design and quite distantly related to Ampere’s newer Altra system processors.

The Skylark cores in the eMAG 8180 are a custom core design having the X-Gene processor pedigree. It’s a 4-wide OOO processor that’s relatively narrow by today’s standards, characterised by quite high operating frequencies up to 3-3.3GHz and quite the unusual cache hierarchy, such as two core pairs sharing the same 256KB L2 cache.

On a chip-level, the CPU is characterised by having a large coherent network tying all the CPU modules, the memory controllers, and a big large 32MB L3 cache together.

What’s surprising here is that the core-to-core latency across the whole chip isn’t bad at all, ranging from 68-73ns. While this certainly doesn’t keep up with more recent monolithic designs, this is an Arm v8.0 core lacking CAS atomic operations – so the above figures are done via regular sequential exclusive load / exclusive stores which aren’t as fast. The coherency here going over the 32MB L3 cache certainly helps the system punch above its weight for a design of its time.

The CPU cores have 32KB L1 instruction and data caches – the access latencies here are 5 cycles. The 256KB L2 caches has a 13-cycle access latency, while the 32LB L3 cache has some massive 45ns+ access latencies that are much slower than any other comparable design out there.

We note the core’s L1 TLB ends at 48 pages (192KB) and the L2 TLB at 1024 pages (4MB), after which page-miss access times increasingly result in worse latencies.

In contrast with the quite large cache access latencies, the DRAM access latency isn’t all that bad at around 137ns full random at 128MB depth.

Single-core bandwidth of the Skylark cores isn’t too pretty, load and store bandwidth into the L1 and L2 seem to be limited at 8B/cycle and a combined 16B/cycle for concurrent load & stores. The dip between the L2 and L3 is usually a showcase of a bandwidth bottleneck when evicting/replacing a cacheline, and the load bandwidth at the DRAM level is also quite disappointing.

Overall, the performance here is only half of a more modern Arm core, but again, this is a 2015-2016 core design.



SPEC2017: Weak ST Performance 

Single-threaded performance of the system is going to be interesting, but given the age of the CPUs we shouldn’t be expecting any miracles.

As comparison points, I’m adding the new Neoverse-N1 based Graviton2 results, which should server as an indicator what a contemporary Arm core should be able to achieve, as well as Intel’s i7-10700K (Equivalent to an 9900K) – a common mainstream consumer-grade CPU that should represent your higher-end x86 desktop machine.

SPECint2017 Rate-1 Estimated Scores

Things aren’t looking too well for the Skylark cores, as performance isn’t really up to par with more recent generation hardware. There’s no specific workload in which the eMAG does badly in, but we do see that small memory footprint workloads such as 548.exchange2 aren’t faring all that badly – pointing out that for the other workloads the system must be cache and memory bottlenecked.

SPECfp2017 Rate-1 Estimated Scores

In the floating point suite, things are again not too great and performance further craters in some tests.

SPEC2017 Rate-1 Estimated Total

Overall, the eMAG 8180 is extremely disappointing in its single-threaded performance. It’s actually quite intriguing to see the results. Even though the Skylark cores are operating at 3.3GHz, the end performance isn’t any better than a 2.1GHz Cortex-A72 core such as found in the first-generation Graviton chip. That’s quite the massive IPC disadvantage even between those two older CPU microarchitectures, and reminds us of the reason AppliedMicro hadn’t really seen much success with its design.



SPEC2006 & 2017: Weak ST Performance 

Of course, while maybe the individual cores might not be all that performant, the chip employs 32 of them. Together with the 32MB L3 cache as well as the 8-channel DDR4-2666 memory interface certainly the system should be able to showcase better multi-core results.

SPECint2017 Rate Estimated Scores (Max CPU) SPECfp2017 Rate Estimated Scores (Max CPU)

Indeed, the chip does better in the tests, at least more often than not being able to beat the consumer-grade Intel CPU. The performance scaling of the eMAG system also isn’t bad at all – scaling from 1 core to 32 cores sees the performance scale with an average factor of 0.73x per core, and a median per SPEC2006 and 2017 test of 0.78x; that’s much better scaling than Amazon’s Graviton2 when scaling ST performance to its full 64 cores.

SPEC2017 Rate-N Estimated Total

Still, the MT results, while beating the Intel system, still don’t look all that great when considering the fact that we’re talking about 8 cores vs 32 cores. A 400% advantage in cores for only a 30% performance advantage.

The Graviton2 naturally eclipses both comparisons here, and the only reason to consider this chip’s figures is that Ampere’s upcoming Altra processor with 80 cores and 3GHz should be notably faster than the Amazon chip.



General Code Compile - Who's it For?

As I had mentioned, one big advantage of having an Arm system like this is the fact that it enables your native software development, without having to worry about cross-compiling code and all of the kerfuffle that that entrails.

For me personally, one issue had been the fact that we need to compile our SPEC test suite for Arm architectures, which isn’t all that evident when you don’t have a native machine to test things on.

Still, I had been able to configure a proper cross-compile setup- in this case GCC9 on an x86 machine (AMD Ryzen 3700X) cross-compiling for an AArch64 target, versus natively compiling the same target with the same compiler on the eMAG workstation.

Native vs Cross-Compile (Andrei's SPEC setup)

The first thing I noted when compiling things on the eMAG system, is that it took quite an enormous amount of time for linking together the libraries and executables. Unfortunately, this isn’t something that can be parallelised, and it ends up being a mostly single-thread performance bottlenecked part of software development.

Although the eMAG system does have more processing power than your average consumer system, it’s unlikely for this to actually materialise in the average development environment due to the massive singe-threaded performance disadvantage of the system.

In my case, my personal desktop machine outperformed the eMAG system in this one use-case by a factor of >2x.

My personal view on this is that if I were to be trying to port a piece of software to Arm, assuming that the act of cross-compiling itself isn’t an inherent issue, then I would probably prefer to simply cross-compile things on my regular x86 machine and deploy and test it on a lower-end embedded Arm board.

The only real audience who could rationalise the system performance deficit for its architectural flexibility would be developers who actually work on hardware enablement – using the SBSA system to its fullest for things like Arm driver development and making full use of the peripherals and PCIe capabilities of the system.



Conclusion - All Eyes on an Altra System

Overall, my expectations of the eMAG Workstation at the beginning were in my view quite realistic, but I can’t help but still feel a bit underwhelmed by the actual experience of the system.

Yes, it’s incredibly important to have an SBSA system that boots generic OS images – and this probably remains the single biggest advantage of the eMAG to date. The problem is that even accounting for all those advantages, the aging CPU’s lacklustre performance just doesn’t add up to the extremely high cost of the system.

      
Ampere eMAG Workstation vs HoneyComb LX2K Pricing

If you’re a developer who needs to work on hardware enablement and make use of the SBSA system for software development, then you probably won’t need to be reading this piece to rationalise the eMAG Workstation, you probably already have one.

For the general populace, there’s better value Arm alternatives out there, even if SBSA is to be compromised.

Ampere Altra: 80 High-Performance Cores

What’s actually more important than the current generation eMAG Workstation is the possibility of Avantek and Ampere creating an updated successor based on the new Altra processor.

Amazon has already proven that Arm’s Neoverse-N1 CPU cores performs extremely well, and Ampere’s implementation with 80 cores and higher up to 3GHz clock speeds should pretty much outperform the cloud-provider’s chip.

From a software perspective, the eMAG Workstation is great. Iterating on that aspect, updating the hardware with the newest Ampere chip, and improving the cooling solution to something that’s quieter in an office environment, Avantek could see a ton of success with such a system, finally turbo-charging the Arm software ecosystems and finally giving developers the machines they’ve been demanding for years.

Avantek told us they’re willing to build such a system as long as there’s sufficient demand for it. I think the demand is there, we just need more awareness and for the hardware to deliver on its performance.

Related Reading:

Log in

Don't have an account? Sign up now