Qualcomm Demos 48-Core Centriq 2400 Server SoC in Action, Begins Sampling

by Anton Shilov on December 16, 2016 6:00 PM EST

88 Comments | Add A Comment

88 Comments

Qualcomm this month demonstrated its 48-core Centriq 2400 SoC in action and announced that it had started to sample its first server processor with select customers. The live showcase is an important milestone for the SoC because it proves that the part is functional and is on track for commercialization in the second half of next year.

Qualcomm announced plans to enter the server market more than two years ago, in November 2014, but the first rumors about the company’s intentions to develop server CPUs emerged long before that. In fact, being one of the largest designers of ARM-based SoCs for mobile devices, Qualcomm was well prepared to move beyond smartphones and tablets. However, while it is not easy to develop a custom ARMv8 processor core and build a server-grade SoC, building an ecosystem around such chip is even more complicated in a world where ARM-based servers are typically used in isolated cases. From the very start, Qualcomm has been rather serious not only about the processors themselves but also about the ecosystem and support by third parties (Facebook was one of the first companies to support Qualcomm’s server efforts). In 2015, Qualcomm teamed up with Xilinx and Mellanox to ensure that its server SoCs are compatible with FPGA-based accelerators and data-center connectivity solutions (the fruits of this partnership will likely emerge in 2018 at best). Then it released a development platform featuring its custom 24-core ARMv8 SoC that it made available to customers and various partners among ISVs, IHVs and so on. Earlier this year the company co-founded the CCIX consortium to standardize various special-purpose accelerators for data-centers and make certain that its processors can support them. Taking into account all the evangelization and preparation work that Qualcomm has disclosed so far, it is evident that the company is very serious about its server business.

From the hardware standpoint, Qualcomm’s initial server platform will rely on the company’s Centriq 2400-series family of microprocessors that will be made using a 10 nm FinFET fabrication process in the second half of next year. Qualcomm does not name the exact manufacturing technology, but the timeframe points to either performance-optimized Samsung’s 10LPP or TSMC’s CLN10FF (keep in mind that TSMC has a lot of experience fabbing large chips and a 48-core SoC is not going to be small). The key element of the Centriq 2400 will be Qualcomm’s custom ARMv8-compliant 64-bit core code-named Falkor. Qualcomm has yet has to disclose more information about Falkor, but the important thing here is that this core was purpose-built for data-center applications, which means that it will likely be faster than the company’s cores used inside mobile SoCs when running appropriate workloads. Qualcomm currently keeps peculiarities of its cores under wraps, but it is logical to expect the developer to increase frequency potential of the Falkor cores (vs mobile ones), add support of L3 cache and make other tweaks to maximize their performance. The SoCs do not support any multi-threading or SMP technologies, hence boxes based on the Centriq 2400-series will be single-socket machines able to handle up to 48 threads. The core count is an obvious promotional point that Qualcomm is going to use over competing offerings and it is naturally going to capitalize on the fact that it takes two Intel multi-core CPUs to offer the same amount of physical cores. Another advantage of the Qualcomm Centriq over rivals could be the integration of various I/O components (storage, network, basic graphics, etc.) that are now supported by PCH or other chips, but that is something that the company yet has to confirm.

From the platform point of view, Qualcomm follows ARM’s guidelines for servers, which is why machines running the Centriq 2400-series SoC will be compliant with ARM’s server base system architecture and server base boot requirements. The former is not a mandatory specification, but it defines an architecture that developers of OSes, hypervisors, software and firmware can rely on. As a result, servers compliant with the SBSA promise to support more software and hardware components out-of-the-box, an important thing for high-volume products. Apart from giant cloud companies like Amazon, Facebook, Google and Microsoft that develop their own software (and who are evaluating Centriq CPUs), Qualcomm targets traditional server OEMs like Quanta or Wiwynn (a subsidiary of Wistron) with the Centriq and for these companies having software compatibility matters a lot. On the other hand, Qualcomm’s primary server targets are large cloud companies, whereas server makers do not have their samples of Centriq yet.

During the presentation, Qualcomm demonstrated Centriq 2400-based 1U 1P servers running Apache Spark, Hadoop on Linux, and Java: a typical set of server software. No performance numbers were shared and the company did not open up the boxes so not to disclose any further information about the CPUs (i.e., the number of DDR memory channels, type of cooling, supported storage options, etc.).

Qualcomm intends to start selling its Centriq 2400-series processors in the second half of next year. Typically it takes developers of server platforms a year to polish off their designs before they can ship them, normally it would make sense to expect Centriq 2400-based machines to emerge in the latter half of 2H 2017. But since Qualcomm wants to address operators of cloud data-centers first and companies like Facebook and Google develop and build their own servers, they do not have to extensively test them in different applications, but just make sure that the chips can run their software stack.

As for the server world outside of cloud companies, it remains to be seen whether the server industry is going to bite Qualcomm’s server platform given the lukewarm welcome for ARMv8 servers in general. For these markets, performance, compatibility, and longevity are all critical factors in adopting a new set of protocols.

88 Comments

View All Comments

deltaFx2 - Thursday, December 22, 2016 - link
Thanks for the ARM vs x86 data. I suppose a fairer comparison would be to compute dynamic instruction count (i.e. sum of (instruction size * execution frequency)), but it's probably best to leave it at that.

The trouble with instructions that crack into multiple ops is, before decode, you have no idea how many entries you need to hold the decoded uops. So you can't allow an instruction to expand into an arbitrary number of uops inline because you may not have enough slots plus you have to allign the results of the parallel decodes (dispatch is in-order). Pure RISC with 1:1 decode is clearly simple. For ops that are not 1:1 you may need to break the decode packet, and invoke a sequencer the next cycle to inject 2+ ops. Intel kinda does this with their 1 complex decode + 3 simple decoders. ld2/ld3/ld4 can be a stream of ops that are pretty much microcode even if you implement it as a hardware state machine instead of a ROM lookup table. The moment you have even one instruction that cracks into multiple uops, you need to build all the plumbing that is unavoidable in CISC, and what RISC was seeking to avoid. At this point, it's not an argument of principle but degree. CISC has a lot of microcode, ARM has a little microcode(or equivalent).

"keeping things as simple as possible and only add complexity when there is a quantifiable benefit that outweighs the cost" -> Well, that is the bedrock of computer architecture well before RISC, and it says nothing. Intel might argue that AVX-512 is worth the complexity. Fixed instruction length is a good idea. Relying on software to implement complexity is reasonable. Other than that, RISC designs have become more CISCy to a point where the distinction is blurred, and largely irrelevant. IBM Power has ucode but is fixed length. SPARC implements plenty of CISCy instructions/features.

BTW, it's going to be very very inefficient to implement ld-pair as one op. I doubt anyone would put in the effort.
deltaFx2 - Thursday, December 22, 2016 - link
The point being, ARM's ISA has more in common with x86 than Alpha (IMHO the cleanest true RISC ISA). ARM has carried forward some warts from A32. Not that there's anything wrong with it, but ARM decode itself has significant complexity. As noted earlier, x86's primary decode complexity comes from not knowing where the instructions are in the fetched packet of bytes (variable length). Sure, the extra instructions in x86 need validation (much of it legacy), but I don't believe it is a significant fraction of the overall verif effort on a (say) 20 core part, or a notebook part with GPU+accelerators+memory controllers+coherence fabric etc. Similarly, the power advantage of a simpler ISA is overstated given all of the above IP residing on the chip. If ARM wins in the data center, it will not be because it had a purer ISA, but because its ecosystem is superior (s/w and h/w). Like C/C++, html, etc, x86 survived and thrived not because it was pure and efficient, but because it got the job done (cheaper, better, faster).
azazel1024 - Tuesday, December 20, 2016 - link
Not always. Example, at work we are looking to build some database servers running MS SQL on them, since it is licensed per 2 cores (as is a lot of server software these days), the fewer cores we run, the cheaper. A couple of dual socket Xenon E7-8893v4 servers is significantly cheaper to setup than a couple of single socket E7-8867v4. Yes, the ultimate performance is a fair amount less, but it is something approaching 75% of the performance, in exchange it ends up costing about $50,000 less per server on the software side of things.
deltaFx2 - Saturday, December 17, 2016 - link
With SMT on, Skylake is the equivalent of 64 "cores" (No SMT on the qualcomm cores). If skylake's thread is just as powerful as an Qualcomm core, why would one switch? Also, there's AMD Naples/Zen due in mid 2017, also 32C64T. To top it, from the description above, QC appears to be a 1P system only whereas the x86 systems will likely also support 2P (so up to 64C128T per rack).

So really, the QC core has two competitors. You might argue that AMD and QC are the real competitors (Intel being deeply entrenched) but the barrier for switching to QC is higher. Unless QC has some fancy killer app/accelerators that neither x86 vendors provide. Will be interesting to see how it shapes up.
Antony Newman - Saturday, December 17, 2016 - link
In 2017 H2, we may find Qualcomm cores are 30% slower (IPC) than Apples A10, and Apple is 15% lower IPC than Intel XEON. A 48 cores Qualcomm will, if it does not melt in its single socket, perform comparably to a 24-32 core Xeon, where no special AVX 'hardware acceleration' is invoked.

If at that point, Intel does not reduce its prices and maintain its ~70% profit margin, Qualcomm will - if the software ecosystem is sound - find acceptance in the server world.

If Qualcomm add hardware acceleration that can offload more computational work than Intel, their 48 core chip will be received even more favourably; delegating to the ARM cores what they are more efficient at handling.

When CCIX eventually matures, those 'bolt on accelerators' are - in my opinion - going to drive their uptake in large scale systems.

At TSMC 10nm / Intel 14nm - Qualcomm will be able to get a foothold.

When TSMC 7nm is available, Qualcomm will no doubt close the gap on architectural IPC and may only be 30% slower than Intel for the CPU core - But they will now have enough silicon area to have a 64 core ARM (perhaps with SVE extensions), and a melenox et at ready to help them have accelerated offerings that target desktop to hyperscaler systems.

(Dreaming) Who knows - maybe Apple will use them in a future iMacPro? ;-)

AJ
MrSpadge - Monday, December 19, 2016 - link
I think it's rather going to compete with Xeon-D than Skylake-EP.
iwod - Friday, December 16, 2016 - link
Let's hope it will offer decent single thread performance first. Otherwise I am much more looking forward to AMD Zen.
boeush - Saturday, December 17, 2016 - link
Many years ago, at a point along the Sun's galactic orbit far, far away, there used to exist a company called Sun Microsystems, which tried to push the idea of giant chips full if a myriad tiny, weak cores.

That company no longer exists. One of the reasons, it turns out most software is not easily parallelizable and people would rather run web sites (or other server workloads) with snappy response and an option to scale out through more hardware, than being able to support more simultaneous clients out of tge box but with each client experiencing invariably ssssllllloooooowwwwww response - no matter how much money you throw at the hardware...
patrickjp93 - Saturday, December 17, 2016 - link
It got bought by Oracle, and btw, SPARC processors are still made by Oracle and Fujitsu and are in use in some workstations and supercomputers, and they have many "weak" cores.
Ariknowsbest - Saturday, December 17, 2016 - link
Sun Microsystems are thriving under Oracle. And the latest SPARC chips have up to 256 threads at 20nn, perfect for business applications.

Qualcomm Demos 48-Core Centriq 2400 Server SoC in Action, Begins Sampling

Post Your Comment

88 Comments

View All Comments

deltaFx2 - Thursday, December 22, 2016 - link

deltaFx2 - Thursday, December 22, 2016 - link

azazel1024 - Tuesday, December 20, 2016 - link

deltaFx2 - Saturday, December 17, 2016 - link

Antony Newman - Saturday, December 17, 2016 - link

MrSpadge - Monday, December 19, 2016 - link

iwod - Friday, December 16, 2016 - link

boeush - Saturday, December 17, 2016 - link

patrickjp93 - Saturday, December 17, 2016 - link

Ariknowsbest - Saturday, December 17, 2016 - link

Log in

Don't have an account? Sign up now