Hot Chips 31 Live Blogs: IBM's Next Generation POWER
by Dr. Ian Cutress on August 19, 2019 12:00 PM EST12:34PM EDT - We're here at Hot Chips 31 / 2019, and the first talk to be live blogged is IBM's newest variant of its POWER CPUs.
12:37PM EDT - Quite possibly the biggest Hot Chips crowd I can remember.
12:45PM EDT - The Arm talk is set to finish here in a bit, then IBM will start
12:45PM EDT - We already covered Arm's Neoverse N1 strategy earlier in the year: https://www.anandtech.com/show/13959/arm-announces-neoverse-n1-platform
12:55PM EDT - Just finishing up the previous talk
12:57PM EDT - Hopefully this is about POWER10 :)
12:57PM EDT - It could be the Power9 IO chip
12:58PM EDT - 2018 talk was about Power9 SU core
12:58PM EDT - IBM now has family of processors. Start with some one up front, and work on the rest of the family
12:58PM EDT - Scale out first, then scale up
12:58PM EDT - One optimized for dual socket, one optimized for 16 sockets
12:59PM EDT - Power9 AIO does things they wanted to do before power 10
12:59PM EDT - new accelerator technology deployed on Power9
12:59PM EDT - Today in Power9
12:59PM EDT - Power10 for 2021
12:59PM EDT - New core on Power10 and new transistor technology in 2021
01:00PM EDT - Accessing heterogenous systems
01:00PM EDT - Need to focus on diverse acceleration devices and diverse memory devices beyond CPUs
01:01PM EDT - Need to focus on heterogenous systems, not just GHz
01:01PM EDT - Need to deploy different types of hetergeneous systems
01:01PM EDT - Trying to remove the different types of SerDes on a chip. Want to consolodate these down to a single design
01:02PM EDT - On Power9, now only have two types of SerDes. PCIe and everything else is built on 25G SerDes
01:02PM EDT - SerDes can make something area and power efficient when fixed to 25G, then just scale the number of links
01:02PM EDT - Take all the 25G signals from the chip and deploy composable systems across different accelerator technologies
01:03PM EDT - NVLINK and OpenCAPI and OMI
01:03PM EDT - OMI is the memory interface to connect memory across SerDes
01:04PM EDT - On-chip Gzip accelerator
01:04PM EDT - IBM has delivered #1 and #2 supercomputers on the list
01:04PM EDT - Built for the AI era
01:05PM EDT - Now OpenCAPI, IBM sees it as being very important in future accelerator systems
01:05PM EDT - Minimizing overhead and latency that PCIe has
01:05PM EDT - Accelerators not only GPU, but SmartNICs, networking, FPGAs, AI accel
01:06PM EDT - Want software to take data from anywhere in the system on any device
01:06PM EDT - (some of the images here look low quality - click through to see full quality)
01:06PM EDT - Power9 has direct attached memory
01:07PM EDT - Some of the former secret sauce technologies are in the new open memory standard
01:07PM EDT - Can deal with asymmetry
01:08PM EDT - Having this connectivity allows for independent development of accelerators rather than focusing on the CPU
01:09PM EDT - Don't want programmers to worry about host-to-device connectivity
01:09PM EDT - Also OpenCAPI helps with security
01:09PM EDT - Prevents an accelerator crashing a whole system
01:10PM EDT - Need to make sure accelerators can't add in potential cache coherent bugs
01:11PM EDT - Aligned all packers with deserialised interface
01:11PM EDT - Accelerators always see aligned data to help make assumptions for performance
01:11PM EDT - Can start processing the command before checking the CRC
01:12PM EDT - Separately pipelined control/tag vs data
01:13PM EDT - (coherence over switching is not supported in OpenCAPI due to complexity)
01:14PM EDT - 1/6th the cost in die area to put OMI instead of DDR
01:14PM EDT - So memory is easier to support
01:14PM EDT - Can enable more bandwidth in smaller ASICs with OMI
01:15PM EDT - Differential buffer attach is now agnostic - the buffer is on the memory
01:15PM EDT - Can put buffered DDR or GDDR, rather than one or the other
01:16PM EDT - OMI is lighter weight and open to enable more ecosystem support
01:17PM EDT - With OMI memory, based on OpenCAPI SerDes, can use multiple DDR4 and DDR5 on the same system with the same connector
01:18PM EDT - e.g. if enabled on AMD sIOD, would decouple memory technology from host silicon development
01:19PM EDT - Power9 Advanced IO chip = P9 AIO
01:19PM EDT - 728mm2, 8B transistors
01:19PM EDT - 24 SMT4 cores, 120 MB eDRAM L3
01:19PM EDT - Built on 14FF (GF?)
01:19PM EDT - 17 layer metal stack
01:19PM EDT - 16 channels of x8 OMI, 650 GB/s peak r/w bandwidth
01:20PM EDT - 48 lanes of PCIe 4.0
01:20PM EDT - Up to x16 CAPI 2.0
01:20PM EDT - Up to x48 NVLINK attach
01:20PM EDT - Shows 2S replacement, but can scale to 16 socket
01:21PM EDT - OpenCAPI 4.0
01:21PM EDT - support for 64/128/256B cache lines
01:21PM EDT - supports 128B messages for low latency
01:22PM EDT - Supports virtual address cache for system memory
01:22PM EDT - Host manages the higher level cache coherency
01:23PM EDT - P9 SU supports 4xDDR4, P9 SO supports 4x Centaur, P9 AIO supports 8x OMI
01:23PM EDT - On each side
01:24PM EDT - OMI DDIMM looks very different
01:24PM EDT - Will see if I can get a better photo
01:25PM EDT - Microchip SMC1000 chip used on the OMI DDIMM
01:25PM EDT - effective bandwidth and latency equivalent to LRDIMM
01:26PM EDT - Q: energy per bit on memory vs DDR?
01:27PM EDT - A: Don't have numbers here. We shifted power from the DDR PHY onto the memory DIMM which helps with cooling conditions. The 8 lane memory device can move to 2 lane or 4 lane depending on use. It does dynamically shift based on utilization. Better than DDR anywya
01:28PM EDT - Q: Does the OMI DDIMM have a cache? A: No, it's a slimmer device with write buffering no caching
01:29PM EDT - Q: Is OMI like CXL? A: Viewing CXL is focused more on accelerators. OMI is available today and ahead of the competition and been in development a long time. I'd be surprised if other buffered memory solutions get as low latency as us. I'd be surprised if CXL has such a low latency to memory
01:30PM EDT - That's it for this talk. Small break now, next talk for live blogging is MLperf
17 Comments
View All Comments
nyoungman - Monday, August 19, 2019 - link
The OpenPOWER Summit is happening at the same time, with a livestream.https://www.youtube.com/watch?v=bpAv91NszoQ
The roadmap has a PCIe Gen4 and a new memory subsystem for POWER9 in 2020, and POWER10 with PCIe Gen5 coming in 2021.
mode_13h - Monday, August 19, 2019 - link
They already had PCIe 4.0 for a couple years, now. POWER was the first kid on the block to have it.Threska - Monday, August 19, 2019 - link
Someone might want to run the images through a program for a little tweaking to improve clarity.Ian Cutress - Monday, August 19, 2019 - link
Click through, you get the full quality. (I mentioned during the talk)aryonoco - Monday, August 19, 2019 - link
728mm2... wow. Talk about a behemoth.Would love to know the price of a 16 socket server... yeah I know it's not for us mere mortals.
A 2S EPYC 7742 gets you 128 cores and 256 threads for $14,000. This thing would get you 3 times the number of cores and 6 times the number of thread for 50x the price probably? And that might be lowballing it.
On the other hand, lots of people in the HPC crowd, the DoD and a DoE and various other agencies who buy these don't care about hardware cost anyway.
mode_13h - Monday, August 19, 2019 - link
14 nm? *yawn*I guess that's due to some GloFo contract?
These guys are treading down the same path towards technical irrelevance as SPARC. "Remember POWER?" That's going to be the new refrain, to replace "Remember Alpha?"
aryonoco - Monday, August 19, 2019 - link
10nm is not suitable for a high powered chip, and 7nm is nowhere near mature enough to fab a 728mm2 chip.