Apple Announces M1 Ultra: Combining Two M1 Maxes For Workstation Performance

Name: Apple Announces M1 Ultra: Combining Two M1 Maxes For Workstation Performance
Item: Apple Announces M1 Ultra: Combining Two M1 Maxes For Workstation Performance
Author: Ryan Smith

by Ryan Smith on March 8, 2022 6:00 PM EST

219 Comments | Add A Comment

219 Comments

As part of Apple’s spring “Peek Performance” product event this morning, Apple unveiled the fourth and final member of the M1 family of Apple Silicon SoCs, the M1 Ultra. Aimed squarely at desktops – specifically, Apple’s new Mac Studio – the M1 Ultra finds Apple once again upping the ante in terms of SoC performance for both CPU and GPU workloads. And in the process, Apple has thrown the industry a fresh curveball by not just combining two M1 Max dies into a single chip package, but by making the two dies present themselves as a single, monolithic GPU, marking yet another first for the chipmaking industry.

Back when Apple announced the M1 Pro and the ridiculously powerful M1 Max last fall, we figured Apple was done with M1 chips. After all, how would you even top a single 432mm2 chip that’s already pushing the limits of manufacturability on TSMC’s N5 process? Well, as the answer turns out to be, Apple can do one better. Or perhaps it would be more accurate to say twice as better. As for the company’s final and ultimate M1 chip design, the M1 Ultra, Apple has bonded two M1 Max dies together on to a single chip, with all of the performance benefits doubling their hardware would entail.

The net result is a chip that, without a doubt, manages to be one of the most interesting designs I’ve ever seen for a consumer SoC. As we’ll touch upon in our analysis, the M1 Ultra is not quite like any other consumer chip currently on the market. And while double die strategy benefits sprawling multi-threaded CPU and GPU workloads far more than it does more single-threaded tasks – an area where Apple is already starting to fall behind – in the process they re breaking new ground on the GPU front. By enabling the M1 Ultra’s two dies to transparently present themselves as a single GPU, Apple has kicked off a new technology race for placing multi-die GPUs in high-end consumer and workstation hardware.

M1 Max + M1 Max = M1 Ultra

At the heart of the new M1 Ultra is something a bit older: the M1 Max. Specifically, Apple is using two M1 Max dies here, and then bonding them together to form a massive amalgamation of 114B transistors.

As M1 Max itself has been shipping for the last 5 months, the basic architecture of the chip (and its underlying blocks) is at this point a known quantity. M1 Ultra isn’t introducing anything new in teams of end-user features in that respect, and instead the chip is all about scaling up Apple’s M1 architecture one step further by placing a second silicon die on a single chip.

Starting with speeds and feeds, by placing two M1 Max dies on a single package, Apple has doubled the amount of hardware at their disposal in virtually every fashion. This means twice as many CPU cores, twice as many GPU cores, twice as many neural engine cores, twice as many LPDDR5 memory channels, and twice as much I/O for peripherals.

On the CPU front, this means Apple now offers a total of 20 CPU cores. This is comprised of 16 of their performance-focused Firestorm cores, and 4 of their efficiency-focused Icestorm cores. Given that M1 Ultra is aimed solely at desktops (unlike M1 Max) the efficiency cores don’t have quite as big of a role to play here since Apple doesn’t need to conserve energy down to the last joule. Still, as we’ve seen they’re fairly potent cores on their own, and will help add to the CPU throughput of the chip in heavily threaded scenarios.

As is typical for an Apple product announcement, the company isn’t disclosing clockspeeds here. The desktop-focused nature of the chip means that, if they desire, Apple can push clockspeeds a bit higher than they did on the M1 Max, but they would need to leave their energy efficiency sweet spot to do it.

In practice, I will be surprised if the M1 Ultra CPU cores are clocked much higher than on the M1 Max. Which for Apple’s CPU performance is mixed blessings. For multithreaded workloads, 16 Firestorm cores is going to provide enough throughput to top some performance charts. But for single/lightly-threaded workloads, Firestorm has already been outpaced by newer architectures such as Intel’s Golden Cove CPU architecture. So don’t expect to see Apple recover the lead for single-threaded performance here; instead it’s all about MT and especially energy efficiency.

Meanwhile, doubling the number of M1 Max dies on the chip means that Apple is able to double the number of memory channels on the chip, and thus their overall memory bandwidth. Whereas M1 Max had 16 LPDDR5-6400 channels for a total of 408GB/second of memory bandwidth, M1 Ultra doubles that to 32 LPDDR5 channels and 800GB/second of memory bandwidth. And as with the M1 Max, this is accomplished by soldering the LPDDR5 chips directly to the chip package, for a total of 8 chips on M1 Ultra.

The doubled memory chips also allows Apple to double the total amount of memory available in their hardware. Whereas M1 Max topped out at 64GB, M1 Ultra tops out at 128GB. This is still less memory than could be found on a true high-end workstation (such as a Mac Pro), but it puts Apple ahead of all but the highest-end PC desktops, and should be plenty sufficient for their content creator crowd.

As we saw with the launch of the M1 Max, Apple already provides more bandwidth to their SoCs than the CPU cores alone can consume, so the doubled bandwidth isn’t likely to have much of an impact there than otherwise ensuring that the CPU cores are just as well fed as they are on the M1 Max. Instead, all of this extra memory bandwidth is meant to keep pace with the growing number of GPU cores.

Which brings us to the most interesting aspect of the M1 Ultra: the GPU. With 32 GPU cores, M1 Max was already setting records for a monolithic, integrated GPU. And now Apple has doubled things to 64 GPU cores on a single chip.

Unlike multi-die/multi-chip CPU configurations, which have been commonplace in workstations for decades, multi-die GPU configurations are a far different beast. The amount of internal bandwidth GPUs consume, which for high-end parts is well over 1TB/second, has always made linking them up technologically prohibitive. As a result, in a traditional multi-GPU system (such as the Mac Pro), each GPU is presented as a separate device to the system, and it’s up to software vendors to find innovative ways to use them together. In practice, this has meant having multiple GPUs work on different tasks, as the lack of bandwidth meant they can’t effectively work together on a single graphics task.

But, if you could somehow link up multiple GPUs with a ridiculous amount die-to-die bandwidth – enough to replicate their internal bandwidth – then you might just be able to use them together in a single task. This has made combining multiple GPUs in a transparent fashion something of a holy grail of multi-GPU design. It’s a problem that multiple companies have been working on for over a decade, and it would seem that Apple is charting new ground by being the first company to pull it off.

UltraFusion: Apple’s Take On 2.5 Chip Packaging

The secret ingredient that makes this all possible – and which Apple has been keeping under wraps until today – is that M1 Max has a very high speed interface along one of its edges. An interface that, with the help of a silicon interposer, allows two M1 Max dies to be linked up.

Apple calls this packaging architecture UltraFusion, and it’s the latest example in the industry of 2.5D chip packaging. While the details very from implementation to implementation, the fundamentals of the technology are the same. In all cases, some kind of silicon interposer is put beneath two chips, and then signals between the two chips are routed through the interposer. The ultra-fine manufacturing capabilities of silicon mean that an enormous number of traces can be routed between the two chips – in Apple’s case, over 10,000 – which allows for an ultra-wide, ultra-high bandwidth connection between the two chips.

Officially, Apple only states they’re using a silicon interposer here, which is the generic term for this technology. But, going by Apple’s promotional videos and mockup animations, it looks like they’re using a small, silicon bridge of some sort. Which would make this similar in implementation to Intel’s EMIB technology or Elevated Fanout Bridge (EFB) technology. Both of these are already on the market and have been used for years, so Apple is far from the first vendor to use the technology. But what they’re using it for is quite interesting.

With UltraFusion, Apple is able to offer an incredible 2.5TB/second of bandwidth between the two M1 Max dies. Even if we assume that this is an aggregate figure – adding up both directions at once – that would still mean that they have 1.25TB/second of bandwidth in each direction. All of which is approaching how much internal bandwidth some chips use, and exceeds Apple’s aggregate DRAM bandwidth of 800GB/second.

We’ll go more into this in the obligatory follow-up article, but the important point to take away here is that Apple has become the first vendor to bond two GPUs together with such a massive amount of bandwidth. This is what’s enabling them to take a stab at presenting the two GPUs as a single device to the OS and applications, as it allows them to quickly shuffle data between the GPUs as necessary.

But it should also be noted that there are plenty of details that can make or break the usefulness of this approach. For example, is 2.5TB/second enough, given the high performance of the GPUs? And what is the performance impact of the additional latency in going from GPU to GPU? Just because Apple has doubled the number of GPU cores by gluing them together doesn’t mean Apple has doubled their GPU performance. But at the end of the day, if it works even remotely well, then the implications for GPU designs going forward are going to be immense.

GPU Performance: Exceeding GeForce RTX 3090

Thanks to UltraFusion, Apple has become the first vendor to ship a chip that transparently combines two otherwise separate GPUs. And while we’ll have to wait for reviews to find out just how well this works in the real world, Apple is understandably excited about their accomplishment, and the performance implication thereof.

In particular, the company is touting that the M1 Ultra’s GPU performance exceeds that of NVIDIA’s GeForce RTX 3090, which at the moment is the single fastest video card on the market. And furthermore, that they’re able to do so while consuming a bit over 100 Watts, or 200 Watts less than the RTX 3090.

From a performance standpoint, Apple’s claims look reasonable, assuming their multi-GPU technology works as advertised. For as fast as the RTX 3090 is, it can’t be overstated just how many more transistors Apple is throwing at the matter than NVIDIA is; the GA102 GPU used by NVIDIA has 28.3 billion transistors, while the combined M1 Ultra is 114 billion. Not all of which are being used for graphics on the M1 Ultra, of course, but with so many transistors, Apple doesn’t have to be shy about throwing more silicon at the problem.

The amount of silicon Apple has at their disposal is also one of the keys to their low power consumption. As we’ve already seen with the M1 Max, Apple has built a wide enough GPU that they can keep clockspeeds nice and low on the voltage/frequency curve, which keeps overall power consumption down. The RTX 3090, by contrast, is designed to chase performance with no regard to power consumption, allowing NVIDIA to get great performance out of it, but only by riding high on the voltage frequency curve. And of course, Apple enjoys a huge manufacturing process advantage here, using TSMC’s N5 process versus Samsung’s 8nm process.

Still, given the ground-breaking nature of what Apple is trying to pull off with their transparent multi-GPU design, it has to be emphasized that Apple’s performance claims should be taken with a grain of salt, at least for now. Apple typically doesn’t do things half-baked, but as combining two GPUs in this fashion is yet unproven, a bit of skepticism is healthy here.

First Thoughts

While Apple has telegraphed their intention to scale up their chip designs since the first days of their Apple Silicon-powered Macs, I believe it’s safe to say that the M1 Ultra exceeds most expectations. Having reached the practical limits of how big they can make a single die, Apple has taken the logical next step and started placing multiple dies on a single chip in order to build a workstation-class processor. A step that is necessary, given the constraints, but also a step that is historically more cutting edge than is typical even for Apple.

The net result is that Apple has announced a SoC that has no peer in the industry across multiple levels. Going multi-die/multi-chip in a workstation is a tried and true strategy for CPUs, but to do so with GPUs will potentially put Apple on a level all of their own. If their transparent multi-GPU technology works as well as the company claims, then Apple is going to be even farther ahead of their competitors both in performance and in developing the cutting-edge technologies needed to build such a chip. In that respect, while Apple is trailing the industry a bit with their UltraFusion 2.5D chip packing technology, what they’re attempting to do with it is more than making up for lost time.

All of which is to say that we’re very eager to see how M1 Ultra performs in the real world. Apple has already set a rather high bar with the M1 Max, and now they’re aiming to exceed it with the M1 Ultra. And if they can deliver on those goals, then they will have twice set a new high point for SoC design in the span of just 6 months. These are exciting times, indeed.

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

219 Comments

View All Comments

Oxford Guy - Wednesday, March 9, 2022 - link
120 FPS is irrelevant for some genres of games that benefit from rich graphics, such as turn-based RPG. Those require few frames but benefit from graphical richness. Various simulation games do not require a high frame rate.

I find it humorous and unfortunate that the mentality of PC gamers continues to be locked into FPS-style games. It’s an entire genre of art; there is a lot more possible.
blppt - Wednesday, March 9, 2022 - link
While I agree that so far the overall RT performance has not been up to snuff even with this generation of cards (particularly the 6900XT), there are always a portion of gamers who prefer eye candy to throwing 200fps at the screen. If there weren't, Nvidia and AMD wouldn't have bothered supporting that feature.
mattbe - Wednesday, March 9, 2022 - link
Maybe in certain games, but not in any meta analysis.

I am convinced that you are a troll at this point. This is pretty sad.
Alistair - Tuesday, March 8, 2022 - link
also don't forget the 3090 is almost 2 years old now, Apple is ahead now, wait for the 4090 to beat Apple, that is the cycle, don't claim ignorance
OreoCookie - Tuesday, March 8, 2022 - link
You seem stuck on the association integrated graphics = slow graphics. Current-gen consoles already proved this wrong, and finally with the M1 Pro and M1 Max we know that this also works for other platforms. Since Anandtech has already benchmarked the M1 Max, we know what to expect for the Ultra. In gaming benchmarks Apple's chips aren't doing so great, but that is down to drivers. In productivity applications the M1 Max (not the Ultra) already got into the same spheres as fast discrete desktop graphics. Plus, the RTX3090 is getting old and is built on an older process, so Apple's claims aren't surprising.
robotManThingy - Wednesday, March 9, 2022 - link
Being "integrated" is now an advantage. I fail to see how GPUs separated from the CPUs and memory are somehow going to be faster than when they are all bundled together using a common memory space.
name99 - Wednesday, March 9, 2022 - link
Spoken like a true gamer.
You do realize that the world is larger than games, right?
A HUGE aspect of these chips is that they provide a much larger memory capacity for the GPU than either the 3090 (24G) or even a maxed out 48GB Quadro. Which is of immense importance to, for example, people playing with large neural nets...

Yes yes, we get it, Apple is not the preferred platform for playing games. This is not news, is not interesting, and will not change soon. What matters is that it's the preferred for people engaged in tasks other than playing games...
Alistair - Tuesday, March 8, 2022 - link
no, their performance claims have been spot on so far, just remember most games are not programmed well for Mac

the M1 Max already beat the RTX 3080 in Shadow of the Tomb Raider, so this new chip will beat the 3090 by a large margin in some games, and be much less in others
Alistair - Wednesday, March 9, 2022 - link
(just to be clear I meant it beats the 3080 mobile at a limited TDP like 100W or less, the 3090 is a lot faster, so the M1 Ultra being twice as fast as the M1 Max is necessary to equal the 3090)
halo37253 - Wednesday, March 9, 2022 - link
I'm sorry but the M1 Max Struggles to even compete with a RTX 3060 mobile in games.

There has been plenty of tests done on this. The M1 Max GPU is only impressive in synthetic benchmarks. Doubling the cores is not going to help, it will still be slower than a 3080 mobile in actual workloads.

People are quick to try and point at drivers as the reason for low gaming performance is drivers or API calls. But this is just BS, these are not gaming focused GPUs. Back in the day AMD's GCN arch was capable of a good deal higher synthetic benchmarks than nvidia chips while performing around the same game wise.

The M1 is not impressive from a GPU standpoint, and couldn't even compete with something like a steamdeck on performance per watt even given the massive node advantage. The only area M1 is impressive is CPU and Hardware offloading. With the Level of integration, nearly everything being on the package allows for some pretty awesome performance per watt. Add on the massive Node Advantage, AMD nor Intel has any plans to go down that path.

Currently the money is on the Data Center and having access to many PCI-E Lanes for fast SSD Storage. Who knows when we'll get the Mac Pro M1 refresh, as the PCI-E Lanes just isn't there. Maybe Apple will address with with Future M2 Chips, but who knows. Seems like Mac Pro may again go away in favor of Mac Studio. And again back to the Thuderbolt addons.

Apple Announces M1 Ultra: Combining Two M1 Maxes For Workstation Performance

M1 Max + M1 Max = M1 Ultra

UltraFusion: Apple’s Take On 2.5 Chip Packaging

GPU Performance: Exceeding GeForce RTX 3090

First Thoughts

Post Your Comment

219 Comments

View All Comments

Oxford Guy - Wednesday, March 9, 2022 - link

blppt - Wednesday, March 9, 2022 - link

mattbe - Wednesday, March 9, 2022 - link

Alistair - Tuesday, March 8, 2022 - link

OreoCookie - Tuesday, March 8, 2022 - link

robotManThingy - Wednesday, March 9, 2022 - link

name99 - Wednesday, March 9, 2022 - link

Alistair - Tuesday, March 8, 2022 - link

Alistair - Wednesday, March 9, 2022 - link

halo37253 - Wednesday, March 9, 2022 - link

Log in

Don't have an account? Sign up now