Examining Soft Machines' Architecture: An Element of VISC to Improving IPC

Name: Examining Soft Machines' Architecture: An Element of VISC to Improving IPC
Item: Examining Soft Machines' Architecture: An Element of VISC to Improving IPC
Author: Dr. Ian Cutress

by Ian Cutress on February 12, 2016 8:00 AM EST

97 Comments | Add A Comment

97 Comments

Last week, Soft Machines announced that their 'VISC' architecture was available for licensing, following the announcement of the original concepts over a year ago. VISC, in a nutshell, is designed as a solution to improving the number of instructions per clock a single thread can process in a given time, which potentially makes it a very interesting design in an era where IPC gains are harder and harder to realize.

The concepts behind their new ‘VISC’ architecture, which splits the workload of a single linear thread across multiple cores, are intriguing and exciting. But as with any new fundamental change in computer processing, subject to a large barrage of questions. We were invited to a presentation and call with the President and Chief Technical Officer Mohammed Abdallah and the VP Marketing and Business Mark Casey, and I put a number of questions on the lips of analysts to them.

Identifying Single Thread Performance Bottlenecks

Any discussion about processor performance over the last couple of decades has involved several factors, including getting better performance through an increased power budget, a higher frequency, extracting instruction level parallelism (ILP), getting better at minimizing delays through better branch prediction, or adding more cores and improving thread level parallelism (TLP). Each of these methods have varying degrees of success at increasing performance – long-time readers will remember the Pentium 4 days of hitting a frequency and power wall which then switched the focus to efficiency. Some tasks, like graphics, are inherently parallel and can take advantage of multiple hundreds or thousands of cores, or the software can be optimized. However, the nature of most software code and instructions is that they are single threaded by nature, and their performance relies on how fast the instructions can be processed within a single thread.

The main way of increasing performance, or in this case the instructions per unit frequency (instructions per clock, or IPC), is to expand the CPU architecture to allow more commands to be processed at once. Moving from a 3-wide out-of-order architecture to a 5-wide out-of-order architecture theoretically allows for a 66% increase in instruction throughput if (and only if) the code is sufficiently dense enough to extract those operations, and the other features in the architecture can ensure all the operations are fed every clock cycle.

The problem with moving to a wider architecture is typically power and design complexity. As shown by various chip designs over the years, the wider the architecture the more silicon has to be set aside for assets like buffers, re-order windows and caching. If there is a silicon budget and enough power headroom, we see designs like the six-wide Intel Skylake cores or the seven wide NVIDIA Denver cores able to extract peak performance when code is written that matches the hardware. However the potential downside of a wide architecture is that it remains inefficient for sets of instructions that only need a 2-wide or a 3-wide architecture. Alternatively, if multiple programs or threads want to use the hardware, then a single core is inaccessible to additional threads while the first thread is still in use (though this can be avoided somewhat by simultaneous multithreading or SMT which will let another thread have access when the first has encountered a stall such as waiting for L1/L2 memory).

As a result, modern designs also include a number of cores to handle the multile thread/multiple program scenario. Generally speaking this works well, especially with high-performance cores, but it becomes a bit of an issue itself when much of the world’s hardware is actually composed of many cores that have poor single threaded performance. Older Core 2 / Conroe systems, basic Bulldozer, or ARM Cortex-A7 designs are (still) widely used and often ship with multiple cores to allow for multiple programs at once. And while they can scale up with additional threads to the number of cores they offer, if any single or lightly-threaded software needs more performance, those extra cores are not used or are only minimally beneficial overall.

This brings us to Soft Machines, whose VISC architecture aims to change this.

Meet VISC

I should start by saying that despite the similarities to other architectural names, VISC is not an acronym. I asked directly and it is merely a noun for the purposes of trademarking. People can interpret it as a ‘virtual instruction set computing’ or something similar, but the company doesn’t apply any acronym to the letters.

But a virtual instruction set is a good description here. For the most part, processor architectures were traditionally built around either CISC (complex) or RISC (reduced) instruction sets and execution models, while more modern designs (e.g. Intel Core) are increasingly a mix, or so-called ‘CRISC’ design. The difference between CISC and RISC boils down to the fact that simpler designs can be more power efficient, but complex designs can do more complicated things in fewer cycles, all the while CRISC essentially meets the two paradigms in the middle in an attempt to gain the benefits of both, though not without inheriting some of the drawbacks as well. VISC, for lack of a better description, is a RISC design using a custom instruction set over a translation layer which allows a single thread of operations to be dispatched over multiple physical cores. The base diagram looks something like this:

Here is an example of a VISC design with four physical cores. The design can handle four ‘virtual cores’ or threads as well, but what makes the VISC design different is that when the virtual core has a thread of instructions, it can use the resources of any physical core. Thus, if each physical core is a 4-wide out-of-order design, if a thread running on a virtual core can utilize the resources of all four cores essentially making a giant 16-wide design, then under VISC can do so.

This should instantly throw up a number of questions on ‘What!? How?! Why?! Power? Frequency? Performance? Efficiency? Complexity?’ and as well as many others in the industry, we had the same questions.

The VISC ISA and Core Pipeline

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

97 Comments

View All Comments

kgardas - Saturday, February 13, 2016 - link
Market is not dominated by excellent technology, but by average or mediocre in fact. "Good enough" is enemy of any excellence.
Also in comparison with AMD64 which is just pile of hacks to prolong x86 architecture life, IA-64 was clean design on green field and its really a pity Intel can't push that further -- also due to AMD64 existence in the market.
Alexvrb - Monday, February 15, 2016 - link
That kind of thinking is what lead to the Itanic, the unsinkable chip. Too bad they ran afoul of a giant costberg.

But no, you're right, everyone should switch to a new ISA because Intel says it's better, even though it means switching to an Intel-exclusive ISA that will cost you dearly now, and even more dearly later when you become dependent on a product that only Intel can produce.

If Intel's only desire was a better design for everyone, they would have worked with AMD and freely extended licensing agreements to IA-64 to them so they could both produce IA-64 chips. The outcome could have very well been different in that scenario. But that is not what they did, and they paid for it. Of course, Intel is such a giant that they can afford to take such failures in stride. AMD can not afford another flop - Zen, Polaris, and eventually a ZenPolaris APU have to achieve at least a significant degree of success.
FunBunny2 - Monday, February 15, 2016 - link
-- Of course, Intel is such a giant that they can afford to take such failures in stride.

the irony, of course, is that Intel got where it is just because, in ~1980, Intel had one foot in chapter 7 and one foot on a banana peel. IOW, an easy controlled peon for IBM to abuse, thus the 8088 came to be. if IBM had secured the BIOS, life for both would have turned out rather differently, I suspect.
Alexvrb - Monday, February 15, 2016 - link
Yeah that is ironic. Intel was trying to avoid making a similar mistake, and in doing so they screwed up - but the SIZE of the failure was tiny in comparison. One thing about Intel is that they have better foresight and planning than IBM. IBM always was caught up in their own world. Intel probably was working on backup plans for Itanium failure before it even launched, regardless of how high they thought its chances were.
diediealldie - Friday, February 12, 2016 - link
Thanks for great article.

Anyway, I'd wait for actual working silicon with high frequency(Not .5Ghz) to figure out if it's real or not.

Since they're making abstraction layer with real silicon, demonstrating it on slow chip will not enough to convince industry experts(Hardware will be complicated so making it work on very high frequency is also big challenge).
High freqency chips requires more pipeline thus latencies and cache efficiency gets worse, hardware blocks not working...etc. so high frequeny chip is what Shasta really have to demonstrate.
dcbronco - Friday, February 12, 2016 - link
Interesting that AMD are switching to SMT with Zen and are one of the big financers of Soft Machines and VISC works well with SMT. I also wonder if an OS written for VISC would give a boost to APUs or would the bottlenecks kill any advantage.
zodiacfml - Saturday, February 13, 2016 - link
Interesting as numerous patents created can be beneficial to other CPU makers but for them creating a compelling chip that could sell would be miraculous.
haplo602 - Saturday, February 13, 2016 - link
Lots of marketing and even more estimates and projections. Looks like a very long road ahead to actual working chip with peripherals. And a lot of obstacles still to clear.

While the idea is interesting, it is highly impractical. The higher frequency they'll go, the more the final compositing overhead will bite them. There's a cost traversing the layers of the design from thread to CPU and back with results. I do not see that explained anywhere.

Another point is, I think this is not a new idea. It is fairly obvious an extension so expect Intel already went that route and met a dead-end.
vladx - Sunday, February 14, 2016 - link
Intel are too conservative to come with such ideas, they'd rather milk the cow as long as possible.
hMunster - Sunday, February 14, 2016 - link
There's one important question I don't see addressed, how do you run at higher IPCs when you have a conditional branch every few instructions? A 16-wide virtual CPU using 4 physical cores is all nice and dandy, but 16 instructions will, in normal x85 code, contain at least 2 branches. I can't see them doing a lot of speculative execution because that drives up the power consumption. So how do they solve this?

Examining Soft Machines' Architecture: An Element of VISC to Improving IPC

Identifying Single Thread Performance Bottlenecks

Meet VISC

Post Your Comment

97 Comments

View All Comments

kgardas - Saturday, February 13, 2016 - link

Alexvrb - Monday, February 15, 2016 - link

FunBunny2 - Monday, February 15, 2016 - link

Alexvrb - Monday, February 15, 2016 - link

diediealldie - Friday, February 12, 2016 - link

dcbronco - Friday, February 12, 2016 - link

zodiacfml - Saturday, February 13, 2016 - link

haplo602 - Saturday, February 13, 2016 - link

vladx - Sunday, February 14, 2016 - link

hMunster - Sunday, February 14, 2016 - link

Log in

Don't have an account? Sign up now