Turing RT Cores: Hybrid Rendering and Real Time Raytracing

As it presents itself in Turing, real-time raytracing doesn’t completely replace traditional rasterization-based rendering, instead existing as part of Turing’s ‘hybrid rendering’ model. In other words, rasterization is used for most rendering, while ray-tracing techniques are used for select graphical effects. Meanwhile, the ‘real-time’ performance is generally achieved with a very small amount of rays (e.g. 1 or 2) per pixel, and a very large amount of denoising.

The specific implementation is ultimately in the hands of developers, and NVIDIA naturally has their raytracing development ecosystem, which we’ll go over in a later section. But because of the computational intensity, it simply isn’t possible to use real-time raytracing for the complete rendering workload. And higher resolutions, more complex scenes, and numerous graphical effects also compound the difficulty. So for performance reasons, developers will be utilizing raytracing in a deliberate and targeted manner for specific effects, such as global illumination, ambient occlusion, realistic shadows, reflections, and refractions. Likewise, raytracing may be limited to specific objects in a scene, and rasterization and z-buffering may replace primary ray casting while only secondary rays are raytraced. Thus, the goal of developers is to use raytracing for the most noticeable and realistic effects that rasterization cannot accomplish.

Essentially, this style of ‘hybrid rendering’ is a lot less raytracing than one might imagine from the marketing material. Perhaps a blunt way to generalize might be: real time raytracing in Turing typically means only certain objects are being rendered with certain raytraced graphical effects, using a minimal amount of rays per pixel and/or only raytracing secondary rays, and using a lot of denoising filtering; anything more would affect performance too much. Interestingly, explaining all the caveats this way both undersells and oversells the technology, because therein lies the paradox. Even in this very circumscribed way, GPU performance is significantly affected, but image quality is enhanced with a realism that cannot be provided by a higher resolution or better anti-aliasing. Except ‘real time’ interactivity in gaming essentially means a minimum of 30 to 45 fps, and lowering the render resolution to achieve those framerates hurts image quality. What complicates this is that real time raytracing is indeed considered the ‘holy grail’ of computer graphics, and so managing the feat at all is a big deal, but there are equally valid professional and consumer perspectives on how that translates into a compelling product.

On that note, then, NVIDIA accomplished what the industry was not expecting to be possible for at least a few more years, and certainly not at this scale and development ecosystem. Real time raytracing is the culmination of a decade or so of work, and the Turing RT Cores are the lynchpin. But in building up to it, NVIDIA summarizes the achievement as a result of:

  • Hybrid rendering pipeline
  • Efficient denoising algorithms
  • Efficient BVH algorithms

By themselves, these developments were unable to improve raytracing efficiency, but set the stage for RT Cores. By virtue of raytracing’s importance in the world of computer graphics, NVIDIA Research has been looking into various BVH implementations for quite some time, as well as exploring architectural concerns for raytracing acceleration, something easily noted from their patents and publications. Likewise with denoising, though the latest trend has veered towards using AI and by extension Tensor Cores. When BVH became a standard of sorts, NVIDIA was able to design a corresponding fixed function hardware accelerator.

Being so crucial to their achievement, NVIDIA is not disclosing many details about the RT Cores or their BVH implementation. Of the details given, much is somewhat generic. To reiterate, BVH is a rather general category, and all modern raytracing acceleration structures are typically BVH or kd-tree based.

Unlike Tensor Cores, which are better seen as an FMA array alongside the FP and INT cores, the RT Cores are more like a classic offloading IP block. Treated very similar to texture units by the sub-cores, instructions bound for RT Cores are routed out of sub-cores, which is later notified on completion. Upon receiving a ray probe from the SM, the RT Core proceeds to autonomously traverse the BVH and perform ray-intersection tests. This type of ‘traversal and intersection’ fixed function raytracing accelerator is a well-known concept and has had quite a few implementations over the years, as traversal and intersection testing are two of the most computationally intensive tasks involved. In comparison, traversing the BVH in shaders would require thousands of instruction slots per ray cast, all for testing against bounding box intersections in the BVH.

Returning to the RT Core, it will then return any hits and letting shaders do implement the result. The RT Core also handles some grouping and scheduling of memory operations for maximizing memory throughput across multiple rays. And given the workload, presumably some amount of memory and/or ray buffer within the SIP block as well. Like in many other workloads, memory bandwidth is a common bottleneck in raytracing, and has been the focus of several NVIDIA Research papers. And in general, raytracing workloads result in very irregular and random memory accesses, mainly due to incoherent rays, that prove especially problematic for how GPUs typically utilize their memory.

But otherwise, everything else is at a high level governed by the API (i.e. DXR) and the application; construction and update of the BVH is done on CUDA cores, governed by the particular IHV – in this case, NVIDIA – in their DXR implementation.

All-in-all, there’s clearly more involved, and we’ll be looking to run some microbenchmarks in the future. NVIDIA’s custom BVH algorithms are clearly in play, but right now we can’t say what the optimizations might be, such as compressions, wide BVH, node subdivision into treelets. The way the RT Cores are integrated into the SM and into the architecture is likely crucial to how it operates well. Internally, the RT Core might just be a basic traversal and intersection unit, but it might also have other bits inside; one of NVIDIA’s recent patents provide a representation, albeit dated, of what else might be present. I, for one, would not be surprised to see it closely tied with the MIO blocks, and perhaps did more with coherency gathering by manipulating memory traffic for higher efficiency. It would need to coordinate well with the other workloads in the SMs without strangling memory access with unmitigated incoherent rays.

Nevertheless, details like performance impact are as yet unspecified.

The Turing Architecture: Volta in Spirit Turing Tensor Cores: Leveraging Deep Learning Inference for Gaming
Comments Locked

111 Comments

View All Comments

  • StormyParis - Friday, September 14, 2018 - link

    Fascinating subject and excellent treatment. I feel informed and intelligent, so thank you.
  • Gc - Friday, September 14, 2018 - link

    Nice introductory article. I wonder if the ray tracing hardware might have other uses, such as path finding in space, or collision detection in explosions.

    The copy editing was a let down.

    Copy editor: please review the "amount vs. number" categorical distinction in English grammar. Parts of this article, that incorrectly use "amount", such as "amount of rays" instead of "number of rays", are comprehensible but jarring to read, in the way that a computer translation can be comprehensible but annoying to read.

    (yes: "amount of noise". no: "amount of rays, usually 1 or 2 per pixel". yes: "number of rays, usually 1 or 2 per pixel".) (Recall that "number" is for countable items, that can be singular or plural, such as 1 ray or 2 rays. "Amount" is for an unspecified quantity such as liquid or money, "amount of water in the tank" or "amount of money in the bank". But if pluralizable units are specified, then those units are countable, so "number of liters in the tank", or "number of dollars in the bank". [In this article, "amount of noise" does not refer to an event as in 1 noise, 2 noises, but rather to an unspecified quantity or ratio.] A web search for "amount vs. number" will turn up other explanations.)
  • Gc - Friday, September 14, 2018 - link

    (Hope you're all staying dry if you're in Florence's storm path.)
  • edzieba - Saturday, September 15, 2018 - link

    " I wonder if the ray tracing hardware might have other uses, such as path finding in space, or collision detection in explosions."

    Yes, these were called out (as well as gun hitscan and AI direct visibility checks) in their developer focused GDC presentation.
  • edzieba - Saturday, September 15, 2018 - link

    One thing that might be worth highlighting (or exploring further) is that raytraced reflections and lighting/shadowing are necessary for VR, where screen-space reflections produce very obviously incorrect results
  • Achaios - Saturday, September 15, 2018 - link

    Τhis is epic. It should be taught as a special lesson in Marketing classes. NVIDIA is selling fanboys technology for which there is presently no practical use for, and the cards are already sold out. Might as well give NVIDIA license to print money.
  • iwod - Saturday, September 15, 2018 - link

    Aren't we fast running to Memory Bandwidth bottleneck?

    Assuming we get 7nm next year at 8192 CUDA Core, that will need at least 80% more bandwidth, or 1TB/s. Neither 512bit memory nor HBM2 could offer that.
  • HStewart - Saturday, September 15, 2018 - link

    I wondering when professional rendering packages support RTX - I personally have Lightwave 3D 2018 and because of Newtek's excellent upgrade process - I could see supporting it in future. I could see this technology do wonders for Movie and Game creations - reducing the dependency on CPU cpres
  • YaleZhang - Saturday, September 15, 2018 - link

    Increased power use is disappointing. Is the 225W TDP for 2080 the power used or the heat dissipated? If it's power used, then that would include the 27W power used by VirtualLink. So then the real power usage would be 198 W.
  • willis936 - Sunday, September 16, 2018 - link

    I've been in signal integrity for five years. I write automation scripts for half million dollar oscilloscopes. I love it. It's my jam. Why on god's green earth does nvidia think their audience cares about eye diagrams? They mean literally nothing to the target audience. They're not talking to system integrators or chip manufacturers. Even if they were a single eye diagram with an eye width measurement means next to nothing beyond demonstrating that they have an image of what a signal at a given baud rate should look like (it's unclear if it's simulated or taken from one of their test monkeys). If they really wanted to blow us away they could say something like they've verified 97% confidence that their memory interface/channel BER <= 1E-15 when the spec commands BER <= 1E-12 or something. It's just a jargon image to show off how much they must really know their stuff. It just strikes me as tacky.

Log in

Don't have an account? Sign up now