GPU Megat[H]read - the cores of wrath grew heavy on the die that day

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > GPU Megat[H]read - the cores of wrath grew heavy on the die that day

«‹›3866 »

Zedsdeadbaby: Jun 14, 2008; You have been called out, in the ways of old.

It's leak bait

# ? Jun 25, 2022 00:47

Adbot: ADBOT LOVES YOU

# ? May 26, 2024 10:45

Dr. Video Games 0031: Jul 17, 2004

ConanTheLibrarian posted:

The memory speeds don't look right at all. They're worse than the equivalent current gen cards, how are they supposed to keep a greater number of faster cores fed? Surely just bumping the cache size isn't enough to make up for reduced memory bandwidth.

It will be a lot more cache (16MB of L2 per 64 bits on the IMC, as opposed to 1MB in Ampere), and bear in mind that Ampere arguably had more memory bandwidth than it needed. I think the 4080's bandwidth situation could still get sketchy if it's released like that, but this is probably still up in the air. 21Gbps at least would be nice.

# ? Jun 25, 2022 01:02

hobbesmaster: Jan 28, 2008

ConanTheLibrarian posted:

The memory speeds don't look right at all. They're worse than the equivalent current gen cards, how are they supposed to keep a greater number of faster cores fed? Surely just bumping the cache size isn't enough to make up for reduced memory bandwidth.

The leaks would be nominal (ie gddr6x chip data sheet) speeds and those numbers are identical to current cards.

# ? Jun 25, 2022 01:58

repiv: Aug 13, 2009

someone's made a proof-of-concept DLSS DLL substitute based on FSR2 that works in cyberpunk

https://www.nexusmods.com/cyberpunk2077/mods/3001?tab=description

it's rough around the edges but pretty cool that it works at all

# ? Jun 25, 2022 03:14

Rinkles: Oct 24, 2010; What I'm getting at is...
Do you feel the same way?

repiv posted:

that's an issue with the engine integration, rather than the actual scaler

AMD is doing an official UE4 FSR2 plugin though, so if gearbox used that then any issues seen there might also show up in other unreal games until AMD fixes it

Do you mean the cropped rendering thing, or that it samples interface elements it shouldn't?

It's pretty disorienting when it happens.

# ? Jun 25, 2022 05:32

CaptainSarcastic: Jul 6, 2013

Well, looks like my Best Buy Open Box deal was too good to be true. It arrived today, I tried installing it tonight, and it's DOA. The RGB lights up and fans spin on boot, but the OS wouldn't see it. Did all the necessary troubleshooting, and determined the card is non-working. Now I get to deal with returning it to Best Buy on the one day I don't have a car available to me. :mad:

# ? Jun 25, 2022 05:57

Bloodplay it again: Aug 25, 2003; Oh, Dee, you card. :-*

CaptainSarcastic posted:

Well, looks like my Best Buy Open Box deal was too good to be true. It arrived today, I tried installing it tonight, and it's DOA. The RGB lights up and fans spin on boot, but the OS wouldn't see it. Did all the necessary troubleshooting, and determined the card is non-working. Now I get to deal with returning it to Best Buy on the one day I don't have a car available to me.

Do you have one of the weird motherboard/CPU combos that hates pcie 4.0? Try going into your bios, setting it to 3.0, and giving a shot before sending it back if you haven't already done so. Do it for science.

# ? Jun 25, 2022 06:23

CaptainSarcastic: Jul 6, 2013

Bloodplay it again posted:

Do you have one of the weird motherboard/CPU combos that hates pcie 4.0? Try going into your bios, setting it to 3.0, and giving a shot before sending it back if you haven't already done so. Do it for science.

Ugh, that's the one thing I didn't check. I'll dig around in my BIOS the next time I reboot and if it does look like a possible issue I'll try installing the 3080 one more time. This board does have PCIe 4.0, so I should've thought to check on that. That can wait for tomorrow, though - I already have slung around my desktop enough for one day. At least it is nicely dusted now.

# ? Jun 25, 2022 07:03

ConanTheLibrarian: Aug 13, 2004; dis buch is late; Fallen Rib

hobbesmaster posted:

The leaks would be nominal (ie gddr6x chip data sheet) speeds and those numbers are identical to current cards.

The 4070 has a 160 bit interface, that's less than a 3060. Even using faster memory only brings the 4070 to up the same bandwidth as the 3060.

# ? Jun 25, 2022 12:00

Zedsdeadbaby: Jun 14, 2008; You have been called out, in the ways of old.

Rinkles posted:

Do you mean the cropped rendering thing, or that it samples interface elements it shouldn't?

It's pretty disorienting when it happens.

Never seen that happen in the other two 2.0 games I've tried, it's probably just a bad implementation by Gearbox.

# ? Jun 25, 2022 13:20

hobbesmaster: Jan 28, 2008

ConanTheLibrarian posted:

The 4070 has a 160 bit interface, that's less than a 3060. Even using faster memory only brings the 4070 to up the same bandwidth as the 3060.

Ok let�s do a deep dive into this rumor. 160bits of gddr6(x) is 5 modules. That�d be 80 GBit of 16GBit modules for 10GiB so the amount tracks. GDDR6 (from micron/samsung)
traditionally came in either 14 or 16 Gbps, but samsung now has one 18Gbps sku and are sampling 20 and 24 Gbps, here�s the 18Gbps to follow the rumor
https://semiconductor.samsung.com/dram/gddr/gddr6/k4zaf325bm-hc18/

Nominal memory bandwidths of the ampere cards would be:
3060 - 360 GBps
3060 ti - 448 GBps
3070 - 448 GBps
3070 ti - 608 GBps
3080 - 760 GBps

Using Samsung�s gddr6 SKUs the options for a 4070 with 5 16Gb chips would be:
14Gbps - 220 GBps
16Gbps - 320 GBps
18Gbps - 360 GBps

Ok, yes that�s a 3060. HOWEVER, remember that rumors are always wrong and recall how I said that Samsung was sampling other speeds? That was first announced in December
20Gbps - 400 GBps
24Gbps - 480 GBps
Here we go - 480 GBps is a number that would make a lot of sense!

Interestingly announced micron GDDR6x is also only up to 24Gbps. Is there any other info on that out there?

# ? Jun 25, 2022 17:09

Struensee: Nov 9, 2011

Someone tell me which PSU I need if I'm only getting the next generations 3060 Ti equivalent! I'm using a 650 W evga GQ atm.

Struensee fucked around with this message at 23:18 on Jun 25, 2022

# ? Jun 25, 2022 23:15

Dr. Video Games 0031: Jul 17, 2004

Struensee posted:

Someone tell me which PSU I need if I'm only getting the next generations 3060 Ti equivalent! I'm using a 650 W evga GQ atm.

Impossible to say since we know nothing about the next generation's midrange cards and what their power draw characteristics will be.

# ? Jun 25, 2022 23:21

Cygni: Nov 12, 2005; raring to post

Honestly a 650w is probably gonna be enough if you are looking for 3060 Ti performance in the next gen. I would probably wait on buying a PSU until then, because it still a pretty long way away.

# ? Jun 25, 2022 23:31

Struensee: Nov 9, 2011

Thank you.

# ? Jun 25, 2022 23:32

Zedsdeadbaby: Jun 14, 2008; You have been called out, in the ways of old.

They are not going to massively increase PSU requirements like some are fearing here & elsewhere, it would be too damaging to consumer confidence, nobody would buy the drat things

# ? Jun 25, 2022 23:50

hobbesmaster: Jan 28, 2008

Zedsdeadbaby posted:

They are not going to massively increase PSU requirements like some are fearing here & elsewhere, it would be too damaging to consumer confidence, nobody would buy the drat things

I�m not so sure, the broader base of PC gaming enthusiasts seem to have taken the bad old �double your cpu+GPU power to size the psu� advice and set that as a floor. There�s people out there buying 800W+ power supplies for midrange GPUs and 1 CCD ryzen/Intel i5 based PCs.

# ? Jun 25, 2022 23:56

Cygni: Nov 12, 2005; raring to post

This is all conjecture, but imo performance per watt will definitely increase with the die shrink. So if you want the performance of current Ampere part X in the next gen, that level of performance is very likely to have lower power requirements with Ada.

That said, those types of mid-low parts (something like a 4050 or 4050 Ti or whatever Nvidia chooses to brand that as) are likely pretty far away, as Nvidia tends to release those parts last in the cycle. Could easily be a year away from right now.

# ? Jun 26, 2022 00:06

Collateral: Feb 17, 2010

Apparently my 550w is good for a 10400f and 3070 FE. Is this correct?

# ? Jun 26, 2022 00:11

LRADIKAL: Jun 10, 2001; Fun Shoe

MAYBE!

# ? Jun 26, 2022 00:16

The Kenosha Kid: Sep 15, 2007; You never did.

Probably, I'm running a 12400 and 3070 off a 550 w psu with no issues.

# ? Jun 26, 2022 00:29

Begall: Jul 28, 2008

A 750W Corsair CXF has been happily supplying my 12600KF/3080Ti FE build that pcpartpicker puts at 660W with no issues at all (since December 21) :shrug:

# ? Jun 26, 2022 00:30

BrassRoots: Jan 9, 2012; You can play a shoestring if you're sincere - John Coltrane

Asus tuf 6900xt now AU$999 here is aus. Includes tax and free shipping. Starting to see some good value in Southern hemisphere.

# ? Jun 26, 2022 00:31

Aware: Nov 18, 2003

My 5900x and 3080 ran off a 550w for a while just fine. I upgraded to an 850w to be safe but it worked.

# ? Jun 26, 2022 00:39

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

v1ld posted:

Is the advantage of BVH traversal happening on GPU simply that it's traversing structures in GPU memory or are there actual computational advantages being exploited by NVidia's dedicated hardware? That's a naive question, I know, driven by my ignorance as is clear when I try to google the diffs between CPU and GPU BVH traversal (and run into papers from 5-10 years ago before commodity RT gear even existed).

What I'd really like to understand is how much the difference in hardware support for RT primitives between vendors will hold back the more general use of RT in games.

https://developer.nvidia.com/blog/thinking-parallel-part-i-collision-detection-gpu/

https://developer.nvidia.com/blog/thinking-parallel-part-ii-tree-traversal-gpu/

https://developer.nvidia.com/blog/thinking-parallel-part-iii-tree-construction-gpu/

https://pbr-book.org/3ed-2018/Primitives_and_Intersection_Acceleration/Bounding_Volume_Hierarchies

https://pbr-book.org/3ed-2018/Primitives_and_Intersection_Acceleration/Kd-Tree_Accelerator

so, intersection is one part of raytracing - the question "does the ray intersect this primitive". But it's computationally infeasible (and inefficient) to just brute-force test every primitive in a scene - that's every facet of every object. You can't even z-cull because otherwise reflections wouldn't show the backside of objects. BVH (Bounding Volume Hierarchy) is the canonical way you approach this algorithmically - like a bounding hull but in 3D instead of 2D, built recursively around groups of primitives, and then you can answer the question "does anywhere in this hull intersect? ok next smaller one down in the hierarchy, does anywhere in this hull intersect"? And that saves you a massive amount of computational power because you can algorithmically cull huge numbers of primitives that definitely don't intersect.

This is not the only possible way to do it - the last article discusses Kd-trees as another way - and there's advantages and disadvantages either way, generally BVH seems to be the preferred structure for realtime graphics because they're faster to set up (and realtime graphics rebuild them every frame - which is why raytracing increases CPU requirements). But you generally need some sort of hierarchical structure to avoid fully enumerating every possible primitive.

So ray intersection is what tells you whether the ray collides, and BVH traversal is what lets you apply that efficiently across a scene without doing useless computation.

AMD still has to do BVH traversal - they just do it iteratively in software in their GPU-side compute kernels rather than with a dedicated unit. AMD specifically doesn't really give a poo poo about raytracing in the current architectural generation, it's a box they felt like they had to check and they basically threw some transistors towards making the intersection math faster (it's a bunch of trig math and trig is super slow compared to addition/subtraction/multiplication) since that�s most likely not going to change and punted on worrying about traversal. You can do it in software, it�s just more shader-intensive.

The possible exception is something like Unreal's nanite but I'm not real familiar with the technical details of what's going on there. But in contrast if BVH falls out of favor, NVIDIA's BVH units also do nothing for them either, of course. If Nanite does things really differently and BVH isn't what's used... sucks to be NVIDIA (and Intel). They�d be in the same boat AMD is in now.

What's the advantage of doing it on GPU rather than CPU? Because these are massively parallel tasks that benefit from GPU-style acceleration, in principle 1 ray is 1 CUDA/stream processor lane and you shoot millions of them per frame. And because it is in the middle of the pipeline - there are more pipeline stages that come after RT, so doing it on CPU would imply copying a ton of intermediate computation over to the CPU, and then copying a bunch back, which is a huge waste of bandwidth and a huge amount of latency, and doing so means giving up a huge amount of cache locality . Even doing it on a standalone "RT accelerator" (as suggested by MLID) is not really feasible because of this, it's not just PCIe that's the problem, it's just not possible to pull a task out of the middle of the graphics pipeline and do it somewhere else without giving up a huge amount of latency, cache locality, and performance as a whole.

(Don�t be deceived by the early diagrams - RT units are really part of the Streaming Multiprocessor, they aren�t freestanding units, they couldn�t be since the SM controls work allocation and scheduling� the stuff about �AMD put them in the texture unit and NVIDIA has freestanding execution units� people were talking about with rdna2 isn�t really important, that�s an implementation detail. The SM is the �core� (working on 32 lanes at once) and the RT units are entirely �part of the core� so to speak. It has to be given how tightly integrated it is into the graphics pipeline.)

Intel doing ray binning/sorting potentially goes further down the "tile-based" sort of thing - since most rays don't hit, and most don't hit the same things, these tend to be highly "divergent" gpu kernels that don't get good occupancy without some sort of assistance. You can of course do this in software, somewhat - for example kernels can sort the active threads to be within a single warp, or use dynamic parallelism to launch kernels for the "successful" rays and just iterate down the hierarchy - but this is why BVH traversal units are a good optimization. But again, like tile-based rendering - it would be even better if instead of working against random parts of the tree (and different textures, etc) you could guarantee that rays that are hitting the same thing would be executed together - that would give you much better data locality for cache etc. And I'm guessing that's what Intel is doing with "ray binning", that's the concept that leaps to mind. They also are going with a much smaller warp size (AMD used to be 64-wide, RDNA is 32, NVIDIA has always been 32, Intel is 8-wide) and that will help with the divergence problem too - it will be interesting to see how RT performance compares to raster across these architectures.

Paul MaudDib fucked around with this message at 01:31 on Jun 26, 2022

# ? Jun 26, 2022 00:50

repiv: Aug 13, 2009

Paul MaudDib posted:

The possible exception is something like Unreal's nanite but I'm not real familiar with the technical details of what's going on there. But in contrast if BVH falls out of favor, NVIDIA's BVH units also do nothing for them either, of course. If Nanite does things really differently and BVH isn't what's used... sucks to be NVIDIA (and Intel).

nanite has nothing to do with raytracing, that's a solution for rasterizing arbitrarily dense meshes which is novel for mostly relying on software rasterization running on compute, rather than the fixed-function raster pipeline

lumen is the raytracing side of UE5 and there's two sides to that, it has the option to trace against signed distance fields (a sort of volumetric representation) rather than triangle BVHes which trades off precision for potentially higher performance (especially on AMD) or it can leverage hardware support for high precision triangle BVH tracing instead. which path gets used will depend on the needs of the project, the matrix demo for example uses the HW RT path even on the consoles because the sharp reflections in the city scene would look pretty bad if you saw a blobby distance field representation in them.

# ? Jun 26, 2022 00:59

Dr. Video Games 0031: Jul 17, 2004

Well, it's always possible that RDNA3 won't support hardware BVH traversal either...

https://twitter.com/0x22h/status/1530005215625195521

# ? Jun 26, 2022 01:06

repiv: Aug 13, 2009

Paul MaudDib posted:

Intel doing ray binning/sorting potentially goes further down the "tile-based" sort of thing - since most rays don't hit, and most don't hit the same things, these tend to be highly "divergent" gpu kernels that don't get good occupancy without some sort of assistance. You can of course do this in software, somewhat - for example kernels can sort the active threads to be within a single warp, or use dynamic parallelism to launch kernels for the "successful" rays and just iterate down the hierarchy - but this is why BVH traversal units are a good optimization. But again, like tile-based rendering - it would be even better if instead of working against random parts of the tree (and different textures, etc) you could guarantee that rays that are hitting the same thing would be executed together - that would give you much better data locality for cache etc. And I'm guessing that's what Intel is doing with "ray binning", that's the concept that leaps to mind. They also are going with a much smaller warp size (AMD used to be 64-wide, RDNA is 32, NVIDIA has always been 32, Intel is 8-wide) and that will help with the divergence problem too - it will be interesting to see how RT performance compares to raster across these architectures.

also intel has done a presentation on the nitty gritty of their raytracing architecture if you're interested

https://www.youtube.com/watch?v=SA1yvWs3lHU

on paper it sounds very powerful, though it's going to be awkward on the software side as the optimal API usage is very different between AMD/NV (manually sort/bin to increase occupancy) and intel (just YOLO it and let the hardware scheduler sort it out)

repiv fucked around with this message at 01:29 on Jun 26, 2022

# ? Jun 26, 2022 01:08

CaptainSarcastic: Jul 6, 2013

Bloodplay it again posted:

Do you have one of the weird motherboard/CPU combos that hates pcie 4.0? Try going into your bios, setting it to 3.0, and giving a shot before sending it back if you haven't already done so. Do it for science.

Following up to say this was the loving problem. Set the PCIe slot to 3.0 and the card was recognized. Of course the Linux half of my dual-boot now gets no display, so I'm having to troubleshoot that. But the 3080 is running great in Windows.

# ? Jun 26, 2022 02:02

MarcusSA: Sep 23, 2007

CaptainSarcastic posted:

Following up to say this was the loving problem. Set the PCIe slot to 3.0 and the card was recognized. Of course the Linux half of my dual-boot now gets no display, so I'm having to troubleshoot that. But the 3080 is running great in Windows.

Nice! Really glad it wasn't a busted card.

# ? Jun 26, 2022 02:07

CaptainSarcastic: Jul 6, 2013

MarcusSA posted:

Nice! Really glad it wasn't a busted card.

Thank you! I feel stupid that I didn't check the PCIe gen setting in amongst the frenzy of troubleshooting I did. Posting this from my Linux install, so if it survives a bunch of updates then I should have it all ironed out.

Until I update the BIOS for a CPU upgrade in the near future. :sigh:

# ? Jun 26, 2022 02:35

Wistful of Dollars: Aug 25, 2009

Bondematt posted:

They already exist, but yeah if you plan in getting 4080/4090 you would want to get one now as they will be gone by launch.

The only one I see available right now is Gigabyte and well...

Gigabyte. :yikes:

# ? Jun 26, 2022 02:43

Taima: Dec 31, 2006; tfw you're peeing next to someone in the lineup and they don't know

Wait sorry- do you NEED one of these new psus for a 4090?

I plan on getting a 4090 asap and have the following PSU, am I ok or do I need to gently caress with some new standard?

https://www.newegg.com/antec-signature-1000-titanium-st1000-1000w/p/N82E16817371133?Item=N82E16817371133

# ? Jun 26, 2022 03:36

hobbesmaster: Jan 28, 2008

You�ll just need adapters that will come in the box. Be careful with wire gauge and amperage though.

# ? Jun 26, 2022 03:37

MarcusSA: Sep 23, 2007

hobbesmaster posted:

You�ll just need adapters that will come in the box. Be careful with wire gauge and amperage though.

Don't some of the MFG's sell the wires directly so you don't have to deal with the adapters?

# ? Jun 26, 2022 03:38

Bloodplay it again: Aug 25, 2003; Oh, Dee, you card. :-*

CaptainSarcastic posted:

Following up to say this was the loving problem. Set the PCIe slot to 3.0 and the card was recognized. Of course the Linux half of my dual-boot now gets no display, so I'm having to troubleshoot that. But the 3080 is running great in Windows.

If you hadn't said the fans and RGB were lit up, I wouldn't even have suggested it. While not guaranteed, I would expect a dead or malfunctioning card to not light up or spin the fans during POST. I'm glad it works. Now be sure to undervolt with MSI Afterburner it to shave off ~40-50W under load with little-to-no degredation to performance. For what it's worth, I genuinely have no idea why this is an issue with some Ryzen CPU/mobo combos and if others hadn't run into the same issue in this thread, I wouldn't have been able to suggest it. Beats having to return a card that works, though, that's for sure.

# ? Jun 26, 2022 03:43

CaptainSarcastic: Jul 6, 2013

Bloodplay it again posted:

If you hadn't said the fans and RGB were lit up, I wouldn't even have suggested it. While not guaranteed, I would expect a dead or malfunctioning card to not light up or spin the fans during POST. I'm glad it works. Now be sure to undervolt with MSI Afterburner it to shave off ~40-50W under load with little-to-no degredation to performance. For what it's worth, I genuinely have no idea why this is an issue with some Ryzen CPU/mobo combos and if others hadn't run into the same issue in this thread, I wouldn't have been able to suggest it. Beats having to return a card that works, though, that's for sure.

I really appreciate you throwing that out there - I've been hanging around in this thread for years and I knew that was a thing, but it simply didn't occur to me to check until you said it. This is a Gigabyte X570 Aorus Elite board, and somehow I naively thought it wouldn't have that issue. I'll play with Afterburner later - I spent most of the day having to get my Linux install working again after it apparently freaked out over the 3080 I installed. At least it spurred me to update my distro to the current release, and everything seems to be clicking along fine now.

Thank you again!

# ? Jun 26, 2022 03:49

hobbesmaster: Jan 28, 2008

MarcusSA posted:

Don't some of the MFG's sell the wires directly so you don't have to deal with the adapters?

Some did that briefly at 3xxx launch. I�m not sure if that continued?

# ? Jun 26, 2022 03:52

Dr. Video Games 0031: Jul 17, 2004

Wistful of Dollars posted:

The only one I see available right now is Gigabyte and well...

Gigabyte.

For what it's worth, Aris Mpitziopoulos (the best PSU reviewer still active) reviewed it and found that it didn't explode: https://www.techpowerup.com/review/gigabyte-ud1000gm-pg5-1000-w/

Though it wasn't particularly great or anything, and it still failed some of the tougher transient tests�it shut down when faced with 1600 - 1800W spikes, but not even the 4090 will manage to do that. Though it is a noisy one, seems like.

Dr. Video Games 0031 fucked around with this message at 04:49 on Jun 26, 2022

# ? Jun 26, 2022 04:42

Adbot: ADBOT LOVES YOU

# ? May 26, 2024 10:45

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

repiv posted:

also intel has done a presentation on the nitty gritty of their raytracing architecture if you're interested

https://www.youtube.com/watch?v=SA1yvWs3lHU

on paper it sounds very powerful, though it's going to be awkward on the software side as the optimal API usage is very different between AMD/NV (manually sort/bin to increase occupancy) and intel (just YOLO it and let the hardware scheduler sort it out)

this is fantastic, thanks, I hadn't heard this but this yeah, pretty much what I expected. GPUs really aren't good with small granular launches of kernels onto a few elements - it's not a "promise/future" model at all, you want the "chunk" you launch onto to be as large as possible. They benefit from large-scale aligned operations and minimal divergence (in launch/access patterns, both data and code). A lot of the obvious optimizations don't work on GPUs. They don't do well with communication outside the warp and they officially don't communicate between kernels/grids at all - it is undefined behavior and other grids or threads in that grid are never guaranteed to be scheduled. And every grid launch has parameters that eat into cache and overhead, you really don't want to launch tiny little grids, it's not promise/future or async continuations.

it'd be really interesting to see this exposed as a general-purpose unit too. I get right now it's tied into the whole RT thing but "my problem is highly divergent and diffuse, and I want to launch more granular 'promises', and have them be aligned with some level of data locality" (as defined by some cost function) is such a common problem that comes up in GPGPU programming. GPUs, again, really are not good with small work items, they don't map cleanly into the kernel/data grid model of launching across a big dataset. A "mixmaster" call which moves a granular async promise to the best locality and executes it there and returns to a fixed stack point with a couple parameters, would be super super useful. "gosub_aligned" basically.

I've thought of a pretty similar pattern as a "RTOS", breaking large function invocations into "work tokens" (basically an instruction code and a stack frame) and having those be sorted into a grid wide thing, and then the __device__ code has a big "switch(myInstructionCode):" that calls all possible functions (would have to be hardcoded at compile-time). Sorting actually is exceedingly cheap on GPUs, the "write a sort index/comparator for my object and then radix sort into workgroups, use prefix-sum to find the index for a given workgroup" strategy actually is incredibly performant - that is practically a drop-in replacement for an object-oriented "collection" model at times. Radix sort is lockless and massively parallel and runs super fast on GPUs and there is even an inplace radix sort to let you push a bit larger, and prefix-sum is also massively parallel and super fast. So you have firmly fixed token capacity, and you always know how many tokens will be emitted at state N+1 (because it's a compiled RTOS and you can see all code), and you define some stack frame for each token. Work units emitting more work-units is fine as long as you keep the "stack frame" reasonable per work-unit, and as long as you keep recursion reasonable - and in a near-OOM you could spill to system memory etc. It would be an interesting toy computer architecture.

Wide, aligned full-memory scans are not very expensive on GPU - a 3090 has what, close to a TB/s, to read 24GB of memory? And each read can be broadcast to 32 threads... and nor is computation expensive. Random123 is a really cool RNG library because it's completely stateless and thus no storage needed for each RNG instance, no memory traffic, etc - and pure arithmetic cycles are super super cheap on GPUs, it's basically RNG on cycles you weren't going to use anyway. In contrast though, most "lock" sorts of models have to be done very carefully! You wouldn't want to "take a lock" below the level of a decent sized block of threads - because that's built on atomicswap which is an O(1) lock, and that scales poorly when you are launching 100k threads. Recomputing things is also often cheaper than having to maintain state - state eats (expensive) memory, in GPUs and cycles are cheap. The GPU programming paradigm is just totally different, some things that are expensive on CPU are cheap, some things that are cheap on CPU are expensive. But if your task fits into that "aligned read, or sort and seek" box then yeah it's pretty fantastic. Divergence kills performance and this is a cool end-run around that in general, not just for raytracing.

Paul MaudDib fucked around with this message at 05:42 on Jun 26, 2022

# ? Jun 26, 2022 04:53

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > GPU Megat[H]read - the cores of wrath grew heavy on the die that day

«‹›3866 »