AMD CPU and Platform Discussion: Ryzen Might Be...Good? *knocks on wood*

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > AMD CPU and Platform Discussion: Ryzen Might Be...Good? *knocks on wood*

«‹›7 »

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

A SWEATY FATBEARD posted:

Many years ago I became an AMD fanboy after witnessing a dualcore Prescott P4 in action. It was the noisiest, hottest running turd I've ever seen, and it was obliterated by ANY Athlon x2. Few years later, the company that gave us A64 architecture has stooped to making eight-core prescotts; noisy and hot FX turds. AMD, I am so disappoint

My family has been long-time AMD diehards. My desktops have pretty much always been AMD based - first a K6-2 400, then an Athlon XP 1800+, then an A64 X2 3800+, and most recently a Phenom II X4. I did a lot of video encoding on a Thunderbird and had a media PC running an XP Mobile chip long before TV PCs were common.

At this point I'm trying to figure out a reasonable upgrade path from the Phenom and the math on Bulldozer/Piledriver just doesn't work. The traditional advantage of AMD was power usage, and at this point AMD has become Pentium 4-level bad on power. The per-core performance is solidly better on the i7 and even on highly threaded workloads the AMD processors still underperform. The AMD processors usually have been better value even when they underperformed, but that's no longer true either.

In particular on price, the AMD 970 chipset really really hosed things up. I could see scrounging up some components on the cheap, but all the cheap AM3+ mobos are 970-based. The 970 is advertised as supporting 125W TDP (i.e. Piledriver chips) and some of the mobos specifically advertise support but if you actually try to pull that much power in a sustained fashion (encoding, etc) then the motherboard chipset burns up. So as an upgrade path, that equipment is totally throwaway. You need the new 990 chipset which is just as expensive as buying an Intel motherboard. gently caress that noise.

Paul MaudDib fucked around with this message at 00:02 on Apr 15, 2014

# ¿ Apr 14, 2014 23:49

Adbot: ADBOT LOVES YOU

# ¿ Apr 26, 2024 04:54

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

A SWEATY FATBEARD posted:

On the other hand, FX seems to be doomed to become a novelty part down the line through sheer audacity. What are the sales figures for Bulldozer and Piledriver? In the past AMD had trouble meeting the demand for XP and A64 processors, while FX seems like a pretty hard part to move.

Numbers here: http://www.extremetech.com/gaming/175190-amd-beats-earnings-estimates-thanks-to-console-sales-but-apu-outlook-is-bleak

Basically AMD is being driven out of the desktop market. Desktops are declining overall but Intel has suffered a glancing blow while AMD took it on the chin. They're riding pretty heavily on console sales (both the PS4 and the Xbox One use AMD CPUs) as well as their GPU business, which is thriving thanks to Bitcoin and altcoins. They're still in a somewhat unstable position but they're not bleeding out anymore at least.

It's worth noting that both the PS4 and the Xbox One are built around Jaguar, AMD's low-power architecture, rather than Piledriver.

Paul MaudDib fucked around with this message at 03:31 on Apr 15, 2014

# ¿ Apr 15, 2014 03:06

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

Install Windows posted:

I still don't understand what APUs do that are different from any modern CPU with GPU built in. Can someone explain?

It's their term for a CPU with GPU compute capability built in, like Nvidia's Tegra chips with CUDA or something like that.

Theoretically there's advantages to doing it this way, if I remember they were working on a fully unified address space, so you avoid the bottleneck of having to offload data to the coprocessor.

In practical terms, I don't think it's much different for most desktop users, since most of the intensive tasks like video decoding are handled in any decent processor with GPU. It's probably a selling point if you're (say) Sony and you're going to build a software ecosystem around the feature set.

Paul MaudDib fucked around with this message at 04:24 on Apr 20, 2014

# ¿ Apr 20, 2014 02:54

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

Rastor posted:

They are developing a 64-bit ARM core, called K12.

They are also developing a new x86 core which is "a new design built from the ground up". Rumor has it that for this new core AMD is giving up on the CMT design used in Bulldozer/Piledriver/Steamroller/Excavator and will instead go back to something more like what Intel has been using.

With the shift to ARM-architecture APUs (which this presumably is) AMD is now in direct competition with Nvidia designs like the Tegra series. Nvidia is ahead on the 64-bit ARM architecture, but if I remember they still haven't quite made it to a fully unified memory architecture. I'm really liking the trend of heterogenous processors working independently inside a unified memory architecture - that really seems like a logical design choice to me. You have your general-purpose processors for random access stuff and the ability to deploy bigger guns on computationally-intensive tasks. The unified memory architecture means there's little penalty for this processor switch, you avoid stuff like time copying down to the coprocessor.

The Bulldozer/Steamroller/etc architecture was just a bad idea and it looks like even AMD is giving up on fixing it. I'm reading this as capitulation and trying to figure out where to go from here. At least they're aware how hosed they are.

# ¿ May 7, 2014 18:12

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

Sir Unimaginative posted:

If anything, power usage on processors of any mode is going down, and rapidly.

Yeah the long-term trend has been set by the mobile computing segment, where the trend is towards low-power processors and offloading the heavy-duty work to the GPU cores. That's even AMD's approach in the low-end market with their APU processors.

Obviously that approach doesn't work for everything, brawny cores are still better at some things, but the problem there is that AMD processors are just too weak for too much power draw.

# ¿ Aug 10, 2014 22:27

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

Palladium posted:

They are probably way too busy comparing their FXs with i5s to notice they are also probably spending the equivalent of an i7 4790K on motherboards and cooling just to make sure that 200W+ chip doesn't burn up the mobo or melt the chip. What you mean by "opportunity costs?" gently caress that poo poo man!

Yeah that was what I ran into when I was looking at parts. AMD is just outclassed in pure performance, the niche remaining is "I can get a complete mobo/processor set for less than an Intel mobo alone" ultra-bargain pricing and the cooling requirements make that infeasible.

I still think there's enough fanboys and cheapasses out there to move low-end processors and motherboards in the $100-150 range. Basically go after the people who would otherwise buy an Intel NUC but have the space for a bigger machine if there's a performance reason to do so. But you'd basically need to be handing out free watercooling kits to move any of the mid or high range parts in preference to the Intel equivalents.

Paul MaudDib fucked around with this message at 22:35 on Aug 10, 2014

# ¿ Aug 10, 2014 22:33

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

Alereon posted:

We're getting kind of far afield here, but keep in mind that the Source engine undergoes pretty regular overhauls as Valve releases new products. You can read about some past engine enhancements on the Valve publications page. These new engine revisions are periodically back-ported to active games, so for example TF2 is currently running Valve's very latest version of the Source engine.

Yeah, and over the years TF2 has gotten worse and worse with the particle effects. They used to be fairly rare items, nowadays everyone is a walking particle fountain. Cosmetic models have also gone up in complexity.

Source does a pretty good job of scaling to the available processing resources. When you turn the graphics down TF2 is heavily bottlenecked by CPU, and most of that happens on a single core. I used to play it on a Compaq CQ56-115DX with a single-core 2.3ghz AMD V140 cpu and a Radeon HD Mobility 4250 graphics chipset. It couldn't do much more than minimum spec graphics and it could chug a bit during intense combat but it was tolerably playable.

Paul MaudDib fucked around with this message at 17:46 on Aug 18, 2014

# ¿ Aug 18, 2014 17:37

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

Lord Windy posted:

Well, I was saying Apple just because they piles of money. I think they make almost as much in operating profit as Intel does in revenue. Also, wouldn't AMD come with ATI which isn't terrible?

Apple was one of the big movers behind OpenCL, which seems like it's one of AMD's few significant selling points right now (Jaguar and GPUs). The question is what Apple would get out of it, they haven't really coordinated on designing chips since the PowerPC days from what I remember.

Paul MaudDib fucked around with this message at 05:42 on Aug 21, 2014

# ¿ Aug 21, 2014 05:39

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

The problem with neural networks has always been in applying them. Simple nets don't seem to be more powerful than a linear regression, complex networks are hard to train to do anything useful.

A brain chip is only as useful as the brain programs it runs.

# ¿ Aug 21, 2014 20:50

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

1gnoirents posted:

GPUs I guess. My real point was it sounds like its pretty doomed, they are providing almost almost negative competition, and I'd rather see that money be funneled into some moon project like what IBM is doing.

Putting GPU cores on the CPU is a much more plausible long-term shot, actually. AMD was the first to market with a fully unified memory architecture, and they're still the only ones who offer that.

Despite NVIDIA's advertising they haven't actually done it. Their CPU/GPU SoC, Tegra K1, still forces you to declare memory as either CPU or GPU for the purposes of caching, and with a discrete GPU there's fundamentally no way to avoid copying data across the bus, at best you can just allow the programmer to pretend that you can and make it happen behind the scenes/take it out of their control.

Intel's CPU-based heterogenous compute has been a total joke up until recently. So far the Phi coprocessor hasn't really had much penetration_{_{probably not a coincidence}}, far less than CUDA or even OpenCL. I just read an interesting slide deck on the Knights Landing AVX512 instruction set, seems like they'll catch up eventually.

In comparison AMD has a product on the market right now that will let you dispatch CPU and GPU cores to the same memory space without having to copy data around. Honestly that's where I see the long-term performance growth being as we go forward - per-processor performance will continue to double every 18 months but you can get a 8-100x performance bump right now in a lot of applications_{_{not every application}} with heterogenous computing. The fact that you don't need to copy data around makes AMD's product really advantageous in desktop or server-type applications where latency or power matter or where overhead would eat up the performance gains on small problem sizes.

Seems doubtful that AMD can really manage to sell it, though. Technical capability doesn't mean poo poo unless someone's written programs to exploit it and people have processors to run it. Jaguar is doing OK in the mobile market but CUDA's got the momentum so far in technical computing markets. Intel can muscle their way in if they need to, AMD doesn't really have that luxury.

Paul MaudDib fucked around with this message at 02:48 on Aug 22, 2014

# ¿ Aug 21, 2014 21:44

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

Lord Windy posted:

Oh, I meant the drivers. Nvidia does have OpenCL and it works well, but their drivers only work for their video cards and not CPUs. Intel's drivers only work on Intel CPUs and integrated graphics. But AMD Drivers work on everything except for Nvidia cards.

OpenCL is just a specification for some APIs, it's up to the hardware company to implement a compiler and drivers and so on. NVIDIA doesn't implement OpenCL drivers for CPUs because they don't produce CPUs. Intel's drivers only work on CPUs and integrated graphics because they don't produce discrete graphics cards.

It's not even tied to a specific type of heterogenous computing device, you can also use it to dispatch work to things like DSPs or FPGAs or CPU cores. That's actually one of the core challenges with OpenCL - it's really kind of a vague standard and it's up to the hardware manufacturer to do the legwork on it.

Paul MaudDib fucked around with this message at 00:24 on Aug 22, 2014

# ¿ Aug 21, 2014 23:03

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

JawnV6 posted:

General question, but how's it do with true/false sharing scenarios split over CPU/GPU? Disallowed, allowed-but-not-guaranteed-coherent, allowed-but-slow? Cursory googling didn't bring anything up, even after assuring the search engine I didn't mean "and."

I was curious and did some brief searching on this earlier, here's what I came up with on a Kaveri chip:

quote:

CPU-GPU coherence: So far we have discussed how the GPU can read/write from the CPU address space without any data copies. However, that is not the full story. CPU and the GPU may want to work together on a given problem. For some types of problems, it is critical that the CPU/GPU be able to see each other's writes during the computation. This is non-trivial because of issues such as caches. HSA memory model provides optional coherence between the CPU and the GPU through what is called acquire-release type memory instructions. However, this coherence comes at a performance cost and thus HSA provides mechanisms for the programmer to express when CPU/GPU coherence is not required. Apart from coherent memory instructions, HSA also provides atomic instructions that allow the CPU to GPU to read/write atomically from a given memory location. These �platform atomics� are designed to do as regular atomics, i.e. provide a read-modify-write operation in a single instruction without developing custom fences or locks around the data element or data set.

http://www.anandtech.com/show/7677/amd-kaveri-review-a8-7600-a10-7850k/6

There's more there, it's not a bad overview of the current "unified memory architecture" situation for NVIDIA/Intel/etc too. AMD has clearly put a lot of work into making it possible to mix CPU and GPU computation in more mundane desktop/server tasks. Intel's moving in that direction too with Knight's Landing.

e: Forgot the link :downs:

Paul MaudDib fucked around with this message at 02:43 on Aug 22, 2014

# ¿ Aug 21, 2014 23:25

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

Menacer posted:

You can choose to put data in a region of system memory dedicated for GPU's framebuffer and use the non-coherent "Garlic" bus to get full-speed memory accesses from the GPU. You can also choose to put data in regular system memory and access it over the coherent "Onion" bus, which is currently slower. Source. Note that the throughput numbers on slide 10 are likely per-chanel; I've measured an A10-7850K with dual-channel DDR3-2133 hitting about 30 GB/s over Garlic and 18 GB/s over Onion.

That's about what a GDDR3 media PC card could do to GPU memory. A GT640 (GDDR3, 384 kepler cores) is 28.5 GB/s. A GTX 650 (same thing with GDDR5 memory) is about 80 GB/s. A high end GDDR5 card like the K40 (2880 cores and tweaked for compute) does 288 GB/s.

Pretty good given that system memory is only DDR3 and you can use as much of it as you want, potentially intertwined with CPU segments.

e: The A10-7850k has 512 stream processor cores. Any idea how a Kaveri core compares to a Kepler core?

Paul MaudDib fucked around with this message at 02:59 on Aug 22, 2014

# ¿ Aug 22, 2014 02:25

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

e: This is the wrong way to look at it but I contend that the ability to manage more cores doing more useful work thanks to improved architectures/frameworks has probably increased performance relative to well-established CPU computation since day 1 of CUDA's launch over the last 6 years. Server CPUs haven't gotten much more parallel since 2008, CUDA has gone from 128 cores to 2880 per die, the architecture (instructions/bandwidth balance/featuresets/etc) is more optimized for compute work instead of graphics rendering, and there's been more time to work on optimizing algorithms for GPU processors. And working with zero-copy on die is inherently more powerful than a co-processor on a bus in terms of latency and flexibility.

Stuff like database access seems to be relatively amenable to GPU acceleration.

Paul MaudDib fucked around with this message at 05:05 on Aug 22, 2014

# ¿ Aug 22, 2014 03:43

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

Professor Science posted:

if you port a naive (non-SSE/AVX, single-threaded, not really optimized at all) C++ app to CUDA and optimize the hell out of it, yeah, you might get 10 or 20x. if you actually optimize the CPU code and run it on anything approaching 250W worth of CPUs, yeah, you might get 10 or 20x out of it there too. GPUs offer advantages, but it's ~2.5x versus well-optimized CPU code on apps that are especially suitable for GPUs.

You should be comparing a multi-threaded or GPU implementation against a single-threaded implementation. That's literally the definition of speedup. Saying that "GPUs are only 2.5x faster than a 20x multithreaded implementation" is really abusing the concept of speedup, and that's not the measurement I was using. The correct way to look at that is "20x multithreaded or 50x gpu speedup".

There's also something to be said for an architecture that can maintain performance when scaled up to 250W per socket, too. Any decent CPU maxes out at around 150W per socket, most motherboards only have 2 sockets tops, whereas you can reasonably stack 2x 250W GPUs in a single box. Even that's being a bit gracious and comparing a server motherboard for the CPUs to a standard gaming motherboard for the GPUs, there are off-the-shelf motherboards supporting up to 4x 16-lane PCI-e sockets, whereas I don't know of a way to get 1000W of CPUs onto a single motherboard. Maybe up to 4x CPU sockets, so something like 600W is probably the limit.

A pair of sockets is a fundamentally less powerful approach than one big processor, and a pair of machines is fundamentally less powerful than one machine with more processors, due to the limits/overhead of cache/memory coherency, interconnection latency/bandwidth, etc. It's not reasonable to compare a real-world GPU to a fantasy 250-500W cpu that provides exactly proportional performance to a 150W CPU. The wattage comparison is sorta reasonable at 1xGPU vs 2xCPU, it starts breaking down beyond that.

Professor Science posted:

database access for sets that fits entirely within GPU memory works well because GPUs have very high memory bandwidth, and if you're doing lots of queries simultaneously and can hide memory latency, yeah, you get 6x over a single Intel CPU because of GDDR5 vs DDR3 bandwidth alone.

This is really kind of taking a lot of engineering work for granted. Any single-processor device that can process as fast as GPUs can is going to have to be attached to a seriously fast memory system to keep it fed, that's inherently true, but at the same time you can't just handwave and say that the entire performance is the result of having some fast memory. It's more than having fast processors, it's more than having a fast memory subsystem, it takes system engineering and a program tailored to the architecture.

Anyway as a general observation, that GPUs are faster because they have fast memory doesn't really stand up - CPUs can actually access more memory per floating point operation than a GPU can. Obviously there aren't a lot of floating point operations in data access, but in terms of general program operation GPUs can actually address much less memory per core per cycle. That's why there's so much work put into memory coalescing/broadcasting and so on - they actually need tricks to keep the system fed. Even in something like database searching, some of the gain is not just from memory bandwidth, it's from algorithms that let the hardware coalesce memory requests. As a simplification, think of binary searching - if you have a bunch of threads searching the upper parts of the tree, they're accessing the same data and the hardware lets you combine a significant fraction of the memory accesses into single requests which are then broadcast to the warps.

http://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/

Obviously global memory is limited in size, that's one of the big limitations of GPUs, but so is L3 cache, or any other high-performance low-size memory. For a sense of perspective that's about the same bandwidth as Haswell can deliver from its L2 cache. Haswell offers 4x256k at that bandwidth, a K40 offers 12GB total or 4.16MB per core. Obviously it does so at a higher latency, of course, so you need sufficient in-flight requests to cover the latency.

There's also a ton of work devoted to getting around the memory size limitation. Multiple GPUs in a system can do DMAs to each other's memory, so you can get up to 48GB of GDDR5 memory per system with 4x K40s. In theory you could also do DMA to something like a PCI-e SSD, which might offer better latency (but lower bandwidth) than main system memory.

Professor Science posted:

in general, I think we have decent tools and programming models for writing kernels on specific GPUs. what we don't have is any sort of way to write applications with large parallel sections that run on the appropriate compute device. until the latter gets solved, GPU compute will remain a novelty limited to HPC, embedded platforms, and the occasional developer relations stunt. I don't think coherent memory does the latter on its own, although it's a nice step--it's more of an improvement for the "writing kernels" part.

We actually do have tools that let you write large parallel sections easily - OpenMP and OpenACC frameworks for exploiting loop-level parallelism in particular. On a finer grain there's marking calls as tasks and then synchronizing, etc.

The problem is that tools are much less of an effective solution when there's big overhead to start/stop a co-processor. That's my problem with the OpenACC approach to GPUs - it doesn't make sense to copy it down to the processor, invoke and synchronize the kernel, and then copy it back up to parallelize a single loop, GPU programs really should be running mostly on the GPU rather than shuttling data back and forth all the time. It makes sense for intensive portions that are reasonably separate from the rest of the program, but a generic "send this part of the program to the GPU" isn't really going to be the best solution in a lot of cases. The same thing applies to the CUDA 6.0 "unified memory" thing - hiding it from the programmer is a low-performance solution when there's high overhead costs.

In comparison I think the OpenACC approach could be much more appropriate on a platform like APUs, because there's very little overhead involved with invoking the stream processors. The OpenMP model is a workable approach to parallelizing existing code, and extending it to support different types of compute elements seems like a rational step.

Of course, all such tools do have the problem that their design goal is to make it easy to write/adapt generic code. Writing applications using architecture-specific features can make a pretty big impact in performance. One reasonable approach to this would be to let you compile in OpenCL or CUDA code directly - i.e. you should be able to call __device__ functions that can use the architecture-specific features, perhaps with multiple versions to target different architectures.

Paul MaudDib fucked around with this message at 22:29 on Aug 27, 2014

# ¿ Aug 27, 2014 17:42

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

What I'm getting from your earlier post is that most multithreaded applications are written with some type of "task-based" parallelization (you said TBB). For those reading along, this basically means that when a work item is received, the processing work is broken into a series of discrete tasks which can be accomplished in parallel. These tasks can then spawn further tasks of their own. There is often a requirement that all child tasks be finished before processing is allowed to continue. This sets up a dependency tree of tasks. So you might have:

Work Item: spawn A, B, wait until A and B are completed, launch C,D
Task A: no children
Task B: spawn D,E, wait until completion
Task C: no children
Task D: no children

In contrast there's also loop based parallelism. In some types of work, you have some big for loop that iterates and processes data items. Loop based parallelism approaches this by spreading the loop iterations between threads, to turn a for loop into data parallelism. So if you have 4 threads and 8 work items, you might get

Thread0: data0, data1
Thread1: data2, data3
Thread2: data4, data5
Thread3: data6, data7

OpenMP and OpenACC are frameworks that handle distributing the iterations between threads.

Professor Science posted:

it's pretty straightforward. based on hardware alone, GPUs offer a pretty good value proposition for HPC (lots of BW, lots of FLOPs), but they are hamstrung by two big software issues.

1. it takes a lot of parallel work to fill up a GPU, and GPUs can't timeslice between processes or anything like that when one task isn't offering enough work to completely fill the GPU. this limitation generally doesn't apply to CPU apps, so the problem with porting becomes more about restructuring the rest of your app to get enough work in a single place at a time than porting some kernels.

GPUs are fundamentally data parallel processors, they operate by lining up a bunch of pieces of data and performing the same sequence of operations on multiple pieces of data at once (a "warp" of threads). Task-based parallelism is not a good approach to SIMD processors, because all threads in a warp must follow all code paths. If you have 4 different code paths for your different tasks, every thread in a warp has to execute all 4 code paths (4 threads will be launched in a warp, 31/32 threads in the warp will be disabled per code path). Or with dynamic parallelism, you need to invoke 4 more kernels (4 32-thread kernels will launch of which 31/32 will idle) and then synchronize.

I think the number of programs that really, truly can only express a low degree of parallelism is pretty low, though. As an example, consider something like video compression - you need to base frame n from the encoding of the prior frame n-1, so in a naive approach you can only process one frame at a time. Even so there's ways to artificially boost that - for example, maybe you can generate some estimate of what the previous frame is going to be (assume that it was encoded perfectly and there's no artifacting, etc) and then propagate an "error frame" forward that represents the difference from your estimate. Or do a search for keyframes, which mark the beginning of independent sequences of the video, and then process the independent sequences in parallel. That's what a lot of the "rewriting algorithms" looks like - finding ways to expose greater degrees of parallelism, instead of just having a single thread that chugs through a serial algorithm.

I think one strategy is to try and generate your own data parallelism as you go. Latency already sucks compared to CPUs, so just embrace it and let a runtime batch up data items/tasks and then process them in parallel when possible. In terms of implementation strategies, I think that would end up looking a lot like the Erlang runtime. You have something that looks like lightweight threads which a runtime or the scheduler batches and then dispatches when a sufficient number of data items are ready (or the processor is under-utilized enough that processing partial batches doesn't matter).

"Lightweight threads on GPU" isn't really an insignificant task, but I don't think it's insurmountable either. GPUs are built to launch vastly more threads than the multiprocessors can actually handle at once, to cover latency. Until the data is ready, the threads block and other warps can execute, Only here the launch condition is "32 successful stores from other threads' shared memory" instead of the current "32 successful reads from global memory". That looks a lot like Erlang actors to me - threads sitting idle until they're needed. I don't think you can make idling threads work with the current scheduler, but it doesn't seem like too much of a stretch, and you could potentially get the same effect using dynamic parallelism and just launching new kernels instead of waking up sleeping threads. That has overhead too of course, but you're amortizing it across more than 1 data item per launch, and kernel launch overhead is much, much lower from the kernel (onboard the GPU) than from the CPU.

And it should be pretty easy on APUs. Which is again why I think they're such an interesting tech - the overhead to invoke the GPU is really low and the CPU can wrangle things into batches that are worthwhile to apply SIMD processing to. Instead of "make this entire program work on GPU" you're now talking about a more manageable target of "wrangle up a workable degree of parallelism with the CPU and then use the SIMD cores". It'll never be worth doing over a half dozen lines of code, but it should be worth using on stuff that is intensive enough to be worth explicitly marking as a task for parallelization.

quote:

2. when everything about your platform (GPU type, GPU count, whether GPUs are shared or exclusive, CPU count, PCIe perf, ...) isn't known a priori, writing an application that uses the right processor at the right time becomes really hard. it's partially a language issue and partially a runtime issue. language-wise, the CUDA execution model (later used by OpenCL and DirectCompute) exposes some extremely low-level hardware details and in fact requires users to specify things at that level in order to get good performance. these include exact details of the memory hierarchy, warp widths and warp-synchronous programming, how a kernel traverses over data, etc--the developer has to control all of this to get something that runs fast. since those details change from GPU generation to GPU generation, there's no one ideal GPU kernel. runtime wise, there's no way to figure out how busy a GPU is (other than "is anyone else potentially using this GPU at all"), so picking a particular GPU to use in a multi-GPU system is also really hard. couple that with GPU performance variance between different models or different vendors, and the question rapidly becomes "should I use a GPU at all right now or stick to the CPU."

Well some of this is real and other stuff isn't.

I generally agree that CUDA and OpenCL expose a lot of low-level mechanics to the programmer, but you don't really need to hand-tune your program to every single device to get good performance. If you're writing low-level C code for your entire program, yeah, your life is going to suck, but the higher-productivity way here is to write your program in terms of template library calls and then let the library authors handle tuning the operations to the various architectures. That's Thrust, Cuda Unbound, CUDPP, and so on, which handle warp-, block-, and device-wide collective operations. All of those are designed to be drop-in solutions that will work on any architecture or block size. Like CPUs, not all portions of your program are really critical, and libraries often cover most of the critical parts fairly well.

As for grid topology, you can write some math which does some rough occupancy calculations at runtime and get a reasonable guesstimate. Here's Thrust's implementation. A wild-rear end guess of 32/64/128/256 threads and enough blocks to saturate the processor usually doesn't produce awful results, and there's a profiler that'll give you real-world tuning on this. The only real hard rule is that you should use a round multiple of your warp size, if you launch a 63 thread block you're cruising for trouble. Powers of 2 are also useful for exponential reduction patterns.

Not quite sure what you mean by "warp synchronization", if you mean thread-fences around shared memory operations, that's roughly equivalent to the trouble caused on CPUs by forgetting to wait for task completion. It's a thing you need to remember, but one you'll notice right away when your 2880-core processor spews obvious race conditions at you. That previous example uses them (__syncthreads()), they're pretty basic.

Other stuff here is not real at all. Warp size has been 32 threads for every CUDA Compute Capability spec so far. Traversing the kernel's grid over data is pretty straightforward, and the fact that blocks can't communicate basically eliminates the possibility that you're doing something not straightforward. If threads need to communicate, they go in a block together. There is definitely an API that lets you get GPU utilization.

quote:

the reason why GPUs have been pretty successful in HPC thus far is because #2 doesn't really apply--if you're buying 10-40k GPUs, you generally buy the same kind of GPUs and write software for that machine. most of the software improvements in recent years have been focused on #1 (GPU work creation, improved support for concurrent kernels, multi-process support, even interaction with InfiniBand and other NICs), and the rest of the porting problem can get brute forced by investing in enough software engineers. meanwhile, #2 is as critical an issue in desktop and mobile as #1, and there's been very little work to solve that. OpenACC and Thrust come to mind to solve the language issue, but there's still very little research on good runtime schedulers that I'm aware of (the last big thing was StarPU, and that was what, four or five years ago?). over time, #2 will become more important for HPC as there's less of a CUDA monoculture, but I don't have any idea as to what direction that will take right now.

Yeah GPUs are not a general purpose computer, yet. They're obviously headed in that direction though, given CPU/GPU SoCs, Dynamic Parallelism, APUs, and so on.

(sorry AMD guys, my experience here is mostly NVIDIA/CUDA :shobon:

I know there's an equivalent library to Thrust for OpenCL, it's called Bolt)

Paul MaudDib fucked around with this message at 02:09 on Aug 29, 2014

# ¿ Aug 28, 2014 21:23

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

eggyolk posted:

I love the discussion but only made it this far into your post before not understanding anything anymore.

It's a question of cpu-bound programs and memory bound programs.

GPUs have a shitload of memory bandwidth because of their GDDR5 memory. A K40 has 288 GB/s of bandwidth, compared to <30 GB/s for DDR3 RAM. Memory is accessed through "ports", GDDR5 is a special type which (pretends to) allow two processors to access the same port at once, and the bandwidth is obviously huge. On top of that there's a shitload of engineering to get more out of the bandwidth that's there - for example, threads in the same warp (32 threads) can combine accesses to sequential addresses into a single request which is broadcast to all of them, and so on.

Because they were originally designed to process graphics, they also have various special features which are designed to help that. For example, most memory operates with what's called 1d locality - memory is a flat space divided into pages, meaning accesses near a previously requested address are likely cached and happen much faster. By various tricks, GPUs allow 2d and 3d locality, which helps cache data that is close in 2d/3d space even if that doesn't translate into a nearby 1d address, which helps some problems. And on top of that there's special access modes that can perform extra floating-point calculations (interpolation, etc) for "free". Memory access and some calculations are combined into a single operation which doesn't use core processor cycles, but does have some extra latency.

This is all necessary because GPUs put out a ton of floating point calculations (FLOPs). An i7-4770k puts out around 32 GFLOPs per core (x4), a K40 puts out about 4,290 GFLOPs. It's not a trivial task to feed that much data to the processor, even with that much bandwidth. An implication of this is that GPUs can perform a lot more intensive programs - from what I remember Kepler is up around 64 floating point operations per float of memory access, versus ~8 for a Haswell. On the other hand that's also problematic - it's easy for bandwidth to be the limiting factor, and memory bandwidth hasn't kept up with processor power. There's also not that much per card, and the workarounds aren't great, you're down to ~3-6 GB/s accessing somewhere else.

Another factor here is latency. It does take a while to service all the FP calculations, memory requests, and possible code paths 2880 cores can throw out, even with bandwidth and tricks. So instead of a CPU, where you usually have ~1-2 threads per core, GPU programs often operate on the premise of "fucktons of threads, most of which are blocked". At peak a K40 can keep track of up to about 31k resident threads even if it can only execute 2880 at a time. In most cases multiprocessor shared memory or register pressure will limit that number though.

Most of the architecture and programming paradigm is designed around making that manageable, and you need enough independent data items to keep it busy, so it's not universally applicable. Often it gets used to accelerate specific parts of programs with most of it remaining on the CPU, which seems pretty obviously better addressed with an APU. Or more likely, a Xeon Phi, which is the same-ish thing in that role.

Paul MaudDib fucked around with this message at 01:46 on Aug 29, 2014

# ¿ Aug 29, 2014 00:32

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

Professor Science posted:

(I've written a _lot_ of CUDA. also blocks can communicate, I'll leave that as an exercise to the reader)

Spinlocked infinite-loop kernels? I was under the impression that falls into "seriously undefined behavior".

Professor Science posted:

Let me blow your mind: for a single kernel launch, this is absolutely false. GDDR5 latency sucks a lot. Like a whole, whole lot. GPUs may have 5-6x the memory bandwidth of CPUs, but they do so by having ~15x the memory latency. Reading across PCIe after fiddling a register is better than running a scheduler on a GPU on GDDR5. (If you can batch N kernel launches together at the same time, then yeah, it'll perform better from the GPU.)

For global memory, sure, which is why I specified "shared memory" On-multiprocessor memory has much lower latency. Dynamic parallelism should then avoid this, right? Is there huge latency to dispatch a kernel to another multiprocessor?

I heard you can also play games with sharing local memory?

It would probably be manageable/feasible to run a single runtime engine/task scheduler per device, at least, if there's such a need for tasks and sufficient memory capacity to run multiple nodes per machine. The function calls should be the same, just different parameters (memory coherency would be reduced). I guess that should go in at the hardware/firmware level.

Paul MaudDib fucked around with this message at 03:34 on Aug 29, 2014

# ¿ Aug 29, 2014 02:55

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

Professor Science posted:

edit: wait, do you mean launching out of the equivalent of __shared__?

quote:

A child grid inherits from the parent grid certain attributes and limits, such as the L1 cache / shared memory configuration and stack size

http://devblogs.nvidia.com/parallelforall/cuda-dynamic-parallelism-api-principles/

So - basically you have some program with an update loop, with high thread divergence - say you have like 4/32 threads per warp that need to process something (making 32/32 threads sit through long latency). Or some other task/request/query that sporadically or asynchronously occurs but takes some effort to process.

First you do a prefix-sum+atomicAdd+write or something to write the parameters for each task found (for each multiprocessor) into shared memory, with return pointers, etc. Iterate one update loop for all nodes of the problem, or until you get enough tasks to be worth dispatching early, then process all the tasks. Then process the next update loop and so on. Assuming this task selection process is free, we've increased our active thread count during this intensive task from 4/32 to 32/32 threads per warp, which is a significant increase in processing power, and we're not making every data item sit through the latency for the 12% that are processed.

If you launch a kernel from there to pick off the parameters and do the tasks, it shouldn't add much additional access latency over the update loop calling the function directly (just a couple round-trips from shared memory, which is quick), but it should decrease divergence because most of the threads are doing things (as opposed to being masked to disabled), right?

Implement that with a lightweight threading system to process the tasks. Do warp scheduling that allows cross-block/cross-grid processing of reasonably large tasks based on function call scheduling, or instruction address, or something like that. That's basically the "task parallel" approach on GPU, you're not going to get single data items to process efficiently.

Paul MaudDib fucked around with this message at 04:46 on Aug 29, 2014

# ¿ Aug 29, 2014 03:56

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

Professor Science posted:

maybe I'm not following you, but either you can't assume that the same SM is going to run the new kernel (%smid in PTX is effectively volatile, after all) or you don't need dynamic parallelism.

Yeah I guess that's wrong, dynamic parallelism kernels can't be passed shared or local memory. You could do block-wide task processing without DP I guess.

The ability of a kernel to be instantiated on a single SM probably would be really useful though. Like being able to lock a stream's processor affinity or something like that. That plus the ability for the hardware scheduler to block until incoming memory writes are completed (as opposed to outgoing memory requests) would basically be a lightweight task/threading system, which is what I've been going at.

And again this is all way easier assuming you have a real CPU to handle data wrangling on, so APUs should be better at tasks.

quote:

also you should get plat or some other easy means of contact.

AIM's good. In the profile. Or

Paul MaudDib fucked around with this message at 04:49 on Aug 29, 2014

# ¿ Aug 29, 2014 04:30

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

Menacer posted:

You can replace both the write and the spin-read with atomics, which should push those writes and reads out to the coherence point (the shared L2 in AMD GPUs). You'll also need to be careful with atomicity here, because atomic operations are not ordering fences in a workgroup -- you would need to put explicit global memory barrier in place to make sure all of the threads in a workgroup are done with these atomic writes before you move on.

It's worth noting that atomics completely bypass the caching system in CUDA (obv not AMD). They are explicitly operations that are guaranteed happen to their target memory before the call unblocks. With all the costs and/or implied failure that that entails.

Paul MaudDib fucked around with this message at 05:32 on Aug 29, 2014

# ¿ Aug 29, 2014 05:22

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

Rastor posted:

AMD just announced earnings. Profits down 65%, they are going to lay off 7% of staff (about 700 people).

In particular their GPU revenue is really starting to decline. That's one of their lifelines at this point, so that's not a good sign.

They really need to get their new GPU architecture out ASAP. Supposedly their fab has been struggling with the 20nm process node.

# ¿ Oct 17, 2014 18:19

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

El Scotch posted:

When the day comes to replace my 3570k AMD will have something worth buying, right guys?

Right?

The Kabini can be worth buying, in certain circumstances. If you want a low-power processor that'll do AES-NI, you don't have a ton of options. The Athlon 5350 is an OK laptop-level processor.

High-power stuff (>50W)? Nah, buy a Pentium AE or an i5/i7.

# ¿ Dec 25, 2014 07:00

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

thebigcow posted:

They hired a bunch of guys from DEC that worked on the Alpha right around the time Compaq bought what was left of DEC. Those were the people that made the Athlon.

That's what My Father From DEC has always claimed. "The Athlon XP/64 was just the practical commercialization of the Alpha architecture". Good to hear it's not just DadTales.

a survivor who works for HP now

Paul MaudDib fucked around with this message at 09:36 on Dec 25, 2014

# ¿ Dec 25, 2014 09:31

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

Lord Dudeguy posted:

Yeah a 10% bump for $250 (A10-7800 & FM2+ miniATX mobo) probably won't fit the bill. I'll wait for the 8000s. Hopefully they still make a 65w TDP version.

Thanks for the links!

You could also pick up a Pentium G3258 and a motherboard bundled together for around $100. That buys you an enormous bump in single-core performance (~63%) and it will likely tie even on highly threaded workloads. Toss in an overclock of up to 50% and the numbers get even more stark. Plus, after this $100 you've got an upgrade path with a manufacturer that isn't circling the drain. At some future point you could pick up a Haswell i5 or i7 pretty easily.

A $100 G3258 destroys anything AMD has to offer in single-core performance, and you have to start talking about a very hefty, very expensive, high-TDP processor on a very hefty, very expensive motherboard (990 chipset to handle the TDP) with lots of cooling before AMD can compete even on threading-friendly workloads, in which case you're looking at like $300 to even compete with the "budget" processor. And I'm talking about AM3+ here - FM2 is just outclassed period. Step up to an equivalent $300 worth of Intel kit and AMD is trashed again.

People have slung around phrases like "hating AMD" - I ran their processors for something like 20 years, starting with a K6-2 450, my second system runs a Phenom II, I literally just built a Kabini system as a low-power always-on server - but right now you are literally wasting money buying an AMD except for extremely narrow use-cases. If Intel's J1900 did AES-NI I probably wouldn't have bought the Kabini.

Paul MaudDib fucked around with this message at 03:05 on Jan 4, 2015

# ¿ Jan 4, 2015 02:38

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

Lord Dudeguy posted:

How does it compare on IGP performance? That's the key to this whole puzzle, and why I haven't switched to Intel. I'm running my APU in a low-clearance, 150w chassis and 720p gaming is a must.

The Pentium G3258 has an Intel HD 4600, the A10 5700 has a Radeon HD 7660D. According to this site, the topline "conslusion" [sic] is that the Radeon is ~10% faster.

Define "low clearance" - do you mean "not possible to mount a GPU at all", or just "fullsize cards won't fit"? I think gaming at 720p is probably problematic, your current rig has got to be running low settings and not hitting 60fps. Even a cheap media PC GPU is probably going to outperform onboard, and your money is way better off sitting in a discrete card. $150 of gpu buys you a R9 280, which is going to be capable of doing high settings at 1080p
.
Could you swing building a larger PC (capable of mounting a GPU) and doing in-home streaming to a media PC, possibly? You don't need a lot of power to render a display...

Paul MaudDib fucked around with this message at 04:05 on Jan 4, 2015

# ¿ Jan 4, 2015 04:01

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

Lord Dudeguy posted:

The case is only 4" tall. Only half-height cards will fit. Then there's the wattage limit.

Half-height is doable. I have one of these in one of my PCs, picked it up for $10 AR. A modern equivalent might be something like a GTX 750 (NVIDIA products are usually more power efficient than their AMD equivalent). Could you swing another 65 watts (bearing in mind that a good chunk of your CPU TDP is invested in the GPU and will go inactive once you have a discrete card)?

e: Radeon 7750 is down around 45W TDP, at ~50% performance loss.

Paul MaudDib fucked around with this message at 04:13 on Jan 4, 2015

# ¿ Jan 4, 2015 04:08

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

Lord Dudeguy posted:

I think the PSU is custom to the case. I'm not mechanically savvy enough to start modding a bigger PSU into it. The PicoPSUs don't look to meet the 400w requirement of the 750.

GPU recommendations are usually system totals, and generous ones at that. The actual measurement is (CPU idle, GPU full load) - (CPU idle, GPU idle), and the 750 works out around 65 watts. The question is how much the rest of your system draws.

Also you can buy PicoPSUs up to 160W continuous (200W peak).

When you're dealing with a tiny power supply, you want to play games, you want to run HD resolution, you want playable FPS, and you can't stream from a better PC - there's not going to be 50% extra wattage as a safety margin, you're going to have to load things up to the rated limits.

Paul MaudDib fucked around with this message at 04:25 on Jan 4, 2015

# ¿ Jan 4, 2015 04:16

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

SwissArmyDruid posted:

Now (and I realize this is a pipedream yet) the only thing that remains to be done is a processor that can look at a given task and figure out what would be done best with the CPU and what would be done best with the GPU, then assign appropriately without the need for OpenCL.

So, OpenCL is just an API specification for dispatching work. In the sense of the processor automatically identifying data-parallel sections, that already happens in modern processors and compilers, the low-hanging fruit has been more or less picked there. You might be able to get some sorta-OK auto-parallelization using a compiler with additional annotations. Running CPU algorithms directly on GPU usually isn't an efficient way to make use of data-parallel processors though, for real gains you are probably going to need a programmer to rewrite whole methods if not whole sections of the program. It's not something that can really done automagically.

OpenCL actually doesn't even provide a compiler or runtime - that's up to the hardware manufacturer. Which is why uptake has been so slow. It doesn't currently have any sort of auto-benchmarking system to determine whether deploying heterogenous compute resources would be advantageous, even if you have the binary right there. Assuming you have equivalent methods for GPU, you could probably make an auto-tuning suite to decide whether to turn them on There could potentially be some issues with linking different code segments together, and best case you'd have some seriously bloated binaries since they'd have code compiled for AMD processors, AMD GPUs, Intel processors, Intel GPUs, NVIDIA GPUs, and so on. I don't even know how you would handle more than one runtime "owning" a process, eg what if a Xeon Phi running Intel's OpenCL dispatches to NVIDIA's GPU?

I will say that it is an interesting pipe dream I've thought about too. Getting programmers to write OpenCL seems to be the real bottleneck, but it's a chicken and egg situation since it's hard to deploy and hit more than a handful of special-interest users.

At minimum to make the auto-tuning idea work, I think you'd need library-style programs that could be linked at install time, plus a common runtime. So you download .obj or .dll files for x64, APU, etc and try to find an optimal mix for a simulated task load, then link it and install.

Paul MaudDib fucked around with this message at 02:08 on Feb 7, 2015

# ¿ Feb 7, 2015 01:27

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

Samsung buying AMD makes very little sense, and I doubt it's real. There's not a lot worth having at AMD: the embedded systems contracts, the GPU business, low-power x86, some IP, and I guess maybe a fixer-upper chip design to jump into the desktop market. None of those lack a significant caveat, they'd all take some significant elbow grease to utilize successfully.

The dreamer in me wishes it was true, it would be great if there were a competitor to keep Intel moving forward. But I don't think it really makes sense for Samsung to buy out a client who is barely holding on to profitability with very specific niches, especially when that client is locked into using their fabs to produce high-performance chips.

OK, fanboy time is over. Gotta go drag my new 4690K into the den. :unsmigghh:

# ¿ Mar 26, 2015 05:11

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

JawnV6 posted:

In what sense?

AFAIK there's no x86 equivalent to something like the 4+1 core ARM chips. Being able to sleep your powerful cores but still respond if needed is really great.

The only thing that comes to mind is that tech Intel is working on that lets the CPU go to a full sleep but still respond to network connections. Damned if I can remember what they called that though. I think it was an evolution of the C6/C7 processor state. They weren't looking at putting it on their super low power stuff either, just server stuff from what I remember.

Paul MaudDib fucked around with this message at 05:33 on Apr 18, 2015

# ¿ Apr 18, 2015 05:31

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

Nintendo Kid posted:

Powergating all your cores except one and running the remaining one at low speed. It's really simple and has been available in dual core form since the mid 2000s and more cores since slightly later.

The idea on 4+1 is that you build it using slower, lower-power processes than the big cores.

Power usage doesn't follow a perfect linear scale. Taken to an extreme example, you'll never get a 4790k to consume 0.1W like a microcontroller given a similar computational load. At the end of the day you can never gate off and clock down a performance core far enough to match the power consumption of a core specifically designed to consume nothing.

The tradeoff is that your software has to be smart enough to take advantage of it. If your kernel treats the battery-saver core like a normal core you're going to have issues.

Paul MaudDib fucked around with this message at 05:52 on Apr 18, 2015

# ¿ Apr 18, 2015 05:43

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

Professor Science posted:

No software is smart enough for this. Also, 4+1 is dead in favor of 4+4 A53+A57 (or A53+A53 if you're Huawei lol), which software is even less equipped to handle.

I would think it's a pretty straightforward fix - you tweak your kernel scheduler and power manager to prefer the battery-saver core when load is below some threshold. I guess I shouldn't have said "software" - that's a kernel thing. Userland software shouldn't handle processor management.

Guess I'm behind the times on that. In terms of being "equipped to handle that" my intuition would be that it's a lot simpler to write a rule for handling one low power core when threshold < X (let's say sysload < ~0.1) than 4 cores, for various reasons.

# ¿ Apr 18, 2015 07:49

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

Don't forget the HP-EDS merger too. Now it's three garbage trucks.

HP is a microcosm of tech company mismanagement and a godawful place to work. It survived because it's a giant but its inertia is finally running out. In no way should it be viewed as a successful anything.

# ¿ Apr 19, 2015 02:15

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

The_Franz posted:

No. Vulkan, Mantle and DX12 basically get rid of the massive, slow hack-filled black-boxes that make up current video drivers and give control over most of the minutiae to developers. Automatically threading anything would run counter to their design. It's up to the application developer to decide if they want to build command buffers across multiple threads or not. There is no magic number of threads for maximum performance since it will all depend on what the application is doing, CPU speed and the quirks of the particular GPU where ideal workloads can not only vary between manufacturers, but between different generations of chips from the same vendor.

For context here, the way game engines and GPU drivers work is basically a massive game of second-guessing. The game writers write the engine with the style they think will work best with the drivers, and then the driver guys write their drivers to make the game engine actually work. It's a massive game of turning individual rendering settings on and off to produce stability and performance. This is a major reason why most games are buggy messes on release, why you need custom drivers for SLI/Crossfire for every game, etc.

The goal of the frameworks is to get rid of that, and shunt the workload onto engine developers to handle writing and optimizing their own rendering.

Paul MaudDib fucked around with this message at 06:48 on Apr 23, 2015

# ¿ Apr 23, 2015 06:45

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

cat doter posted:

That I did know, so the point of Vulkan and DX12 is that game developers can essentially write their own driver profile? That and the fixing of the draw call bottleneck.

It's about the same outcome as if they could write their own driver profile, but that's not how they're getting there. Previously the goal has been to hide the complexity of computer graphics from the game programmer and present as simplistic and accomodating an interface to the graphics system as possible. What Vulkan/Mantle/DX12 are doing is exposing a lower level, more complex, more stringent API. The theory here is that you're handling an enormously complex task that needs to happen extremely efficiently and at the end of the day trying to handle that using Babby's First Graphics API is a losing battle. At the end of the day you can only hide so much complexity, so the idea is to give everyone a fixed target and then let engine and GPU devs each handle their own half of the task. The GPU guys write the API and make their hardware do it quickly, the game guys write their game and tell the API how they want it to work.

Think of this as being somewhat like Java/C# vs C - Java will shield you somewhat from the complexities of the task and maybe insulate you from your bad behavior, and it performs OK. Maybe even good if you take heroic measures. It's a lot easier to write fast code in C and it can potentially go much faster than even great Java, but the "heroic measures" here are in terms of complexity of managing everything. There's no handholding, it dumps everything on you and if you mess it up you crash and burn. There's no soft landing for you mistake, just a big old black hole we call UNDEFINED BEHAVIOR. Not intended to be a detailed metaphor, don't tear into me too hard here. :shobon:

That's the theory at least. I'm sure in the real world there will still be tons of patching happening behind the scenes and stuff, because game devs are under huge pressure to ship before their holiday deadline and GPU devs don't want to be the brand whose cards don't work with a AAA title on release day. I'd agree that handling the complexity is best done by the commercial engine people and anyone less than a AAA game studio probably doesn't want to wade into the intracacies of how to get Enviro Bear 2000's graphics to render properly on a Crossfire setup.

Paul MaudDib fucked around with this message at 01:14 on Apr 24, 2015

# ¿ Apr 24, 2015 00:06

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

evol262 posted:

A rendering pipeline is like any other pipeline in a general sense. You should, again, learn how processors work. And how heap and stack works. You may also want to look into the specific language VMs you're familiar with (maybe .NET).

There's no real "explain like I'm 5" explanation from where you're starting from. Learn how processors work. Refresh your memory on discrete mathematics and linear algebra. Then pick up a book on shaders.

To echo this, there really isn't. I'll do my best but this is a drastic simplification. You really can't approach this without at least a passing knowledge of linear algebra, matrix math, and coordinate systems.

Basically you start with geometry. Take a 3d model and store it as a list of vertexes, which form surfaces. You have some origin and then take the points relative to the origin and put them into a matrix. We can take certain matrixes (matrices) and multiply the model matrix by this "transformation matrix" and perform various operations on them. For example the matrix

code:

Will double the size of a model matrix - the distance between the origin and the coordinate is multiplied by 2 in each dimension. Other transformation matrices can do things like rotate or translate a model. We can chain together multiple operations - if I have matrix S that scales, R that rotates, and T that transforms, then the matrix TSR will first (!) rotate the model, then scale it, then transform it. Order of operations is very significant here.

So basically you start with a model, then you perform some operation to embed it inside another coordinate space. For instance, maybe you take a hat model and then translate it to the top of your character's head. Then you take the gun and rotate and translate it into his hand. Then you take this combined model and translate it into a larger coordinate space - eg maybe inside a building, which has other objects, then that's inside a larger world model. We have a list of models, their vertices, and the transformation matrix for each model that takes its verticies from the model matrix to their final placement in the world matrix. We then convert this list of vertex to surfaces (primitive polygons).

Then you define a camera point (eg "Viewport"). Once you have that you then need to figure out which of these polygons are actually visible from the camera and which are not in the angle of view, obstructed by other surfaces, distort the polygons according to perspective, etc. Then you convert this polygon view to a flat 2D image (rasterization). Then you apply lighting to the coloring/textures on each pixel, etc.

The parts of the algorithm that are applied in parallel are the shaders. There's shaders that handle geometry work (translating the vertices to their final place, emitting primitives, etc) and shaders that handle lighting work. There's also tessellation shaders in DX11 but I just did OpenGL so I can't do too much detail there.

This is an enormous simplification because it is a really complex topic. As a starting point you might want to look at the classic OpenGL fixed-function pipeline and then work forward from there. The FF pipeline is deprecated and not included in the newer OpenGL standards anymore but it'll make a lot more sense if you see the historical context and the decisions that were made to move forward from there. Maybe this article?

There's some other really fun stuff in Games Programming 101 too. For example you may wonder why I used a 4D coordinate space. The answer is Gimbal Lock - a 3d orientation system specified in terms of 3 dimensions can end up with basically a zero-vector and a zero vector shits all these mathematics up bigtime (we keep track of the unit vector representing "up", if that ends up being [0,0,0] then we have no idea what that encodes). Instead you need an extra dimension, so that when you end up with [0,0,0] you have the extra dimension [0,0,0,1] to bail your rear end out.

The invention of this led to one of the greatest moments of nerd-ery in history.

quote:

Quaternion algebra was introduced by Hamilton in 1843.[6] Important precursors to this work included Euler's four-square identity (1748) and Olinde Rodrigues' parameterization of general rotations by four parameters (1840), but neither of these writers treated the four-parameter rotations as an algebra.[7][8] Carl Friedrich Gauss had also discovered quaternions in 1819, but this work was not published until 1900.[9][10]

Hamilton knew that the complex numbers could be interpreted as points in a plane, and he was looking for a way to do the same for points in three-dimensional space. Points in space can be represented by their coordinates, which are triples of numbers, and for many years he had known how to add and subtract triples of numbers. However, Hamilton had been stuck on the problem of multiplication and division for a long time. He could not figure out how to calculate the quotient of the coordinates of two points in space.

The great breakthrough in quaternions finally came on Monday 16 October 1843 in Dublin, when Hamilton was on his way to the Royal Irish Academy where he was going to preside at a council meeting. As he walked along the towpath of the Royal Canal with his wife, the concepts behind quaternions were taking shape in his mind. When the answer dawned on him, Hamilton could not resist the urge to carve the formula for the quaternions, i2 = j2 = k2 = ijk = −1, into the stone of Brougham Bridge as he paused on it.

http://en.wikipedia.org/wiki/Quaternions#History

Disregard the constable.

Paul MaudDib fucked around with this message at 02:26 on Apr 24, 2015

# ¿ Apr 24, 2015 00:29

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

El Scotch posted:

Haswell performance with 6/8 core chips for cheaper money could be viable in the DX12/Vulcan world.

Fewer cores for the same performance is always better. Games that don't thread well will always run like garbage, and even games that do thread highly will perform better on fewer/stronger cores due to stuff like less lock contention.

DX12/Vulkan is going to force most people to use a big-budget engine that has better-skilled developers, so I guess that means we'll probably see games that thread better in the DX12/Vulkan future, but there's only so far you can go on that. Even highly threaded games tend to lean on one core real hard - some of that is due to the difficulties of threading with current-gen graphics APIs, but there are just some inherent bottlenecks to the game loop.

You can pick up a i5-4690K and a MSI Z97 PC Mate for $265 out the door at Microcenter. To be appealing given power usage and the fact that each core is weaker it'd probably have to be down in the $175 range or lower. Factor in $75 for the motherboard manufacturer and you don't leave much room for AMD. Let alone if still you need to bundle it with a water cooler.

Paul MaudDib fucked around with this message at 01:53 on May 7, 2015

# ¿ May 7, 2015 01:47

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

El Scotch posted:

We'll see when DX12/Vulcan games start hitting the market. The early synthetic benchmarks indicate a significant boost going from 4 to 6/8. However, they're only synthetic and we won't know until the real thing is running around.

Synthetic benchmarks also show perfect scaling from 4-way Crossfire/SLI, which doesn't show up in the real world either. Having a game that responds to player interaction (and maybe even talks to stuff on the internet) is way different than running FireStrike.

You can pick up a i5-4690K and a MSI Z97 PC Mate for $265 out the door at Microcenter. To be appealing given power usage and the fact that each core is weaker it'd probably have to be down in the $175 range or lower. Factor in $75 for the motherboard manufacturer and you don't leave much room for AMD. Let alone if still you need to bundle it with a water cooler. If DX12 games scale great it might be worth it at $200 or so.

# ¿ May 7, 2015 01:57

Adbot: ADBOT LOVES YOU

# ¿ Apr 26, 2024 04:54

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

Killer robot posted:

I imagine a generation of consoles that have a large number of weak x86 cores is going to continue pushing development there, isn't it?

I'd think that if AMD was going to get a boost from the fact that their processors are in both current-gen consoles, it would already be showing.

Your average console runs on a 1.6 GHz APU and it hasn't made PC games perform vastly better. I think the way this usually tends to play out is that console devs take a PC gaming engine and start stripping features and detail and tuning to the hardware until they can get it to run, rather than engineering something better from the ground up. That poo poo takes too long and if the game isn't in stores by Christmas then your publisher is going to cut you off. It would be sweet to spend a couple years writing an engine that could push 144 FPS on a 16-core potato but the priority tends to be writing games.

Paul MaudDib fucked around with this message at 02:15 on May 7, 2015

# ¿ May 7, 2015 02:10

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > AMD CPU and Platform Discussion: Ryzen Might Be...Good? *knocks on wood*

«‹›7 »