Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Locked thread
Professor Science
Mar 8, 2006
diplodocus + mortarboard = party

Alereon posted:

ARM-based servers are also a thing that will happen, which is going to seriously impact both Intel and AMD.
how much of an impact there will be is debatable. the flat cost of x86 instruction decode is pretty minimal these days--otherwise Intel wouldn't have a chance in mobile, but devices thus far have been OK power consumption wise. in servers, ARM isn't going to have a massive power consumption advantage. the advantage ARM will have is different (and disruptive) kinds of servers that compete with virtualization, which might hurt Intel's margins going forward if people are buying commoditized ARM instead of Super Fancy Xeon 9000.

Adbot
ADBOT LOVES YOU

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party

Factory Factory posted:

The article addresses CPU choice a bit: that's just what the original Jaguar system used, since they were the better choice at its inception in 2005, and the Cray XK line has kept AMD offerings since.

Nvidia K20s are used instead of Southern Islands for roughly the same reason: the decision to use GPGPU acceleration was made three years ago, and only Nvidia had real GPGPU chops and roadmaps at the time, so that's who Cray contracted with.
CPU: 100% incorrect (or at least misses the point entirely, yet the article doesn't talk about it at all either, so I can't get too grumpy about it). Cray cannot use non-AMD parts until their next generation interconnect (Aries, which was recently purchased by Intel) comes online because Gemini, their current proprietary interconnect, is connected to the machine via HyperTransport, not PCIe. Considering the interconnect is Cray's primary value-add as a system builder, they're stuck on AMD for now.

GPU: AMD never had the software strategy to actually go after this market. It's a lot harder than "build a driver, call it a day," (support for third party debuggers and profilers, out-of-band monitoring, dealing nicely with exotic interconnects, etc--lots of stuff that never shows up at all in the non-HPC side of things) and they simply never staffed OpenCL/compute up enough to ever show up in these bids. Xeon Phi is the first real competing accelerator that NV has had in the HPC space.

Professor Science fucked around with this message at 04:30 on Nov 1, 2012

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party

necrobobsledder posted:

Yeah, it needs to be stressed that the primary reason for going AMD years ago was because of Hypertransport beating the pants off Intel's interconnect since that's such a big factor in large-scale HPC workloads (the joke is that supercomputers turn cpu-bound tasks into I/O bound ones)
no no, this isn't a QPI or FSB versus HT performance/multi-socket thing (although I'm sure that didn't hurt at the time)--Gemini is actually a device that hangs off the HyperTransport bus. besides the occasional Torrenza FPGA, Gemini is the only device I know of on HT like this.

the only equivalent device that I know of that sits on QPI is the interconnect on the SGI UV machines.

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party

Factory Factory posted:

Meh, it's not like AMD has much else going on right now, and this is definitely relevant with respect to AMD's HSA, what with the (dedicated?) GPGPU block.
Er, dedicated? As far as I've seen, PS4 seems to have a GPU with (at least relatively) standard GCN units. The only interesting thing that HSA would provide is if the GPU and the CPU have some level of cache coherence.

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party

Factory Factory posted:

One of the rumors going around, not sure if it was for PS4 or XboxNumbers, was that there would be a CU set aside for GPGPU programming to take over some of the highly parallel functions in the current CPUs of the PS3 and Xbox360.
Unlikely to be static, but I wouldn't be surprised if they expose something similar to OpenCL device fission to allow devs to partition the GPU semi-dynamically.

edit: reason why it won't be static is because since there's a single platform, you could pretty easily either timeslice the entire GPU if you have bulk processing to do or drain a limited number of CUs (known a priori thanks to aforementioned single platform) at specific times if you're doing more latency sensitive work.

Professor Science fucked around with this message at 06:15 on Feb 21, 2013

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party

roadhead posted:

Main memory is shared and I bet all the L2 cache (and L3 if there is any) is shared as well.
Shared main memory != cache coherence. This kind of coherence (between CPU and integrated GPU) is one of the big promises of HSA. If I had to make a bet, I'd say the GPU can snoop the CPU L1/L2 but not vice-versa.

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party

Cybernetic Vermin posted:

I think you are vastly underestimating the number of extremely pessimistic global locks in software today. There are bound to be a lot of software that go from being just "multi-threaded" to basically scaling linearly in some operations with cheap memory transactions.
transactional memory is a cool bandaid on top of bad programming models. if you're writing pthreads today (and you're not writing a language runtime or something similarly low level) you are probably doing your job really badly. learn how to track dependencies and avoid shared state, programming models!

edit: you're basically arguing that there are a lot of applications that are inherently parallel, CPU bound, and extremely impacted by locking overhead to some shared object. those are the only cases where transactional memory could theoretically make a performance difference. what applications would those be? also, please note that Haswell TSX won't solve those problems due to the inability for any app that relies on a monolithic global lock to fit all of its shared state in L1.

the transactional memory people have been tooting this horn for ten years; if it were actually as amazing as all the academic papers claim, Rock probably wouldn't have killed Sun

Professor Science fucked around with this message at 00:36 on Feb 24, 2013

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party

Factory Factory posted:

Actually threading isn't the limit of parallelism in FPUs. It's far older tech than multi-threading to pass multiple numbers in a single instruction - it's called vector processing, and it's been part of x86 since Intel's SSE on the Pentium III and part of computing for far longer than that.
actually preceded by MMX and 3DNow, but that's neither here nor there...

if you look at the perf advantage in GPU versus CPU for scientific compute applications, a lot of the time it scales linearly with bandwidth. graphics often performs similarly. as Intel's fabs get better and their ability to slap down huge amounts of eDRAM improves, it's going to be bad for AMD.

also, it sounds like HSA for discrete GPU is dead. can't say I'm shocked, that was never a good value proposition for anyone (cache coherence over PCIe... no).

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party

Factory Factory posted:

I imagine the big deal here is just plugging into the HSA stack and letting the scheduler do the job rather than having to write up an entire OpenCL workload, but honestly I have no Goddamn clue.
nah, I think it's using CL, it's just lower overhead because it's shared memory. whether this is just "faster because more FLOPs because GPU" or "faster because HSA itself is actually useful" is not addressed.

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party

Factory Factory posted:

That... would be a significant disappointment compared to how they're selling this. The whole subtext behind HSA and HUMA was that GPGPU programming would become as easy as sufficiently threading your workload. How does this make OpenCL programming any easier?
don't have to copy memory from one pool to another, and in theory you can reuse complex data structures because it's cache-coherent. that certainly sucks when dealing with standard GPU programming but you still have to write some kernels that can handle craploads of parallelism in the first place

(HSA is mostly hype? from a collection of hardware vendors? THEY WOULD NEVER)

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party
deleted: I'm stupid

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party
there are some rumors that Kaveri was going to depend on GDDR5M for better perf, but since Elpida imploded, that's disappeared.

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party
edit: nah, delete me

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party

Paul MaudDib posted:

e: The A10-7850k has 512 stream processor cores. Any idea how a Kaveri core compares to a Kepler core?
don't ever try to compare GPUs this way, the "core counts" are so far beyond meaningless it's unbelievable

also good lord, people are still buying the "GPUs are 8-100x faster than CPUs if used properly" tripe? that's NVIDIA PR from 2008 or 2009, and it has no basis in reality other than really specific special cases that generally boil down to the texture unit offers free FLOPs if you need interpolation and the texture fetch offers enough accuracy. if you port a naive (non-SSE/AVX, single-threaded, not really optimized at all) C++ app to CUDA and optimize the hell out of it, yeah, you might get 10 or 20x. if you actually optimize the CPU code and run it on anything approaching 250W worth of CPUs, yeah, you might get 10 or 20x out of it there too. GPUs offer advantages, but it's ~2.5x versus well-optimized CPU code on apps that are especially suitable for GPUs.

(maybe it seems like I poo poo on GPUs a lot--I only do because I know a lot about them, and the claims people make are usually totally unrealistic. this is one of them.)

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party

Paul MaudDib posted:

e: This is the wrong way to look at it but I contend that the ability to manage more cores doing more useful work thanks to improved architectures/frameworks has probably increased performance relative to well-established CPU computation since day 1 of CUDA's launch over the last 6 years. Server CPUs haven't gotten much more parallel since 2008, CUDA has gone from 128 cores to 2880 per die, the architecture (instructions/bandwidth balance/featuresets/etc) is more optimized for compute work instead of graphics rendering, and there's been more time to work on optimizing algorithms for GPU processors. And working with zero-copy on die is inherently more powerful than a co-processor on a bus in terms of latency and flexibility.

Stuff like database access seems to be relatively amenable to GPU acceleration.
database access for sets that fits entirely within GPU memory works well because GPUs have very high memory bandwidth, and if you're doing lots of queries simultaneously and can hide memory latency, yeah, you get 6x over a single Intel CPU because of GDDR5 vs DDR3 bandwidth alone. those are all big ifs, and they certainly do not mean that general database access is amenable for GPUs.

the only case I know of where GPUs took over an industry was reverse time migration for seismic processing.

in general, I think we have decent tools and programming models for writing kernels on specific GPUs. what we don't have is any sort of way to write applications with large parallel sections that run on the appropriate compute device. until the latter gets solved, GPU compute will remain a novelty limited to HPC, embedded platforms, and the occasional developer relations stunt. I don't think coherent memory does the latter on its own, although it's a nice step--it's more of an improvement for the "writing kernels" part.

edit: whoops, this is the AMD thread and I'm pretty far afield. just PM me if you want to talk more about this.

Professor Science fucked around with this message at 07:02 on Aug 22, 2014

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party

GrizzlyCow posted:

Well, AMD is trying to make inroads on that HSA promise. They partnered with Microsoft to create a new C++ compiler.

So there's that. AMD APU related stuff.
this doesn't seem to be upstreamed in Clang or LLVM, so it's probably not going to fare well in the long term.

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party
I don't have time/effort to respond to all of this (just compare benchmark results--I don't know of any GPU app in HPC that is actually flops bound instead of bandwidth bound), but a few things:

Paul MaudDib posted:

There's also a ton of work devoted to getting around the memory size limitation. Multiple GPUs in a system can do DMAs to each other's memory, so you can get up to 48GB of GDDR5 memory per system with 4x K40s. In theory you could also do DMA to something like a PCI-e SSD, which might offer better latency (but lower bandwidth) than main system memory.
PCIe is 12GB/s per direction versus 200GB/s+ onboard, and last time I checked you still need PCIe switches between all of your GPUs in order for general purpose peer-to-peer transfers to actually work. however, lots of things don't work over PCIe (atomics come to mind), and good luck if you start hitting remote GPU memory in anything resembling a performance critical path.

also, peer to peer latency is ~identical to system memory latency, although I haven't measured that in a long time.

quote:

We actually do have tools that let you write large parallel sections easily - OpenMP and OpenACC frameworks for exploiting loop-level parallelism in particular. On a finer grain there's marking calls as tasks and then synchronizing, etc.
oh my god, no. speaking from experience, both of these tools are irrelevant to the problem I'm describing. first of all, OpenMP pretty much sucks for getting anything near peak CPU performance; it's just easy (and it's usually off by a lot vs a good multithreaded implementation, otherwise Intel wouldn't have TBB). are they vaguely decent when you know that an arbitrary section is exceedingly parallel and has sufficient work to fill a given GPU? I guess, they'll generate something not completely awful (I hope). the problem is such hot spots are basically nonexistent at this point. if you can find a section of code where you know you're spending a significant portion of your time and it has lots of parallel work, actually porting such a piece to CUDA or OpenCL or whatever you want to run on KNC/KNL isn't that hard. it may take some effort, sure, but the problem isn't writing the kernel. the problem is that no such sections really exist in most applications. the amount of work available to these functions is usually too small to actually fill a GPU and be fast compared to PCIe overheads, so the porting problem because one of restructuring the entire codebase to get more parallel work available to your compute kernels at once in addition to porting the compute kernels themselves. in other words, for a lot of HPC apps, now you're touching 10M lines of code written by generations of grad students instead of your 1000 line kernel.

quote:

The problem is that tools are much less of an effective solution when there's big overhead to start/stop a co-processor. That's my problem with the OpenACC approach to GPUs - it doesn't make sense to copy it down to the processor, invoke and synchronize the kernel, and then copy it back up to parallelize a single loop, GPU programs really should be running mostly on the GPU rather than shuttling data back and forth all the time. It makes sense for intensive portions that are reasonably separate from the rest of the program, but a generic "send this part of the program to the GPU" isn't really going to be the best solution in a lot of cases.
ahha, now you're getting closer to the root of the problem. last time I measured, the baseline empty kernel latency in CUDA was something like 15 microseconds. in other words, to do absolutely nothing except to send a message to the GPU and wait for a response back that the message was received, you had 45,000 cycles per 3GHz CPU core. this makes getting wins on GPUs difficult, because that quantum determines how much restructuring of your code you have to do--even if your code is twice as fast on a GPU vs a CPU, if your function is less than 30us on a CPU then the GPU isn't faster overall. and that's ignoring PCIe! that latency is even slower than RDMA over InfiniBand or better interconnects, too.

this is the interesting part of APUs or KNL; by skipping PCIe, the quantum required for a speedup should be a lot lower. once it gets low enough to the point that arbitrary function dispatch can be profitable even if the CPU can just wait for the results to come back, then that will be transformative. however, that's probably going to look more like AVX512 or AVX1024 than a dedicated GPU.

also, OpenACC actually tries to keep memory on the GPU as much as possible and isn't doing a naive copy-to/copy-back every time.

quote:

Of course, all such tools do have the problem that their design goal is to make it easy to write/adapt generic code. Writing applications using architecture-specific features can make a pretty big impact in performance. One reasonable approach to this would be to let you compile in OpenCL or CUDA code directly - i.e. you should be able to call __device__ functions that can use the architecture-specific features, perhaps with multiple versions to target different architectures.
I don't know what you mean here--IIRC you can do this in CUDA by checking defines for particular SM versions. The problem is that the architecture-specific stuff that impacts performance is more meta than that; it's usually not a matter of using shuffle on Kepler versus not on Fermi and getting a speedup from that. Look at Tesla to Fermi to Kepler. Tesla's access to shared memory was almost as fast as registers (see the Volkov papers if you're not familiar with that), but on Fermi and the integration of shared memory with L1, shared memory latency became tens of cycles. That led to all sorts of tradeoffs, and Fermi's register-per-thread count was extremely low compared to GT200 and GK110, so that led to a whole new set of tradeoffs and code rewrites for GK110 that further moved toward register blocking and away from shared memory, etc. And these are just NV GPUs. Having something that works reasonably for CPUs (or GPUs with integrated memory) is very different than something that works well for the big GDDR5-equipped GPUs, as the former are going to depend a lot more on cache friendly approaches than the streaming approach favored by large GPUs.

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party

Factory Factory posted:

Anyone else feelin' pretty dumb after the last few posts? Heck, I'm almost ready to buy an AMD CPU.
it's pretty straightforward. based on hardware alone, GPUs offer a pretty good value proposition for HPC (lots of BW, lots of FLOPs), but they are hamstrung by two big software issues.

1. it takes a lot of parallel work to fill up a GPU, and GPUs can't timeslice between processes or anything like that when one task isn't offering enough work to completely fill the GPU. this limitation generally doesn't apply to CPU apps, so the problem with porting becomes more about restructuring the rest of your app to get enough work in a single place at a time than porting some kernels.

2. when everything about your platform (GPU type, GPU count, whether GPUs are shared or exclusive, CPU count, PCIe perf, ...) isn't known a priori, writing an application that uses the right processor at the right time becomes really hard. it's partially a language issue and partially a runtime issue. language-wise, the CUDA execution model (later used by OpenCL and DirectCompute) exposes some extremely low-level hardware details and in fact requires users to specify things at that level in order to get good performance. these include exact details of the memory hierarchy, warp widths and warp-synchronous programming, how a kernel traverses over data, etc--the developer has to control all of this to get something that runs fast. since those details change from GPU generation to GPU generation, there's no one ideal GPU kernel. runtime wise, there's no way to figure out how busy a GPU is (other than "is anyone else potentially using this GPU at all"), so picking a particular GPU to use in a multi-GPU system is also really hard. couple that with GPU performance variance between different models or different vendors, and the question rapidly becomes "should I use a GPU at all right now or stick to the CPU."

the reason why GPUs have been pretty successful in HPC thus far is because #2 doesn't really apply--if you're buying 10-40k GPUs, you generally buy the same kind of GPUs and write software for that machine. most of the software improvements in recent years have been focused on #1 (GPU work creation, improved support for concurrent kernels, multi-process support, even interaction with InfiniBand and other NICs), and the rest of the porting problem can get brute forced by investing in enough software engineers. meanwhile, #2 is as critical an issue in desktop and mobile as #1, and there's been very little work to solve that. OpenACC and Thrust come to mind to solve the language issue, but there's still very little research on good runtime schedulers that I'm aware of (the last big thing was StarPU, and that was what, four or five years ago?). over time, #2 will become more important for HPC as there's less of a CUDA monoculture, but I don't have any idea as to what direction that will take right now.

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party

Paul MaudDib posted:

What I'm getting from your earlier post is that most multithreaded applications are written with some type of "task-based" parallelization (you said TBB).
I was just using TBB as an example of something that is better able to get near-peak multithreaded perf than OpenMP (although not as good for CUDA-style workloads as ISPC). anyway, task parallelism isn't really the case in HPC apps. HPC apps are overwhelmingly MPI-only; they use MPI for both inter- and intra-node work distribution. there's no work creation or task-based stuff. generally, such MPI schemes are exceedingly data parallel, but when you're running 8/16/24/more MPI ranks (aka individual processes) per node that has 1/2/4 GPUs, it's hard to get enough work per MPI rank to fill a GPU (and multiple processes can't generally dispatch to a single GPU cleanly, hence the multi-process stuff in CUDA). you can try to smush all of the GPU work from all of the MPI ranks in a node to a single MPI rank and have that basically be your GPU scheduler/runtime, but the latency of doing so is really painful. it also means that every GPU-enabled app has to effectively include its own scheduler and runtime; that's crazy complicated.

quote:

I think one strategy is to try and generate your own data parallelism as you go. Latency already sucks compared to CPUs, so just embrace it and let a runtime batch up data items/tasks and then process them in parallel when possible. In terms of implementation strategies, I think that would end up looking a lot like the Erlang runtime. You have something that looks like lightweight threads which a runtime or the scheduler batches and then dispatches when a sufficient number of data items are ready (or the processor is under-utilized enough that processing partial batches doesn't matter).
Doesn't really work over PCIe, though. I wouldn't be surprised if something vaguely like this ends up a viable strategy for coherent heterogeneous non-power-constrained devices (as in, not mobile).

quote:

That has overhead too of course, but you're amortizing it across more data items per launch, and kernel launch overhead is much, much lower from the kernel (onboard the GPU) than from the CPU.
Let me blow your mind: for a single kernel launch, this is absolutely false. GDDR5 latency sucks a lot. Like a whole, whole lot. GPUs may have 5-6x the memory bandwidth of CPUs, but they do so by having ~15x the memory latency. Reading across PCIe after fiddling a register is better than running a scheduler on a GPU on GDDR5. (If you can batch N kernel launches together at the same time, then yeah, it'll perform better from the GPU.)

quote:

I generally agree that CUDA and OpenCL expose a lot of low-level mechanics to the programmer, but you don't really need to hand-tune your program to every single device to get good performance. If you're writing low-level C code for your entire program, yeah, your life is going to suck, but the higher-productivity way here is to write your program in terms of template library calls and then let the library authors handle tuning the operations to the various architectures. That's Thrust, Cuda Unbound, CUDPP, and so on, which handle warp-, block-, and device-wide collective operations. All of those are designed to be drop-in solutions that will work on any architecture or block size. Like CPUs, not all portions of your program are really critical, and libraries often cover most of the critical parts fairly well.
Libraries don't really cover the critical stuff for large apps; they're not just simple BLAS or LAPACK calls or sorts or simple reductions. They may use a bunch of BLAS and LAPACK calls, but they'll have large matrix operations that are not straightforawrd applications of Thrust/CUDPP/etc. Also all these apps are still FORTRAN, not C++ :v:

quote:

As for grid topology, you can write some math which does some rough occupancy calculations at runtime and get a reasonable guesstimate.
occupancy is generally an irrelevant metric for performance. that's what I meant about register blocking on GK110 versus Fermi--there are these larger scale implicit tradeoffs that are not obvious at all from detecting individual device properties.

quote:

Not quite sure what you mean by "warp synchronization", if you mean thread-fences around shared memory operations, that's roughly equivalent to the trouble caused on CPUs by forgetting to wait for task completion. It's a thing you need to remember, but one you'll notice right away when your 2880-core processor spews obvious race conditions at you. That previous example uses them (__syncthreads()), they're pretty basic.
gently caress, I was hoping I wouldn't have to explain this because I hate it. so warp-synchronous programming is this thing that you'll see in a lot of CUDA kernels that are designed to go very fast. it is based on two ideas:

1. it's good for warps to be out of phase with each other within a block, as that makes it more likely that your various execution units will be in use while memory loads/stores are happening. in other words, __syncthreads() eats performance.
2. if you know the warp size, you know that all of your operations will be completed at the same time, meaning there's no such thing as an intra-warp dependency across an instruction boundary.

so a simple example of this is the same thing you'd use warp shuffle for:

code:
volatile __shared__ double butts[32]; // THIS IS NOT WHAT VOLATILE MEANS, NVIDIA
uint32_t id = threadIdx.x % 32; // get our index within the warp
butts[id] = 1024 - id; // stupid arbitrary value
uint32_t newVal = butts[31 - id];
no __syncthreads() required, so long as you know the warp is 32 threads or larger. if the warp were ever smaller, then your code blows up. CUDPP makes heavy use of this, I think Thrust does too, basically every fast sort/reduction/etc relies on this. you can get arbitrarily complicated with this same pattern, too.

quote:

Other stuff here is not real at all. Warp size has been 32 threads for every CUDA Compute Capability spec so far. Traversing the kernel's grid over data is pretty straightforward, and the fact that blocks can't communicate basically eliminates the possibility that you're doing something not straightforward. If threads need to communicate, they go in a block together. There is definitely an API that lets you get GPU utilization.
Warp size has been 32 on NVIDIA GPUs. It's not necessarily on Intel GPUs (variable warp width toot), and it's not on AMD GPUs IIRC. So that's fine for CUDA, but if you want to have a generic language that can target NV, AMD, Intel, GPUs and CPUs, then you can't expose anything like that. Grids/blocks/explicit memory hierarchy are (I'd argue) entirely the wrong primitives for such a generic language. Also, NVML's utilization metric isn't useful for the problem I'm describing; it is not fine-grained enough.

(I've written a _lot_ of CUDA. also blocks can communicate, I'll leave that as an exercise to the reader)

Professor Science fucked around with this message at 02:52 on Aug 29, 2014

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party

Paul MaudDib posted:

Spinlocked infinite-loop kernels? I was under the impression that falls into "seriously undefined behavior".
Eh, you can do more than that. For example, you can build your own work distributor if you really want to. Getting an actual maximal launch to initialize things isn't straightforward, though.

quote:

For global memory, sure, which is why I specified "shared memory" On-multiprocessor memory has much lower latency. Dynamic parallelism should then avoid this, right?

I heard you can also play games with sharing local memory?
If you're sharing the same memory with no PCIe, then CPU->GPU kernel launch latency and GPU->GPU launch latency should be ~identical. Shouldn't be any faster, since you're still fundamentally fiddling the same bits.

Not sure what you mean by sharing local memory.

edit: wait, do you mean launching out of the equivalent of __shared__?

Professor Science fucked around with this message at 03:37 on Aug 29, 2014

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party
maybe I'm not following you, but either you can't assume that the same SM is going to run the new kernel (%smid in PTX is effectively volatile, after all) or you don't need dynamic parallelism.

also you should get plat or some other easy means of contact.

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party
afaik Power is not a meaningful player in high-performance computing (Top500 and the like) anymore, for lots of reasons that nobody probably cares about but me. glancing through Top500, the only new IBM-based machine at all in the past two years in the top 25 is an internal IBM Power8 machine.

also this conversation about ISAs is silly because for various reasons the ISA has been proven to pretty much not matter so long as it's not completely idiotic (eg, x86 is kind of idiotic, x86-64 is not, ARMv8 is not, I don't know enough about MIPS64 to say either way). if you really wanted to build a <1W Power8 CPU with good perf/W, you probably could. if you wanted to build a 300W ARMv8 CPU with big iron style perf/W, you could do that too. the argument people always make is that x86 in particular has huge overhead for instruction decode, but things like Silvermont/Knights Corner/Knights Landing have shown that instruction decode is such a minuscule part of a modern CPU it doesn't matter. the thing that matters once you have equivalent processors with different ISAs: toolchain. x86 dominates (especially for HPC), ARM is closing fast in the not-FORTRAN oriented space, and everything else is significantly further behind.

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party

Rastor posted:

POWER architecture has been around for some time and claims to be gearing up to face down Intel from the high end, while ARM continues to challenge on the low end and even MIPS making some noise as they are expecting to be supported in the next Android release.
MIPS has been supported for a while, MIPS64 is the new hotness.

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party

Panty Saluter posted:

The Nvidia 9xx series might be the final nail in the coffin. Far as I can tell there just isn't a reason to buy anything else right now unless all you need is a (low margin) budget card. I hope AMD pulls through but it looks really bad.
If they're really dependent on 20nm, it bodes poorly; 28nm will have better yields for a long, long time, to the extent that when the AMD 20nm GPU does ship it probably won't be available anywhere in the short term anyway.

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party
people should probably take CPU perf measurements with power figures between phone/tablet and laptop/desktop with a huge grain of salt because measuring actual power consumption and accounting for process differences is Hard; NOP; NOP; NOP

(it is an ia64 post)

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party

ehnus posted:

There can be bugs on machines with lower numbers of cores that can sometimes manifest. For example if you have two threads of high priority busy-waiting for work to be finished by threads of lower priority the system can stop making forward progress as the operating system will not pre-empt the higher priority threads. On a four core system this situation wouldn't happen.
uh that is not how thread priority works, these aren't realtime OSes, a high priority thread does not prevent a low priority thread from running forever

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party

SwissArmyDruid posted:

Motherfucker, seriously?!
did you seriously not see this coming? microservers never really made sense http://www.realworldtech.com/microservers/

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party

Paul MaudDib posted:

The tradeoff is that your software has to be smart enough to take advantage of it. If your kernel treats the battery-saver core like a normal core you're going to have issues.
No software is smart enough for this. Also, 4+1 is dead in favor of 4+4 A53+A57 (or A53+A53 if you're Huawei lol), which software is even less equipped to handle.

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party

Paul MaudDib posted:

I would think it's a pretty straightforward fix - you tweak your kernel scheduler and power manager to prefer the battery-saver core when load is below some threshold. I guess I shouldn't have said "software" - that's a kernel thing. Userland software shouldn't handle processor management.
everybody seems to think this (including _lots_ of people in industry) and it's magical thinking, that you can somehow know the load at an exact point in time on the small core and instantly migrate. migration is not free power-wise, it's not free latency-wise, and there's no way to predict the need to migrate from one to the other (you only know load in the past, and past load is not an indicator of future load, especially when you consider that most workloads are bursty). as a result, you're going to have up to N ms (whatever your scheduler interval is) of being totally overloaded on the low power core before the scheduler load balances. and that means you still have to clock up the larger cores, which is not free latency-wise either. so best case, you're right and you probably have some period of time where the system performed badly (which may be okay or it may cause dropped frames or stuttering or other bad things), or you powered up the big core unnecessarily, migrated, kept it on for as long as it takes the scheduler to decay the big core's load, and then migrated everything back to the little core (so much for saving power).

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party

JawnV6 posted:

This is pretty much happening. Your phrasing acts like it isn't, which is sorta confusing? Intel's doing their best to jam it into the mobile segment but everyone else is counting down the days until ARM scales up to servers.
either that or Power stuff starts coming out. I think it's much more likely that the EU funds ARM enough to be a thing in servers rather than IBM somehow making money on Power to continue meaningful development, though.

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party

FaustianQ posted:

I mean fair enough and everything, but why care about AMD64 at all then? It seems pretty drat irrelevant since Intel currently has defacto control over x86-64 in the server market, and is kind of irrelevant in the mobile market. So Intel grabs AMD64, no one cares as ARM marches on and Intel fiddles with an increasingly dead technology, woop, or moves to ARM as well.
cause Windows desktops. if Apple wanted OSX to be ARM64 or Power or homegrown ISA, they could (new Xcode, it compiles to the new thing, they're a big enough market that people will generally do whatever). same thing with Linux, since the vast majority is open source. the massive amount of legacy Windows software that will never be compiled to target a new ISA is the only reason why anyone really cares about x86 as an ISA (vs Intel or AMD processors as generic CPUs at a given price/perf point) at this point.

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party

ohgodwhat posted:

Is Power dying too?
well, they have some big open consortium now to popularize Power, including NVIDIA and Google, and they've won two giant DOE supercomputers in 2017 IIRC, so that's good. but IBM as a whole is doing badly enough that anything that isn't making a lot of money is going to die. given that the rest of IBM is moving toward services, HPC in general is not a huge business, and producing chips is increasingly expensive, I'm not bullish about Power's prospects overall. I can't see ISA licensing being a great prospect either, considering there will be orders of magnitude more investment in the ARM ecosystem, which means ARM will have significantly better compilers/debuggers/etc.

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party
would take 4 to 5 years and $1.5B+ in R&D to ship something like that, since you're basically talking about coupling two vastly different processors at a level far beyond anything that exists today (actual shared MMU which is a new level of synchronization hell that no one is prepared to deal with), an entirely new interconnect, far better ECC/reliability functionality than exists on any GPU today, and complete rewrites of every software package that these exascale sites care about (plus all of the required driver/OS/networking stack bits). it's also something of a niche product--I don't think this is a great design for anything that doesn't require strong scaling, as your node throughput is going to be fixed and so you have a fixed ratio of nodes to networking gear, even in problem domains where you want the fattest nodes possible. (you know, deep learning)

fake edit: I actually glanced over the AMD paper after writing the above, and they briefly touch on all of these concerns. the SW concerns are blown off with "eh HSA will handle it! it's open, everybody will be using it by then!" like every other AMD document ever produced, the RAS concerns are "oh yeah we figure that out huh," no mention of an MPI replacement or something that works better than MPI for so many nodes, et al. in other words, don't get your hopes up, pals, this is a fluff piece.

it's exactly as fluffy as the NVIDIA SC10 preso that laid out "Echelon," their exascale processor. and shocking! it basically says exactly the same stuff.

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party

wipeout posted:

Much more interested in K12 being a killer chip these days - at least they might stand a chance long term in the ARM market.
what market? I'm not being flippant, this is a serious statement--everybody keeps making claims about how server ARM will absolutely positive be a thing and I always go back to this, and AMD has no mobile presence.

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party

Nintendo Kid posted:

In transitioning minor applications over sure, but for servers people will port poo poo to anything for business line applications, including Itanium, so long as the chip works well for a server.

And having a slow x86 core on top of a decent ARM core doesn't really help for anything, you'd need a radically modified OS to handle some of the stuff going on on x86 and some of the things going on on ARM. There are some mainframe OSes which can handle splitting tasks over diverse microarcitectures, but it's useless for say a Windows application.
nah, you'd just do binary translation like Denver or x86 Android. Houdini on x86 runs ARM code on x86, after all. The problem is that the perf hit sucks in a close market.

The real problem, of course, is that ARM isn't easier to make a chip for than x86 so who even cares if you're building ARM or x86 anyway. Larrabee/the Knights line disproved the whole "instruction decode on x86 will completely dominate the power for a small chip" thing (also Edison).

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party
and Jim Keller is out at AMD

so much for Zen, I guess!

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party

FaustianQ posted:

Sounds like 16nm is better for K12, Zen mobile, and mobile GPU. 14nm sounds better for Zen desktop, desktop iGPU and dGPU.

I mean there has to be a reason AMD is willing to tap into both TSMC and GoFlo for upcoming products. Maybe GoFlo 14nm will give Zen the overclocking headroom to make up IPC shortfall.
it's probably just to avoid being supply constrained, same as apple

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party

Paul MaudDib posted:

re CPU + CPU, i.e. big.LITTLE: It's a promising combination on paper, but my understanding is that it's hard to get a chip that does a good job of balancing real-world loads. It takes a lot of energy to switch between the cores, and it's easy to get "thrashing" back and forth between the processor types.
I think it's fairly unlikely that we'll ever see heterogeneous cores outside of ARM64 and probably never on any chips that expect to run above a 10W TDP.

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party
FWIW. you'll learn more about GPU compute from reading about C* and MasPar than any other historical artifacts I've seen.

Professor Science fucked around with this message at 06:46 on Jan 26, 2016

Adbot
ADBOT LOVES YOU

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party

FaustianQ posted:

I'm going to OD on all this salt

This likely means nothing but it'd be pretty cool if the EHP was coming along nicely, although this should indicate functional engineering samples of Zen exist and are currently in the late stages of sampling or towards first revisions if AMD wants to meet Q3 target with enough volume.
uh the very next patch indicates that this is for Carrizo?

  • Locked thread