|
Alereon posted:ARM-based servers are also a thing that will happen, which is going to seriously impact both Intel and AMD.
|
# ¿ Oct 15, 2012 05:13 |
|
|
# ¿ Apr 29, 2024 09:58 |
|
Factory Factory posted:The article addresses CPU choice a bit: that's just what the original Jaguar system used, since they were the better choice at its inception in 2005, and the Cray XK line has kept AMD offerings since. GPU: AMD never had the software strategy to actually go after this market. It's a lot harder than "build a driver, call it a day," (support for third party debuggers and profilers, out-of-band monitoring, dealing nicely with exotic interconnects, etc--lots of stuff that never shows up at all in the non-HPC side of things) and they simply never staffed OpenCL/compute up enough to ever show up in these bids. Xeon Phi is the first real competing accelerator that NV has had in the HPC space. Professor Science fucked around with this message at 04:30 on Nov 1, 2012 |
# ¿ Nov 1, 2012 04:24 |
|
necrobobsledder posted:Yeah, it needs to be stressed that the primary reason for going AMD years ago was because of Hypertransport beating the pants off Intel's interconnect since that's such a big factor in large-scale HPC workloads (the joke is that supercomputers turn cpu-bound tasks into I/O bound ones) the only equivalent device that I know of that sits on QPI is the interconnect on the SGI UV machines.
|
# ¿ Nov 1, 2012 06:51 |
|
Factory Factory posted:Meh, it's not like AMD has much else going on right now, and this is definitely relevant with respect to AMD's HSA, what with the (dedicated?) GPGPU block.
|
# ¿ Feb 21, 2013 05:58 |
|
Factory Factory posted:One of the rumors going around, not sure if it was for PS4 or XboxNumbers, was that there would be a CU set aside for GPGPU programming to take over some of the highly parallel functions in the current CPUs of the PS3 and Xbox360. edit: reason why it won't be static is because since there's a single platform, you could pretty easily either timeslice the entire GPU if you have bulk processing to do or drain a limited number of CUs (known a priori thanks to aforementioned single platform) at specific times if you're doing more latency sensitive work. Professor Science fucked around with this message at 06:15 on Feb 21, 2013 |
# ¿ Feb 21, 2013 06:11 |
|
roadhead posted:Main memory is shared and I bet all the L2 cache (and L3 if there is any) is shared as well.
|
# ¿ Feb 22, 2013 03:38 |
|
Cybernetic Vermin posted:I think you are vastly underestimating the number of extremely pessimistic global locks in software today. There are bound to be a lot of software that go from being just "multi-threaded" to basically scaling linearly in some operations with cheap memory transactions. edit: you're basically arguing that there are a lot of applications that are inherently parallel, CPU bound, and extremely impacted by locking overhead to some shared object. those are the only cases where transactional memory could theoretically make a performance difference. what applications would those be? also, please note that Haswell TSX won't solve those problems due to the inability for any app that relies on a monolithic global lock to fit all of its shared state in L1. the transactional memory people have been tooting this horn for ten years; if it were actually as amazing as all the academic papers claim, Rock probably wouldn't have killed Sun Professor Science fucked around with this message at 00:36 on Feb 24, 2013 |
# ¿ Feb 24, 2013 00:24 |
|
Factory Factory posted:Actually threading isn't the limit of parallelism in FPUs. It's far older tech than multi-threading to pass multiple numbers in a single instruction - it's called vector processing, and it's been part of x86 since Intel's SSE on the Pentium III and part of computing for far longer than that. if you look at the perf advantage in GPU versus CPU for scientific compute applications, a lot of the time it scales linearly with bandwidth. graphics often performs similarly. as Intel's fabs get better and their ability to slap down huge amounts of eDRAM improves, it's going to be bad for AMD. also, it sounds like HSA for discrete GPU is dead. can't say I'm shocked, that was never a good value proposition for anyone (cache coherence over PCIe... no).
|
# ¿ Nov 18, 2013 05:12 |
|
Factory Factory posted:I imagine the big deal here is just plugging into the HSA stack and letting the scheduler do the job rather than having to write up an entire OpenCL workload, but honestly I have no Goddamn clue.
|
# ¿ Jan 7, 2014 03:22 |
|
Factory Factory posted:That... would be a significant disappointment compared to how they're selling this. The whole subtext behind HSA and HUMA was that GPGPU programming would become as easy as sufficiently threading your workload. How does this make OpenCL programming any easier? (HSA is mostly hype? from a collection of hardware vendors? THEY WOULD NEVER)
|
# ¿ Jan 7, 2014 03:53 |
|
deleted: I'm stupid
|
# ¿ Jan 16, 2014 03:43 |
|
there are some rumors that Kaveri was going to depend on GDDR5M for better perf, but since Elpida imploded, that's disappeared.
|
# ¿ Jan 18, 2014 04:47 |
|
edit: nah, delete me
|
# ¿ May 7, 2014 19:41 |
|
Paul MaudDib posted:e: The A10-7850k has 512 stream processor cores. Any idea how a Kaveri core compares to a Kepler core? also good lord, people are still buying the "GPUs are 8-100x faster than CPUs if used properly" tripe? that's NVIDIA PR from 2008 or 2009, and it has no basis in reality other than really specific special cases that generally boil down to the texture unit offers free FLOPs if you need interpolation and the texture fetch offers enough accuracy. if you port a naive (non-SSE/AVX, single-threaded, not really optimized at all) C++ app to CUDA and optimize the hell out of it, yeah, you might get 10 or 20x. if you actually optimize the CPU code and run it on anything approaching 250W worth of CPUs, yeah, you might get 10 or 20x out of it there too. GPUs offer advantages, but it's ~2.5x versus well-optimized CPU code on apps that are especially suitable for GPUs. (maybe it seems like I poo poo on GPUs a lot--I only do because I know a lot about them, and the claims people make are usually totally unrealistic. this is one of them.)
|
# ¿ Aug 22, 2014 03:27 |
|
Paul MaudDib posted:e: This is the wrong way to look at it but I contend that the ability to manage more cores doing more useful work thanks to improved architectures/frameworks has probably increased performance relative to well-established CPU computation since day 1 of CUDA's launch over the last 6 years. Server CPUs haven't gotten much more parallel since 2008, CUDA has gone from 128 cores to 2880 per die, the architecture (instructions/bandwidth balance/featuresets/etc) is more optimized for compute work instead of graphics rendering, and there's been more time to work on optimizing algorithms for GPU processors. And working with zero-copy on die is inherently more powerful than a co-processor on a bus in terms of latency and flexibility. the only case I know of where GPUs took over an industry was reverse time migration for seismic processing. in general, I think we have decent tools and programming models for writing kernels on specific GPUs. what we don't have is any sort of way to write applications with large parallel sections that run on the appropriate compute device. until the latter gets solved, GPU compute will remain a novelty limited to HPC, embedded platforms, and the occasional developer relations stunt. I don't think coherent memory does the latter on its own, although it's a nice step--it's more of an improvement for the "writing kernels" part. edit: whoops, this is the AMD thread and I'm pretty far afield. just PM me if you want to talk more about this. Professor Science fucked around with this message at 07:02 on Aug 22, 2014 |
# ¿ Aug 22, 2014 06:59 |
|
GrizzlyCow posted:Well, AMD is trying to make inroads on that HSA promise. They partnered with Microsoft to create a new C++ compiler.
|
# ¿ Aug 27, 2014 04:29 |
|
I don't have time/effort to respond to all of this (just compare benchmark results--I don't know of any GPU app in HPC that is actually flops bound instead of bandwidth bound), but a few things:Paul MaudDib posted:There's also a ton of work devoted to getting around the memory size limitation. Multiple GPUs in a system can do DMAs to each other's memory, so you can get up to 48GB of GDDR5 memory per system with 4x K40s. In theory you could also do DMA to something like a PCI-e SSD, which might offer better latency (but lower bandwidth) than main system memory. also, peer to peer latency is ~identical to system memory latency, although I haven't measured that in a long time. quote:We actually do have tools that let you write large parallel sections easily - OpenMP and OpenACC frameworks for exploiting loop-level parallelism in particular. On a finer grain there's marking calls as tasks and then synchronizing, etc. quote:The problem is that tools are much less of an effective solution when there's big overhead to start/stop a co-processor. That's my problem with the OpenACC approach to GPUs - it doesn't make sense to copy it down to the processor, invoke and synchronize the kernel, and then copy it back up to parallelize a single loop, GPU programs really should be running mostly on the GPU rather than shuttling data back and forth all the time. It makes sense for intensive portions that are reasonably separate from the rest of the program, but a generic "send this part of the program to the GPU" isn't really going to be the best solution in a lot of cases. this is the interesting part of APUs or KNL; by skipping PCIe, the quantum required for a speedup should be a lot lower. once it gets low enough to the point that arbitrary function dispatch can be profitable even if the CPU can just wait for the results to come back, then that will be transformative. however, that's probably going to look more like AVX512 or AVX1024 than a dedicated GPU. also, OpenACC actually tries to keep memory on the GPU as much as possible and isn't doing a naive copy-to/copy-back every time. quote:Of course, all such tools do have the problem that their design goal is to make it easy to write/adapt generic code. Writing applications using architecture-specific features can make a pretty big impact in performance. One reasonable approach to this would be to let you compile in OpenCL or CUDA code directly - i.e. you should be able to call __device__ functions that can use the architecture-specific features, perhaps with multiple versions to target different architectures.
|
# ¿ Aug 28, 2014 03:56 |
|
Factory Factory posted:Anyone else feelin' pretty dumb after the last few posts? Heck, I'm almost ready to buy an AMD CPU. 1. it takes a lot of parallel work to fill up a GPU, and GPUs can't timeslice between processes or anything like that when one task isn't offering enough work to completely fill the GPU. this limitation generally doesn't apply to CPU apps, so the problem with porting becomes more about restructuring the rest of your app to get enough work in a single place at a time than porting some kernels. 2. when everything about your platform (GPU type, GPU count, whether GPUs are shared or exclusive, CPU count, PCIe perf, ...) isn't known a priori, writing an application that uses the right processor at the right time becomes really hard. it's partially a language issue and partially a runtime issue. language-wise, the CUDA execution model (later used by OpenCL and DirectCompute) exposes some extremely low-level hardware details and in fact requires users to specify things at that level in order to get good performance. these include exact details of the memory hierarchy, warp widths and warp-synchronous programming, how a kernel traverses over data, etc--the developer has to control all of this to get something that runs fast. since those details change from GPU generation to GPU generation, there's no one ideal GPU kernel. runtime wise, there's no way to figure out how busy a GPU is (other than "is anyone else potentially using this GPU at all"), so picking a particular GPU to use in a multi-GPU system is also really hard. couple that with GPU performance variance between different models or different vendors, and the question rapidly becomes "should I use a GPU at all right now or stick to the CPU." the reason why GPUs have been pretty successful in HPC thus far is because #2 doesn't really apply--if you're buying 10-40k GPUs, you generally buy the same kind of GPUs and write software for that machine. most of the software improvements in recent years have been focused on #1 (GPU work creation, improved support for concurrent kernels, multi-process support, even interaction with InfiniBand and other NICs), and the rest of the porting problem can get brute forced by investing in enough software engineers. meanwhile, #2 is as critical an issue in desktop and mobile as #1, and there's been very little work to solve that. OpenACC and Thrust come to mind to solve the language issue, but there's still very little research on good runtime schedulers that I'm aware of (the last big thing was StarPU, and that was what, four or five years ago?). over time, #2 will become more important for HPC as there's less of a CUDA monoculture, but I don't have any idea as to what direction that will take right now.
|
# ¿ Aug 28, 2014 18:30 |
|
Paul MaudDib posted:What I'm getting from your earlier post is that most multithreaded applications are written with some type of "task-based" parallelization (you said TBB). quote:I think one strategy is to try and generate your own data parallelism as you go. Latency already sucks compared to CPUs, so just embrace it and let a runtime batch up data items/tasks and then process them in parallel when possible. In terms of implementation strategies, I think that would end up looking a lot like the Erlang runtime. You have something that looks like lightweight threads which a runtime or the scheduler batches and then dispatches when a sufficient number of data items are ready (or the processor is under-utilized enough that processing partial batches doesn't matter). quote:That has overhead too of course, but you're amortizing it across more data items per launch, and kernel launch overhead is much, much lower from the kernel (onboard the GPU) than from the CPU. quote:I generally agree that CUDA and OpenCL expose a lot of low-level mechanics to the programmer, but you don't really need to hand-tune your program to every single device to get good performance. If you're writing low-level C code for your entire program, yeah, your life is going to suck, but the higher-productivity way here is to write your program in terms of template library calls and then let the library authors handle tuning the operations to the various architectures. That's Thrust, Cuda Unbound, CUDPP, and so on, which handle warp-, block-, and device-wide collective operations. All of those are designed to be drop-in solutions that will work on any architecture or block size. Like CPUs, not all portions of your program are really critical, and libraries often cover most of the critical parts fairly well. quote:As for grid topology, you can write some math which does some rough occupancy calculations at runtime and get a reasonable guesstimate. quote:Not quite sure what you mean by "warp synchronization", if you mean thread-fences around shared memory operations, that's roughly equivalent to the trouble caused on CPUs by forgetting to wait for task completion. It's a thing you need to remember, but one you'll notice right away when your 2880-core processor spews obvious race conditions at you. That previous example uses them (__syncthreads()), they're pretty basic. 1. it's good for warps to be out of phase with each other within a block, as that makes it more likely that your various execution units will be in use while memory loads/stores are happening. in other words, __syncthreads() eats performance. 2. if you know the warp size, you know that all of your operations will be completed at the same time, meaning there's no such thing as an intra-warp dependency across an instruction boundary. so a simple example of this is the same thing you'd use warp shuffle for: code:
quote:Other stuff here is not real at all. Warp size has been 32 threads for every CUDA Compute Capability spec so far. Traversing the kernel's grid over data is pretty straightforward, and the fact that blocks can't communicate basically eliminates the possibility that you're doing something not straightforward. If threads need to communicate, they go in a block together. There is definitely an API that lets you get GPU utilization. (I've written a _lot_ of CUDA. also blocks can communicate, I'll leave that as an exercise to the reader) Professor Science fucked around with this message at 02:52 on Aug 29, 2014 |
# ¿ Aug 29, 2014 02:41 |
|
Paul MaudDib posted:Spinlocked infinite-loop kernels? I was under the impression that falls into "seriously undefined behavior". quote:For global memory, sure, which is why I specified "shared memory" On-multiprocessor memory has much lower latency. Dynamic parallelism should then avoid this, right? Not sure what you mean by sharing local memory. edit: wait, do you mean launching out of the equivalent of __shared__? Professor Science fucked around with this message at 03:37 on Aug 29, 2014 |
# ¿ Aug 29, 2014 03:28 |
|
maybe I'm not following you, but either you can't assume that the same SM is going to run the new kernel (%smid in PTX is effectively volatile, after all) or you don't need dynamic parallelism. also you should get plat or some other easy means of contact.
|
# ¿ Aug 29, 2014 04:14 |
|
afaik Power is not a meaningful player in high-performance computing (Top500 and the like) anymore, for lots of reasons that nobody probably cares about but me. glancing through Top500, the only new IBM-based machine at all in the past two years in the top 25 is an internal IBM Power8 machine. also this conversation about ISAs is silly because for various reasons the ISA has been proven to pretty much not matter so long as it's not completely idiotic (eg, x86 is kind of idiotic, x86-64 is not, ARMv8 is not, I don't know enough about MIPS64 to say either way). if you really wanted to build a <1W Power8 CPU with good perf/W, you probably could. if you wanted to build a 300W ARMv8 CPU with big iron style perf/W, you could do that too. the argument people always make is that x86 in particular has huge overhead for instruction decode, but things like Silvermont/Knights Corner/Knights Landing have shown that instruction decode is such a minuscule part of a modern CPU it doesn't matter. the thing that matters once you have equivalent processors with different ISAs: toolchain. x86 dominates (especially for HPC), ARM is closing fast in the not-FORTRAN oriented space, and everything else is significantly further behind.
|
# ¿ Oct 7, 2014 06:16 |
|
Rastor posted:POWER architecture has been around for some time and claims to be gearing up to face down Intel from the high end, while ARM continues to challenge on the low end and even MIPS making some noise as they are expecting to be supported in the next Android release.
|
# ¿ Oct 10, 2014 05:17 |
|
Panty Saluter posted:The Nvidia 9xx series might be the final nail in the coffin. Far as I can tell there just isn't a reason to buy anything else right now unless all you need is a (low margin) budget card. I hope AMD pulls through but it looks really bad.
|
# ¿ Oct 18, 2014 04:19 |
|
people should probably take CPU perf measurements with power figures between phone/tablet and laptop/desktop with a huge grain of salt because measuring actual power consumption and accounting for process differences is Hard; NOP; NOP; NOP (it is an ia64 post)
|
# ¿ Dec 30, 2014 00:00 |
|
ehnus posted:There can be bugs on machines with lower numbers of cores that can sometimes manifest. For example if you have two threads of high priority busy-waiting for work to be finished by threads of lower priority the system can stop making forward progress as the operating system will not pre-empt the higher priority threads. On a four core system this situation wouldn't happen.
|
# ¿ Apr 3, 2015 06:45 |
|
SwissArmyDruid posted:Motherfucker, seriously?!
|
# ¿ Apr 17, 2015 01:34 |
|
Paul MaudDib posted:The tradeoff is that your software has to be smart enough to take advantage of it. If your kernel treats the battery-saver core like a normal core you're going to have issues.
|
# ¿ Apr 18, 2015 06:31 |
|
Paul MaudDib posted:I would think it's a pretty straightforward fix - you tweak your kernel scheduler and power manager to prefer the battery-saver core when load is below some threshold. I guess I shouldn't have said "software" - that's a kernel thing. Userland software shouldn't handle processor management.
|
# ¿ Apr 18, 2015 19:11 |
|
JawnV6 posted:This is pretty much happening. Your phrasing acts like it isn't, which is sorta confusing? Intel's doing their best to jam it into the mobile segment but everyone else is counting down the days until ARM scales up to servers.
|
# ¿ Apr 19, 2015 21:06 |
|
FaustianQ posted:I mean fair enough and everything, but why care about AMD64 at all then? It seems pretty drat irrelevant since Intel currently has defacto control over x86-64 in the server market, and is kind of irrelevant in the mobile market. So Intel grabs AMD64, no one cares as ARM marches on and Intel fiddles with an increasingly dead technology, woop, or moves to ARM as well.
|
# ¿ Apr 19, 2015 21:24 |
|
ohgodwhat posted:Is Power dying too?
|
# ¿ Apr 20, 2015 00:26 |
|
would take 4 to 5 years and $1.5B+ in R&D to ship something like that, since you're basically talking about coupling two vastly different processors at a level far beyond anything that exists today (actual shared MMU which is a new level of synchronization hell that no one is prepared to deal with), an entirely new interconnect, far better ECC/reliability functionality than exists on any GPU today, and complete rewrites of every software package that these exascale sites care about (plus all of the required driver/OS/networking stack bits). it's also something of a niche product--I don't think this is a great design for anything that doesn't require strong scaling, as your node throughput is going to be fixed and so you have a fixed ratio of nodes to networking gear, even in problem domains where you want the fattest nodes possible. (you know, deep learning) fake edit: I actually glanced over the AMD paper after writing the above, and they briefly touch on all of these concerns. the SW concerns are blown off with "eh HSA will handle it! it's open, everybody will be using it by then!" like every other AMD document ever produced, the RAS concerns are "oh yeah we figure that out huh," no mention of an MPI replacement or something that works better than MPI for so many nodes, et al. in other words, don't get your hopes up, pals, this is a fluff piece. it's exactly as fluffy as the NVIDIA SC10 preso that laid out "Echelon," their exascale processor. and shocking! it basically says exactly the same stuff.
|
# ¿ Aug 4, 2015 07:56 |
|
wipeout posted:Much more interested in K12 being a killer chip these days - at least they might stand a chance long term in the ARM market.
|
# ¿ Aug 15, 2015 19:30 |
|
Nintendo Kid posted:In transitioning minor applications over sure, but for servers people will port poo poo to anything for business line applications, including Itanium, so long as the chip works well for a server. The real problem, of course, is that ARM isn't easier to make a chip for than x86 so who even cares if you're building ARM or x86 anyway. Larrabee/the Knights line disproved the whole "instruction decode on x86 will completely dominate the power for a small chip" thing (also Edison).
|
# ¿ Aug 16, 2015 01:55 |
|
and Jim Keller is out at AMD so much for Zen, I guess!
|
# ¿ Sep 18, 2015 19:23 |
|
FaustianQ posted:Sounds like 16nm is better for K12, Zen mobile, and mobile GPU. 14nm sounds better for Zen desktop, desktop iGPU and dGPU.
|
# ¿ Nov 6, 2015 05:20 |
|
Paul MaudDib posted:re CPU + CPU, i.e. big.LITTLE: It's a promising combination on paper, but my understanding is that it's hard to get a chip that does a good job of balancing real-world loads. It takes a lot of energy to switch between the cores, and it's easy to get "thrashing" back and forth between the processor types.
|
# ¿ Jan 23, 2016 23:40 |
|
FWIW. you'll learn more about GPU compute from reading about C* and MasPar than any other historical artifacts I've seen.
Professor Science fucked around with this message at 06:46 on Jan 26, 2016 |
# ¿ Jan 26, 2016 06:02 |
|
|
# ¿ Apr 29, 2024 09:58 |
|
FaustianQ posted:I'm going to OD on all this salt
|
# ¿ Feb 3, 2016 04:02 |