Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Locked thread
1gnoirents
Jun 28, 2014

hello :)

JawnV6 posted:

You're beyond bleak if the savior is IBM. I used to be concerned that their teambuilding software (put in project parameters, it scans db of employees/contractors and spits out project team options) would be the core of our dystopian future. But they were bitten by the financialization bug and don't look likely to pull out.

Further reading: http://www.forbes.com/sites/stevedenning/2014/05/30/why-ibm-is-in-decline/
http://www.forbes.com/sites/stevedenning/2014/06/03/why-financialization-has-run-amok/

I'm talking about the brain they just made. From all accounts, its actually real now. They are currently fitting them together like legos to make one as capable as the human brain but at computer speeds.

:skynet:

Adbot
ADBOT LOVES YOU

thegreatcodfish
Aug 2, 2004

1gnoirents posted:

I'm talking about the brain they just made. From all accounts, its actually real now. They are currently fitting them together like legos to make one as capable as the human brain but at computer speeds.

:skynet:

Last week's Science had a couple interesting articles on it if you have access:

http://www.sciencemag.org/content/345/6197/614.summary
http://www.sciencemag.org/content/345/6197/668.summary

1gnoirents
Jun 28, 2014

hello :)
I don't have access, but I downloaded some white papers from IBM and attempted to read it. I am highly interested, and quickly found out exactly how I am not at all smart enough to understand how it really works. I'm stoked at the key selling points though. I was wondering if in the meantime they could simply simulate a traditional computer within a synapse chip and run it on 1 watt. Of course this is the wrong thread now but, AMD sell and invest in that so I can have my super computer phone smell for me

Paul MaudDib
May 3, 2006

TEAM NVIDIA:
FORUM POLICE
The problem with neural networks has always been in applying them. Simple nets don't seem to be more powerful than a linear regression, complex networks are hard to train to do anything useful.

A brain chip is only as useful as the brain programs it runs.

JawnV6
Jul 4, 2004

So hot ...
Seems like the same problem as raytracing hits, if you can do it effectively you've got the hardware to do more triangles better. If you're able to get a 5.4M transistor chip (which is really, really close to the 4.4M logic cells that Xilinx is offering anyway) that takes a few dozen PhD's to effectively "program," it'll make more sense to buy commodity hardware that's accessible by mere mortals with masters'.

Regardless, going back to the original post I quoted, if AMD "sells out" their CPU side... what's left, exactly?

1gnoirents
Jun 28, 2014

hello :)

JawnV6 posted:

Seems like the same problem as raytracing hits, if you can do it effectively you've got the hardware to do more triangles better. If you're able to get a 5.4M transistor chip (which is really, really close to the 4.4M logic cells that Xilinx is offering anyway) that takes a few dozen PhD's to effectively "program," it'll make more sense to buy commodity hardware that's accessible by mere mortals with masters'.

Regardless, going back to the original post I quoted, if AMD "sells out" their CPU side... what's left, exactly?

GPUs I guess. My real point was it sounds like its pretty doomed, they are providing almost almost negative competition, and I'd rather see that money be funneled into some moon project like what IBM is doing.

Factory Factory
Mar 19, 2010

This is what
Arcane Velocity was like.
It'd be tough to say. The CPU side would include their ARM uarch license and their SoCs, so it'd leave... Well, something that isn't ATI any more, but the way it shambles kind of resembles it from one angle.

Graphics is basically the only thing they do now that isn't entirely about putting CPUs in things, and a lot of the graphics stuff is about putting them in the CPUs that get put in things.

Paul MaudDib
May 3, 2006

TEAM NVIDIA:
FORUM POLICE

1gnoirents posted:

GPUs I guess. My real point was it sounds like its pretty doomed, they are providing almost almost negative competition, and I'd rather see that money be funneled into some moon project like what IBM is doing.

Putting GPU cores on the CPU is a much more plausible long-term shot, actually. AMD was the first to market with a fully unified memory architecture, and they're still the only ones who offer that.

Despite NVIDIA's advertising they haven't actually done it. Their CPU/GPU SoC, Tegra K1, still forces you to declare memory as either CPU or GPU for the purposes of caching, and with a discrete GPU there's fundamentally no way to avoid copying data across the bus, at best you can just allow the programmer to pretend that you can and make it happen behind the scenes/take it out of their control.

Intel's CPU-based heterogenous compute has been a total joke up until recently. So far the Phi coprocessor hasn't really had much penetrationprobably not a coincidence, far less than CUDA or even OpenCL. I just read an interesting slide deck on the Knights Landing AVX512 instruction set, seems like they'll catch up eventually.

In comparison AMD has a product on the market right now that will let you dispatch CPU and GPU cores to the same memory space without having to copy data around. Honestly that's where I see the long-term performance growth being as we go forward - per-processor performance will continue to double every 18 months but you can get a 8-100x performance bump right now in a lot of applicationsnot every application with heterogenous computing. The fact that you don't need to copy data around makes AMD's product really advantageous in desktop or server-type applications where latency or power matter or where overhead would eat up the performance gains on small problem sizes.

Seems doubtful that AMD can really manage to sell it, though. Technical capability doesn't mean poo poo unless someone's written programs to exploit it and people have processors to run it. Jaguar is doing OK in the mobile market but CUDA's got the momentum so far in technical computing markets. Intel can muscle their way in if they need to, AMD doesn't really have that luxury.

Paul MaudDib fucked around with this message at 02:48 on Aug 22, 2014

Lord Windy
Mar 26, 2010

Oh, I meant the drivers. Nvidia does have OpenCL and it works well, but their drivers only work for their video cards and not CPUs. Intel's drivers only work on Intel CPUs and integrated graphics. But AMD Drivers work on everything except for Nvidia cards.

Paul MaudDib
May 3, 2006

TEAM NVIDIA:
FORUM POLICE

Lord Windy posted:

Oh, I meant the drivers. Nvidia does have OpenCL and it works well, but their drivers only work for their video cards and not CPUs. Intel's drivers only work on Intel CPUs and integrated graphics. But AMD Drivers work on everything except for Nvidia cards.

OpenCL is just a specification for some APIs, it's up to the hardware company to implement a compiler and drivers and so on. NVIDIA doesn't implement OpenCL drivers for CPUs because they don't produce CPUs. Intel's drivers only work on CPUs and integrated graphics because they don't produce discrete graphics cards.

It's not even tied to a specific type of heterogenous computing device, you can also use it to dispatch work to things like DSPs or FPGAs or CPU cores. That's actually one of the core challenges with OpenCL - it's really kind of a vague standard and it's up to the hardware manufacturer to do the legwork on it.

Paul MaudDib fucked around with this message at 00:24 on Aug 22, 2014

JawnV6
Jul 4, 2004

So hot ...

Paul MaudDib posted:

In comparison AMD has a product on the market right now that will let you dispatch CPU and GPU cores to the same memory space without having to copy data around. Honestly that's where I see the long-term performance growth being as we go forward - CPU processing power will continue to double every 18 months but you can get a 8-100x performance bump right now in a lot of applications with heterogenous computing. The fact that you don't need to copy data around makes AMD's product really advantageous in desktop or server-type applications where latency matters.

General question, but how's it do with true/false sharing scenarios split over CPU/GPU? Disallowed, allowed-but-not-guaranteed-coherent, allowed-but-slow? Cursory googling didn't bring anything up, even after assuring the search engine I didn't mean "and."

Paul MaudDib
May 3, 2006

TEAM NVIDIA:
FORUM POLICE

JawnV6 posted:

General question, but how's it do with true/false sharing scenarios split over CPU/GPU? Disallowed, allowed-but-not-guaranteed-coherent, allowed-but-slow? Cursory googling didn't bring anything up, even after assuring the search engine I didn't mean "and."

I was curious and did some brief searching on this earlier, here's what I came up with on a Kaveri chip:

quote:

CPU-GPU coherence: So far we have discussed how the GPU can read/write from the CPU address space without any data copies. However, that is not the full story. CPU and the GPU may want to work together on a given problem. For some types of problems, it is critical that the CPU/GPU be able to see each other's writes during the computation. This is non-trivial because of issues such as caches. HSA memory model provides optional coherence between the CPU and the GPU through what is called acquire-release type memory instructions. However, this coherence comes at a performance cost and thus HSA provides mechanisms for the programmer to express when CPU/GPU coherence is not required. Apart from coherent memory instructions, HSA also provides atomic instructions that allow the CPU to GPU to read/write atomically from a given memory location. These ‘platform atomics’ are designed to do as regular atomics, i.e. provide a read-modify-write operation in a single instruction without developing custom fences or locks around the data element or data set.
http://www.anandtech.com/show/7677/amd-kaveri-review-a8-7600-a10-7850k/6

There's more there, it's not a bad overview of the current "unified memory architecture" situation for NVIDIA/Intel/etc too. AMD has clearly put a lot of work into making it possible to mix CPU and GPU computation in more mundane desktop/server tasks. Intel's moving in that direction too with Knight's Landing.

e: Forgot the link :downs:

Paul MaudDib fucked around with this message at 02:43 on Aug 22, 2014

Menacer
Nov 25, 2000
Failed Sega Accessory Ahoy!

JawnV6 posted:

General question, but how's it do with true/false sharing scenarios split over CPU/GPU? Disallowed, allowed-but-not-guaranteed-coherent, allowed-but-slow? Cursory googling didn't bring anything up, even after assuring the search engine I didn't mean "and."
You can choose to put data in a region of system memory dedicated for GPU's framebuffer and use the non-coherent "Garlic" bus to get full-speed memory accesses from the GPU. You can also choose to put data in regular system memory and access it over the coherent "Onion" bus, which is currently slower. Source. Note that the throughput numbers on slide 10 are likely per-chanel; I've measured an A10-7850K with dual-channel DDR3-2133 hitting about 30 GB/s over Garlic and 18 GB/s over Onion.

False sharing will, as you would expect in any parallel system, cause performance slowdowns as you ping-pong data between the caches.

Paul MaudDib
May 3, 2006

TEAM NVIDIA:
FORUM POLICE

Menacer posted:

You can choose to put data in a region of system memory dedicated for GPU's framebuffer and use the non-coherent "Garlic" bus to get full-speed memory accesses from the GPU. You can also choose to put data in regular system memory and access it over the coherent "Onion" bus, which is currently slower. Source. Note that the throughput numbers on slide 10 are likely per-chanel; I've measured an A10-7850K with dual-channel DDR3-2133 hitting about 30 GB/s over Garlic and 18 GB/s over Onion.

That's about what a GDDR3 media PC card could do to GPU memory. A GT640 (GDDR3, 384 kepler cores) is 28.5 GB/s. A GTX 650 (same thing with GDDR5 memory) is about 80 GB/s. A high end GDDR5 card like the K40 (2880 cores and tweaked for compute) does 288 GB/s.

Pretty good given that system memory is only DDR3 and you can use as much of it as you want, potentially intertwined with CPU segments.

e: The A10-7850k has 512 stream processor cores. Any idea how a Kaveri core compares to a Kepler core?

Paul MaudDib fucked around with this message at 02:59 on Aug 22, 2014

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party

Paul MaudDib posted:

e: The A10-7850k has 512 stream processor cores. Any idea how a Kaveri core compares to a Kepler core?
don't ever try to compare GPUs this way, the "core counts" are so far beyond meaningless it's unbelievable

also good lord, people are still buying the "GPUs are 8-100x faster than CPUs if used properly" tripe? that's NVIDIA PR from 2008 or 2009, and it has no basis in reality other than really specific special cases that generally boil down to the texture unit offers free FLOPs if you need interpolation and the texture fetch offers enough accuracy. if you port a naive (non-SSE/AVX, single-threaded, not really optimized at all) C++ app to CUDA and optimize the hell out of it, yeah, you might get 10 or 20x. if you actually optimize the CPU code and run it on anything approaching 250W worth of CPUs, yeah, you might get 10 or 20x out of it there too. GPUs offer advantages, but it's ~2.5x versus well-optimized CPU code on apps that are especially suitable for GPUs.

(maybe it seems like I poo poo on GPUs a lot--I only do because I know a lot about them, and the claims people make are usually totally unrealistic. this is one of them.)

Paul MaudDib
May 3, 2006

TEAM NVIDIA:
FORUM POLICE
e: This is the wrong way to look at it but I contend that the ability to manage more cores doing more useful work thanks to improved architectures/frameworks has probably increased performance relative to well-established CPU computation since day 1 of CUDA's launch over the last 6 years. Server CPUs haven't gotten much more parallel since 2008, CUDA has gone from 128 cores to 2880 per die, the architecture (instructions/bandwidth balance/featuresets/etc) is more optimized for compute work instead of graphics rendering, and there's been more time to work on optimizing algorithms for GPU processors. And working with zero-copy on die is inherently more powerful than a co-processor on a bus in terms of latency and flexibility.

Stuff like database access seems to be relatively amenable to GPU acceleration.

Paul MaudDib fucked around with this message at 05:05 on Aug 22, 2014

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party

Paul MaudDib posted:

e: This is the wrong way to look at it but I contend that the ability to manage more cores doing more useful work thanks to improved architectures/frameworks has probably increased performance relative to well-established CPU computation since day 1 of CUDA's launch over the last 6 years. Server CPUs haven't gotten much more parallel since 2008, CUDA has gone from 128 cores to 2880 per die, the architecture (instructions/bandwidth balance/featuresets/etc) is more optimized for compute work instead of graphics rendering, and there's been more time to work on optimizing algorithms for GPU processors. And working with zero-copy on die is inherently more powerful than a co-processor on a bus in terms of latency and flexibility.

Stuff like database access seems to be relatively amenable to GPU acceleration.
database access for sets that fits entirely within GPU memory works well because GPUs have very high memory bandwidth, and if you're doing lots of queries simultaneously and can hide memory latency, yeah, you get 6x over a single Intel CPU because of GDDR5 vs DDR3 bandwidth alone. those are all big ifs, and they certainly do not mean that general database access is amenable for GPUs.

the only case I know of where GPUs took over an industry was reverse time migration for seismic processing.

in general, I think we have decent tools and programming models for writing kernels on specific GPUs. what we don't have is any sort of way to write applications with large parallel sections that run on the appropriate compute device. until the latter gets solved, GPU compute will remain a novelty limited to HPC, embedded platforms, and the occasional developer relations stunt. I don't think coherent memory does the latter on its own, although it's a nice step--it's more of an improvement for the "writing kernels" part.

edit: whoops, this is the AMD thread and I'm pretty far afield. just PM me if you want to talk more about this.

Professor Science fucked around with this message at 07:02 on Aug 22, 2014

Alereon
Feb 6, 2004

Dehumanize yourself and face to Trumpshed
College Slice
I think GPU computer and how it relates to HSA marketing is more interesting and relevant to this thread than why AMD has never managed to not suck :)

JawnV6
Jul 4, 2004

So hot ...

Professor Science posted:

(maybe it seems like I poo poo on GPUs a lot--I only do because I know a lot about them, and the claims people make are usually totally unrealistic. this is one of them.)
Naaah I recall a time I was making GBS threads on GPU's and you corrected me. You're a force for truth :)

Menacer posted:

False sharing will, as you would expect in any parallel system, cause performance slowdowns as you ping-pong data between the caches.
With false sharing the interesting facet is the granularity. It's *probably* one cache line. Deviations from that reveal a lot about the system underneath.

I like the approach of marking regions of memory or execution for don't-cares.

Palladium
May 8, 2012

Very Good
✔️✔️✔️✔️
http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/62166-amd-fx-9590-review-piledriver-5ghz-17.html

quote:

The FX-9590 is a hot running processor and we don’t mean hot in any conventional meaning of the word either. This thing is like having a miniature nuclear reactor strapped to your motherboard; it will thoroughly overwhelm mid-tier heatsinks and AIO water coolers alike. Since it doesn’t come with an included heatsink we’re told that retailers will endeavor to bundle the FX-9590 with high end Corsair, Cooler Master or NZXT water cooling units in an effort to ensure customers won’t damage their new processors with sub-par cooling solutions.

With the potential for astronomical heat output, one would hope for an adequate way to measure temperatures. That just didn’t happen. RealTemp and CoreTemp routinely showed overly low readings and even AMD’s vaunted Overdrive utility was completely out to lunch. It claimed the chip idled at 19°C (ambient temperature was 23°) while load temperatures supposedly hit 46.7°C under load even though our Noctua NH-U14S was hot to the touch.

Only ASUS’ AI Suite II (which takes its temperature readings directly from the BIOS) was somewhat accurate with its reading of 65°C under load but we had reasons to doubt this too since, as you see in the screenshot above, our FX-9590 began throttling some cores down to the 4.515GHz mark after 20 minutes or so of continual full-load testing. Another possibility is that AMD has set Turbo Core 3.0 to begin throttling downwards when core temperature hits that 65°C mark in an effort to cap thermals and power consumption.

A chip that barely beats a 4770K (nevermind the 4790K) in best case benchmarks, throttles at stock with a NH-U12S and also comes with a broken temp sensor? On a ~$230 motherboard?...What a steal at $380!

The Lord Bude
May 23, 2007

ASK ME ABOUT MY SHITTY, BOUGIE INTERIOR DECORATING ADVICE

Palladium posted:

http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/62166-amd-fx-9590-review-piledriver-5ghz-17.html


A chip that barely beats a 4770K (nevermind the 4790K) in best case benchmarks, throttles at stock with a NH-U12S and also comes with a broken temp sensor? On a ~$230 motherboard?...What a steal at $380!

They were using a u14s, so it's even worse. But hey, at least they aren't trying to sell it for $1k anymore.

Knifegrab
Jul 30, 2014

Gadzooks! I'm terrified of this little child who is going to stab me with a knife. I must wrest the knife away from his control and therefore gain the upperhand.
So are there any new iterations of killer CPU's coming out from intel any time soon? I am looking to build a new ridiculous system (because I'm a fat nerd) but I am gating myself by CPU and GPU releases.

Don Lapre
Mar 28, 2001

If you're having problems you're either holding the phone wrong or you have tiny girl hands.

Knifegrab posted:

So are there any new iterations of killer CPU's coming out from intel any time soon? I am looking to build a new ridiculous system (because I'm a fat nerd) but I am gating myself by CPU and GPU releases.

Broadwell comes out next year. Haswell-E comes out really soon.

Proud Christian Mom
Dec 20, 2006
READING COMPREHENSION IS HARD
There's always something awesome. The question is whether whatever you want to do with it can take advantage of it. For gaming, the answer is usually 'no'

The Lord Bude
May 23, 2007

ASK ME ABOUT MY SHITTY, BOUGIE INTERIOR DECORATING ADVICE
AMD CPUs are so bad we've started discussing Intel CPUs in the thread instead.

Agreed
Dec 30, 2003

The price of meat has just gone up, and your old lady has just gone down

The Lord Bude posted:

AMD CPUs are so bad we've started discussing Intel CPUs in the thread instead.

Be the change you want to see, Mr. Bude!

Though I do wonder as well if the thread still has a raison d`etre, now that it is no longer AMD vs. Intel in the consumer desktop space that the companies are doing (that we care about). The GPU thread hosts plenty of talk about both, is AMD a going concern in SH/SC mindspace enough that we couldn't just collapse the two tentpoles and put up one umbrella CPU thread to cover all CPU stuff? ASIC devs, FPGA programmers, chip designers of all kinds, software devs - this is by no means a poor environment when it comes to the potential for discussion, it just seems like at the end of the day, people are going to come back to the bottom line because comparatively fewer members are as interested in what makes these companies' stock tickers move versus what kinda part they should put in their machine. This and the Intel thread both get some good periods, but it all feels weirdly nonspecific considering the names. I actually think we could be more topical and talk about stuff that matters more if we just had a single CPU-stuff thread - as it is, we discuss brands more than product lines at times and that is both important from the perspective of the market gestalt, but also weirdly irrelevant now that "computers? Intel!" and that's pretty much that for folks who aren't involved in more sophisticated stuff.

The parts picking thread twists that knife appropriately hard - don't buy AMD unless you're a sucker, we all know that, and every thread title update has been some flavor of badly wearing hopefulness as we sort of look on aghast as AMD's whole CPU side of things apparently goes to poo poo and there is gently caress all that they can do about it.

The Lord Bude
May 23, 2007

ASK ME ABOUT MY SHITTY, BOUGIE INTERIOR DECORATING ADVICE

Agreed posted:

Be the change you want to see, Mr. Bude!

Though I do wonder as well if the thread still has a raison d`etre, now that it is no longer AMD vs. Intel in the consumer desktop space that the companies are doing (that we care about). The GPU thread hosts plenty of talk about both, is AMD a going concern in SH/SC mindspace enough that we couldn't just collapse the two tentpoles and put up one umbrella CPU thread to cover all CPU stuff? ASIC devs, FPGA programmers, chip designers of all kinds, software devs - this is by no means a poor environment when it comes to the potential for discussion, it just seems like at the end of the day, people are going to come back to the bottom line because comparatively fewer members are as interested in what makes these companies' stock tickers move versus what kinda part they should put in their machine. This and the Intel thread both get some good periods, but it all feels weirdly nonspecific considering the names. I actually think we could be more topical and talk about stuff that matters more if we just had a single CPU-stuff thread - as it is, we discuss brands more than product lines at times and that is both important from the perspective of the market gestalt, but also weirdly irrelevant now that "computers? Intel!" and that's pretty much that for folks who aren't involved in more sophisticated stuff.

The parts picking thread twists that knife appropriately hard - don't buy AMD unless you're a sucker, we all know that, and every thread title update has been some flavor of badly wearing hopefulness as we sort of look on aghast as AMD's whole CPU side of things apparently goes to poo poo and there is gently caress all that they can do about it.

You misunderstand me. I'm not complaining about Intel talk, I just thought it was a hilarious opportunity for a quip at AMD's expense.

Don Lapre
Mar 28, 2001

If you're having problems you're either holding the phone wrong or you have tiny girl hands.
The problem is AMD hasn't released anything more than a stick it into a walmart hp computer cpu in almost 2 years.

Factory Factory
Mar 19, 2010

This is what
Arcane Velocity was like.
I'd like a general CPU thread. Alereon talked about an SoC thread last year or something, but while that's neat, it's also less distinct from a CPU thread now that Haswell and future have SoC versions, and AMD is now doing SoCs e.g. Kabini AM1 chips and the ARM Cortex A57-based Seattle server SoC. In the meantime, there's no place to talk about ARM stuff like Nvidia Denver or Apple Cyclone or whatever Qualcomm is doing lately.

Speaking of Qualcomm, fun fact that I had forgotten about its Adreno GPUs: those are the result of AMD selling off ATI's Imageon cores.

1gnoirents
Jun 28, 2014

hello :)
Since SH/HC seems to exist solely on megathreads (which I dont mind, its just more traffic for the same topics) I would also vote for a single CPU thread if my vote mattered at all. Also in the context of an AMD cpu thread I'd say its more than a reasonable idea. Plus I'd like a place to discuss the Coming of our New Masters (truenorth)

teagone
Jun 10, 2003

That was pretty intense, huh?

Factory Factory posted:

I'd like a general CPU thread. Alereon talked about an SoC thread last year or something, but while that's neat, it's also less distinct from a CPU thread now that Haswell and future have SoC versions, and AMD is now doing SoCs e.g. Kabini AM1 chips and the ARM Cortex A57-based Seattle server SoC. In the meantime, there's no place to talk about ARM stuff like Nvidia Denver or Apple Cyclone or whatever Qualcomm is doing lately.

Speaking of Qualcomm, fun fact that I had forgotten about its Adreno GPUs: those are the result of AMD selling off ATI's Imageon cores.

Why not just merge the two topics into a CPU/SoC megathread?

Agreed
Dec 30, 2003

The price of meat has just gone up, and your old lady has just gone down

teagone posted:

Why not just merge the two topics into a CPU/SoC megathread?

I believe that was his suggestion, actually - that we collapse CPU-related discussion into a potential omnibus CPUI crap thread :) SoCs definitely count, and it'd give people a place to talk specifically about things that are out of place when referring to just Intel or just AMD, and leaving out pretty much everyone else except nVidia. And only them - in the CPU/SoC context - because they're trying to bite into the SoC market like they haven't seen food in five years, and gearing up for a major showdown with Intel in HPC that could be a solid existential challenge given the comparative markets we're talking about, so it comes up now and then.

GrizzlyCow
May 30, 2011
Well, AMD is trying to make inroads on that HSA promise. They partnered with Microsoft to create a new C++ compiler.

So there's that. AMD APU related stuff.

Agreed
Dec 30, 2003

The price of meat has just gone up, and your old lady has just gone down

GrizzlyCow posted:

Well, AMD is trying to make inroads on that HSA promise. They partnered with Microsoft to create a new C++ compiler.

So there's that. AMD APU related stuff.

Anyone else get a kind of CELL round two vibe going on here? Remember the "Octopiler," focus on peak FLOP throughput over compute efficiency? I wish I could just be excited for AMD, but it seems like damned near everything they do on the CPU side of things is cursed. And they have a history of getting creamed when they try to let open source do their work for them. OpenCL got their HPC coprocessors into very few machines, because other companies invest more heavily in software development directly, now they're going for a different open source approach and it's great that they're following through but unless they can afford to work directly on the level of their competitors I don't see this as a killer app for them. I dunno, could be overly pessimistic here. Anyone feel like helping me gain some perspective on this development if you feel my read is unnecessarily bleak?

SwissArmyDruid
Feb 14, 2014

by sebmojo

Agreed posted:

Anyone feel like helping me gain some perspective on this development if you feel my read is unnecessarily bleak?

Little bit of column A, little bit of column B. Should also be noted that AMD also just recently gave Khronos the entire Mantle spec for free and said, "here, use whatever you like for OpenGL."

movax
Aug 30, 2008

Combined CPU/SoC thread sounds like a good plan! I am actually not sure how much mobile SoC chat happens in IYG, or to what depth. Someone with time to spare should do some OP writing...

GrizzlyCow
May 30, 2011
This is just another step towards the Heterogeneous System Architecture that the HSA have been working on. If it helps, just remember AMD is not pushing for this alone. Even if AMD screws up, you still have ARM Holding, Samsung, Texas Instruments, and a few others working on pushing the HSA forward.

But hell, maybe it is doomed to fail on the desktop. I'm not a programmer; I don't know. I'm just asking questions.

Alereon
Feb 6, 2004

Dehumanize yourself and face to Trumpshed
College Slice
I think it's important not to mix CPU and SoC discussion into one thread, because those are very, very different topics and I feel like if we want good discussion of SoC and low-power products we're going to need to separate it from discussion of high-power products. I've been dragging my feet on the SoC thread I mused about, maybe I'll put some work into that tonight.

For the CPU discussion, it might make sense to merge both the Intel and AMD threads into a single "CPU and Platform Discussion", if we don't think there's value in keeping them separate anymore. I don't think there would be much point in having three threads (general CPU, AMD Platform, Intel Platform), as I suspect the Intel and AMD threads would just fall into the archives.

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party

GrizzlyCow posted:

Well, AMD is trying to make inroads on that HSA promise. They partnered with Microsoft to create a new C++ compiler.

So there's that. AMD APU related stuff.
this doesn't seem to be upstreamed in Clang or LLVM, so it's probably not going to fare well in the long term.

Adbot
ADBOT LOVES YOU

Paul MaudDib
May 3, 2006

TEAM NVIDIA:
FORUM POLICE

Professor Science posted:

if you port a naive (non-SSE/AVX, single-threaded, not really optimized at all) C++ app to CUDA and optimize the hell out of it, yeah, you might get 10 or 20x. if you actually optimize the CPU code and run it on anything approaching 250W worth of CPUs, yeah, you might get 10 or 20x out of it there too. GPUs offer advantages, but it's ~2.5x versus well-optimized CPU code on apps that are especially suitable for GPUs.

You should be comparing a multi-threaded or GPU implementation against a single-threaded implementation. That's literally the definition of speedup. Saying that "GPUs are only 2.5x faster than a 20x multithreaded implementation" is really abusing the concept of speedup, and that's not the measurement I was using. The correct way to look at that is "20x multithreaded or 50x gpu speedup".

There's also something to be said for an architecture that can maintain performance when scaled up to 250W per socket, too. Any decent CPU maxes out at around 150W per socket, most motherboards only have 2 sockets tops, whereas you can reasonably stack 2x 250W GPUs in a single box. Even that's being a bit gracious and comparing a server motherboard for the CPUs to a standard gaming motherboard for the GPUs, there are off-the-shelf motherboards supporting up to 4x 16-lane PCI-e sockets, whereas I don't know of a way to get 1000W of CPUs onto a single motherboard. Maybe up to 4x CPU sockets, so something like 600W is probably the limit.

A pair of sockets is a fundamentally less powerful approach than one big processor, and a pair of machines is fundamentally less powerful than one machine with more processors, due to the limits/overhead of cache/memory coherency, interconnection latency/bandwidth, etc. It's not reasonable to compare a real-world GPU to a fantasy 250-500W cpu that provides exactly proportional performance to a 150W CPU. The wattage comparison is sorta reasonable at 1xGPU vs 2xCPU, it starts breaking down beyond that.

Professor Science posted:

database access for sets that fits entirely within GPU memory works well because GPUs have very high memory bandwidth, and if you're doing lots of queries simultaneously and can hide memory latency, yeah, you get 6x over a single Intel CPU because of GDDR5 vs DDR3 bandwidth alone.

This is really kind of taking a lot of engineering work for granted. Any single-processor device that can process as fast as GPUs can is going to have to be attached to a seriously fast memory system to keep it fed, that's inherently true, but at the same time you can't just handwave and say that the entire performance is the result of having some fast memory. It's more than having fast processors, it's more than having a fast memory subsystem, it takes system engineering and a program tailored to the architecture.

Anyway as a general observation, that GPUs are faster because they have fast memory doesn't really stand up - CPUs can actually access more memory per floating point operation than a GPU can. Obviously there aren't a lot of floating point operations in data access, but in terms of general program operation GPUs can actually address much less memory per core per cycle. That's why there's so much work put into memory coalescing/broadcasting and so on - they actually need tricks to keep the system fed. Even in something like database searching, some of the gain is not just from memory bandwidth, it's from algorithms that let the hardware coalesce memory requests. As a simplification, think of binary searching - if you have a bunch of threads searching the upper parts of the tree, they're accessing the same data and the hardware lets you combine a significant fraction of the memory accesses into single requests which are then broadcast to the warps.

http://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/

Obviously global memory is limited in size, that's one of the big limitations of GPUs, but so is L3 cache, or any other high-performance low-size memory. For a sense of perspective that's about the same bandwidth as Haswell can deliver from its L2 cache. Haswell offers 4x256k at that bandwidth, a K40 offers 12GB total or 4.16MB per core. Obviously it does so at a higher latency, of course, so you need sufficient in-flight requests to cover the latency.

There's also a ton of work devoted to getting around the memory size limitation. Multiple GPUs in a system can do DMAs to each other's memory, so you can get up to 48GB of GDDR5 memory per system with 4x K40s. In theory you could also do DMA to something like a PCI-e SSD, which might offer better latency (but lower bandwidth) than main system memory.

Professor Science posted:

in general, I think we have decent tools and programming models for writing kernels on specific GPUs. what we don't have is any sort of way to write applications with large parallel sections that run on the appropriate compute device. until the latter gets solved, GPU compute will remain a novelty limited to HPC, embedded platforms, and the occasional developer relations stunt. I don't think coherent memory does the latter on its own, although it's a nice step--it's more of an improvement for the "writing kernels" part.

We actually do have tools that let you write large parallel sections easily - OpenMP and OpenACC frameworks for exploiting loop-level parallelism in particular. On a finer grain there's marking calls as tasks and then synchronizing, etc.

The problem is that tools are much less of an effective solution when there's big overhead to start/stop a co-processor. That's my problem with the OpenACC approach to GPUs - it doesn't make sense to copy it down to the processor, invoke and synchronize the kernel, and then copy it back up to parallelize a single loop, GPU programs really should be running mostly on the GPU rather than shuttling data back and forth all the time. It makes sense for intensive portions that are reasonably separate from the rest of the program, but a generic "send this part of the program to the GPU" isn't really going to be the best solution in a lot of cases. The same thing applies to the CUDA 6.0 "unified memory" thing - hiding it from the programmer is a low-performance solution when there's high overhead costs.

In comparison I think the OpenACC approach could be much more appropriate on a platform like APUs, because there's very little overhead involved with invoking the stream processors. The OpenMP model is a workable approach to parallelizing existing code, and extending it to support different types of compute elements seems like a rational step.

Of course, all such tools do have the problem that their design goal is to make it easy to write/adapt generic code. Writing applications using architecture-specific features can make a pretty big impact in performance. One reasonable approach to this would be to let you compile in OpenCL or CUDA code directly - i.e. you should be able to call __device__ functions that can use the architecture-specific features, perhaps with multiple versions to target different architectures.

Paul MaudDib fucked around with this message at 22:29 on Aug 27, 2014

  • Locked thread