|
Cygni posted:I imagine Nvidia is as fully aware of the manufacturing constraints and future performance needs as AMD is. Bingo. The problem in general is that all of this adds latency. It's still more performant to have giant monolithic dies if you can. The more slices you cut your power into, the worse the issues. Also, I think AMD is going to have even more trouble because their uarch has a lot more inter-engine interaction (and cache-coherency needs although that might be software-configurable), whereas NVIDIA is using relatively "dumb" hardware. For NVIDIA having your SMX engines living on different dies isn't much of an issue. I think AMD has some pretty fundamental scalability issues with GCN, every time they have tried going past Hawaii size so far it's been a trainwreck, and I keep reading about how the uarch just isn't designed for more than 4 stream engines inside or something (can't evaluate that without more detail). NVIDIA is going with 4-die packages in their paper. With a 100mm^2 die that only leaves you with Polaris 10-plus-a-bit. You still need some fairly big dies if you want high-end gaming performance. Paul MaudDib fucked around with this message at 23:12 on Jul 7, 2017 |
# ? Jul 7, 2017 23:04 |
|
|
# ? Apr 20, 2024 17:25 |
|
AMD GPU needs a blank sheet for navi and forget Hawaii ever existed.
|
# ? Jul 7, 2017 23:15 |
|
NewFatMike posted:Yeah, I man a TR/Epyc chip are really just big in general, but most of the cost from those are still silicon, right? The metal for the pins, package, substrate are all still relatively cheap, right? A big rear end package that's still cheap and performs really well is still a desirable thing. I wonder how repurposable a TR4 socket is, like could it in theory accept a socketable GPU? Imagine AMD selling EPYC 4 socket boards with Zen2, and the sockets accept either a Navi cluster or a Zen2 cluster. I could see the PCB being insanely thick and thus expensive, but you could get a lot of flexibility out of something like that, especially if AMD could deliver future socket compatibility. It makes buying into the AMD GPU ecosystem easier if you've bought into AMDs CPU's at all, and they could still provide PCIE slots for backwards compatibility or even further scalability. Paul MaudDib posted:Bingo. Honestly this seems to be a thing for AMD, they've never had a big die that did well. I don't really see Hawaii as a big die, it's cutting it close but I tend to think of them as around 480-600mm˛. AMD's successes in GPUs were also predicated on Nvidia resting on Tesla uarch too long and Fermi being a disaster (oh god why does that sound like GCN and Vega). Guess this makes AMD's Navi thier Kepler, and maybe Navi2 their Maxwell . See ya guys in 4 years!
|
# ? Jul 7, 2017 23:33 |
|
NewFatMike posted:Don't you think they might target APU GPU dies and move up from that like the Zen CCX? Cygni posted:GPUs are theoretically more suited to multiple die packages too due to the parallel workload. Cygni posted:Epyc's latency for going between the dies is fairly atrocious, and AMDs way of addressing that was essentially working around that reality (just like Intel did with the Pentium D). My understanding with a ccNUMA set up like AMD appears to be using is that so long as latency on each inter core/die/package/whatever bus hovers around what you'd expect from main system RAM then you'd be OK. Then it just comes down to bandwidth and the number of hops necessary between dies and it seems like they've done a OK job on bandwidth and minimzing hops on Epyc with the sheer number of buses going on there. Proof will be in the pudding so to speak but nothing about it looks actually bad so far. Cygni posted:I mean heck, the concept you're describing for Navi is basically VSA-100. Cygni posted:The problem Navi faces is the same one AMD has faced for years: Nvidia isn't just sitting around waiting for them to catch up. Volta is already at the reticle limit of a 12nm process, so I imagine Nvidia is as fully aware of the manufacturing constraints and future performance needs as AMD is.
|
# ? Jul 8, 2017 04:32 |
|
If RTG don't sort their poo poo out and nVidia launch an MCM GPU before Navi, it'll just be cosmically sad.
|
# ? Jul 8, 2017 04:44 |
|
AMD's execution comes down to a sharp disconnect between marketing and engineering departments, lovely marketing, and not enough money for marketing and engineering combined. If Ryzen and Epyc do well you'll first see the changes in marketing, and then maybe in 2019-2021 for products. Also apparently AMD is going on a hiring bonanza for marketing and engineering positions, so uh, guess things are in fact going well?
|
# ? Jul 8, 2017 05:07 |
|
FaustianQ posted:AMD's execution comes down to a sharp disconnect between marketing and engineering departments, lovely marketing, and not enough money for marketing and engineering combined. If Ryzen and Epyc do well you'll first see the changes in marketing, and then maybe in 2019-2021 for products. with stock prices up and ryzen generating enough cash to keep the debters off their back, turn the rest of the cash into much needed company investments.
|
# ? Jul 8, 2017 05:15 |
|
GRINDCORE MEGGIDO posted:If RTG don't sort their poo poo out and nVidia launch an MCM GPU before Navi, it'll just be cosmically sad. FaustianQ posted:not enough money for marketing and engineering combined. I really don't know why AMD keeps having these issues, and execution has been a problem for years and years there, but whenever it comes down to do or die they usually seem able to pull something off. To me that is strongly suggestive of a management issue and not a engineering or marketing one. Upper management has seen a whole lot of turnover there so maybe we won't see a repeat of past behaviors. Have to wait n' see....
|
# ? Jul 8, 2017 05:42 |
|
PC LOAD LETTER posted:I don't think marketing has any real say in what gets developed there or how it gets done either. Even back in the K8 days, when AMD was doing good financially and all, there were numerous rumors of problems at AMD. I think they ended up scrapping entirely whatever K9 or K10 was supposed to be originally and that is why they ending up doing what amounted to revisions of K8 for longer than they should've. Maybe I was misunderstanding your argument, but I was thinking about marketing signing checks the engineers can't cash. Stuff like, 1.5Ghz 4096SP Vega 10XTX @ 225W, vs current reality, or the over promises on Barcelona and Bulldozer (performance, release date, etc.). Sometimes it's a legit fab problem (glares at Glofo's 45nm and 32nm process) that's out of AMD's hands though. But yea I can definitely see how past management was a problem, esp. w/r/t Bulldozer for instance. As soon as they got back the internals on that they should have kept it exclusively as a server options and pushed forward on iterations of K8 until Zen was ready (based on release dates, it's likely as soon as they got the stuff back on Bulldozer, plans for Zen were in motion) because holy lol Bulldozer was godawful until like, Steamroller and that was just getting back to parity.
|
# ? Jul 8, 2017 06:08 |
|
FaustianQ posted:but I was thinking about marketing signing checks the engineers can't cash. Stuff like, 1.5Ghz 4096SP Vega 10XTX @ 225W, vs current reality, or the over promises on Barcelona and Bulldozer (performance, release date, etc.). I don't work at AMD so I have no firsthand knowledge of how they do things there but every place I've ever worked before, or ever heard of, the management had firm control over marketing stooges so that is why I'm looking at things that way. FaustianQ posted:But yea I can definitely see how past management was a problem, esp. w/r/t Bulldozer for instance.
|
# ? Jul 8, 2017 06:54 |
|
Geekbench seems ridiculously useless. I've been looking up my current CPU on it to compare it to these supposed TR 1950X results, and within the results, there's huge disparities, like a presumably stock clocked 5820K returning higher results than severely overclocked ones. (Or at least the app seems to be royally stupid about recording clock speeds.)
|
# ? Jul 8, 2017 16:56 |
|
wargames posted:with stock prices up and ryzen generating enough cash to keep the debters off their back, turn the rest of the cash into much needed company investments. Or, ya know, pay a dividend. ever.
|
# ? Jul 8, 2017 20:45 |
|
incoherent posted:Or, ya know, pay a dividend. ever. Dividends are for companies in good financial situations. That's not AMD.
|
# ? Jul 8, 2017 21:07 |
|
SourKraut posted:Dividends are for companies in good financial situations. That's not AMD. AMD hasn't appeared to have paid a dividend, ever, at least since January 13, 1978 when google starts tracking their stock. Surely you wouldn't count them as never being in a good financial situation since 1978?
|
# ? Jul 8, 2017 21:49 |
|
fishmech posted:AMD hasn't appeared to have paid a dividend, ever, at least since January 13, 1978 when google starts tracking their stock. Surely you wouldn't count them as never being in a good financial situation since 1978? We'd have to know what their cash and liquidity situation has been during that time. Companies shouldn't pay dividends whenever they happen to be in the black. The closest I'd imagine they came to being in a good position to do so was in the mid-2000s and instead they bought ATI which is worth a whole separate discussion. There are plenty of much more stable companies who've never paid dividends, and I think most would rather see AMD survive then try to appease some armchair stock investors.
|
# ? Jul 8, 2017 22:00 |
|
FaustianQ posted:I'm now wondering if it'd be possible to MCM two Raven Ridge dies for the TR4 platform. 8C/16T, 1408SP iGPU, 150W, Quad memory.
|
# ? Jul 8, 2017 23:23 |
|
Anime Schoolgirl posted:mini-STX FaustianQ posted:TR4 platform A whole department of Engineers just leaped from a building, I hope you're happy. EmpyreanFlux fucked around with this message at 02:45 on Jul 9, 2017 |
# ? Jul 9, 2017 01:00 |
|
ASRock didn't want them anyways, if they weren't licking their chops at the idea of that, they clearly weren't cut out for ASRock's mad science to begin with.
|
# ? Jul 9, 2017 02:28 |
|
So about MCM latencies of TR/EPYC, I suppose it would have been better to have a memory controller as separate entity on the IF and have the CCX groups be autonomous?
|
# ? Jul 9, 2017 12:45 |
|
So I got myself an R7 1700, loosened the timings a bit and the ram is running @ 3200 (wasn't even on the QVL), system is rock solid everything is fiiiiiiine. GF got my i5 4570 system to finally replace her E8200 that was beyond inadequate.
|
# ? Jul 9, 2017 20:25 |
|
I am also extremely happy with my R7 1700. It is A Good Chip. Fingers crossed Raven Ridge is also good.
|
# ? Jul 9, 2017 21:06 |
|
Combat Pretzel posted:So about MCM latencies of TR/EPYC, I suppose it would have been better to have a memory controller as separate entity on the IF and have the CCX groups be autonomous? I'm not a CPU designer but I think the approach you're talking about makes more sense if there are more hops between dies OR if there are scaling issues with high(er) core counts with their current approach (I have no idea if there are).
|
# ? Jul 9, 2017 22:31 |
|
Combat Pretzel posted:So about MCM latencies of TR/EPYC, I suppose it would have been better to have a memory controller as separate entity on the IF and have the CCX groups be autonomous? My understanding from the ServeTheHome coverage is that hitting infinity fabric is a bad thing, so adding that to every single memory call probably wouldn't be great for latency, I imagine. As it is, 2 socket Epyc has four options when stuff is in memory: Local to the core (great latency, great bandwidth), One hop on IF away through another core on the package (worse latency, same bandwidth), One hop on IF away through a core on the other package (worse latency, worse bandwidth), Two hops on IF away through a core on the other package (worst latency, worse bandwidth). An on package memory controller would basically cut the top and bottom cases off, and make all the calls the middle two. If NUMA aware OS didn't exist, that might be worth it. But with NUMA, and most calls being the first two options, the current solution on Epyc is probably much more efficient. But I'm just an armchair dork, that could all be wrong.
|
# ? Jul 9, 2017 23:10 |
|
From all I've read, it sounds like cross-CCX L3 cache accesses are the royal pain in the rear end, and probably one cause for funny latencies. Having a memory controller, maybe with some of its own cache, as separate entity on the IF, and disabling cross-CCX L3 accesses, things might be less bad? Just sounds sorta idiotic to access L3 cache on another CCX at memory speeds. Might as well just hit the memory directly. I'm also just armchairing. Cygni posted:If NUMA aware OS didn't exist, that might be worth it. But with NUMA, and most calls being the first two options, the current solution on Epyc is probably much more efficient. I suppose it doesn't matter so much for the current bunch of Ryzens, since as you say, the bandwidth is high between a pair of CCXs, but I'm eyeing at TR, and it sounds like I don't want it, if gaming is part of the workload for it. Combat Pretzel fucked around with this message at 00:19 on Jul 10, 2017 |
# ? Jul 10, 2017 00:14 |
|
I'm curious if the current version of InfinityFabric phy can support PCIe Gen4 speeds (8GHz, 16GT/s), because it would be pretty impressive if they could jump right into the Gen4 ecosystem. There's a few new features they'd have to implement to get to the Gen4 spec of course. Have there been any blurbs on how fast the IF links can go? I've seen things like "512Gps for GPUs" but it doesn't say the width.
|
# ? Jul 10, 2017 05:22 |
|
priznat posted:I'm curious if the current version of InfinityFabric phy can support PCIe Gen4 speeds (8GHz, 16GT/s), because it would be pretty impressive if they could jump right into the Gen4 ecosystem. There's a few new features they'd have to implement to get to the Gen4 spec of course. It doesn't need to, remember AM4 is suppose to be supported for 5 years.
|
# ? Jul 10, 2017 05:23 |
|
priznat posted:I'm curious if the current version of InfinityFabric phy can support PCIe Gen4 speeds (8GHz, 16GT/s), because it would be pretty impressive if they could jump right into the Gen4 ecosystem. There's a few new features they'd have to implement to get to the Gen4 spec of course. It's 32 bytes wide, so it's exactly that fast. 16GT/s * 32 bytes/T = 512 GB/s.
|
# ? Jul 10, 2017 06:19 |
|
Combat Pretzel posted:NUMA is nice for workloads that can wait a while. But on things like gaming where you need to get frames out as fast as possible, if for some reason there's no thread scheduling capacities for the CCX that handles the memory region most of the thread's poo poo is allocated in, there's a problem. Yeah, honestly, I don't expect TR single client gaming performance to be that notable. There are barely any games that even use more than 4 threads, let alone 32 threads, hence the R5 1600X and R7 1800X being basically identical in games. So I imagine a NUMA aware OS will just stick all the processes/data on one die+memory bank and call it a day... so its all back to clockspeeds as per usual. The rumored top of the line TR 1950X is 3.4ghz base, so i imagine gaming will be basically identical to other much cheaper mainstream Ryzen chips in that clock range. Multi-client or virtualized stuff, and of course rendering/content production, will probably be TR's strong suit. For just straight gaming though, the Coffee Lake stuff coming next month will probably be the hot ticket until Cannonlake/Zen2.
|
# ? Jul 10, 2017 06:38 |
|
Oh, I have things to occupy all 16 cores once a while. What I'm more concerned about is when there will be situations when NUMA is going to be a problem, i.e. mismatch between core a thread is running on and where memory allocations are. At a certain point depending on memory pressure, allocations will start to cross NUMA memory regions. If the IF bandwidth between CCX pairs is indeed so much worse as speculated, this is going to be noticeable in some form. On the other hand, if IF is 32 bytes wide and can run an PCIe speeds, as said a couple of posts above, why the hell are there even bandwidth issues that depend on RAM speed? Combat Pretzel fucked around with this message at 11:23 on Jul 10, 2017 |
# ? Jul 10, 2017 11:21 |
|
Its a latency issue not a bandwidth issue with inter CCX data transfers.
|
# ? Jul 10, 2017 16:21 |
|
Combat Pretzel posted:On the other hand, if IF is 32 bytes wide and can run an PCIe speeds, as said a couple of posts above, why the hell are there even bandwidth issues that depend on RAM speed? I'm sure im out of my depth here but as i understand it the infinity fabric memory speed bottleneck is more latency than throughput.
|
# ? Jul 10, 2017 16:22 |
|
So uh, if it's super high bandwidth, why does it get hosed over by the memory clock? Sounds like its data transfers are gated by the memory controller? If so, why?! I guess I kind of fail to see why it gets punished so hard versus Intel's ring bus.
Combat Pretzel fucked around with this message at 16:50 on Jul 10, 2017 |
# ? Jul 10, 2017 16:48 |
|
Combat Pretzel posted:So uh, if it's super high bandwidth, why does it get hosed over by the memory clock? Sounds like its data transfers are gated by the memory controller? If so, why?! I guess I kind of fail to see why it gets punished so hard versus Intel's ring bus. Because IF is half the speed of ram this is a hard setting in Zen1 so if you increase ram speed you increase IF. Combat Pretzel posted:Oh, I have things to occupy all 16 cores once a while. What I'm more concerned about is when there will be situations when NUMA is going to be a problem, i.e. mismatch between core a thread is running on and where memory allocations are. At a certain point depending on memory pressure, allocations will start to cross NUMA memory regions. If the IF bandwidth between CCX pairs is indeed so much worse as speculated, this is going to be noticeable in some form. Also Zen1 doesn't use Numa for CCX talk it uses a new scheduler that microsoft and AMD came up with to account for the slight higher CCX talk.
|
# ? Jul 10, 2017 17:40 |
|
Combat Pretzel posted:So uh, if it's super high bandwidth, why does it get hosed over by the memory clock? Sounds like its data transfers are gated by the memory controller? If so, why?! I guess I kind of fail to see why it gets punished so hard versus Intel's ring bus. i suspect this was a design choice to enable them to easily scale it up and down (multi die threadripper/epyc packages, and symmetrically disabling cores for lower end ryzen parts) and PROBOABLY to make it easier to maintain coherence while doing so or some crap but tbh i really have no idea what im talking about at this point
|
# ? Jul 10, 2017 17:44 |
|
wargames posted:It doesn't need to, remember AM4 is suppose to be supported for 5 years. What does "supported" mean? Warranty? BIOS updates? Some of their chips will use it? All of their chips will use it?
|
# ? Jul 10, 2017 18:16 |
|
wargames posted:Also Zen1 doesn't use Numa for CCX talk it uses a new scheduler that microsoft and AMD came up with to account for the slight higher CCX talk.
|
# ? Jul 10, 2017 18:18 |
|
Subjunctive posted:What does "supported" mean? Warranty? BIOS updates? Some of their chips will use it? All of their chips will use it? That in 5 years zen2 or zen3/whatever will come out on am4/am4+ Combat Pretzel posted:The NUMA stuff is speculation about Threadripper and EPYC. But threadripper is single socket so Numa doesn't come into play, EPYC will require Numa and the special amd scheduler because two sockets and CCX talk.
|
# ? Jul 10, 2017 18:25 |
|
TR will likely be dealing with NUMA, because we have 2 separate memory controllers each with a dual channel connection to their local DIMM banks, and a higher latency connection via IF between them. Would be advantageous for the OS to know that. Someone asked AT's editor about it: https://twitter.com/RyanSmithAT/status/870598439993720832
|
# ? Jul 10, 2017 18:52 |
|
It's a single socket, but things look awfully like it being a Frankensocket (two smashed together). So yeah, NUMA.
|
# ? Jul 10, 2017 19:22 |
|
|
# ? Apr 20, 2024 17:25 |
|
The issue with IF as implemented in Zen 1 is that the memory controller also handles generating the clock for the bus, and that clock is at the same rate as the one that goes out to the DDR4 PHY. So in the most pessimistic case of DDR4-2133, it's limited to just 68GB/s, and in the perfect case of DDR4-3200 you're still only getting 102GB/s. Hopefully future revisions will fix that... frankly inexplicable design choice.
|
# ? Jul 10, 2017 20:03 |