Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Paul MaudDib
May 2, 2006

"Tell me of your home world, Usul"


SwissArmyDruid posted:

Paul, you're more familiar with Intel than I am, are you aware of any functional benefits to die-stacking a la Foveros with regards to power consumption? Because I think there's still an AMD patent from a handful of years back when we were still speculating that the IO die was going to be an interposer that Zen chiplets and an HBM bump were stacked onto.

foveros is a big mystery to me as well, Intel hasn't said a lot in public to extrapolate from, and the problems with stacking multiple compute dies are pretty obvious in terms of thermals / etc.

obviously a lot of the power consumption from infinity fabric isn't inherent to the protocol itself, AMD uses monolithic dies with infinity fabric attaching various parts and that is fine. the power consumption comes from having to run a beefier PHY to overpower the parasitic inductance/capacitance of the bigger+longer wires that go off the die, through the interposer, and back on. I'm unclear what exactly Foveros has to offer here vs AMD's interposer technology but I think that's going to be the relevant metric - how Foveros decreases those parasitics, because that is directly related to how much power you need to drive them.

it may be that the innovation here is that because Foveros is an "active interposer" technology that you need to drive it a lot less hard - because it's not driving a big giant wire with lots of parasitics, it's jumping the microbump (which, granted, will still be a lot more parasitics than just a trace inside a monolithic die) and then right into another transistor inside the active interposer, so the only "trace" involved is crossing the microbump. I would raise a speculative guess that the active-interposer stuff that AMD has been talking about is functionally equivalent to Foveros from a design perspective here.

Cygni posted:

Thats a good point, I hadnt considered the little cores being off on their own ~Shame chiplet~. There is gonna be a lot of complexity in this next wave of chiplet/tile, big-little, fighting ARM, fully SoC tom foolery were gonna be entering.

"shame chiplet"

it's just a guess and I suppose that's maybe not as definite as I think it is. I just think servers will definitely want "all-big" configurations and if that chiplet exists then it becomes trivial to offer an enthusiast package with all-big as well. that would be consistent with how AMD has utilized the enthusiast lineup as a binning pressure relief for the server lineup so far.

but at the same time, there will be a power penalty to big chiplet+little chiplet. I know the ideal is that a lot of time the "big" chiplet would be gated off but who knows how possible that will really be. I guess it really depends on the performance, supposedly the new Tremont cores are pushing Skylake-esque performance already and maybe if you have something like that you only power up the big chiplets on really big sustained tasks and just let the little cores handle the day-to-day.

mixed chiplets let you avoid that penalty, you could power down one chiplet entirely unless it's a really big workload, while also still being able to gate the big cores on the mixed chiplet if there's not a whole lot to work on. But I doubt servers are going to bite heavily on that since they don't care about idle power, servers are specced at some reasonable approximation of full load.

I guess talking it through, perhaps one all-big + one mixed chiplet would be an ideal configuration here. Maybe they don't do an all-little chiplet at all and just do an all-big chiplet and a mixed chiplet (and then APUs). That gives you better increments as far as powering the thing up, you have "one chiplet up, little only", "one chiplet up, all cores up", and "mixed chiplet up + big chiplet up".

edit: I think "all-big" and "all-small" (or mixed chiplets with disabled big cores) is also going to be important going forward for segregating caches to prevent timing side-channels, because it seems obvious at this point that shared cache in a speculative architecture is a bottomless pit of vulnerabilities. the fix is going to be segregating "secure tasks" where you would be concerned with data leakage onto a slower, secure chiplet that does much less speculation and ideally shares as little cache as possible between threads, likely with no SMT/hyperthreading (since that is also a bottomless pit of vulnerabilities), while letting CPU-intensive tasks that don't need security run on faster cores that do speculation/etc. this maps precisely onto the "big.LITTLE" model that both companies are now embracing - perhaps minus the OoO/speculation that Intel has adopted lately in Silvermont/Goldmont/Tremont. You just need to do the work of whitelisting which tasks are probably fine to run on a faster, insecure core.

AMD's model where you have multiple basically independent chiplets, with their own caches, that just happen to share a memory controller (but no cache on the memory controller itself) fits this concept quite well. And they can also mix architectures as well, since the only thing the IO die or CCDs care about is talking Infinity Fabric to the other side, all the caching is completely self-contained on the chiplet.

Paul MaudDib fucked around with this message at 02:57 on May 8, 2021

Adbot
ADBOT LOVES YOU

taqueso
Mar 8, 2004









I want to go the the adidasamd.com customization page and be able to select n x m chiplets to fill up the package.

Anime Schoolgirl
Nov 28, 2002






please don't bring back jaguar for the little cores

Icept
Jul 11, 2001



What's the theoretical use case for the shame chiplet? Running all the Windows / OS / services / background stuff on it and devoting the big boy package to the foreground application?

Truga
May 4, 2014




Lipstick Apathy

Icept posted:

What's the theoretical use case for the shame chiplet? Running all the Windows / OS / services / background stuff on it and devoting the big boy package to the foreground application?

theoretically that's possible, sure. but:
1) that requires placing trust in an OS scheduler to do its job properly. windows sometimes already doesn't do a terribly good job on a relatively homogenous chiplet design like ryzen and requires manual intervention with threads/core affinity
2) you get 12-16 threads on a mid-range ryzen and they're all good threads lol

Icept
Jul 11, 2001



Agreed... but why are AMD/Intel pursuing it for desktop CPUs? It makes sense for mobile because 80% of the time you're just texting or whatever so there's no reason to burn battery on the big cores.

Or is it just because the desktop CPUs are derived from a common stack with an APU/laptop focus so the smol cores got to happen just by association?

Paul MaudDib
May 2, 2006

"Tell me of your home world, Usul"


Anime Schoolgirl posted:

please don't bring back jaguar for the little cores

your vile wishes can't blot out my pure love for the $50 Kabini AM1 combo

Paul MaudDib
May 2, 2006

"Tell me of your home world, Usul"


zen4 is also supposed to introduce avx--512 support

Pablo Bluth
Sep 7, 2007

I've made a huge mistake.


Paul MaudDib posted:

zen4 is also supposed to introduce avx--512 support
Linus Torvalds will be happy...

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull


Icept posted:

What's the theoretical use case for the shame chiplet? Running all the Windows / OS / services / background stuff on it and devoting the big boy package to the foreground application?

Yes. Consider these approximations for Apple's M1 small cores relative to M1 big cores:

Area: ~0.25x
Power @ max freq: ~0.1x
Perf @ max freq: ~0.33x

The small cores have about 3.3x perf/W () and 1.3x perf/area. You wouldn't want a chip with nothing but the small cores since high ST performance is quite important for general purpose computing, but having some small cores is awesome. Using less energy to run all those lightweight system threads frees up power to run the threads you want to go fast on the big cores.

That said, will AMD and Intel have small cores as good as Apple's? Seems very doubtful! Small cores are where you expect the advantages of a clean RISC architecture to be greatest, and Apple's been putting a lot of effort into their small core designs for a long time, while AMD and Intel have not.

And will Microsoft have a scheduler as good at using small cores as Apple's? Also doubtful.

ConanTheLibrarian
Aug 13, 2004


dis buch is late

Fallen Rib

Shame chiplet has to become the accepted nomenclature.

BobHoward posted:

The small cores have about 3.3x perf/W () and 1.3x perf/area.
I think this is the key to their thinking. When I first heard AMD/Intel were looking at big/little designs, I was very sceptical. However you can fit a lot of little cores in the space of a few big ones. Just taking games as an example, an engine may be better off with say 4 high speed cores for critical path logic and 16 little ones for worker threads than 8 big cores. Plus when someone is just browsing the web or writing word docs, the big cores can be powered down.

quote:

And will Microsoft have a scheduler as good at using small cores as Apple's? Also doubtful.
It's not like Apple are the only company who use big/little designs. MS can just rip off whatever Android does.

ConanTheLibrarian fucked around with this message at 11:36 on May 8, 2021

Bofast
Feb 21, 2011



Grimey Drawer

Anime Schoolgirl posted:

please don't bring back jaguar for the little cores

I rather like my old E-350 bobcat based server that is still running in my living room, but I agree

karoshi
Nov 4, 2008

"Can somebody mspaint eyes on the steaming packages? TIA" yeah well fuck you too buddy, this is the best you're gonna get. Is this even "work-safe"? Let's find out!


BobHoward posted:

Yes. Consider these approximations for Apple's M1 small cores relative to M1 big cores:

Area: ~0.25x
Power @ max freq: ~0.1x
Perf @ max freq: ~0.33x

The small cores have about 3.3x perf/W () and 1.3x perf/area. You wouldn't want a chip with nothing but the small cores since high ST performance is quite important for general purpose computing, but having some small cores is awesome. Using less energy to run all those lightweight system threads frees up power to run the threads you want to go fast on the big cores.

A smol core EPYC with AVX512 at half the area would mean 128 cores on 7nm, with 2048 CUDA cores SP ALUs. At 3+GHz that's 6 Tera flops SP (fp32), 12 Tera flops fp16. Double for 5nm. Double for FMA if your marketing department is watching. OFC without the texture samplers graphics are out of the question and without a coherency/sorting engine so is ray tracing.
6 Tera flops would need up to 2 reads and 1 write per fp32 op, demanding peak bandwidths of 48 Terabytes/s for reading and 24 Terabytes/s for writing.

Would such a machine be a good ML training workhorse?

Arzachel
May 12, 2012


I feel that until AMD can get Fabric/IO power down, small core and mixed chiplets just don't sound very appealing. Maybe if you add a small core cluster to the IO die and power down the IF links and chiplets, but then you probably have to fab it on a advanced node.

PC LOAD LETTER
May 23, 2005
WTF?!

Slippery Tilde

Speculation: They could put the small cores on the IO chiplet and for low power operation just turn off the big cores/chiplet + IF bus and save power that way. Would give the small cores slightly better main system RAM latency too for a percent or 2 more performance.

Since the small cores are supposed to be actually small + low power, and the IO die is already a fairly decent size and required for their chiplet approach, squeezing them on shouldn't be too onerous.

Yeah the process for the IO die is different ("12"nm GF process) and not nearly as good as TSMC's 7nm or 6nm processes but it'll be good enough for some low power optimized lower priority cores at ~2-3Ghz which is likely all that is necessary.

Seamonster
Apr 30, 2007

IMMER SIEGREICH


Wait, if the IMC is inside the IO die, will moving to DDR5 and its higher bandwidth increase power usage? That might necessitate dropping it down to 7nm.

PC LOAD LETTER
May 23, 2005
WTF?!

Slippery Tilde

Maybe? edit: The IOD uses ~15W. I believe the IF bus power use is a bigger issue than the memory controller, especially on Epyc. We don't have any numbers for a DDR5 memory controller so all we could do is guess. From what I recall they're using GF's 12nm for the IOD right now because memory controller scaling is so abysmal with smaller nodes they get nearly no benefit while paying much higher costs and dealing with more supply constraint issues.

Its always possible they could use TSMC's 10nm process instead if they really do have to use a more power efficient + smaller feature process.

PC LOAD LETTER fucked around with this message at 07:27 on May 9, 2021

Paul MaudDib
May 2, 2006

"Tell me of your home world, Usul"


Supposedly AMD is moving to a 6nm IO die at some point here.

ConanTheLibrarian
Aug 13, 2004


dis buch is late

Fallen Rib

PC LOAD LETTER posted:

Speculation: They could put the small cores on the IO chiplet and for low power operation just turn off the big cores/chiplet + IF bus and save power that way. Would give the small cores slightly better main system RAM latency too for a percent or 2 more performance.

Since the small cores are supposed to be actually small + low power, and the IO die is already a fairly decent size and required for their chiplet approach, squeezing them on shouldn't be too onerous.

Yeah the process for the IO die is different ("12"nm GF process) and not nearly as good as TSMC's 7nm or 6nm processes but it'll be good enough for some low power optimized lower priority cores at ~2-3Ghz which is likely all that is necessary.

I don't think this will happen. Cores (especially small ones) occupy a surprisingly low proportion of the CPU's area. With small cores, cache would be the dominant feature. The IO die would have to be substantially larger to fit the compute elements.

For reference, here's a Zen 2 die. Purple is L3, orange is L2, green is the core.

PC LOAD LETTER
May 23, 2005
WTF?!

Slippery Tilde

ConanTheLibrarian posted:

I don't think this will happen.

Why would the small low power cores have to have the same or more amount of cache though? Would a L3 even make much sense if they don't have to hop over the IF bus to the system RAM?

If its mostly doing back round or light duty stuff anyways won't the cache requirements be reduced a fair amount too?

If the cache requirements are as high as the main CPU AND you need like 4 or 8 of them then yeah it starts to make less sense to put it on the IOD and it becomes more sensible to put them on the main die with the higher power CPU's. edit: Or do a dedicated low power CPU die too of course. Either could work.

PC LOAD LETTER fucked around with this message at 12:51 on May 9, 2021

mdxi
Mar 13, 2006

to JERK OFF is to be close to GOD... only with SPURTING



PC LOAD LETTER posted:

Why would the small low power cores have to have the same or more amount of cache though? Would a L3 even make much sense

Cache and cache architectures are overwhelmingly important to the performance of modern CPUs. You don't dedicate 60%+ of your silicon to something if it ain't worth having.

It might be true that a small/light duty CPU can get away with less $L3 (compared to, say, a core which is tasked with HPC workloads) and still feel performant, but I genuinely don't think you'd enjoy using a CPU with none.

And that's before we get to the part where (I think) you'd need to re-architect the core fetching/scheduling logic.

But I feel that there's a common issue with all suggestions that core types should be blended (in any combination). AMD's stated reason for doing things the way they have with their current chiplet designs is that decoupling the compute cores from the ancillary functions of the CPU reduces the unit size for lithography purposes -- you're now just fabbing repeating tiles of core/cache which can be sliced up for maximum yield. As soon as you start blending non-compute functionality back into compute dies, or blending core types on a die, or blending compute into the IO die, you've undone that advantage. If you're gonna go big.LITTLE and use chiplets, it makes the most sense to fab the little cores on their own wafers, for even higher yield.

But IANAACPCE (I Am Not An AMD Capacity Planner/Computer Engineer) and plans do change, so this is all just guesswork based on AMD's previous statements.

gradenko_2000
Oct 5, 2010



Lipstick Apathy

mdxi posted:

Cache and cache architectures are overwhelmingly important to the performance of modern CPUs. You don't dedicate 60%+ of your silicon to something if it ain't worth having.

IIRC Intel's Broadwell desktop parts from 2015 were still competitive with 10th gen Comet Lake not just because they had 6 MB of L3 cache, they even still had 128 MB of L4 cache

PC LOAD LETTER
May 23, 2005
WTF?!

Slippery Tilde

My understanding was that for Zen the large L3's were there to help make up for deficiencies with their memory controller + the small added latency from the IOD + mitigate latency from moving things over the IF bus. All of that is important for performance of course but as a low power/low performance CPU would any of that be a priority? Particularly if 2 of those 3 issues could be eliminated by moving the little CPU cores to the IOD itself?

Yeah more cache is going to be better but I don't see what makes the L3 so much more worth it vs say more L1 or L2 which I would assume would be more valuable performance-wise even if you were much more limited in how much you could cram in vs L3.

I know Intel has a L3 with Tremont which is their 10nm low power chip....but its a much smaller one (4MB) and its shared across all 4 cores and it seems more relevant for its use in a SoC (to help coordinate things with the iGPU and chipset) rather than for straight CPU performance alone.

Anyways, I'm not a chip designer either, but going by that example at worst if a L3 is really necessary to get reasonable performance with the little CPU cores it appears to be to a significantly lesser degree than with Zen so they wouldn't necessarily be stuck with blowing over half the die space on cache for low power/performance use.

PC LOAD LETTER fucked around with this message at 17:06 on May 9, 2021

LRADIKAL
Jun 10, 2001
$10


Fun Shoe

Since we're just making poo poo up, why not replace one of the 8 BIG cores on a CCX with 4 small cores, and give them the exact same amount of cache to share?

Arzachel
May 12, 2012


gradenko_2000 posted:

IIRC Intel's Broadwell desktop parts from 2015 were still competitive with 10th gen Comet Lake not just because they had 6 MB of L3 cache, they even still had 128 MB of L4 cache

I feel like this meme has been perpetuated by Anandtech running their benches on JEDEC memory.

LRADIKAL posted:

Since we're just making poo poo up, why not replace one of the 8 BIG cores on a CCX with 4 small cores, and give them the exact same amount of cache to share?

The IO die pulls ~15W to drive the IF links which is probably more than you'd reasonably save using the small cores.

Kazinsal
Dec 13, 2011






For comparison, dual channel DDR4-3200 averages a ~51 GB/s transfer rate and ~60 ns latency. Zen 3's L3 cache, the largest and slowest of them, has a 600 GB/s transfer rate. I can't remember what L2 speeds are like but I know they're north of 2 TB/s, and L1 read is something like 4 TB/s with sub-nanosecond latency.

Cache is *really* goddamn important.

Truga
May 4, 2014




Lipstick Apathy

Arzachel posted:

The IO die pulls ~15W to drive the IF links which is probably more than you'd reasonably save using the small cores.

yeah, ryzen master tells me my zen2 idles at sub-1W or runs a bunch of firefox tabs or a game at sub-10W, but then the SOC drain itself just sits there at 15+ baseline constantly.

there's a ton of savings to be made there, and next to no savings in the CPUs themselves. it's the price for having 1600+mhz ram, a dozen of pcie4 devices, etc, though.

Nomyth
Mar 15, 2013

And if a Nyto get a attitude
Pop it like it's hot
Pop it like it's hot
Pop it like it's hot


Arzachel posted:

I feel like this meme has been perpetuated by Anandtech running their benches on JEDEC memory.

As a former Broadwell and current coffee lake owner I agree it wasn't a particularly revolutionary turn, especially with it being a victim cache

Quaint Quail Quilt
Jun 19, 2006



SourKraut posted:

200mm fans are still not ideal though.
How so? What's all this about negative pressure?

I've had a silverstone 90° rotated case with 3 180mm fans on bottom and one 120mm on top for 8 years and my temps are better than anyone I've ever talked to.

Almost nothing makes it through the dust shields, I clean inside like every 3 years.

SourKraut
Nov 20, 2005

POST QUALITY UNDER CONSTRUCTION




Quaint Quail Quilt posted:

How so? What's all this about negative pressure?

I've had a silverstone 90° rotated case with 3 180mm fans on bottom and one 120mm on top for 8 years and my temps are better than anyone I've ever talked to.

Almost nothing makes it through the dust shields, I clean inside like every 3 years.

200mm fans are great for quietly moving a good amount of air at a low static pressure. So if your case setup supports the hardware configuration you want and you're happy with temp and noise, then great!

In a lot of situations though, their low static pressure will end up hurting someone's use case, such as if they have a radiator on the fan mount, the type of dust filters being used, etc. The 180/200 mm opening size is also less favorable for noise attenuation, so depending on the GPU and other components in the case, and where you place the case, you may end up hearing quite a bit more than if it were a 120/140 mm fan.

Also, a lot of the cases I've seen that support 180 or 200mm fans, usually supported two 120 or 140mm in the same spot. Two 120mm fans won't give you the same level of airflow vs noise performance, but two 140mm fans will probably exceed a single 180/200mm fan as long as you get a PWM fan and don't mind spending some time to adjust the fan curve profile.

It sounds like your case is just about the perfect setup for using 180/200mm fans though, if it truly supports three of them, because that's also the issue with most cases that do support them: It's usually just one spot within the case that can fit the fan size, so then you're stuck with either trying to use it as the sole discharge fan that is quiet and will move a lot of air and doesn't need a filter, but may not be in the best location in terms of airflow/thermodynamics, or otherwise using it as intake but losing airflow performance once you put a filter in front of it, and probably still needing at least one more fan somewhere to help maintain positive pressure.

VorpalFish
Mar 22, 2007
reasonably awesometm



SourKraut posted:

200mm fans are great for quietly moving a good amount of air at a low static pressure. So if your case setup supports the hardware configuration you want and you're happy with temp and noise, then great!

In a lot of situations though, their low static pressure will end up hurting someone's use case, such as if they have a radiator on the fan mount, the type of dust filters being used, etc. The 180/200 mm opening size is also less favorable for noise attenuation, so depending on the GPU and other components in the case, and where you place the case, you may end up hearing quite a bit more than if it were a 120/140 mm fan.

Also, a lot of the cases I've seen that support 180 or 200mm fans, usually supported two 120 or 140mm in the same spot. Two 120mm fans won't give you the same level of airflow vs noise performance, but two 140mm fans will probably exceed a single 180/200mm fan as long as you get a PWM fan and don't mind spending some time to adjust the fan curve profile.

It sounds like your case is just about the perfect setup for using 180/200mm fans though, if it truly supports three of them, because that's also the issue with most cases that do support them: It's usually just one spot within the case that can fit the fan size, so then you're stuck with either trying to use it as the sole discharge fan that is quiet and will move a lot of air and doesn't need a filter, but may not be in the best location in terms of airflow/thermodynamics, or otherwise using it as intake but losing airflow performance once you put a filter in front of it, and probably still needing at least one more fan somewhere to help maintain positive pressure.

I'm guessing he's talking about the.. RV05? FT05 is also 180s with rotated layout but it's 2 instead of 3. Either way I believe both cases make extremely effective use of the 180mm fans. They do bottom -> top airflow, probably with low impedance nylon filters (afaik silverstone filters are very good), and not much in the airflow path between the fans and the CPU/gpu. I believe both cases test very well for both acoustic efficiency and absolute cooling, even by more modern standards.

Kibner
Oct 21, 2008

#1 Pelican Fan


Yeah, the RV05 and FT02 use that layout. They are not as good for GPU cooling as some modern cases but still the best or among the best CPU cooling, iirc. I'm using an FT02 right now but am wanting to switch to a O11-Mini once I can actually get GPUs again.

Adbot
ADBOT LOVES YOU

Gwaihir
Dec 8, 2009



Hair Elf

The FT-02 is still probably my favorite case of all time, I REALLY wish silverstone had kept refreshing that style.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply