Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Hasturtium
May 19, 2020

And that year, for his birthday, he got six pink ping pong balls in a little pink backpack.
Question born of lurid curiosity: I have a Core i9-7940x and it's been great. But outside of fixing the AVX clock scaling issue with Rocket Lake, what additional AVX512 improvements have been made with this new implementation?

edit: Also, will AVX512 optimizations for Rocket Lake at least partially apply to my Skylake-X chip?

Hasturtium fucked around with this message at 15:49 on Apr 10, 2021

Adbot
ADBOT LOVES YOU

Hasturtium
May 19, 2020

And that year, for his birthday, he got six pink ping pong balls in a little pink backpack.

Kazinsal posted:

e: Sarcasm aside, did anyone ever do an OP for a non-x86 CPU architectures thread? Or is there even really enough interest in having one?

As someone who has been lusting after one of Raptor Computing's POWER9 setups for years, that would be welcome. Anybody wants to start talking about the ghosts of SPARC and MIPS, SMT4 and SMT8, and what server-targeted ARM is like, I'd be there.

Hasturtium
May 19, 2020

And that year, for his birthday, he got six pink ping pong balls in a little pink backpack.

Kazinsal posted:

Awesome. I started working on an OP a while ago but lost what I had in a power blip so I’ll probably start a new thread this evening and write about the architectures I know about and let other people contribute primers for the ones I don’t.

Thank you, link it here. I'd love to find a happy medium between "set up a cluster of Raspberry Pi's" and "drop three grand for a POWER setup" that doesn't involve raiding eBay for a closeout server from over a decade ago, or somebody's microATX Amiga non-starter.

Hasturtium
May 19, 2020

And that year, for his birthday, he got six pink ping pong balls in a little pink backpack.
Thanks! I chipped in a quick blurb about Power to get the ball rolling.

Hasturtium
May 19, 2020

And that year, for his birthday, he got six pink ping pong balls in a little pink backpack.
I’d love to hear someone compare the M1 to Power9, both in terms of relative performance and to compare their respective embraces of instruction-level versus thread-level parallelism. They are built on wildly different processes and for different markets, but it would still be illuminating.

Hasturtium fucked around with this message at 03:56 on Aug 15, 2021

Hasturtium
May 19, 2020

And that year, for his birthday, he got six pink ping pong balls in a little pink backpack.

Fantastic Foreskin posted:

without accounting for the business reasons driving the design process it's an apples to oranges kind of deal.

:D

Hasturtium
May 19, 2020

And that year, for his birthday, he got six pink ping pong balls in a little pink backpack.

Arivia posted:

Speaking of x86/x64 being hemmed in by ancient poo poo, I saw an offhand reference the other day that you could still run code from like 8088s on today's processors. I know there was something like the x86 chips these days basically popping themselves through the various modes really quickly at startup so you're out of real mode and protected mode and into running actual x64 code. But I figured that at some point Intel or AMD must have gone "seriously we can just get rid of the poo poo for running programs from like 1990" to clean things up - I know you can't run 16bit executables any more, but still having a bunch of legacy mode support in the CPU feels like a big waste of time. Am I mixing things up/totally wrong?

I don’t think there’s anything nominally preventing you from running 16-bit apps in real mode, though Intel cut gate A20 support with Haswell so most DOS memory extenders don’t work any more in protected mode. You can also (sorta) run Win9x, barring the lack of drivers for just about anything made since 2006. Windows imposing limits on 16-bit protected mode code in newer versions (like for legacy program installers) doesn’t necessarily speak to what the CPUs themselves can do. By and large they really can run a TON of old code.

Hasturtium
May 19, 2020

And that year, for his birthday, he got six pink ping pong balls in a little pink backpack.

BurritoJustice posted:

Multicore enhancement on Rocketlake is absurd, my friends 11900k is reporting 283w package power out of the box in P95. Its a Z590 ROG HERO SUPER GAMER (etc), but it's completely unmodified bios settings other than XMP.. I know it's a power virus but that is absurd it's letting it run that high. It was only managing 4.5GHz too (though I believe with AVX-512).

I haven't seen much in the way of apples-to-apples comparisons, but do the high end Rocketlake chips actually draw comparable power to the Skylake HEDT platform on average? Or more? I'm having trouble imagining an eight core chip pumping out more heat than my 7940x running at alleged stock clocks.

Hasturtium fucked around with this message at 18:25 on Aug 18, 2021

Hasturtium
May 19, 2020

And that year, for his birthday, he got six pink ping pong balls in a little pink backpack.

Twerk from Home posted:

Rocket Lake has AVX512, right? So Intel technically wins that.

Rocket Lake does have AVX-512 - it's one of the only areas where its performance cheerfully zoomed away from Comet Lake. Prior to that, a limited earlier subset of the instructions was available in Skylake-X for the Xeon and HEDT markets. If AMD does successfully entrench AVX-512 by bringing it to the mass market, it'll be a surprise.

Hasturtium
May 19, 2020

And that year, for his birthday, he got six pink ping pong balls in a little pink backpack.

BlankSystemDaemon posted:

Too bad the SMT implementation was a loving disaster to the point that even Intel admitted and removed it, and only reintroduced SMT in Nehalem when they'd finally manged to do it properly.

Do you have info on the Netburst SMT implementations? I remember it being a mixed bag under the best of circumstances but would like a refresher with knowledge of better practices this far out.

Hasturtium
May 19, 2020

And that year, for his birthday, he got six pink ping pong balls in a little pink backpack.

Dr. Video Games 0031 posted:

It really depends on the game and the review setup. I've seen reviews that put the 5600X slightly ahead of the 12400 in both gaming performance and power efficiency, like these: https://www.techspot.com/review/2392-intel-core-i5-12400/, https://www.techpowerup.com/review/intel-core-i5-12400f/

edit: The 5600X has a small edge in gaming performance in the Club386 and GamersNexus reviews too, and in the TomsHardware review it's basically a wash. The 5600X's advantage (or really, the thing that keeps it from getting blown out) is its cache size. If the 12400 had more cache, it would win handily against the 5600X, but as it is, it's not the better gaming CPU. So as long as you got a decent price, I don't think you have anything to worry about.

I have a perfectly dumb question: for the purpose of gaming, would a 12400 be a better general choice than a 7940x? Just wondering if the per-core grunt and lower latency between cores would elevate it above fourteen cores of Skylake-X justice.

Hasturtium
May 19, 2020

And that year, for his birthday, he got six pink ping pong balls in a little pink backpack.
Intel still comes across as scattered and reactive when it pulls poo poo like this and I’m glad they’re getting some flak for it. Is there any kind of decent write up on the differences between that earlier subset of the spec and what’s made it to Rocket Lake before being snuffed out piecemeal with Alder?

Hasturtium
May 19, 2020

And that year, for his birthday, he got six pink ping pong balls in a little pink backpack.

Boat Stuck posted:

Is it normal for Alder Lake to run, like, really hot?

My mildly OC'ed, undervolted 12700k is pulling around 180W under load and running around 80 C with spikes to the high 80s on my Asetek 280mm AIO. Does that sound around right, or should I re-paste?

I think the 180W is probably a bigger contributing factor to heat than the 280mm rad, but I’m also managing temps in the mid-60s on a stock 7940x with a 240mm AIO, so something is amiss. Definitely start with a repaste, then see if you need to go UEFI spelunking to lower the power target.

Hasturtium
May 19, 2020

And that year, for his birthday, he got six pink ping pong balls in a little pink backpack.

RME posted:

Mostly just a technical curiosity but: how does the scheduler(? Or whatever is responsible) figure out what to throw at the e cores anyways
Googling mostly returns a bunch about people running into workloads where it misuses them (expected, I can’t imagine any heuristic is perfect) and talk about windows 10 vs 11

That falls under the purview of the Alder Lake Thread Director, an integrated microcontroller whose entire responsibility is ensuring the right workloads go to the correct chips. It requires integration with the OS scheduler - to my knowledge it’s only supported by Windows 11 and the Linux 5.18 kernel. Anandtech has a pretty good write up on Alder Lake that goes into more detail.

Hasturtium
May 19, 2020

And that year, for his birthday, he got six pink ping pong balls in a little pink backpack.

mdxi posted:

5.18 might be when Alder Lake support went in, but heterogenous multiprocessing has been a well-solved problem in the mainline Linux kernel -- and iOS -- for years. As usual, it's just Windows that needs to play catch-up.

Oh, sure. Intel is late to this party, and so far AMD's been a no-show. Agner Fog went into some detail regarding why Intel's implementation needs special consideration - between the lack of SMT on little cores and overall scope of architectural difference beyond "bigger chip make go faster than little," Intel's made itself a strange bed to lie in this round. The BIG-only Alder models feel like a solid generational improvement that's juiced with enough power to reduce effective power efficiency despite a nice process improvement. The BIG.little ones feel like a weirdly reactionary response to emerging trends - disabling AVX-512 with no option to reuse it is going to bite their efforts to spur adoption hard. Even beyond the lack of support in the E cores, maybe they're afraid throwing power at the architecture and then factoring in AVX-512 will blow way past the bigger power envelopes they're already struggling with?

Hasturtium
May 19, 2020

And that year, for his birthday, he got six pink ping pong balls in a little pink backpack.

VorpalFish posted:

The power envelopes seem almost entirely based on getting bigger number in cinebench. For the 240w cpus you lose something like 8% performance dropping to 125w.

And it's not like avx512 is gonna make it ignore power limits, although it will likely drop clocks to obey.

Yeah, it's nuts. AMD's doing this too - the 5800x eats half again as much power compared to the 5700x, and wins out by somewhere between 5-8% in benchmarks. Unlocking that last ten percent of performance from modern silicon is very expensive.

Your second point is true - the chip won't push beyond the power envelope, but clocks will sag to accommodate that. I recently sold a 7940x where I got to test that quite a bit, and for what I was getting out of the chip the heat was positively bleary. After living with it for four years in north Texas I cried uncle, sold it, and replaced it with a 5700x.

Hasturtium
May 19, 2020

And that year, for his birthday, he got six pink ping pong balls in a little pink backpack.

JawnV6 posted:

gosh, i knew "hide it behind ACPI" was kinda silly and while it's understandable that certain problems aren't amenable to a pure HW solution? that strikes me as overengineered

It’s a problem of the differences between the P cores and E cores - the former support SMT, the latter don’t, they have substantially different performance characteristics from each other besides that, and the result is that an OS scheduler alone is unlikely to make good decisions in distributing workloads across them. It’s over-engineered because the chip’s been lashed together from two disparate CPU families instead of a more traditional BIG.little arrangement. And the worst problem in my eyes is that the E cores have been nudged to excessive clock speeds to goose performance, which defeats the ostensible power-saving purpose of their inclusion outside of providing extra threads for benchmarks. I’d genuinely like to see how an underclocked, undervolted 12600K would do compared to most configurations in the wild.

Hasturtium fucked around with this message at 00:14 on Jun 22, 2022

Hasturtium
May 19, 2020

And that year, for his birthday, he got six pink ping pong balls in a little pink backpack.

Winifred Madgers posted:

Step 1: Give it less volts.

To OP wondering about CPU undervolting: it's... kinda tricky. In days of yore, when I was undervolting an AMD FX-8320, it literally involved stepping down the core voltage by small increments and testing until I found a point where it became unstable, then nudging the voltage up slightly, testing again to see if it was stable, and then hopefully calling it a day. This approach may still work on a modern CPU - hop into the BIOS, see where your core voltages lie, and start gently adjusting voltages and seeing what happens.

redeyes posted:

Went from 3900x to 12700k. Intel is FAR better for various reasons. I am using a DAW studio type computer and Intel is raping AMD in latency out of the box. There are many variables but still, Intel rules the roost for this.

Jesus, goon, I get you're excited, but dial it back.

Hasturtium fucked around with this message at 03:49 on Jun 29, 2022

Hasturtium
May 19, 2020

And that year, for his birthday, he got six pink ping pong balls in a little pink backpack.

mobby_6kl posted:

I'm curious about the power considering how much Alder Lake needs to hit the last hundreds of mhz. Maybe they'd at least run the E-cores at appropriate voltage as someone mentioned before.

I think someone here (Paul?) confirmed that the little cores are running on the same 12V line as the big ones. Since the little cores aren’t operating at an efficient voltage, Intel decided to juice the clocks to get what performance they could out of them, and with Raptor Lake they’re literally doubling down on the strategy. In the future - and especially for mobile-targeted chips - I’d expect Intel to adopt BIG.little with appropriate voltages to suit the strengths of each, but that will likely require a socket change to accomplish.

Hasturtium
May 19, 2020

And that year, for his birthday, he got six pink ping pong balls in a little pink backpack.
What’s the per-clock difference between Rocket and Alder Lake outside of AVX-512, where Intel apparently decided that because little Alder cores couldn’t run it, no Alder cores should run it?

Just kills me, is all. I’m wondering if it’d be worthwhile for a machine chiefly concerned with crunching video to skip Alder this time and save a little money with Rocket Lake.

Hasturtium
May 19, 2020

And that year, for his birthday, he got six pink ping pong balls in a little pink backpack.

PBCrunch posted:

MMX instructions were introduced on the fourth (!) process node version of the Pentium.

P5: 60-66 MHz, Socket 4 (5V), 3.1M transistors, 0.8μm (800 nm)
P54C: 75-100 MHz, Socket 5 (3.3V), 3.2M transistors, 0.5 or 0.6μm depending on who you ask
P54CQS: 120 MHz, Socket 5 (3.3V), 3.3M transistors, 0.35μm
P54CS: 133-200 MHz, Socket 7 (3.3V), 3.3M transistors, 0.35μm
*
P55C: 120-233 MHz, Socket 7 (2.8V), 4.5M transistors, 0.28μm
Tillamook: 166-300 MHz, different formats for mobile and embedded applications, 4.5M transistors (0.25μm)

The * is for the weirdo P24T Pentium Overdrive chips that ran on a kind of parallel riser board with a built-in VRM to let the 3.3V chips run on 5V 486 boards.
P24T: 63-83 MHz, Socket 2 or Socket 3 (3.3V), 0.6μm

You said it all better than I could have. One Weird Trick I remember reading, years after I could have used that information, was that with beefier cooling P55C still supported 3.3V signaling and could work on older motherboards, especially since they used a remapped multiplier. I wish I'd known that when I stumbled on a dual socket Digital Pentium 90 on a curb back around 2002 - a pair of Pentium MMXes would have made it a decently spry Shoutcast server in a corner of my room.

It is weird how the Pentium name lost its luster with the Pentium 4 and was then kept around as the rung above their Celeron-named chips for a decade and a half. I think deprecating those two product lines is going to bite them in the rear end - "drat it, why did I buy this low end desktop with an INTEL PROCESSOR?"

Hasturtium
May 19, 2020

And that year, for his birthday, he got six pink ping pong balls in a little pink backpack.

Rinkles posted:

Are the Quicksync (QSV) presets not available because I'm using my dGPU? This is Handbrake.



I have a 11600K.

That appears to be the case, if your setup prevents the IGP from being used while a dGPU is in use. Some motherboards and configurations are more accommodating than others.

Hasturtium
May 19, 2020

And that year, for his birthday, he got six pink ping pong balls in a little pink backpack.

Twerk from Home posted:

A little fiddling with PBO curves and you can reduce power usage by 50W without giving up a lick of performance. If you're willing to sacrifice 3-5% you can probably cut power usage in half.

I'm absolutely planning to turn a high-end part down a bit to make less heat.
https://www.youtube.com/watch?v=FaOYYHNGlLs

Amen. I’m pretty well computationally set for a while on my desktops, but if I decide to get ambitious again I would never run one of these chips at fireball stock settings. Cutting the stock max voltage in half, still getting 90% of the performance, and not outrageously heating up my Texas house is a no-brainer for me.

Hasturtium
May 19, 2020

And that year, for his birthday, he got six pink ping pong balls in a little pink backpack.

Palladium posted:

its funny when intel NICs used to be regarded as 100% bulletproof and realtek was the crappy alternative, when I never ever had any problems with realtek based stuff from audio to USB wifi

even my intel AX200 m2 wifi card still occasionally disconnects despite having a constant full signal; the previous 8265 was even worse

Seems like a reflection of Intel stagnating for too long before working to turn the tide, and of Realtek becoming so ubiquitous that they were pressured into a level of baseline competence. Their kit 20-ish years ago was genuinely awful, as evidenced by a legendary bitching session in the comments of the FreeBSD driver for their NICs.

Hasturtium
May 19, 2020

And that year, for his birthday, he got six pink ping pong balls in a little pink backpack.

HalloKitty posted:

Maybe I can conjure up some bad feelings:
ECS
VIA

And some nostalgia for past quality brands:
Abit
DFI

On the bad: I remember the ECS K7S5A, a motherboard so awful it went through at least five revisions. A friend of mine got a special on one at Fry’s which literally wouldn’t work with one class of CPUs or another because the wrong kind of resistors were soldered onto part of the board. Even a “good” one was flaky, and then most of them died of capacitor plague. VIA was just half a rung above SiS, and we only put up with them because there wasn’t another high profile supporter of AMD chipsets for Slot/Socket A for years. Those were fun times with poo poo hardware.

On the good: I managed to snag a DFI SB600-C motherboard on eBay for like $20 a while back. It was a Sandy/Ivy Bridge board with five vanilla PCI slots and PCI Express x16, and led to me falling into a wormhole of trying to make various jacked up MS-DOS versions work with PCI audio and a preposterously overpowered CPU for the purpose. Ended up giving it to a friend, but the build quality was basically perfect for what it was, outside of never getting mini-PCIe storage working. DFI is still out there doing good work.

Hasturtium
May 19, 2020

And that year, for his birthday, he got six pink ping pong balls in a little pink backpack.

Cygni posted:

I've definitely built more systems with K7S5A's than any other single motherboard. :haw: Maybe a few hundred, all told? I would take small contracts in college and go in and out of Frys repeatedly for the "1 per household" insanely cheap sub $80 Duron+K7S5A combos to fill them. The worst part is that while the K7S5A was flaky, like you pointed out, it was actually BETTER than most other Socket A boards out at the time. It feels like there was always some major issue or other until the later Nforce 2 boards became mature.

The worst part about the K7S5A (and their PC Chips cheapo KT266A stablemates) was the insanely thin traces on the top layer of the board around the socket. One slip with the awful old skewdriver clamp coolers and an accidental scrape of the board, and you could end up cutting multiple traces.

It really seems like the Slot A boards with Irongate chipsets (which, IIRC, were essentially AMD-blessed VIA ones) were more predictable than the slew of socket A boards that came later. They weren’t necessarily better - AGP was so awful and conditional on the platform I stuck with PCI graphics on my Athlon 500, forever ago - but there was a consistency to the experience. The KT series by VIA in all its permutations and inconsistencies was like an old war wound that only faded after I’d spent a blessed decade plus not worrying over them. Remember how badly they got along with Creative sound cards? I haven’t forgotten.

Nforce2 had quirks, but for running a regular Windows XP box with an Athlon XP 2400+ and a 6600GT, my Gigabyte board was stable and pretty hassle-free. Compared to what came before it felt like I was sitting pretty.

But hey! In terms of brain-melting boards that were everywhere for a while as an indictment of capitalism, at least we aren’t talking about the FIC VA-503+ back on Super7.

Hasturtium
May 19, 2020

And that year, for his birthday, he got six pink ping pong balls in a little pink backpack.
Yeah, the very first PC my family bought was a Compaq Pentium 90 and sported a whopping 150W power supply. Considering the demands of that 1995 configuration it was pretty generously over spec.

Hasturtium
May 19, 2020

And that year, for his birthday, he got six pink ping pong balls in a little pink backpack.

Cygni posted:

https://twitter.com/BIOSTAR_Global/status/1589596025349554177

a ton of serial headers and a ton of PCI(non-e) slots, now thats what peak performance looks like

If you’ve got a factory floor that needs legacy PCI controller cards and serial connectors, this very much would be peak performance. I’d be interested to know how they implemented the legacy PCI bus here; from what I understood playing around with an industrial DFI SB600-C (itself a Sandy/Ivy board replete with PCI slots), everybody more or less stopped bothering with legacy PCI and simply used bridge chips for consumer market PCI slot backwards compatibility starting around Haswell on the Intel side and Bulldozer on AMD.

Hasturtium
May 19, 2020

And that year, for his birthday, he got six pink ping pong balls in a little pink backpack.

WhyteRyce posted:

it has to be a bridge chip because Intel stopped putting native PCI into the chipset some time ago

Hopefully the bridge chip isn't garbage though. Last time I used one, it had some read-around-write pcie bug and the only solution was to hard strap it to pcie gen1/x1 (i.e. please don't flood our bridge chip with much traffic)

I figured it had to be a bridge chip, yeah. It would be nice to think some enterprising firm built a reliable bridge that could handle the peak 666MB/second five vanilla PCI slots could drive. That’d amount to, what, PCIe 3.0 x1 with some generous headroom?

Bigger question is whether it would accurately handle some PCI weirdnesses like port addressing or DMA behavior; there was some lamenting over on VOGONS that more recent PCIe bridge solutions weren’t allowing access to Yamaha OPL3 chips on elderly sound cards and the like. It makes me wonder about the reality of the economics of this sector and how much any of that comes into play for industrial purposes. Lord knows something like this would have been welcome at a lab where I used to work - virtualization and device passthrough for electron microscope controller boards that never got driver support past Windows 2000 was a concern all the way back in 2009…

Hasturtium fucked around with this message at 19:30 on Nov 7, 2022

Hasturtium
May 19, 2020

And that year, for his birthday, he got six pink ping pong balls in a little pink backpack.

hobbesmaster posted:

Skylake to Coffee Lake are still very common for embedded and extended availability applications. Jokes aside, Intel has a lot of experience with and fab space for 14nm.

This is 100% correct. That’s not a board intended for high CPU performance in absolute terms anyway - looks like a 4+2 VRM setup. Cutting edge is a lot less important for these roles than “setup and configure once, then run without incident indefinitely.”

Hasturtium
May 19, 2020

And that year, for his birthday, he got six pink ping pong balls in a little pink backpack.

SourKraut posted:

Apple should have just updated the MP to use these instead of the burning dumpster fire that is the ASi Mac Pro.

They were eager to move onto in-house solutions for PR, but the bet on an M2 Extreme / four M2 Max chips all working in tandem fell apart due to manufacturing problems. Thus, the Mac Pro 2023 is the fallback: literally an Ultra Studio with PCIe 4 support enabled by switching, no external GPU support, and 64GB of non-upgradable RAM starting at seven thousand dollars. I know you know how insane this is, but I’m still aghast.

Hasturtium
May 19, 2020

And that year, for his birthday, he got six pink ping pong balls in a little pink backpack.

Twerk from Home posted:

When was this? Both Athlon XPs and 64s looked really efficient against Pentium 4s. Hell, K6es weren't that hot unless you were trying to overclock the poo poo out of them.

K6 also had the problem of a pipeline all of four stages deep. It just didn’t scale well to higher clocks, and I’m a little impressed they got up to 550MHz with K6-3+ and its sizable cache. But the efficiency advantage has bounced between Intel and AMD repeatedly - Netburst was power-drinking trash versus K7 and K8 positively pantsed it, Conroe/Core 2 put Intel back on top until things leveled out again with Phenom II, and then Intel made a huge jump on efficiency with Sandy Bridge that they maintained over AMD’s construction cores until Ryzen came back around and started competing on efficiency while Intel was stuck on 14nm for half a decade plus.

My growing suspicion is that the don’t-call-it-Atom-descended E cores will eventually be grown to supersede the P cores in future Intel designs, as the latter is grossly less efficient in real terms for everything but SIMD-heavy work. Zen4c was an interesting recent development - I’ll be interested in seeing where things shake out in the next five years.

Hasturtium
May 19, 2020

And that year, for his birthday, he got six pink ping pong balls in a little pink backpack.

DoombatINC posted:

That's our Intel babyyy, right as brands like Minisforum and Beelink are popularizing the ultra small form factor for consumers they pull up stakes and leave the market

It’s amazing - they’ve been the flagbearer for the sector, to the point that a recent knockoff low-end brand is called ATOPNUC, and they’re kiboshing the line. The flop sweat is alarming.

Hasturtium
May 19, 2020

And that year, for his birthday, he got six pink ping pong balls in a little pink backpack.

BobHoward posted:

The 'construction machine' Family 15h chips - Bulldozer and its successors Piledriver, Steamroller, and Excavator - were all pretty bad.

With Bulldozer, AMD bet the farm on a novel idea for making highly threaded server chips. Many server workloads don't need the FPU much, so they came up with an idea for a 2-thread 'module' where there are two independent cache/integer complexes sharing one FPU. Sort of like Intel hyperthreading, except much less resource sharing between the two threads. The idea was that one module was substantially smaller than two fully independent cores, but would perform like independent cores on integer workloads. Due to the size reduction, AMD could pack more cores into a big server chip.

Since the integer execution width for each thread in a module was somewhat narrow, AMD needed to target very high clock frequencies. However, as I understand it, AMD didn't hit their power targets and was forced to back frequencies way off. Even with the clock speed reduction, the chips still used lots of power, and performance was not great.

Yes - clustered modular threading/CMT didn’t pan out for AMD. The siloed integer/cache complexes only shared a prefetch unit, a decode unit (which was re-duplicated for Steamroller to improve performance, then rolled back to one for Excavator for power savings), and the FPU. The FPU was dual-issue and capable of handling two 128-bit values at once, and at least for Excavator those could be ganged together for 256-bit work, as those last chips finally added support for AVX2.

With lower IPC and clocks failing to scale to the levels needed for competitive performance (partly due to AMD's process disadvantage with Global Foundries), the chips were only capable in well-threaded and integer-heavy niches. I knew a few people who kept them around for munching DVD rips and cheap build servers, and I swore by my FX-8320 as a quirky but decent workhorse, but Intel's chips outclassed them for general workloads. Simple as that.

As a weird side note/epilogue, there’s a dirt cheap mini PC made by ATOPNUC available now, the MA90, featuring an AMD A9 9400 - a single module/dual thread Excavator from 2016 with a Radeon R5. I obtained it for the princely sum of $86 before taxes, as it is cheaper than a Pi-alike for running Pi-Hole, and in my initial testing it will not surprise you to learn, dear goons, that it is slow.

Hasturtium fucked around with this message at 02:53 on Jul 12, 2023

Hasturtium
May 19, 2020

And that year, for his birthday, he got six pink ping pong balls in a little pink backpack.

Klyith posted:

I still use an A9-something craptop. And get this, most of the time I limit the CPU speed to 70%. :sickos:

(Because it sits next to my bed and I don't want the fan to turn on.)

If I want to watch a youtube at higher than 720p I download it with yt-dlp, because a browser decoder drops frames.



It sucks, but the two things I do with it are watch video or type plain text and given that it's actually ok. The GPU has better video decode than whatever celeron was in $300 laptops in 2018. TBQH I don't hate it!

The later FXes got through the worst deficits of the Construction arch through clockspeed and extravagant TDP.

Funny how that pattern keeps repeating!

I absolutely believe it. Last night I got YouTube running on Ubuntu and it’s amazing how hard those two cores work to just sit on a webpage and play 720p video. If you haven’t installed h264ify, that will probably help on the IGP, but I haven’t exhaustively verified that.

It skins my nose a little that this thing is running single-channel memory, but I happen to have two 4GB SO-DIMMs left over from an old project. Later today I’ll swap RAM and see what a difference it makes. I’ll verify that it took at least 20 minutes to compile GZDoom, stock.

Hasturtium
May 19, 2020

And that year, for his birthday, he got six pink ping pong balls in a little pink backpack.

Kazinsal posted:

:stare: I would love to see how long it takes to compile binutils and gcc on this thing

It’s pokey on vanilla Ubuntu, though it picked up a smidge switching to Xubuntu + lightdm. I fear how pokey this would be trying to lug Windows 10 around - just having a few Firefox tabs open makes it chug. I am… tempted by your proposal. Let me kick it around a bit first. Also, genuinely impressed that dual channel memory doesn’t seem to have made much difference.

Hasturtium
May 19, 2020

And that year, for his birthday, he got six pink ping pong balls in a little pink backpack.

Kazinsal posted:

If you want something fairly reproducible for comparison then a buddy of mine has a script set that can produce gcc cross toolchains easily: https://github.com/travisg/toolchains

Each arch passed to -a causes a whole new compile/link cycle for the whole suite so you’d want to time eg. just x86_64.

Y’know, this would be riotously funny to run and compare on my eight core Power9 versus this thing. 32 threads of screaming ppc64le versus… this.

Hasturtium
May 19, 2020

And that year, for his birthday, he got six pink ping pong balls in a little pink backpack.

Kazinsal posted:

That’d be brilliant. The script can only guess at 2-way SMT so for 4-way SMT you’ll have to pass -j32 as well.

Just FYI, in poking through the script the command used to determine -j’s default value accurately returned 32 on the Power9. I’ll let you know numbers after the grinding has concluded, but the A9 basically feels like a Core 2 Duo, and the Power9 is closer to a 3950x.

Hasturtium
May 19, 2020

And that year, for his birthday, he got six pink ping pong balls in a little pink backpack.

Kazinsal posted:

That’d be brilliant. The script can only guess at 2-way SMT so for 4-way SMT you’ll have to pass -j32 as well.

All right, numbers have been run. The time needed to compile the x86_64 version of the GCC 13.1.0 + binutils toolchain with the helpful link you posted:

IBM Power9, 8 cores 32 threads (-j 32), 32GB DDR4-2666 dual-channel, Samsung 870 256GB SATA SSD
real 9m45.233s
user 124m55.872s
sys 3m58.117s

AMD A9 9400, 2 cores(ish) 2 threads (-j 4 - don't ask), 8GB DDR4-1600 dual-channel, 128GB generic SATA M.2
real 80m30.175s
user 127m45.480s
sys 17m51.778s

If anything, I'm a little surprised the A9 didn't take even longer.

Adbot
ADBOT LOVES YOU

Hasturtium
May 19, 2020

And that year, for his birthday, he got six pink ping pong balls in a little pink backpack.

Paul MaudDib posted:

well, getting away from that is what keller did with the royal core series. "rentable units" are supposedly the replacement for hyperthreading in the royal core series, starting with arrow lake. it's not quite clear what "rentable unit" means, but presumably some kind of execution resource that can be allocated to a thread, or a shared unit between multiple threads that is shared in a module (like FPUs in CMT)? But either way they are thinking about managing that balance of registers/visibility complexity/etc bloating area per core etc.

Someone recently asked me where the term "scoreboarding" comes from, and the answer is the CDC 6600. Which uses barrel processing as well, where threads get scheduled onto shared physical cores as they're ready to execute. And the Peripheral Processor concept is an interesting one too. The idea is you have 12 big processors you can do an async launch onto and they'll go load some file from disk or do some processing subtask etc. And that's not too dissimilar from the idea of modern async promise/await coding. So the central vector processor can just blast along doing its thing and let the Peripheral Processors worry about the administrivia.

Maybe rentable units are something like the idea of thread-level barrel processing. Like you have a bank of thread contexts for a 4c or 8c cluster, and the cores themselves just grab a thread in the ready-to-run state, run it, and put it back into the bank when it needs to wait. How do you really describe the core count in that situation? If it's 8 p-cores and only one thread is executing at a time per core... it's an 8/8 processor. But there are a lot of other ones in a close-to-running state. GPUs do the same thing with warp scheduling to cover for the latency of memory access... if a thread wants to go out to memory, sleep it until it comes back and process some other thread in the meantime.

(And that's an interesting development in light of the Thread Director - it totally seems like overkill for big/little but maybe it makes sense to have this facility that knows about process state and scheduling and priority, living on the die.)

In general I still def feel that at least SMT2 is worth it, let some other thread have a crack at any spare execution resources. Even if you are barrel-processing it's still units that are getting filled that weren't before. But I think it's really an increasing nightmare for speculative correctness etc. If a core has to track two threads at once, it has to track two sets of state and speculative visibility etc. And the overhead of that may be more than it appears at first crack, on top of just being more complex to do correctly. On top of that if it produces even a marginal reduction in area/complexity that can be plowed back into having more of them. But you've taken a crack at execution unit waste - if it's stalled, move it back to the bank, or move it to a slower core, or whatever. So hopefully that would translate into better occupancy on the units such that there's less waste for SMT to exploit.

the other weird one is IBM's thing where there is no L3 but you have superfast L2s that can be read by any other socket in the machine at essentially line speed, and they build a tagged cache thing where evictions from one machine are pushed into the others, so everyone lives in one space and you have private caches built on top of a shared virtual cache infra

This is all really interesting, thank you. Can you link to some resources so I can read more about this?

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply