Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
No Gravitas
Jun 12, 2013

by FactsAreUseless
Why is Atom lovely?

Adbot
ADBOT LOVES YOU

No Gravitas
Jun 12, 2013

by FactsAreUseless
What is that thing anyway? A GPU that isn't a GPU?

No Gravitas
Jun 12, 2013

by FactsAreUseless

Lowen SoDium posted:

It's an add in card with 57 Intel Atom cores on it. It's Intel's version of Nvidia's Tesla co-compute cards.

poo poo, sounds pretty cool actually. I don't suppose I can boot Linux on that?

No Gravitas
Jun 12, 2013

by FactsAreUseless

PCjr sidecar posted:

As could be expected for a 270W card with no fan, cooling is *very* important for these. If you're interested, look at the Supermicro 4U passively cooled GPU/Phi chassis to see what they do.


It runs Linux. You can SSH to it.

They aren't Atom cores, they're Original Pentium (P54C) cores without out-of-order execution, 4-way hyperthreading, and an wide-rear end vector unit bolted on. Next-gen will have Atom.

So it will run GNU Octave 20+ times at once?

If so, this is exactly what I need.

Except cooling. And power supply.

poo poo... Hmm...

No Gravitas
Jun 12, 2013

by FactsAreUseless
Oh, yes. Oh, yes. I want this.

What kind of cooling does it need?

Does it just plug into PCIe or something?

atomicthumbs posted:

They're on sale everywhere; apparently Intel's about to announce the next generation or something.

Can you give me some links? I think I want one, but I don't want to buy from that company. Some system with a checkout cart and all that, maybe?

This is almost exactly what I need for my work.

No Gravitas
Jun 12, 2013

by FactsAreUseless
Are there any alternatives for people who want a ton of integer cores?

No Gravitas
Jun 12, 2013

by FactsAreUseless

~Coxy posted:

I doubt they ship overseas, but got a link?

That ship sailed. They did not ship to :canada: while it was up, otherwise I would have gotten one.

EDIT: Current best is 140$. They also don't ship to :canada:

http://www.amazon.com/gp/product/B00OMCB4JI/ref=gno_cart_title_0?ie=UTF8&psc=1&smid=A183W8CNLFPZLY

No Gravitas fucked around with this message at 06:12 on Oct 30, 2014

No Gravitas
Jun 12, 2013

by FactsAreUseless
The Amazon Phi deal popped up for a brief while at 92$ shipped. I caved in. Helps that my job will pay for part of it too.

I got a forwarding service box and got it sent to there. No idea how (and for how much) I will get it out of the USA, but one step at a time, I guess. Might be something worth investigating for other non-USA people out there.

Work is about to get awesome. Once I figure out a way how to exactly cool and feed the thing.

Thanks, thread!

No Gravitas
Jun 12, 2013

by FactsAreUseless
I spent the last couple of days researching the Xeon Phi that is on sale. Man, what a beast.

You have those 57 fairly wimpy cores, each running at 1.1GHz. Those cores are very close to Pentium 1. Not even Pentium MMX. Pentium 1. We are talking 1994 here, 20 years ago. Of course they have been tweaked a little bit. We have 64-bit support, some improvements to instruction prefix decode handling, each core has a beefy 512-bit vector unit and they adjusted the pipeline a lot.

As a clock speed improvement measure they made it into something like a barrel processor. There are 4 execution threads in every core, for a total of 228. Each clock cycle one of the threads is picked and one or two instructions are issued from that thread. There is a rule that you cannot pick a thread twice in a row, otherwise anything goes. Just because a thread is not picked in a specific clock cycle does not mean it does not make progress. The clock cycle still counts for cache misses, long-latency operations, etc... When some threads are stalled, others can continue. Due to the fact that a thread cannot be picked twice in a row this is like Hyper-threading on steroids. You likely want to have at least two threads per core at all times, if not three or even four.

Caches are fun. Each core has the standard 32 kilobytes of data and same amount of instruction cache. 3 cycle latency in practice, sometimes less. There is an L2 cache, 512 kilobytes per core. A better way to think about it is as of a 30 megabyte chunk that is shared between the cores. When your data isn't shared it takes 24 cycles to get data from L2. With shared data that is being modified you can end up with 250 cycles of waiting for a remote part of L2. Don't share non-readonly data. Also only L2 cache prefetches via hardware, so taking some L1 misses is a given. You can try to do software prefetching, of course. Hitting up the main memory? 300+ cycles. Those are the numbers seen in practice. On Intel's paper they are about half and it probably varies between devices. You have 8 gigabytes of GDDR5 RAM at 5GT/s. Want ECC? You got ECC that you can turn off and on... Except it will eat up 1/30th of your RAM and quarter of your bandwidth if you turn it on. Not quite like ECC on the desktop.

Which leads me to storage. There isn't any. How do you boot that thing, then? Well, you get a staged bootloader design. Code in ROM boots code in the tiny flash. Flash bootloader signals that the Phi is ready to the host. The host then uploads a Linux distribution to the Phi's RAM and off it all goes. You can store files via NFS or upload them from the host as needed. You can open a serial connection or ssh into your Phi.

Something had to go and it is binary compatibility. Sorry, you don't get to run Dune 2 on this. You need to recompile everything in order for it to run. Sure, you can use the bundled qemu if you cannot recompile, but it won't run fast. This recompiling is the first big pain of the Phi. For performance you need to use Intel's C compiler. Sure, there is a GCC port, but it won't use the beefy vector units, won't do the software prefetching and won't be as optimized. This leaves grad students like me a bit in a bind. I cannot run ICC legally as I do get paid for it. However, spending 700$ on the barebones compiler set is a bit rich for me when I earn 3000$ a term. There isn't a free fortran compiler for the Phi at all. There also isn't a simulator, so you cannot test your code once you get it compiled.

It will be a fun experiment for me. Everyone loves floating point, but I only require some nice integer performance on the cheap. (Yes, I know I'm doing it wrong, shut up!) The Phi should be roughly equivalent in throughput to 57*2/(7.5*(3.3/1.1)*1.15) = 4 Haswell cores at 3.3GHz, given that it is fed properly. Even for 150$ this is amazing, considering you get the RAM too. And even if it isn't... Well, gotta take fun from something, right? And if running a 57 core, 300W monster won't do it, I don't know what will.

Now if only it would get here already, so I could dhrystone, coremark and STREAM it... I will post results here as I get them.

No Gravitas fucked around with this message at 21:10 on Nov 1, 2014

No Gravitas
Jun 12, 2013

by FactsAreUseless

PCjr sidecar posted:

1) RAM is a bit more than 5 GT/s; ~200 GB/s on the 3xxx, but you have to spread that out across all of the controllers to get anywhere near that.
2) Talk to your local Intel sales rep and see if he can help you with a license; see if he can get you a VTune license also. If you have a local/regional HPC center that has access to recent Intel compilers they may have the cross-compilation tools; you may be able to compile there and copy binaries to your desktop. If you're in the US, try to get on XSEDE; your campus champion can get you a startup allocation fairly easily, which will get you access to TACC's Stampede cluster (and Phi tools.)
3) If you haven't seen it yet, https://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss
4) The STREAM benchmark results site has specific tuning/compilation examples for the Phi.

2) Nah. I don't like to involve Team Blue any more than I need to. What I'm planning to do is a terrible abuse of the Phi's capabilities. I'm basically Aperture Science in real life. I can't help it, I just like trying crazy things and making them work. I can imagine how the conversation with Intel would go.


:science: Hi, I'm a poor grad student doing work at a university. Can I have a compiler for your 2000$ device that was on sale? I will earn some money with it by doing integer-intensive computations totally unsuited to the Phi!
:ohdear: Please talk to your datacenter manager and ask him to buy you a licence.
:science: I don't have a datacenter. I'm running it all by myself. It's a blast.
:ohdear: How the hell are you running it without a datacenter?
:science: In my desktop, I have a sever motherboard.
:ohdear: How in the world are you cooling it?
:science: Oh, I have a quiet fan, duct tape and some ductwork made from card stock.
:ohdear: :stonk: :stonk: :stonk:
:ohdear: Can we hire you? We need enthusiastic/excited madmen like you who push the envelope of sanity in name of being cheap. We can put you on our circuit testing. If you cannot break something, nothing will.

Yeah, I think I will just run with the GPL stuff they released. Which reminds me: No gprof either, but you do get gdb.


3) Yup. Already using it. I have hello world ready and waiting to run. The Phi will get here in a couple of weeks at the earliest. For now I'm working on cutting down Octave to build without fortran for now... Loooong story.


Ika posted:

The only annoying thing is that most boards don't (explicitly) support it unless you already are running something like an E5.

I'm running on a Xeon E3-1226 v3, and a server motherboard that has all the right BIOS thingies to support it. Fingers crossed! If it does not work, I can always sell it or keep as a cool doorstop.

No Gravitas
Jun 12, 2013

by FactsAreUseless
Heh. I wish it were only true that they'd send me hardware too. I mean, Knight's Landing is coming out soon, right? Atom is surely better isoclock than Pentium, right? Maybe it even takes unmodified binaries too? Please?

As much as I'd love to get stuff, I won't be asking for it. 1) I will earn money on this. 2) I'm abusing the Phi. 3) It isn't even needed to use the Phi. 4) I'm not even gaining that much by using the Phi, if my estimates hold. I'm doubling my performance at most.

Aaaaand... It shipped. 2+ kg package. Lovely. I think the heatsink alone must be 1kg. Assuming I manage to get it out of the USA somehow it will be here by late November. I cannot imagine how much the forwarding of that lovely brick will be. By the time I get it I hope to have Octave cut down to make it all work.

What have I done. This is insane.

No Gravitas
Jun 12, 2013

by FactsAreUseless

Professor Science posted:

you have piqued my interest, why are you doing this

My boss is too cheap to use MATLAB, has me using Octave instead.

We don't really need MATLAB/Octave. No math functions are actually used. Pure scripting.

poo poo runs too slowly, no surprise. The code is effectively written as if it were C, just happens to be in Octave. This will never run fast. We have 2-3 kloc of this. I wrote a fair chunk of it too. It is the best documented code I have ever worked with in my life. It is beautiful. All variables have good names. There are regression tests. Power-on self-tests even. Abstractions are great. Consistent style... and beyond horrid choice of programming language for the task.

As you can see, the project is quite absurd. I gave in and embraced the insanity. I thought that since the Phi is on sale, maybe I should just play along and propose something absurd right back. I love hardware and they won't hear about anything other than Octave. I proposed a 50-50 split on the costs of the Phi to (maybe / at most) double the total processing speed. I get to keep the Phi once we are done. I pay for the Phi, they pay for the electricity and the power supply. To my horror I got a yes.

Nevermind that for that money you could probably buy MATLAB, but well... I don't even care anymore. So cold.

There is no free fortran compiler and since we are only scripting... I'm slowly filing away all the fortran bits that will never get used...

:stonk:

EDIT: I forgot the best part. The abuse of the Phi that I keep talking about? Not running a single application with 100 threads, oh no. I will be running ~100 single-threaded instances of Octave on the Phi. The boss is always right. The boss is always right. The boss is always right. Why does it hurt so much that the boss is always right. I begged and pleaded. I asked for MATLAB. For C. For anything. The boss is always right. I just mention that we could buy the Phi for a tiny boost and she just home ran with it to death.

JawnV6 posted:

You're making up a world that doesn't exist and crippling your project because of it.

You know what... You have a point. I will try. Can't hurt too much.

No Gravitas fucked around with this message at 23:07 on Nov 2, 2014

No Gravitas
Jun 12, 2013

by FactsAreUseless

Ika posted:

Matlab is something like 1k or 2k for the base suite, and if you want to do anything other than multiply vectors / matrices and maybe basic functions on matrices you need to buy the correct addin toolbox for another 500.

Well, throwing the 100$ Phi at the problem makes a bit more sense then.

Still not as much sense as just porting over to Python, C, Julia, Java, etc...

No Gravitas
Jun 12, 2013

by FactsAreUseless

GokieKS posted:

Ugh, seems like Dragon Age: Inquisition just flat out refuses to work on a dual-core machine regardless of how fast those cores may be.

From what little I looked at, it seems something like DRM takes up an entire core or something.

Ugh.

You don't want X99, Haswell-E or (for now) a 5XXX CPU. Those are for people who think big model numbers/big price = better. For games they are worse than Z97. Don't fall for lovely misleading Intel marketing.

No Gravitas
Jun 12, 2013

by FactsAreUseless

movax posted:

I've got a problem that I'm pretty sure I'm over-complicating because of my background. I have a system that operates on a 10Hz SYNC signal that is distributed throughout the system and to various nodes. I have a x86 box running Linux that acts as a tester that needs to consume the 10Hz SYNC as interrupts to Linux to synchronize timestamps, etc.

I'm so broken that my easiest solution is to throw in a PCIe FPGA devkit and have it issue MSIs at a 10Hz rate to the kernel, since that's pretty simple. The machines are new enough that any legacy I/O doesn't even exist on the mobo as a header from the SuperIO, it's PCIe add-in card or bust. Am I forgetting any other braindead simple ways to wire a signal to the 8259-esque interrupt controller in the PCH?

I think most USB devices that expose GPIOs would have to poll on the interface. I was also entertaining the thought of having the 10Hz signal cycle the SMBus ALERT pin.

e: asking here because this is sort of generic x86 chat and there are plenty of Intel lurkers

Maybe I'm an idiot, but if you have a USB device that can tell you how many SYNCs hit since your last check and that can tell you how far back in time each of those was... then you can figure out the timestamps x86 side? Maybe?

EDIT for clarity: Any microcontroller with a USB-UART should do this.

Ugh. I am an idiot, probably best to ignore me, but I just could not refuse a stab at this riddle.

No Gravitas fucked around with this message at 05:32 on Nov 20, 2014

No Gravitas
Jun 12, 2013

by FactsAreUseless

JawnV6 posted:

Yeah, true. Still, you're essentially counting off 100ms chunks, so even with a poll interval of 16ms you're not too far off? 100ms seems like eons, how much jitter can you tolerate there?

C# has spoiled me, I have nice DataReceived events that act enough like interrupts that I wasn't thinking about the USB device not having that capability.

100ms is eons.

USB does have an interrupt mode for quick responses, seems pretty speedy. Again, a microcontroller can do this for you, although not as neatly as going via USB-UART. You can measure the latencies you get and do a poll after 100ms - latency. This should get you there, repeat the poll if you are too early. Maybe poll such that you should be 2ms late.

Could also plug your SYNC into the microphone jack and try to do stuff with that, I guess. Then there is PS2, if you have the luxury...

No Gravitas
Jun 12, 2013

by FactsAreUseless
My Xeon Phi finally arrived.

Gotta love how they packaged it. The packaging is for 4 Xeons, but only one slot was filled. I guess they really expect you to buy in bulk. No instructions of any kind included, just the bare unit in an antistatic bag.

I'm busy today. Tomorrow I will try to run it, my mighty Noctua NF-R8 PWM fan providing the cooling.

No Gravitas
Jun 12, 2013

by FactsAreUseless

Chuu posted:

Is this the 31S1P? I've been meaning to get on that promo, but I couldn't figure out how to keep it cool. How exactly do you have your cooling setup?

Yup, 31S1P.

I will have the fan strapped directly to the card.

I looked at the datasheets. You don't need that much airflow if your air is room temperature, as opposed to having datacenter quality air to work with. This requirement also applies for cooling a device going at full-clip, which I won't be doing. Integer code only most of the time.

I have just a tiny bit of headroom with the fan I have, even though the fan is twice as tall as the card and thus I'm only counting on getting half of the airflow that I'd be getting otherwise. I went with a radiator/CPU cooler fan. Those have pretty decent air pressures, something that should help push the air through the Phi.

I will keep you guys posted on how that works out, keeping in mind that I'm not touching the vector units.

No Gravitas
Jun 12, 2013

by FactsAreUseless

Chuu posted:

I'd love a picture if possible. I though you'd need a blower setup considering max TDP is 220W.

That being said, the first thing I was going to do was build the Intel MKL BLAS library and see what performance gains I could get in R. That would probably get it near max TDP.

270W, actually. 300W has been cited in some places too.

Considering I only have GCC to work with here, I won't be building any BLAS stuff. Coremark, dhrystone, etc... If you have any binaries to send me, I'm happy to run them and watch the temperatures. I'm not the usual Xeon Phi customer, you see.

For me it was a choice: Buy a second computer for 600-700$ or the Phi, a fan and a new power supply for 300$ total, with my work paying for part of it. (I also get to play with the Phi and get experience rigging up crazy cooling systems.) The performance of both of those options is about equal on my integer-only load without any vector instructions.

There will be pictures.

No Gravitas
Jun 12, 2013

by FactsAreUseless

Chuu posted:

Since you mentioned GCC, can I assume linux?

I don't have any binaries handy, for me step 1 was to get the card, step 2 was to figure out how to build the BLAS libraries. That being said, linking R vs. a custom BLAS library is really trivial in Linux -- there are a lot of tutorials out there on how to do it, and most linux distros have several BLAS libraries in the default repo to download. Given my experience with Intel libs I was hoping it would be easy to build MKL targeting a Xeon Phi, and then just use the generated object files.

I'm running Arch Linux, but the host computer is irrelevant. The co-processor runs Linux on an architecture that is slightly something like the original Pentium, except 64 bit and with a seriously insanely great vector unit bolted to it. Oh, and it is a barrel processor too, so you need 2 threads per hardware core for full utilization. Binaries have to be custom compiled for this, you cannot just throw x86 code onto the hardware. Both GCC and the Intel compiler can do this. Buuuut...

You need the Intel compiler for the vector instructions in the Phi. I don't have one, and I don't intend to get one as my use case does not call for vector instructions. With GCC you can only compile x87 floating point code and integer code. I really doubt this will light up the Phi.

I'm doing this because it is a way-outside-of-the-box solution to my problem which just happened to be cheaper and more interesting. It also is insane, which fits my usual way of life.

If you get me ready-made binaries, I will gladly run them and let you know how my cooling solution holds up.

No Gravitas fucked around with this message at 04:34 on Dec 10, 2014

No Gravitas
Jun 12, 2013

by FactsAreUseless
Well, it did not work! Even mad science has failures somehow.

I figured out why too: I fell into the trap of marketing.

My chosen fan does 33+ CFM. I need 18. My fan gives me 1.5 mmH2O of pressure, more than enough according to the datasheet. I'm fine then, yes?

No.

This fan can do 33CFM and 1.5 mmH2O in ideal conditions. However those are two very different circumstances which will never happen at the same time.

See, pushing air through a 30cm radiator block isn't ideal. CFM drops waaaay, waaaay down when it meets opposing pressure. Fans are advertised with CFM which they can get at no pressure loss and at pressure which prevents all air flow. You aren't getting both the CFM and the pressure. The actual performance will have less than ideal CFM at less than ideal pressure. What is advertised is just two points along a curve specific to each fan. You need that curve to know what the hell is going on at your pressure drop.

Xeon pressure and airflow requirements are in the datasheet. Different airflows have different pressure drops too. It really is easier to keep the beast cool with cold air as compared to 45C, but a single Noctua won't do it. For my case I use only half of the fan too, so that does not help.

It does make a difference to have a fan or not, even when the Phi is on idle. 3 minutes to a thermal shutdown vs 12 is a big step forward.

Tomorrow I duct my 75CFM, 2000RPM, 120mm case fan straight into the Phi as a push fan, with the Noctua doing pull. Considering even the Noctua at half-fan is an improvement, 2000RPM ducted right through this should do better. And I'm keeping the Noctua around to help, it won't hurt. I don't need the case fan anyway, it is very airy.

And when it comes to motherboards: Mine supported the Phi without any issues. Supermicro X10SLM-F, I believe.

I'm enjoying this adventure so far. Wonder where it will take me...

No Gravitas
Jun 12, 2013

by FactsAreUseless

redstormpopcorn posted:

Find someone to 3D-print a 120mm fan duct set and I will send you two slightly-used Ultra Kazes to push-pull on that fucker.

Two words: Card stock.

Krailor posted:

I'm with Chuu, instead of trying to duct a case fan in there you should look at trying to attach a blower to the front. Something like this: http://www.xoxide.com/evercool-fox2.html

Just take the PCI bracket off and figure out some way to attach it.

Good backup position. I'm trying with the fan because I already have it and I'm not in the mood to go to the store until I go through the less sane options first.

No Gravitas
Jun 12, 2013

by FactsAreUseless
Given a great exhaust fan and a lovely intake fan with a duct, I have the drat beast stable at 88C on idle, at least for now. I might flip the fans around a bit, see if that makes a difference.

Oh, yeah. It idles at 120W. Yeah... There are low power modes, but not on by default. Next goal for me, I think.

Clearly, I need a more powerful fan on the intake side, but things are looking mighty good considering where I was a few days back.

Now to start having fun with it...

EDIT: Low power mode on. Yup, now it idles at 68C and 60W. Yeah, don't want to use it full-throttle like this without a great intake fan. Only the CPU is hot, the rest of the circuit board idles at 50C at most, usually around 40C.

pre:
mic0 (info):
   Device Series: ........... Intel(R) Xeon Phi(TM) coprocessor x100 family
   Device ID: ............... 0x225e
   Number of Cores: ......... 57
   OS Version: .............. 2.6.38.8+mpss3.4.1
   Flash Version: ........... 2.1.02.0381
   Driver Version: .......... 3.4.1-1
   Stepping: ................ 0x3
   Substepping: ............. 0x0

mic0 (temp):
   Cpu Temp: ................ 69.00 C
   Memory Temp: ............. 44.00 C
   Fan-In Temp: ............. 31.00 C
   Fan-Out Temp: ............ 45.00 C
   Core Rail Temp: .......... 42.00 C
   Uncore Rail Temp: ........ 39.00 C
   Memory Rail Temp: ........ 39.00 C

mic0 (freq):
   Core Frequency: .......... 1.10 GHz
   Total Power: ............. 58.00 Watts
   Low Power Limit: ......... 283.00 Watts
   High Power Limit: ........ 337.00 Watts
   Physical Power Limit: .... 357.00 Watts
EDIT2: The very first "benchmark" is in!
bogomips : 2206.63
(Yes, I know this means nothing.)

Next up dhrystone and coremark. But that is tomorrow, I guess.

EDIT3:
Dhrystone is here! Done with gcc, latest on Arch Linux for the host and the one that Intel gives you for the Phi.

Xeon Phi 31S1P:
Dhrystones per Second: 714285.7

Xeon E3 1226 v3:
Dhrystones per Second: 14925373.0

Take whatever integer based-program result you have, divide by 21, get the Phi speed. (Yes, I made sure it isn't throttling.)

Now to turn on compiler optimizations.

Xeon Phi 31S1P:
Dhrystones per Second: 746268.7

Xeon E3 1226 v3:
Dhrystones per Second: 41666668.0

Ummm... WHAT?

I'm so lucky that I have 57 CPUs, because things now run 55 times slower. Looks like my E3 1226 and the Phi are about the same power on dhrystone when you take optimizations into account. Then the host has 3 more cores and all the Phi has to offer at that point is multi-threading. I expect the Phi to maybe give me 3/4 of the power that the host CPU gives me. Maybe a bit more.

This is poo poo of the highest caliber. I'm not sure if the version of gcc Intel gives you cannot optimize for poo poo (this isn't about just the extra instructions, but also about instruction scheduling, evidence of which I don't see in the disassembly [but I'm pretty tired and it is 2am here]) or if the Phi is just hard to optimize for. Either way: Meh.

One fan, some card stock... About 200$ cost to me, total, one future extra fan purchase included. Considering the Phi will likely perform about as well as a second Xeon E3 1226 v3... Eh, fair enough of a deal. I saved on a motherboard, RAM, case, my job paid for the new power supply... Not bad. I'll take it.

You cannot beat the fun factor of setting it up and playing with it though.


EDIT4: Memory bound benchmark: Stream. Optimizations on.

pre:
Phi:
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            1352.9     0.118420     0.118266     0.118489
Scale:            990.9     0.161569     0.161463     0.161766
Add:             1264.2     0.189998     0.189849     0.190197
Triad:           1107.1     0.216889     0.216785     0.217044

Host:
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           13263.9     0.012128     0.012063     0.012372
Scale:          13400.3     0.011981     0.011940     0.012097
Add:            14874.5     0.016169     0.016135     0.016326
Triad:          14729.3     0.016395     0.016294     0.016647
Not bad. Wonder why Scale ran so much slower, but eh... Time to sleep.

No Gravitas fucked around with this message at 11:29 on Dec 15, 2014

No Gravitas
Jun 12, 2013

by FactsAreUseless
Switched the fans around, the good fan being the push fan now with the wimpy fan on exhaust. Much better. A better fan will arrive in a week, but for now I guess I'm set for some trial calculations.

Also ran a C50 classification, just to get a feel for real-world performance. 20 times slower on the Phi. Perfectly acceptable if this is the case on my software.

Yay!

...

Well, I'm sure no one cares, so I will shut up about my Phi now. Yay!

No Gravitas
Jun 12, 2013

by FactsAreUseless

r0ck0 posted:

Why are you happy with 20 times slower? What is the advantage?

The scaling.

My host has 4 cores, one thread each. If the Phi was going on a single core it would be 80x slower than the host, assuming the host scales perfectly.

But the Phi has 57 cores, each with 4 hardware threads. My problem scales wonderfully and with 200 instances in parallel I get about 180x speedup over a single instance. Suddenly I'm 2.25 times as fast as the host, and I can still use the host. And I still have a few cores left on the Phi in case I want to do something else...

All this for the cost of two fans and the Phi, which I got on a severe discount. Truly a steal compared to buying two computers and a bit.

Oh, and the fun of running the beast. And the joy of being able to put it on my resume (and I honestly do desperately want to work for Intel! I love them!). And the DIY factor.

FormatAmerica posted:

It might burn your house/apartment/workplace down when the index cards fan ducts catch fire :laugh:

It won't. The CPU runs hot, sure. Nothing else is even remotely hot in there though. The paper bit touches the intake end, which is kinda a long rail of heatsinks without any components that generate heat. The temperatures there don't exceed 40C. They might reach more once I get better airflow, but I'm not worried about paper combusting at those temperatures. (EDIT: I'm running the air in reverse, so the Phi's intake is my exhaust and vice versa. It works better that way.)

cycleback posted:

Have you read of anyone using the Phi with and i7-5820k on and X99 motherboard. I have a problem that is trivially parallelizable that might work well with the Phi though I am wary of getting bogged down with it.

The Phi is finnicky. You need great airflow or at least good and cold airflow, a motherboard that can support it, a good power supply, some means of physically supporting the weight in case you have a tower computer and some people say that a proper CPU too, although I think it should run fine with any. You also probably want to run Linux on the host and will be running Linux on the Phi. I hope your host Linux is one of the two supported ones, or you are in for a treat trying to get your Phi to work. I had to do some trivial kernel-module hacking to have the mic module to work. Your computer will likely require some modding to provide cooling, I had to cut some holes for the fan to blow through nicely. A hammer needed to dent some things to make a fan fit into a place which is not supposed to have a fan. Then you need to recompile everything you want to run, and if you aren't running vectorized code then you aren't getting the most from your Phi. For vectorized code you need to get the Intel compiler. Then your instances must all fit under 8GB both in RAM and on "disk" because the Phi has only a ramdisk and no swap.

Not for the faint of heart. I was worried even having a Xeon CPU and a server motherboard. I'm still not out of the woods on the cooling, although I'm pretty close. I can run full-scale experiments now. A single batch jumps me from 60C to 97C, just barely before the beast throttles. My load is minimally impacted by throttling, but it is a mental barrier I don't want to cross.

The sale also is not as good as it used to be. No more 80$ Phis around that I can see.

Also an educational note.
The Phi in "ready" state. This might make you think it enters a low power mode. Nope. It spins at pretty high power. For low power you want to boot it up into an OS and make sure you have enabled sleep modes. And then leave it alone. Any action will push it up to 110+W immediately (this includes micsmc commands!). Leave the beast be and it will only take 60W.

No Gravitas fucked around with this message at 08:07 on Dec 16, 2014

No Gravitas
Jun 12, 2013

by FactsAreUseless

Chuu posted:

No Gravitas: I've been super busy and will see if I can get you some sort binary that natively targets the Phi, but it's not looking very likely. I'm actually really surprised Intel doesn't have a LINPACK binary that you can just download off their site.

Just tell me what to point curl/wget at, and I gladly will. I'm lazy.


Yes, NFS does work too, if you can get away with the speed hit.

Stream: I was only comparing with the host and only using the compiler that I have, not trying to get a great result. Sure, nice to see what the Phi can do with a good setup and with a problem that fits. For me the only thing that could make a difference is maybe better instruction scheduling to use both the U and V pipes whenever possible. I keep meaning to look at disassemblies to see if GCC does this, but... :effort:

For 80$ the Phi is worth it for pretty much anything that runs a whole bunch of threads, if you don't mind getting it running as a fun project/adventure. Cheaper than getting a second computer. For 200$? If you have a good case for it and can use the vector units, sure. For more $ than that it is indeed mediocre. I do count the elecricity in, but I'm in :canada:, so at least that isn't much of a worry.

I'm kinda low energy lately. I need that daylight lamp to light up my eyes... So lazy without it. So :effort:.

Let me get those pictures taken, prepare to :barf:...

No Gravitas
Jun 12, 2013

by FactsAreUseless
Pictures taken. Gotta import. Sheesh. I'm really having a poo poo effort day though and probably won't get it done before tomorrow.

Mr Chips posted:

Any chance you could run MrBayes in MPI mode on the Xeon and the Phi for comparison?
I can help out with setting up a test workload.

With pleasure. EDIT: I'm on Arch Linux, 64-bit, Xeon E3-1226 v3, 16GB ECC RAM at 1600 CL11. Also 1TB of normal disk, no SSD.

Keep it mind, as is right now we are surely going to hit throttling in about three minutes unless you are using only half the chip or something. I need more fan. If this does not work, I will get still more fan. There can never be enough and the Phi is just too... eh... cool... not to use.

No Gravitas
Jun 12, 2013

by FactsAreUseless

Mr Chips posted:

cool, i'll try and come up with recipie for you. Presumably the Intel compiler kit for the Phi includes mpicc and mpirun? I've only ever used the openMPI toolkit for this, to build x86_64 binaries.

edit: going by this: https://software.intel.com/en-us/articles/how-to-run-intel-mpi-on-xeon-phi, it doesn't look like too much of a deviation from what I've done in the past..

I do not have the Intel compiler available. I only have the MPSS GCC installed, in addition to the host's GCC compiler.

No Gravitas
Jun 12, 2013

by FactsAreUseless

Mr Chips posted:

Ahhh...it might be a bit of a wild goose chase if we don't have the Intel MPI dev tools for Phi.

Yup, I think so. Sadly, those don't grow on trees and I already tried to source them fruitlessly twice in the past few weeks. (I will earn money on the Phi, so I cannot just get a free/student license. My school was not helpful either.)

If you do get a binary generated, I will be more than happy to take it for a spin and see how well it runs.

No Gravitas
Jun 12, 2013

by FactsAreUseless

Mr Chips posted:

no worries, will see if I can get something from Intel, but I'm not working with the HPC team any more

If/when you have stuff for me to run, I'm firefly@gmx.ca.

EDIT: And here is my lovely, lovely setup. :barf: territory has been reached.

http://imgur.com/a/ieOfU

Ugh, that carpet alone is vomit-inducing.

EDIT2: I forgot to add to the pictures: The side of the computer is actually closed when I run it. Only the front is popped off because it would really restrict the airflow otherwise. Instant 2C difference and a bit more over time.

No Gravitas fucked around with this message at 05:20 on Dec 19, 2014

No Gravitas
Jun 12, 2013

by FactsAreUseless
The second Noctua together with a better duct did it. I'm running very hot, but not throttling at full load (no Intel compiler, so no vector units, however much of a different that would make) anymore.

I'm still going to buy a 80mm PWM fan for massive cooling in case some other load will need it, but for now I'm running fine.

No Gravitas
Jun 12, 2013

by FactsAreUseless
Would it run on a virtual machine with 4 virtual cores backed by 2 physical cores? I mean, it won't run well, but does it run in any capacity?

No Gravitas
Jun 12, 2013

by FactsAreUseless

horriblePencilist posted:

My Intel i5 is seriously underperforming. I did some benchmarks, and this was the result:



Not sure what's wrong , since I've made sure I have the latest drivers.

Maybe thermal throttling?

No Gravitas
Jun 12, 2013

by FactsAreUseless

go3 posted:

basically anything to do with power is 'you get what you paid for'

At this point I have to post this delightful review of a 20$, 500W power supply.

http://www.jonnyguru.com/modules.php?name=NDReviews&op=Story&reid=324

No Gravitas
Jun 12, 2013

by FactsAreUseless

Mr Chips posted:

what do you mean? A 'desktop' DIMM is an unbuffered DIMM

Maybe unbuffered memory which is ECC?

No Gravitas
Jun 12, 2013

by FactsAreUseless

WobblySausage posted:

Are these temperatures safe?



Are you trying to boil water?

If not, then it seems a bit high.

No Gravitas
Jun 12, 2013

by FactsAreUseless
May I ask, what bugs exactly?

No Gravitas
Jun 12, 2013

by FactsAreUseless

Grundulum posted:

I have a dumb question about the MIC architecture (Xeon Phi).

Are these devices like a single 60-odd core CPU, in that each core can operate independently, or are they closer to a GPU in that all cores execute the same instruction, just on different data? I see Phis called SIMD, which suggests the latter, but in that case I can't understand why they're different from GPGPUs.

When I try to search for this on Google, all I get are benchmark tests.

All the cores are independent.

No Gravitas
Jun 12, 2013

by FactsAreUseless

BurritoJustice posted:

That was "No Gravitas" if I remember correctly. It was a cool series of posts.

Yup, it was me. Give me something insane to do, and I will.

Durinia posted:

Yeah, you can "compile and go", but you'll get complete rear end-level performance.

KNF and KNC were mostly experiments. KNL is being pushed by Intel as the first real focused HPC implementation as a product.

Of course, they also said that about KNC, so...

The performance sucked with gcc as only the intel compiler uses the wide execution units. About a 20? 30? 50? times degradation over Haswell Xeon processor, if I recall. (compared single core to single core) If I booked up the whole Phi (and remember you need twice the amount of threads as cores!) I would have ended up on par with my server processor. Not a bad deal for 100$, or however much it cost me to get it, just not what I dreamed of.

Still an amazingly cool device, but you need the intel compiler for doing anything serious on it, period.

Maybe use it as sorta kind of a swap, those 8GB of ram could come in handy for something...

Adbot
ADBOT LOVES YOU

No Gravitas
Jun 12, 2013

by FactsAreUseless

mobby_6kl posted:

Is that the performance with ICC or were you not able to test that?

GCC, never bothered with ICC in the end. Had other projects to do than to go looking for ICC. I imagine ICC would be better, yeah. My workload wouldn't be able to use the wide execution units anyway, being basically 224 instances of GNU Octave running highly branchy code with no matrix processing at all...

Yay for academic legacy issues!

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply