Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Locked thread
BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

Stanley Pain posted:

Funny anecdotal story here. I bought 8 3TB Seagate drives (0.95% defect/return rate). Fast forward about 6 months and I had 6 of the drives fail. Bad luck and all that :(

Hard drives have a common failure mode unrelated to manufacturing quality control, though, and it's often responsible for failure clusters like yours: did someone drop the shipping crate? (Related: were the drives repackaged improperly by a reseller or shipper somewhere along the chain from factory to you?)

Note that the drive's not necessarily OK if it appears to function immediately after being dropped, pass SMART tests, etc. It can take a few months for total failure to develop. The initial impact event generates small debris particles. These eventually get sucked under a flying head, damaging head and platter and generating more debris, which then goes on to do the same thing.

They put particle filters inside the drives in an effort to prevent debris failure cascades, but they're not perfect.

Adbot
ADBOT LOVES YOU

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

Alereon posted:

That's a members-only post, but most Toshiba SSDs are pretty bad. They were the only major manufacturer (without a reputation for low quality) to ship Sandforce drives with error correction disabled, they shipped them in Apple systems and had to be recalled due to failures.

Error correction disabled? Nah. Whatever source said that is exaggerating or speculating from a position of considerable ignorance. Such an SSD would corrupt data all the time, even when it was brand new with no wear. Error correction is an absolute requirement, so much so that NAND media has extra storage dedicated to it. I found a circa 2004 Micron 2Gb NAND datasheet showing 64 bytes per 2K byte page, but I'm sure the overhead is much higher on modern flash process nodes.

This extra space is distinct and different from the extra storage used for wear leveling and overprovisioning. Technically you could use it to store whatever you like, but if you want the NAND to be a non-lossy storage medium you'd better use at least some of the extra bytes for ECC parity. Standard SSD controllers like SandForce are designed to use most of it for ECC and the remainder for metadata (wear tracking, ID, pointers, and so forth). Toshiba could not have gained anything by turning SandForce error correction off, which is why that idea has to be a misinformed rumor.

If SandForce provides the ability to tweak their ECC algorithms, I could buy Toshiba doing that. After all Toshiba is a NAND manufacturer and could customize based on intimate knowledge of their own NAND. However, IMO, the problems those Apple/Toshiba/SandForce drives have are unrelated to error correction. They're consistent with firmware bugs: the drive eventually locks up after a long period of use. SSD internal metadata structures necessarily get more complex over time as you write and overwrite data, and as the drive moves data around to level out media wear. It's possible to think you've caught all the bugs in testing environments, only to find new ones in fielded drives after they've seen a year or two of real-world usage patterns.

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

Alereon posted:

I'm talking about the Sandforce RAISE (Redundant Array of Independent Storage Elements) error-correction. (...) Only the brands that are typically associated with low-end discount RAM produced drives with RAISE disabled...and Toshiba, whom later recalled the drives they shipped in Macs. I think their choice with RAISE on this drive is likely to be only one of a series of choices they made that impact drive reliability.

Okay, now I get where you're coming from. But RAISE is kinda an enterprise level reliability feature, the equivalent of RAID-5 across all the flash die in the SSD. It's supposed to save your data in the face of physical failures -- pages, blocks, or even an entire flash die crapping out. You don't really need it for ordinary error correction on a drive where the flash media isn't failing, and if Toshiba shipped lots of dodgy flash die in Apple OEM SSDs they're kinda crazy. (Apple is, last I heard, the largest single buyer of flash memory which doesn't own its own flash memory fab, and Toshiba has long been one of their two main suppliers. If you're Toshiba, pissing Apple off is the last thing you want to do.)

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull
I believe Samsung just announced 840 series M.2 drives.

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull
ReRAM isn't the same thing as the "RAM" in your computer today. It's a new type of nonvolatile memory (retains contents when power is removed).

If you go straight to the source:

http://techon.nikkeibp.co.jp/english/NEWS_EN/20120614/223032/?P=1

this hybrid MLC flash + ReRAM SSD is not as immediate or as exciting as the bit-tech article implies. It's not a real thing, this is just a research group doing theoretical performance calculations for a proposed design.

It sounds like they're more focused on enterprise SSDs since their main concern is random write acceleration and reduction of MLC media wear. They're proposing use of a comparatively small amount of ReRAM as a buffer to absorb and group small writes so the main MLC media doesn't see as much small-block churn. That isn't far removed from what Samsung is doing in the 840 EVO today, using SLC flash as its write buffer rather than ReRAM, but ReRAM would likely result in better performance (at a higher cost).

Footnote - don't count ReRAM chickens before they've hatched. Unfortunately there have been a ton of failed revolutionary silicon memory technologies. An example related to ReRAM is chalcogenide phase change memory, aka PCM or PCRAM. Much like ReRAM it potentially solved almost all of flash memory's problems. Lots of money was thrown at R&D and even pilot production by industry giants. Unfortunately it's been looking pretty dead the last couple years, and the industry giants have gone silent about future plans. :( Hopefully ReRAM will work out better, just don't count on it.

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

Craptacular! posted:

The thing is, he doesn't boot into Windows, so if it's necessary to use Windows to use the included software it's not ideal (I can set it up for him and then hand the drive to him, but it would suck to have to pry open the laptop and then remove the drive every time there was a firmware update or something like that.

If the drive has Linux firmware updates available you can always put a live version of Ubuntu on a USB stick or SD card and boot from that to do an update. Ubuntu's website has instructions about what to download and how to copy it to media, the process is a little obtuse but not too bad. I recently did this to erase an Apple SSD (there's no way to issue the ATA Secure Erase command from OS X).

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull
Just checked and Samsung's firmware page provides Mac updaters. They're actually ISO images which you must burn to a disc and then boot from (the updater is a DOS program). This obviously is not ideal since you're going to remove the DVD drive.

You might be able to run it anyway by "burning" the ISO to a flash stick or whatever instead of a CD. Using disk utility on a Mac the process is simple, just use the "restore" function to write an image to a device. Might not work, but it's free to try before you buy anything, just download one of Samsung's images and see if you can get it to boot from USB. Long as it can actually boot there's no obvious reason why the firmware update will not work once you have the SSD installed.

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

phongn posted:

That's a really bad idea! Windows absolutely operates under the assumption that it has a real pagefile separate from RAM. If you have enough RAM, writes should be relatively limited anyways.

Yeah, was going to comment on that but I know Windows paging behavior has at least historically been a bit crazy (aggressive preemptive paging when tons of memory is free because Inscrutable Microsoft Reasons). Assuming they've fixed that in modern Windows, the 16GB RAM with a 3GB page file in a 4GB ram disk setup should behave normally for 0 to 12GB real memory used, degrade in performance up to 15GB, and then fall over and die thanks to running out of memory plus page file. A normal setup would have normal performance up to 16 used and then start hitting a page file on disk. If you turned paging off it would be mostly okay up to 16 used and then fall over and die. So there's just no point to the RAM disk thing, even if you're trying to avoid paging to a SSD. Turning paging off altogether is a better idea (which is still not a very good idea on almost all modern desktop operating systems).

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

incoherent posted:

Toshiba sourced SSD for macbooks have all had serious firmware updates needed to prevent data corruption.

"Have all"? No, that's not even close to true. The Toshiba SSDs used in one model year (2012) of one Apple product line (MacBook Air) have needed one firmware update to prevent data corruption and/or failure. Apple's been using a mix of Samsung and Toshiba SSDs almost the entire time they've been shipping SSDs, and have only recently branched out to throw something else in the mix (SandForce controllers). They'd be nuts to keep going back to Toshiba if literally every time they had to issue critical firmware updates.

(Also, Apple likes to get a bit involved in requesting firmware customization and doing extensive qualification testing, both HDD and SSD. I don't think they get into writing firmware themselves, but there are more reasons than cosmetic as to why their Samsung and Toshiba SSDs identify themselves via SMART as "APPLE SSD SMxxx" or "APPLE SSD TSxxx" instead of retaining the full Samsung or Toshiba brand name. That particular bug reflects badly on Apple, not just Toshiba -- it's something they probably ought to have caught with their own in-house qualification testing.)

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

Alereon posted:

Keep in mind that power demands from videocards have increased every generation, Crossfired R9 290(X) cards can easily use most of a 1200W power supply before overclocking. It's also important not to max out your power supply for power quality, noise, efficiency, and reliability reasons, so if you're buying less than a 750W power supply for a high-end system you are making a Bad Choice(tm).

This is true, but my impression is that the enthusiast PSU market is actually in trouble thanks to reduced demand for fuckoff ridiculous CF / SLI gaming systems. Same is true of most other specialty "enthusiast" components. The money's in console games and mainstream PCs, and fancier graphics in games cost more to create, so fewer and fewer game developers deliberately target their games at running well only on cutting edge gaming PCs. For quite a while CF/SLI demand was driven mostly by people wanting to play on large and/or multiple monitors, but that's melting away as single GPUs get better and better. Maybe we'll see a return to that pattern as 4K+ monitors get cheap enough to appeal to gamers.

That's why OCZ was desperate to reinvent itself by moving into the then-new SSD market. SSDs have mainstream appeal, it was a chance to break out of the enthusiast niche. Too bad they were apparently run by incompetent jerks who thought the path to glory was doing any shady thing to grab market share, and that lovely products would surely not come back to bite them in the rear end.

(of course, that formula kinda sorta worked for them on RAM and so forth, but I think people get more pissed off by losing data than anything else, and also the mainstream has increased expectations for reliability compared to the overclock-till-it-smokes crowd)

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull
I remember OCZ's earliest SSDs as being egregious examples of JMicron controller garbage. To be fair, everyone who tried to make drives with those controllers got burned. The JMicron supplied firmware had both performance and data integrity problems. As a RAM supplier OCZ didn't start out being equipped to address that kind of issue in house, the problem is that they never really got good at it either.

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

Alereon posted:

In something of a bombshell, The SSD Review is reporting that this and other Intel enterprise drives are based on LSI Sandforce controllers with extensive customization for Intel. The article doesn't mention Sandforce, but the SSD 730 features RAISE-like error correction, so I think it's likely that these drives are using Sandforce-based controllers.

Bonus Edit: I wonder how different these controllers are hardware-wise, I bet you could free up a lot of CPU cycles for performance consistency by disabling compression, deduplication, and some of the other Sandforce technologies. The drive is outfitted with DRAM to hold a flatter page table while traditional Sandforce controllers used SRAM, it'll be interesting to see if the Sandforce 3 controller does this as well.

I think you're jumping to conclusions on this one. Having a feature similar to RAISE doesn't mean much; the basic idea has been around at least as long as RAID 5, and even its application to silicon memory in a form similar to RAISE is at least 15 years old (IBM's ChipKill tech).

But there's more direct reasons to doubt a Sandforce link. SF-37xx controllers don't have a DRAM interface at all, and the chips themselves are in a fairly large FCBGA package with an organic substrate and exposed die, while the SSD 730 controller is a smaller fully encapsulated plastic BGA package of some kind (may be wirebond).

e: also, I should explain the more plausible option here. Intel's main business is big expensive high performance CPUs, and their own fabs have traditionally been exquisitely tailored for that business -- which ends up making them less than economically ideal for smaller low performance chips like SSD controllers. And on the other side, LSI doesn't just make products for themselves, they also act as a manufacturer (reselling TSMC wafers instead of actually doing it in house) and offer an ASIC IP portfolio and design services. Assuming the story is true, it probably just means that Intel chose LSI as its manufacturing partner for this chip.

BobHoward fucked around with this message at 08:28 on Feb 28, 2014

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

WattsvilleBlues posted:

Is the issue that OS X doesn't support TRIM commands? Is there a manual way to do this? No point in me Googling this, I haven't a clue about OS X.

OS X supports TRIM, but out of the box it will only use TRIM commands on Apple's OEM'ed SSDs. There's a free utility called TRIM Enabler which turns it on for any SSD.

Some of Apple's factory SSDs are 840 Pros with an Apple OEM firmware build, so an 840 series plus TRIM Enabler is a good option. That said, so is that Intel 530, and it would give the option of running without TRIM Enabler.

If your friend wants to use FileVault 2, the OS X full disk encryption system, SandForce SF-22xx controller drives like the 530 are not ideal. SF-22xx relies on being able to compress data for full performance. FileVault encrypts everything before it's sent to the SSD, and encrypted data is pretty much incompressible, so it doesn't play well with SandForce. (Though this hasn't stopped Apple from shipping some SandForce controller SSDs.)

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

Seems fine to me.

By the way, firmware update chat reminds me that Intel and Samsung are good brands for Macs because these companies make at least some effort to provide firmware updater tools that can run on Macs. Samsung's are provided as downloadable ISO CD images here and a 2011 15" MacBook Pro has an internal optical drive, so that angle's covered.

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull
Any 840 series drive including EVO should be fine

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

WattsvilleBlues posted:

Does the OP not recommend against the EVO for Macs? Is there a way to enable TRIM for OS X?

I feel like we're going in circles here, I posted about how to enable TRIM in OS X just above.

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull
That SATA Express motherboard connector is even worse bullshit than usual for connectors that come out of ATA standards committees. I guess it's really really important to be able to reuse 5 dollar SATA cables that one time you need to plug an ancient drive into a SATAe port on a motherboard.

The device end got compromised by this stupidity too; the connector there has power but the motherboard end doesn't so you'll need a stupid power pigtail.

JFC guys, nobody's gonna care if you make a clean break new connector design that can only talk with legacy drives via an adapter cable.

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

Chuu posted:

Knowing nothing about cable design, what's the problem with it?

EDIT: The old SATA connector, I agree the new SATAe connector is an abomination.

Google image search "broken SATA connector" and be enlightened. The drive end connector isn't fully enclosed (and cannot be!) so there's no mechanical support when the cable is pulled downwards. All the torque goes right into that pathetically thin wafer of plastic which supports the metal contacts, and it's very easy to snap off.

There are other problems, but that's the worst one IMO. It's not just that it breaks easily, it's that the side which breaks is the one soldered to a piece of equipment that people tend to really care about, instead of the easily replaced cheapshit cable.

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

big mean giraffe posted:

What kind of idiot is leaving his case open and pulling down on cables? Doesn't excuse a weak point that catastrophic but I can't really imagine how you'd run into that problem.

This is not a good argument against robust connector design.

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

DrDork posted:

The whole "flimsy exposed plastic bit" wasn't a problem for the first half of SATA's life, because they were mounted on 3.5" drives where the port was fully enclosed by a supporting plastic frame. Only recently--in the quest to shove ever smaller drives into laptops, mostly--have SATA ports started to go "naked" which obviously increases the possibility that they'll break.

2.5" sata disks have been a thing as long as there's been SATA, and check it again, the port isn't fully enclosed even on 3.5" drives. To support backplane style systems where you slide drives into bays, the connector must always be located the same place relative to bottom and side mounting holes. Whoever drew up the standards for SATA drives (both 2.5" and 3.5") put the connector in a location where fully shrouding it would put shroud plastic below the mounting plane, which is not possible so nobody does it. (Without violating the spec by moving the connector, that is.)

Broken by design, always has been. SATA is a loving hack, just one which happens to mostly work reasonably well. Did you know that SATA retains literal register-level compatibility with 1982 IBM PC hard drive controllers, just so that vendors too loving lazy to update BIOSes could keep booting ancient operating systems on modern SATA HW?

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

Geemer posted:

Woop, looks like you're right. I should've checked before posting.

Some operating systems delay TRIMming deleted file contents to batch up the TRIM work, which might give you a small time window in which it's still possible to recover real data. But I wouldn't expect this to work reliably beyond a day or two.

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

xpander posted:

I've been doing some reading on upgrading Macbook Pros with SSDs, and on the Apple forums, nearly everyone seems to recommend the Crucial M500 drives, saying Sandforce-based ones are performing terribly(notably, the EVO).

The discussions.apple.com forums are generally awful. You can sometimes get nuggets of information there but it's buried in a ton of noise. This is some of the noise. Samsung EVO series uses a Samsung controller, not Sandforce, and besides that Sandforce is actually a good choice for Macs under some circumstances.

The key question is whether you're comfortable installing TRIM Enabler to turn on TRIM for non-Apple SSDs. If you are, just go with a Samsung 840 Pro or 840 EVO. They're solid performers, reliable, and not super expensive. If you're going to run without TRIM Enabler, you want a controller known to do OK without TRIM. And guess what, that's SandForce! In that case, choose a high quality SandForce drive like the Intel 530 series.

Apple itself ships Sandforce-based SSDs. The only reason to avoid SandForce on the Mac is if you're using FileVault 2, because FV2 encryption makes all data incompressible, but SF-2200 controllers rely on being able to compress data to hit their full performance potential.

There were some MBP model years which used NVidia chipsets whose SATA has issues with some SSDs, but I'm pretty sure a 2012 is not one so you shouldn't have to worry about that aspect.

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

Xenomorph posted:

I just figured the overall sustained transfer speeds help with gaming when loading large maps.

The IOPS stuff helps with loading smaller files, like when Windows boots or a program starts up. Lots of quick file accessing.

Some games benefit a lot from IOPS. All depends on whether the game dev put in the effort to optimize loading.

To expand on that, it's common for games in development to load data from thousands of individual files. Think one file per texture or 3D object. In order to reduce the disk footprint and number of files for distribution, the game engine will also typically support loading the exact same file hierarchy after it's been packed into an archive, which is usually just something standard like zip or Unix TAR. So the level might look like a single file, but internally it isn't and reading it can create lots of random accesses.

The games which load levels fast are the ones where the game developer (a) chose an archive format which allows rearranging the blocks inside the archive in any order and (b) used tools to do so based on the access pattern observed during loading. Doing this right transforms the level load into a sequential read, which is much faster. If they don't do this, level load can be greatly accelerated by a SSD with high IOPS.

Traditionally, console games tend to be a lot better about including this optimization since it is super painful to seek a lot when loading from optical media. It sucks less when loading from HDD, so lots of games written for the PC first don't bother, but it still sucks and it's always annoying to see games thrash the HDD for 30 seconds or more when you know it could probably load in less than 5 seconds with a bit of extra effort.

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

Xenomorph posted:

I think many of the "native" PATA SSDs that are sold (like from OWC) are just 1.8" SATA drives with a PATA bridge chip, inside a 2.5" enclosure.

Possible, but I wouldn't be surprised if it's a single board. OWC does custom SSD printed circuit board designs (or contracts them out, dunno which). If you already have that capability it's not a big deal to create a derivative. If you're careful, with some EDA software you should be able to reuse most of the board layout work by treating a PATA 2.5" design as an ECO to a 2.5" SATA design.

Whichever way they did it will be the cheapest way accounting for one-time engineering costs and economies of scale. Just saying, alternate PCB layouts aren't a very high barrier, especially not when the carrier card option requires a PCB layout too. (And extra connectors, which aren't cheap.)

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull
It's just a dumb windows driver limitation.

To the host computer, this type of SATA express SSD looks like an ordinary PCIe AHCI SATA controller with an ordinary SATA drive attached -- the whole point of this branch of SATA Express is that drivers literally don't have to change. There is no good reason why TRIM doesn't work already, it's just the standard overly conservative windows device driver mentality at work. It's the Windows design philosophy which comically insists on always trying to find and "install" special drivers every time Windows encounters a different generic USB flash key brand, or generic USB mouse, etc. A generic thing can't possibly work the same as all other generic things which came before it!

This Samsung controller has been shipping in Apple OEM SSDs for about a year now and TRIM was enabled in OS X for these drives all along, as you'd expect.

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

Shaocaholica posted:

Is this some sort of anti hackintosh thing? Guh.

The most likely motivation (IMO) is improved security.

I would expect them to still support KEXTs signed by third party devs, since they can't just give up on third parties providing drivers. Think of it as being much the same reason why they've provided Gatekeeper for userland application software ever since Mountain Lion. Mechanisms which let Apple revoke signed software give users much better protection from malware, only with KEXTs there's a much stronger case for completely locking out unsigned code with no easy path for end users to bypass or disable signing (unlike Gatekeeper).

If anything I'm surprised it hasn't happened already. The signing mechanism is already in place, it's just not enforced yet.

The big question is whether they'll drop the hammer on someone willing to redistribute a modified Apple KEXT re-signed with that developer's key. Past behavior suggests they'll turn a blind eye to hobbyist activity, even Hackintosh stuff, so if required signing happens that's what we have to hope for (that and someone willing to risk $100/yr on a dev account so they can sign things).

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

Hadlock posted:

This is the whole Apple branded SCSI drive debacle all over again :allears: (sweet jesus how old am i??)

All over again? Apple's mass storage group has been super anal about HDD/SSD firmware continuously the entire time.

And not entirely without reason. As is typical for the storage industry, a lot of early TRIM SSDs had hosed up buggy implementations, assisted by (as I understand it) the ATA committee not doing a super great job of specifying and clarifying TRIM's exact semantics. Apple's response was typical of their mass storage group: develop comprehensive testing to use in qualifying vendors and specific products they're going to ship, do not lift a finger applying this to anything else, and maybe just lock out problematic features on non qualified hardware if you're not sure.

It seems like the wave of total poo poo is behind the industry and it would be nice if Apple's professional buttclenchers loosened up, but that's where they're still at

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull
It's a silly comparison because PCIe is a packetized channel with a SERDES at each end, and a bunch of buffering for the PCIe equivalent of dealing with devices that have different MTUs and so on. It will never have latency as good as ancient async DRAM, let alone one of the newer synchronous standards.

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

Sir Unimaginative posted:

While Linux basically doesn't care where the OS is as long as it can do its job, Windows and MacOS are is bad at running off externals even if you find some way to force them to

Fixed that for you. OS X does not give a gently caress whether it's on an internal or external. You don't have to "force" anything, either. It's just like installing on an internal, and if you use Thunderbolt, USB3, or Firewire as the interface to your external, it performs quite well too. (You can get native SATA 6G SSD performance off a Thunderbolt SSD.)

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

Harik posted:

Speaking of 5k, was there ever a postmortem on what the gently caress that bug was? It's not a power-of-two overflow that I can see unless they were counting time in 4.2 millisecond increments or something equally "creative". So some kind of periodic sanity check that went horribly wrong because it was never tested?

Most likely, but who knows? Crucial's never going to say in public.

It doesn't have to be a power of 2 overflow, but it could be anyways because the counter could be operating off any random reference frequency rather than a nice round 1s interval or whatever.

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

Geemer posted:

If anything I'd be less nervous about using the drive, it lasted well beyond its rated life.

I'm used to platter drives failing completely out of the blue or with minimal warning. But with SSDs you can see it coming by looking at the SMART data as it slowly accumulates reallocated sectors.

You know that you can look at the SMART data on platter drives and see failures coming as they slowly accumulate reallocated sectors, right? Lots of HDD failures are preceded by the drive accumulating bad sectors.

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

Alereon posted:

Keep in mind that ECC protects against hardware failures and defects, you do not get random bit flips on a properly functioning system. The idea that you need ECC because cosmic rays flip bits randomly is an urban legend.

Urban legend?! In the chip business we frequently design in ECC protection even for on-chip SRAM memories. I assure you the reason we do so is explicitly to protect against single event upsets (aka cosmic rays). There's even crazy nutters out there who use ECC on virtually everything inside a chip, not just medium size or larger SRAMs. Building a state machine? Slap ECC on the state storage, even if it's only ten bits. (That level of SEU paranoia is mostly reserved for space hardware, where it is fully justified, or for safety-critical systems such as reactor control and medical devices.)

I realize you're talking about ECC in the context of DRAM, but even there the SECDED (Single Error Correct Double Error Detect) ECC code used with commodity 64b data + 8b parity ECC DIMMs isn't designed to protect against hardware failures or defects. It can do so, it's just that actual failures are likely to corrupt multiple bits. SECDED is not guaranteed to detect errors if there happen to be more than 2 in a word. (Worse, it may even falsely interpret these as single-bit errors, and then perform an incorrect correction.) These limitations make it a lousy choice for anything other than SEU protection.

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

Factory Factory posted:

Yeah, no.

1) Empirically, only about 50% of hard drive failures show SMART warnings at all, never mind predictive errors.

2) Statistically, a single reallocated sector is a failure. A single reallocated sector changes the drive's likelihood to survive 6 months from about 95% to about 60%. When SMART was designed, it was assumed that bad sectors accumulated naturally over a drive's lifetime, but that has turned out not to be the case. As such, a reallocated sector is not "predictive" of a failure, but rather an indication that the drive has already suffered a mortal wound and you're just waiting for it to stop spinning.

Yeah, a rare drive will pick up one or two reallocated sectors and then last forever, but that's the exception by far. The greatest statistical predictor of having two or more reallocated sectors is having one in the first place.

re #1 - Not talking about SMART warnings. I'm saying that you can look at the reallocated sector count reported by SMART (and also the unrecoverable error count) to get an idea whether the drive is truly healthy, regardless of whether it's generating warnings. Same principle as looking at these numbers for a SSD.

In my experience, sick HDDs never actually generate formal SMART warnings. However, they usually have a significant number of reallocations.

re #2 - I have no idea where that 95% to 60% figure comes from, but you're not really disagreeing with me!

Also I question your claim that even 1 reallocation is necessarily a sign of bad poo poo. Every HDD has a hidden (and large) number of sectors mapped out by the factory's media defect scan. Error rates are often data pattern dependent (you'll have to take my word on that), so ideally you'd want to do a defect scan by running many passes with different patterns. However, even a single-pass scan takes a long time, factory test fixture time is expensive, and HDD profit margins are small. They're not likely to want to run more than 1 pass. So I can easily believe that out of 244 million user-visible 4K sectors on a 1TB drive, you will normally have a few marginal ones which might eventually need to be remapped by the drive.

What I watch for is sudden large increases in the reallocated count, or more than ~20 reallocations period, or just 1 unrecoverable read error. Every time I've seen an unrecoverable read error, I've been able to make that drive fail completely by exercising it with a heavy I/O load.

My most recent HDD failure was a Seagate which reallocated 1 sector at about 1 month of service life, lasted 4.5 years that way, then suddenly rocketed up to over 100 reallocations and some unrecoverable errors. It took about 3 weeks of hammering it 24/7 to make it die the true death, during which it reallocated another 500 sectors or so (and experienced thousands of unrecoverable read errors).

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

Klyith posted:

I'll agree that these quibbles are fairly pointless when most people would need to use the thing for a decade to see that failure in action. But putting a self-destruct on a product is a lovely thing to do unless you are using it for the best of intentions. Putting a drive into permanent read-only to protect data as a last resort: fine. Bricking a drive to punish someone who didn't buy your expensive enterprise drive: bad.

You're making a lot of unwarranted assumptions here.

SSDs use a complicated data structure to track which user LBA is stored where on the flash media. This data structure is itself also stored on the flash media, and is typically cached in DRAM attached to the SSD's controller (for performance). I can almost guarantee that this is what happened:

1. Firmware detects flash media wearout, takes drive into read-only mode.

2. But it was actually too late with that decision! The copy of the mapping data in flash is corrupted.

3. The drive works while still powered and still holding cached mapping structures in DRAM, but as soon as you power cycle or reset, it forces the SSD to attempt to do its normal bootstrap process where it reads mapping data from the flash media into DRAM. Partway through, it runs into poo poo that is so hopelessly corrupted that it either crashes or it detects that something's not right, gives up, and bricks itself.

4. Now, you have a brick. Note that if it "deliberately" bricked itself, you actually have an improved chance of getting data out of it, via a data recovery service which can dump the raw contents of flash and try to interpret them. Some of the scenarios where firmware doesn't detect bad mapping data and therefore doesn't brick itself will result in it attempting to write to media, which is a really bad idea in that state.

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

Alereon posted:

With all due respect, this simply isn't true. While HDDs do experience soft errors (such as read errors that work upon retry) at a low rate as part of normal operation, hard errors (such as bad sectors or uncorrectable read errors) do not occur on drives that are functioning normally. Even a single error such as this means the drive is failing. Failing drives can take highly variable amounts of time to die completely, but it's usually sooner rather than later.

Soft and hard errors are both events where there were too many bit errors to successfully correct a sector on the first read attempt. The thing which turns some of these into "hard" errors is merely that HDD firmware shouldn't retry indefinitely. Instead it gives up after a semi-arbitrary number of failed retries, logs an uncorrectable read error (hard error) and returns a failure code to the host.

Anticipating the rejoinder to that, yes there are hard errors which can never be recovered with any number of retries. Just pointing out that these things aren't in different universes. There are "hard" errors which could have been soft errors if the firmware engineer typed in a slightly larger constant, and soft errors which might be hard 99% of the time but you got super lucky.

More to the point, normal, healthy HDDs can and do decide to map out sectors that have experienced soft errors. All it takes is the "should I remap this" threshold being a harsher standard than the "should I give up" threshold. Arguably, it always should be. Also note that the remap decision may not be solely based on retry count, unlike the hard error decision.

So, to me, the fact that remapping has taken place is not absolute proof that a failure is in progress. As I talked about before, there are plausible reasons why some marginal sectors may pass factory media scans, yet trigger remaps later on. So I just don't get your absolutism about this.

For clarity, the two SMART attributes I'm usually attentive to are 187 (Reported Uncorrectable Errors) and 5 (Reallocated Sector Count). Nonzero values of 187 are always very bad news in my experience. Small nonzero counts in 5 are often OK if 187 is still zero. (At the very least, so long as 187 is 0 you know the drive has never actually failed to read data correctly, within the limits of its BCH algorithm anyways.)

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

Alereon posted:

While I understand that your viewpoint is based on the same assumptions as SMART failure predictions, that's not how hard drives actually work in the real world.

If that's what you understand my viewpoint to be, you have not understood it and/or I haven't communicated it well. Please understand that absolutely none of my posts were about SMART failure predictions. I am trying to talk about how human beings can interpret some of the raw event counts reported through SMART, which is not the same thing.

(But speaking of the actual formal SMART predictions, I don't think it's possible to identify a single set of assumptions behind them. I know there was some ancient theory of how to predict HDD failures behind the original work on SMART, 20ish years ago, but the SMART reporting interface doesn't rigidly hew to it. Instead it provides this super generic way of reporting raw attribute values, cooked attribute values, and a pass/fail threshold for the cooked attribute values. Drives are free to use more or less any algorithm for cooking the raw values, and any threshold too, so the warnings are not dictated by 20 year old theory at all. Cynically speaking, manufacturers probably do not trigger SMART failure warnings as aggressively as they should, as they would prefer to not deal with RMA until the drive actually fails.)

quote:

The 2007 Google Labs paper Failure Trends in a Large Disk Drive Population [PDF] revolutionized our understanding of how hard drives fail by correlating SMART error logs with observed failure rates in a population large enough to provide statistical validity. The important takeaway is that logging even one hard error is associated with such a massive spike in failure rates over all timeframes that we can say that the logging of a hard error means the drive has begun its failure sequence.

Except it doesn't really say that? I mean yeah, that's pretty much the gist of the paper's conclusions, but even when all the data is sound papers are not by definition perfect at interpreting it.

The middle graph of fig. 11 is particularly interesting. It shows that drives 0 to 5 months old when they log their first reallocation survive at a rate of over 90% over the time window they examined, and drives 5 to 10 months old at the time of first reallocation survive at a rate over 95%. Over 10 months old at first reallocation, survival rate plummets.

This is consistent with the idea that media flaws which the factory failed to map out are a source of reallocation events (note: not necessarily hard errors) which do not reflect a physical problem. In a high use environment like Google's, it makes sense that the first year or so of operation would probably find most of these latent bad sectors, removing that source of reallocations from the population. Reallocations after that window of time are more reliably a sign of something bad happening. (In private use, you probably aren't reading and writing as much as Google, so you might not find latent bad sectors as quickly)

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

Klyith posted:

Aslo, holy poo poo. Does anyone else have comparable tech on the way? Because if samsung is really going to be able to stack layers as well as they predict, they will loving own flash storage. I expected 3d nand to have some kind of downside in the same way that 3d SRAM can have, but nope. It's pure gravy. Everything about the stuff is just better. :getin:

Whoa there, don't get too excited. I read the Anandtech article and it didn't go into the obvious penalty: the layer stack requires more process steps to create, which costs money. I read the much more technical industry analyst article AT linked to, and Samsung's using a clever design of the vertical bit string to avoid scaling the number of lithography steps (generally the most expensive kind of process step) with the number of bit layers, so there is a real cost advantage. But it's not 100% delicious gravy.

BTW, that industry analyst article links to a paper which tries to analyze the future of V-NAND scaling, and after skimming it I have to say you should take Samsung PR's predictions of The Glorious Future with a grain of salt. Not saying scaling won't happen, but there's some interesting and difficult problems to address along the way. (If there weren't, they'd probably be shipping with more layers already!)

Finally, yes, others have comparable tech on the way. Toshiba apparently came up with the idea first, and all the usual suspects have plans to go 3D. Samsung's just being aggressive about being first to market -- it seems different companies came up with different strategies based on perceived risks of pushing for 3D NAND quickly (it hasn't been a super easy tech to develop) versus slow-rolling it while pushing planar as far as it can go. Samsung is probably enjoying the benefits of being the industry volume leader here, which lets them push new tech into production sooner in much the same way that Intel does for high performance logic processes. Toshiba/SanDisk apparently don't plan to hit volume production until 2016.

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

Factory Factory posted:

Yeah - Micron is doing it with DRAM. Intel's sticking some on its next-gen Xeon Phi many-core processor. Here's a slide:

Not the same thing fyi. Micron HMC (hybrid memory cube) is a stackup of multiple silicon die, one 2D plane of DRAM per die. This new NAND tech is noteworthy because it's a single piece of silicon with multiple memory planes built up vertically.

Multiple NAND flash die stacked in one package is a thing that has been around basically forever, and is in your SSD already, using considerably less advanced tech for connecting the stacked die than HMC. With NAND, die stacking is done to increase the amount of memory in one package, so wirebond works ok. HMC uses through-silicon vias, and also puts a dedicated high performance controller / interface logic die at the bottom of the stack. Hence the "hybrid" -- it's not purely DRAM.

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

Klyith posted:

Some of the things that RAPID does, like keeping a copy of the drive's block map in RAM and using that info to cache & write full blocks, are specific to the hardware.

I think you are overstating what RAPID does. What you describe would require massive layering violations, such as nonstandard SATA command extensions. It's also unnecessary because any halfway competent SSD firmware must be capable of writing full blocks on its own, so long as it's given enough write data to fill them. If a SSD can't do this optimization on its own, it will suffer from excessive write amplification under heavy write loads, which reduces write lifespan and performance. If you're writing operating system code, all you should need to do is write data in batches sized to fill integral numbers of SSD blocks, and let the SSD worry about the details.

I just looked up Samsung's RAPID white paper. To the extent you can tell from white-paperese, it sounds like that's exactly what they're doing. There's no mention of any SSD-specific caching technologies for the read cache side of RAPID, it's just a more content aware caching scheme (cynical reading: may be designed to game benchmarks). The RAPID write cache is the SSD-specific part, and it's described as batching together small, low queue-depth random writes, especially those caused by background activity such as system log writes.

The read cache is clearly hardware-neutral. The write cache might have some small amount of tuning for Samsung's own SSDs, but it sounds fairly hardware neutral too. If you could wave a wand and force Samsung to open the code, optimizing for any vendor's SSD would probably be a matter of plugging different constants in.

Adbot
ADBOT LOVES YOU

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

Alereon posted:

Why don't you think TLC will stick around? They are migrating V-NAND to TLC for consumer drives.

In fact, V-NAND seems likely to be a better match for MLC and TLC than 2x or 1x nm planar flash cell designs.

  • Locked thread