Dell & HPE Issue Updates to Fix 40K Hour Runtime Flaw in Enterprise SSDs
by Anton Shilov on March 27, 2020 4:00 PM ESTIn a second SSD snafu in as many years, Dell and HPE have revealed that the two vendors have shipped enterprise drives with a critical firmware bug, one will eventually cause data loss. The bug, seemingly related to an internal runtime counter in the SSDs, causes them to fail once they reach 40,000 hours runtime, losing all data in the process. As a result, both companies have needed to issue firmware updates for their respective drives, as customers who have been running them 24/7 (or nearly as much) are starting to trigger the bug.
Ultimately, both issues, while announced/documented separately, seem to stem from the same basic flaw. HPE and Dell both used the same upstream supplier (believed to be SanDisk) for SSD controllers and firmware for certain, now-legacy, SSDs that the two computer makers sold. And with the oldest of these drives having reached 40,000 hours runtime (4 years, 206 days, and 16 hours), this has led to the discovery of the firmware bug and the need to quickly patch it. To that end, both companies have begun rolling out firmware
As reported by Blocks & Files, the actual firmware bug seems to be a relatively simple off-by-one error that none the less has a significant repercussion to it.
The fault fixed by the Dell EMC firmware concerns an Assert function which had a bad check to validate the value of a circular buffer’s index value. Instead of checking the maximum value as N, it checked for N-1. The fix corrects the assert check to use the maximum value as N.
Overall, Dell EMC shipped a number of the faulty SAS-12Gbps enterprise drives over the years, ranging in capacity from 200 GB to 1.6 TB. All of which will require the new D417 firmware update to avoid an untimely death at 40,000 hours.
Meanwhile, HPE shipped 800 GB and 1.6 TB drives using the faulty firmware. These drives were, in turn, were used in numerous server and storage products, including HPE ProLiant, Synergy, Apollo 4200, Synergy Storage Modules, D3000 Storage Enclosure, and StoreEasy 1000 Storage, and require HPE's firmware update to secure their stability.
As for the supplier of the faulty SSDs, while HPE declined to name its vendor, Dell EMC did reveal that the affected drives were made by SanDisk (now a part of Western Digital). Furthermore, based on an image of HPE’s MO1600JVYPR SSDs published by Blocks & Files, it would appear that HPE’s drives were also made by SanDisk. To that end, it is highly likely that the affected Dell EMC and HPE SSDs are essentially the same drives from the same maker.
Overall, this is the second time in less than a year that a major SSD runtime bug has been revealed. Late last year HPE ran into a similar issue at 32,768 hours with a different series of drives. So as SSDs are now reliable enough to be put into service for several years, we're going to start seeing the long-term impact of such a long service life.
Related Reading:
- Western Digital Introduces WD Gold Enterprise SSDs
- Western Digital Starts Sales of WD_Black P50 USB 3.2 Gen 2x2 SSDs
- Western Digital Ultrastar DC SS540 SAS SSDs: Up to 15.36 TB, Up to 3 DWPD
Sources: Blocks & Files, ZDNet
51 Comments
View All Comments
Samus - Monday, March 30, 2020 - link
Re-read my statement. The two companies that are seemingly the only enterprise equipment suppliers affected by these SSD's running this particular firmware are CONVENIENTLY the only two enterprise suppliers that strongarm their partners into maintenance agreements beyond the warranty period to receive what are otherwise free updates from virtually any other supplier.The crime here is it still isn't clear if EMC and HPe are providing these updates for out-of-warranty equipment. Everything else is, as I admitted, speculation, not conspiracy.
Gigaplex - Sunday, March 29, 2020 - link
"But this just doesn’t add up when you consider such a ridiculous flaw in such a mission critical scenario"Such a ridiculous flaw in such a mission critical scenario makes even LESS sense if that flaw was intentional.
leexgx - Wednesday, July 8, 2020 - link
The bug was due to an coding error (should be N it was N-1 in the code witch had somthong to do when 40k hours passed) raid is never a backupyou should have an secondary array on another server that's using completely different drives for server to server mirroring (real-time if needed or every hour or day really depends on your requirement, for most 2am backup everyday is enough)
oRAirwolf - Saturday, March 28, 2020 - link
Hanlon's razor, my dude.rrinker - Monday, March 30, 2020 - link
It's entirely accidental - caused by the very common fault of programmers who don't understand the limits of various data types. All sorts of unintended consequences have happened because of these types of errors - including deaths, in the case of the 737 Max.It makes absolutely no sense for a company to purposely brick a device which is STILL UNDER WARRANTY - that's a recipe for killing the company if every single one of a product line fails before the warranty is up, leaving them on the hook for supplying replacements.
FunBunny2 - Monday, March 30, 2020 - link
"It's entirely accidental - caused by the very common fault of programmers who don't understand the limits of various data types."there was a time when most commercial programs (COBOL, almost always) were written by HS graduates (or GEDs) who got a 'certificate' from some store-front 'programming school'. you can guess the result. in these days, the C/java/PHP crowd are largely as ignorant.
leexgx - Wednesday, July 8, 2020 - link
This was an coding error, they used N-1 instead of just N so when it hits 40k hours it does some sort of internal hard error due to everytime it trys to read 40k hours it hard errors the firmware on boot up (this is why you should try not to use disks that have the same uptime as nearly impossible rare as it can be it could happen)InTheMidstOfTheInBeforeCrowd - Saturday, March 28, 2020 - link
If by saying "planned obsolescence" you mean such blunder potentially making the company or the brand(s) the company sells obsolete because almost nobody wants to buy their data-killing products anymore, then i agree. If you rather meant the commonly agreed-upon meaning of "planned obsolescence", well, please don't let me stop you wallowing in absurd theories.Also, i am quite curious about the physical law or whatever it is that allows building planned obsolescence into SSD firmwares, yet seemingly makes it impossible to build such into firmwares of HDDs. Please tell me more! (...goes to redirect response output to /dev/nul)
FunBunny2 - Saturday, March 28, 2020 - link
"yet seemingly makes it impossible to build such into firmwares of HDDs. "HDD vis-a-vis SSD has virtually no logic used in data R/W. it's just a bit of magnetism going back and forth. now, HDD manufacturers could well build the platter hub ball bearings with leftover BB gun shot, and the voice coils from $10 transistor radio speakers, of course.
InTheMidstOfTheInBeforeCrowd - Sunday, March 29, 2020 - link
And that would stop a manufacturer to build planned obsolescence measures into a HDD? Because it is so much simpler than a SSD, therefore SSDs have planned obsolescence measures built-in, and HDDs have not? You know what is even simpler than a HDD? Good old traditional light bulbs. According to the logic of your argument, those light bulbs must have been immune from planned obsolescence Dude, i have a bridge in Brooklyn to sell you...