In a second SSD snafu in as many years, Dell and HPE have revealed that the two vendors have shipped enterprise drives with a critical firmware bug, one will eventually cause data loss. The bug, seemingly related to an internal runtime counter in the SSDs, causes them to fail once they reach 40,000 hours runtime, losing all data in the process. As a result, both companies have needed to issue firmware updates for their respective drives, as customers who have been running them 24/7 (or nearly as much) are starting to trigger the bug.

Ultimately, both issues, while announced/documented separately, seem to stem from the same basic flaw. HPE and Dell both used the same upstream supplier (believed to be SanDisk) for SSD controllers and firmware for certain, now-legacy, SSDs that the two computer makers sold. And with the oldest of these drives having reached 40,000 hours runtime (4 years, 206 days, and 16 hours), this has led to the discovery of the firmware bug and the need to quickly patch it. To that end, both companies have begun rolling out firmware

As reported by Blocks & Files, the actual firmware bug seems to be a relatively simple off-by-one error that none the less has a significant repercussion to it.

The fault fixed by the Dell EMC firmware concerns an Assert function which had a bad check to validate the value of a circular buffer’s index value. Instead of checking the maximum value as N, it checked for N-1. The fix corrects the assert check to use the maximum value as N.

Overall, Dell EMC shipped a number of the faulty SAS-12Gbps enterprise drives over the years, ranging in capacity from 200 GB to 1.6 TB. All of which will require the new D417 firmware update  to avoid an untimely death at 40,000 hours.

Meanwhile, HPE shipped 800 GB and 1.6 TB drives using the faulty firmware. These drives were, in turn, were used in numerous server and storage products, including HPE ProLiant, Synergy, Apollo 4200, Synergy Storage Modules, D3000 Storage Enclosure, and StoreEasy 1000 Storage, and require HPE's firmware update to secure their stability.

As for the supplier of the faulty SSDs, while HPE declined to name its vendor, Dell EMC did reveal that the affected drives were made by SanDisk (now a part of Western Digital). Furthermore, based on an image of HPE’s MO1600JVYPR SSDs published by Blocks & Files, it would appear that HPE’s drives were also made by SanDisk. To that end, it is highly likely that the affected Dell EMC and HPE SSDs are essentially the same drives from the same maker.

Overall, this is the second time in less than a year that a major SSD runtime bug has been revealed. Late last year HPE ran into a similar issue at 32,768 hours with a different series of drives. So as SSDs are now reliable enough to be put into service for several years, we're going to start seeing the long-term impact of such a long service life.

Related Reading:

Sources: Blocks & Files, ZDNet

Comments Locked

51 Comments

View All Comments

  • InTheMidstOfTheInBeforeCrowd - Sunday, March 29, 2020 - link

    Correction of phrasing in my last comment: "Because it is so much simpler than a SSD, therefore SSDs have planned obsolescence measures built-in, and HDDs have not?" should be rather "Because it is so much simpler than a SSD, therefore SSDs can have planned obsolescence measures built-in, and HDDs would not allow that?"

    I am not trying to argue about whether SSDs or HDDs have actual planned obsolenscene measures built in or not. I am (haphazardly, i guess) trying to dispel this ridiculous notion that SSDs are not trustworthy because they are seen as affected by planned obsolescence whereas HDDs are seen to be safe/unable to be affected by planned obsolenscene.
  • edzieba - Monday, March 30, 2020 - link

    "HDD vis-a-vis SSD has virtually no logic used in data R/W. it's just a bit of magnetism going back and forth."

    I would advise looking inside an HDD made in the last 3 or so decades. You may be suppressed to find a copious account of electronic processing is required to turn magnetic domains into addressable blocks.
  • StrangerGuy - Friday, March 27, 2020 - link

    How did this escaped QA to begin with?
  • ABR - Saturday, March 28, 2020 - link

    That's what I'm wondering? Where is their HALT (Highly Accelerated Life Testing)?
  • shabby - Saturday, March 28, 2020 - link

    How do you accelerate time?
  • PreacherEddie - Saturday, March 28, 2020 - link

    It is zero sum. Every person who uses a time machine to go back in time allows a company to test products for MTBF.
  • FunBunny2 - Saturday, March 28, 2020 - link

    I believe it's called WARP drive. In a practical sense, many (hundreds, thousands?) are run 24/7 for some time period, and the total uptime hours across all devices are algorithmically massaged to MTBF. but you knew that, right?
  • shabby - Saturday, March 28, 2020 - link

    Yes I did, but this drive specifically dies after 40,000 hours, mtbf won't find this flaw until the drive actually reaches those amount of hours.
  • FunBunny2 - Saturday, March 28, 2020 - link

    "Yes I did"

    Yes I did, too. I was answering the different question: "How do you accelerate time?" That's how it's done, in general.
  • Kvaern1 - Sunday, March 29, 2020 - link

    "How do you accelerate time?"

    You record something and watch it on FF.

Log in

Don't have an account? Sign up now