The advent of modern storage technology (such as SSDs and RAID hardware) has further complicated the storage systems used in data centers and increased reliability concerns for the new storage components build on this technology. Given the fact that power faults are a common occurrence in such environments, we need to examine the behavior and consistency of these devices post power failures.
The key benefit offered by block devices (such as SSDs) is performance gain. But this benefit (including others, such as lower power draw) does not surpass the reliability edge spinning disk drives have over these relatively new SSDs. We have more than 50 years of working knowledge of the Hard Disk Drive (HDD) technology, which means that SSDs have a long way to go until they become as reliable as today’s average enterprise HDD.
Researchers of the Ohio State University and HP Labs performed tests to study the behavior of block devices in the event of power failures. Their testing framework comprised unique hardware that could introduce power faults into block devices and diligently-developed software to stress these devices while allowing post-fault consistency check. They performed the tests for fifteen SSDs from five different vendors and two hard drives (one low-end and the other high-end). The results were recorded after going through more than three thousand fault injection cycles.
When the workloads were applied after injecting the power faults to these block devices, there were mixed observations. Of all the drives that were tested, only two behaved as expected after the power failure. Of the two spinning-platter hard drives examined for comparison, only one exhibited expected behavior under power fault. The remaining thirteen SSDs and one HDD showed inconsistent behavior. The unexpected failure types brought to light by these experimental examinations include shorn writes, bit corruption, metadata corruption, unserializable writes, and dead devices.
These observations came clean on the fact that SSDs are not an apt choice for reliability-critical workloads. This can be attributed to the behavior of these drives even under the simplest of faults, i.e. power outage. Therefore, it is incumbent on us to diligently examine the reliability properties of each block device before they could be used on critical servers and desktops.