Developers Club geek daily blog

1 year, 6 months ago
Not so long ago the most interesting research of "A Large-Scale Study of Flash Memory Failures in the Field" behind authorship of Qiang Wu and Sanjev Kumar from Facebook, and also Justin Meza and Onur Mutlu from Carnegie Mellon University has been published. The main outputs from article with small comments are lower.

Now, when flash drives are very actively used as high-performance replacement to hard drives, reliability plays them more and more important role. Failures of chips can lead to idle times and even data loss. For development of understanding of processes of change of reliability flash memory in actual practice of the loaded project the research provided in the discussed article has been conducted.

Authors have collected extensive statistics in four years of operation of flash drives in Facebook data-centers.

As many for certain know, Facebook long time was the best (and the basic) the client of the Fusuion-IO company (SANdisk is now bought), which has started releasing one of the first PCI-e flash drives.
image

As a result of the carried-out analysis of collected data, number of interesting outputs has been made:

• The probability of failure of SSD changes not linearly in time. It is possible to expect that the probability of failure will linearly grow together with growth of number of write cycles. On the contrary, separate peaks in which the probability of failure increases are observed, but these peaks are defined by other factors, than natural wear.
• Reading errors in practice meet seldom and are not dominating.
• Distribution of data on the volume of the SSD drive can significantly influence probability of failure.
• Temperature increase leads to increase of probability of failure, but thanks to support of throttling, the negative temperature effect considerably decreases.
• The data volume which has been written by operating system not always precisely reflects degree of wear of the drive as in the controler internal algorithms of optimization work in SSD, and also buffering in system software is used.

Objects of research.
Authors managed to obtain statistical data from set of drives 3x of types (different generations) in 6 different hardware configurations: 1 or 2 drives 720GB PCI-e v1 x4, 1 or 2 drives 1.2TB PCI-e v2 x4 and 1 or 2 drives 3.2TB PCI-e v2 x4. As all measurements were removed from "live" systems, operating time of drives (and also the written/read data volume) considerably differs from each other. Nevertheless, collected data it has appeared enough to obtain statistically significant data after averaging of results within separate groups. The main measurable indicator of reliability with which authors of article operate, the coefficient of unrecoverable bit errors is (uncorrectable bit error rate, UBER = unrecorectable errors/bits accessed). These are those errors which arise at reading/record, but cannot be corrected by the ssd controler. Very interesting that for some systems indicators of UBER are comparable to within value order with the data obtained by other researchers at measurement of bit errors (BER) at the level of separate chips in synthetic tests (L. M. Grupp, J. D. Davis, and S seems. Swanson. The Bleak Future of NAND Flash Memory. In FAST, 2012.). Nevertheless, such similarity has been received only for drives of first generation and only in configuration with two payments in system. In all other cases, distinction makes some orders that looks quite logically. Most likely, a number both internal, and external (temperature, power supply) factors therefore no significant outputs can be made of this supervision became the reason.

Error distribution.
It is interesting that the quantity of observed errors strongly depends on the specific drive — authors note that only 10% of total number of SSD drives show 95% of all unrecoverable errors. Besides, the probability of emergence of errors essentially depends on drive "history": if within week one error was observed at least, with probability of 99.8% next week it is also possible to expect emergence of error on this drive. Also authors note correlation between probability of emergence of error and number of SSD of payments in system — for configurations with two drives the probability of failure increased. Here, however, it is necessary to consider other external factors — first of all character of loading and way of redistribution of loading in case of failure of the drive. Therefore it is impossible to speak about direct influence of drives at each other, but when planning complex systems, there is important how loading not only in normal state, but also in case of failures of separate components is distributed. It is necessary to plan complex so that failure of one component did not lead to the avalanche increase of probability of failures in other components of system.

Dependence of number of errors on operation term (number of write cycles).
It is well known that the term of life of SSD depends on number of write cycles which, in turn, is quite strictly limited within the used technology. It is logical to expect that the quantity of observed errors will increase in proportion to the volume of the data written on SSD. The obtained experimental data show that in reality the picture is a little more difficult. It is known that for normal hard drives the U-shaped curve describing probability of failure is typical.
image
(Jimmy Yang, Feng-Bin Sun A comprehensive review of hard-disk drive reliability. Reliability and Maintainability Symposium, 1999. Proceedings. Annual)
At the initial stage of operation rather high probability of failures which then decreases is observed and starts increasing again after long operation. For SSD we too see the raised number of failures at the initial stage, but not at once, and at first there is gradual growth of number of errors.
image
Authors make hypothesis that availability of "weak link" — cells which are subject to wear much quicker is the reason of nonlinear behavior. These cells at early stage of operation generate unrecoverable errors, and the controler, in turn, excludes them from work. The remained "reliable" cells function throughout the life cycle normally and start serving as the reason of errors already only through progressive tense of operation (as it is expected on the basis of limit number of write cycles). This quite logical assumption — primary failures are observed and for hard drives, and for SSD. Distinction in behavior of HDD and SSD is explained by that the physical error on the hard drive usually leads to loss of disk from the RAID, and for the SSD controler can usually correct error and move data on reserve volume. It is possible to lower probability of emergence of failures at the initial stage of operation preliminary control (" running in") that sometimes practices vendors at special stands.

Dependence of number of errors on volume of the read data.
The assumption has been separately studied that the volume of the read data can also influence the value UBER, nevertheless, it has appeared that for SSD for which the volume of the read data considerably differs (at the similar volume of the written data), the coefficient of unrecoverable errors differs slightly. Thus, authors claim that read operations do not render any significant effect on reliability of drives.

Influence of fragmentation of data in SSD on failures.
One more aspect to which it is worth paying attention — communication of error rate with load of the buffer. Of course, directly load of the buffer (which is normal DRAM chip) is not connected in any way. However, than the written blocks on the volume of SSD (fragmentation) "are more spread", especially the buffer which serves for storage of metadata is actively used. As a result of researches of data retrieveds, number of configurations has shown explicit dependence of error rate on distribution of the written data on SSD volume. It allows to allow considerable potential in development of the technologies allowing to optimize write operations due to optimum allocation of data on the drive that, in turn, will allow to provide higher reliability of drives.

Temperature effects.
From the external factors having potential impact on reliability of drives, first of all, it is possible to select temperature effects. As well as any semiconductor, flash chips are subject to degradation at high temperatures therefore it is possible to expect that growth of temperature in system can lead to growth of error rate. In reality such behavior is observed only for number of configurations. Most clearly influence of temperature noticeable for first generation of drives, and also for systems with two drives of second generation. In other cases the temperature effect was rather small, and sometimes was even negative. Such behavior easily explainably support of throttling (admission of cycles) in SSD.
image
Possibly, for earlier models the technology either was not supported, or has not been implemented up to standard. New drives quietly transfer temperature increase, however, at the price of it decline in production is. Therefore if suddenly SSD drive productivity in system has decreased, it is worth checking temperature condition. The temperature effect is very interesting especially in the light of that engineering divisions try to increase as much as possible temperature in TsOD in recent years to cut expenses on cooling. In the documents published by ASHRAE (American Society of Heating, Refrigerating and Air-Conditioning Engineers) it is possible to find recommendations for systems with SSD drives. Here, for example, the document which can quite be useful — Data Center Storage Equipment – Thermal Guidelines, Issues, and Best Practices. When planning serious computer systems, certainly, it should be taken into account recommendations of ASHRAE and attentively to study characteristics of the drives planned to use not to get to such temperature condition when degradation of productivity for saving of reliability of the drive has already begun.

Reliability of statistical data in system software.
One more interesting supervision of authors — in some cases though metrics of operating system showed the high volume of the written data, the error rate was lower, than for systems where the volume of the written data was lower.
image
As it has become clear, often metrics of operating system and directly the SSD controler considerably differed. It is connected with optimization in the SSD controler, and also with buffering of input-output both in the most operating system, and in the drive. As a result, you should not rely on all 100% for metrics of operating system — they can be not absolutely exact, and further optimization of subsystem of input-output can make this gap even more noticeable.

Practical outputs.
So, what practical conclusions can be drawn on the basis of this research?
1. Projecting serious solutions on the basis of SSD, it is necessary to show consideration very much for temperature condition in TsOD, differently it is possible to receive either degradation of productivity, or high probability of failure.
2. Before input in produktiv, it is worth "warming up" system for the purpose of identification of "weak links". This council, however, equally well approaches any components, whether it be SSD either hard drives, or memory modules. Load testing allows to reveal many "problem" components which, otherwise, could "spoil blood" in fighting infrastructure.
3. If the drive has started giving out errors, it is worth thinking of availability of ZIP in advance.
4. It is better to collect statistics from all available sources, but for number of indicators it is better to be guided by low-level data in drives.
5. New generations of SSD it is normal better than the old :)
In all these councils there is nothing unexpected, but often simple things escape from attention.

Here I have mentioned the moments which have seemed to me the most interesting if you are interested in detailed studying of question, it is worth studying attentively all calculations of authors, having read the original version. Really titanic work on collecting and data analysis has been done. If the team continues the researches, in a couple of years it is possible to wait for even more extensive and comprehensive investigation.

PS the term "reliability" in the text is used only as analog the term "number of errors".

Other articles of Triniti can be found in Triniti's hub. Subscribe!

This article is a translation of the original post at habrahabr.ru/post/264463/
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here: sysmagazine.com@gmail.com.

We believe that the knowledge, which is available at the most popular Russian IT blog habrahabr.ru, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.
Best wishes.

comments powered by Disqus