Developers Club geek daily blog

1 year, 7 months ago
image

A year ago performed quite interesting work on development of the built-in computer for one enterprise which is engaged in electronics. The computer represented nothing essentially interesting: the Cortex A-8 processor working at subgigahertz frequencies, 512Mb as DDR3, 1Gb NAND, lightweight assembly of Linux. However the device which the computer was built in, so and to him should work in quite severe conditions. Broad temperature range (from-40 to +85 Celsius degrees), moisture resistance, resistance to electro-magnetic radiations, kilovoltage impulses on power supply, protection against a statics in 4 kV and a lot of things interesting that is well described in different state standard specifications on special equipment, – all this about it. One of the main requirements of the customer – development term on failure not less than 10 years. At the same time the vendor provides warranty repair of a product within five years therefore a question not rhetorical, but monetary and serious. In a product the corresponding element base was mortgaged. The device with honor passed tests and received required certificates, but conversation not about that. Problems began when the installation lot was produced, and devices dispersed on departments and CB for creation of application software. Returns with the formulation went: "Something is not loaded".

It was FAIL


During survey it became clear that in 100% cases of failure the section NAND with file system (rootfs) was damaged, and all other sections were whole, were normally mounted and read. Interrogation of witnesses showed that the device refused to be started, after tough emergency power off. The direction of research is clear. Failure of file system can be caused by switching off of power supply during writing on the carrier. We construct the test stand which problem to give power supply on the device, to wait when Linux is loaded and will start a test script (generates files and writes on Flash) and to chop off power supply. And so around. In total the cycle proceeded slightly more than a minute. Put several devices on testing. On average, through 2000 iterations, each device refused to be loaded, the section fell from rootfs! It seems found.

For reasons of durability and reliability in our device as ROM SLC NAND is used. Options with eMMC (embedded Multimedia Memory Card) were rejected because of a small amount of cycles of rewriting at once. Today eMMC is not the standard for industrial-applications, probably for this reason such small number of sentences of similar chips with the lower bound of working temperature range - 40C. The main restriction for application in industrial systems is a small term of warranty data storage. If for SLC NAND it is about 10 years, then for eMMC – about a year.

In difference from a solution based on eMMC (or normal SD "Secure Digital" Card) where the programming layer of interaction with the physical medium (FTL – Flash Translation Layer) performs the controller which is built in memory, FTL has to be implemented by means is central the processor. Despite the increasing complexity of implementation, it gives notable benefits in flexibility of a configuration of final system, and also at the expense of a possibility of use of special algorithms of alignment of wear of physical storage cells increases durability of the carrier. (Actually in the FTL level which is built in eMMC algorithms of alignment of wear, but it "a black box" are implemented too).

In the Linux operating systems for work with the NAND physical medium a number of file systems are used: JFFS2 and its evolutionary development – UBI/UBIFS (Nokia thanks for it), and also the competitor – LogFS. On set of parameters, the preference was given to a linking of UBI/UBIFS. UBI/UBIFS are two program layers: UBI (Unsorted Block Images) – ensures functioning directly with the physical medium, UBIFS (UBI File System) – actually, file system.

Main opportunities of UBI:

  • works with sections, allows to create, delete, or to change their size;
  • provides alignment of record on all volume of the carrier;
    works with Bad-blocks;
  • minimizes probability of data loss at an emergency shutdown power supply or other failures.

UBIFS, among other things, is engaged in maintaining logs.

In spite of the fact that in the whole UBI and UBIFS were developed taking into account the requirement of tolerance to power supply interruption as practice in use showed devices under certain conditions, after abnormal termination of work (in other words, power off) the section is damaged. If it is the section with rootfs, then the device loses working capacity in general. The probability of this event is not high, the device can work steadily several months or even several years, safely to endure any power failure. However this factor should be considered if the device is intended for work in a hard-to-reach spot, with limited access for the person or the output it out of operation can carry fatal effects.

The reason of failures in consists in physical feature of a structure of NAND. Data record happens page-by-page, previously, the page has to be erased – to the area all units are written. Erasing happens blocks, such block is called PEB (physical erase blocks). To erase the page, it is necessary to erase the block entirely. In one block there can be a set of pages, for example, the page 4 of CB, and the block 256KB. Developers of the UBI/UBIFS technology know about such problem and blame for everything so-called "unstable bits" (unstable bits). They point to four main events when data from the carrier can be lost.

The reasons of failures and loss of information in NAND
  1. It was powered off before work with the page of memory was complete. After reset the page can be read correctly, but at repeated reading it is possible to receive the error ECC. It occurs because there was a quantity of unstable bits which can correctly be read or it is not correct.
  2. Power off occurs at the time of the beginning of work with the NAND page. After reset, the page can be read correctly: all units (0xFF), but sometimes are considered, after reset it is possible to consider from this area only zero. Besides, if then to write this page again, there can sometimes be an error ECC. The reason – again unstable bits.
  3. Power off during erasing of the block. After reset, besides, can will appear unstable bits, and data in the block become damaged.
  4. It is powered off after operation of cleaning of the block was started. And besides, after reset the block contains unstable bits: or when reading returns zero, or the damaged data, in attempt to write there information.


In all cases, after emergency power off the area of memory can be read correctly, as a result the system of journalizing will not see a dirty trick. But at the subsequent access to this area data can be damaged. The number of such "unstable bits" can be more, than the algorithm of ECC will be able to correct. Therefore, earlier read pages become unreadable, or on the contrary, earlier unreadable page can become suddenly readable. The problem is aggravated with the fact that unstable bits can arise in the log of file system as statistically, this NAND area is most often modified.

We rescue system


For increase of survivability of file system we decided to enter redundancy to architecture of the root file system (RFS). The idea is following: we create the "virtual" section from two physical sections on the carrier. One section contains the image of rootfs available only to reading, and during operating time of an operating system all changes are entered in the second section which is available to a read and write. As record is performed only in the second section, at emergency power off only it can be damaged. The second section will remain initial. Such technology is known as the cascade and joint assembling.

Besides, decided to carry system software (rootfs means, the kernel was initial on separate read-only the section) and application software on different physical sections. Owing to specifics of our device (works with massive databases), selected the section for backup. In this place we rejoice that in the device mortgaged enough memory (1 GiB).

image

For the cascade and joint assembling of sections the auxiliary aufs file system is used. As it was already told above, there is consolidation of two physical sections. The first section in which the image of working KFS is initially written is available only to reading (RO – read only), the second section, initially empty, for storage of changes, respectively, it serves available both to reading, and to record (RW – read write). In the terms aufs first and second sections are called branches (branch). Consolidation of branches happens in the course of assembling. As a result the operating system sees the mounted area as a unit. The data access is provided by kernel driver. The driver first of all sends inquiries for reading the file to branch RW; if data are present there, they are issued if there are no data, the file is read from RO branch. At record, data get to RO branch. During removal of the file, the tag that this file was deleted is added to a branch of RW (the corresponding empty hidden file with a certain prefix in the name is created). Physically the file remains in RO branch whole. Such approach allows to avoid write operations in the section with the critical information. Besides, as the branch of RO is available only to reading, there is a basic opportunity to add additional control of integrity of data. It is possible to implement it means of UBIFS, having made the created section static. The static section is available only to reading and data are protected by checksum (CRC-32) there.

Total, we want that such architecture of KFS turned out:

image

The sections "rootfs _" contain the system part of KFS providing operability of the Linux operating system, the sections "data _" are intended for storage of application software, files of settings, databases. The section "backup" is intended for periodic reservation of current settings of system and databases. Reservation is provided to application software.

We bake aufs


At the moment aufs is not included in the main branch of a kernel of Linux therefore in addition to utilities for work with technology it is required to impose patches on source codes of a kernel independently. To unroll the aufs technology on a target platform (target) Linux it is necessary:

  1. To apply patches on a kernel. All patches and how-to can be found on the project website.
  2. In a kernel to include aufs.
  3. To collect a kernel.
  4. To collect utilities for work with aufs.
  5. To transfer a kernel and utilities to target.

It is possible to check technology for target, having executed:

mount -t aufs -o br=/tmp/rw=rw:${HOME}=ro none /tmp/aufs

Command format
mount [-fnrsvw] [-t тип_ФС] [-o параметры]         устройство  каталог 
moun             -t aufs     -o br=/tmp/rw:${HOME} none        /tmp/aufs 


As a result contents of the house directory will appear in / tmp/aufs, it is possible to write there and to delete files, contents of $ {HOME} will not change.

It's cool! aufs connected, the most interesting now: how to force system to be loaded from it? By default, when loading we cannot point through cmdline to a kernel the partition with rootfs to aufs. At start of a kernel there is no such section yet, it only should be created. Means during start of system before initialization process is started (process with PID = 0, in my case is systemd) we have to mount the auxiliary section aufs, execute chroot on it, and only after that start / sbin/init. For similar tasks there is a mechanism of preliminary initialization. We specify a way to a script which will have to fulfill before start of the demon of initialization in cmdline. We add parameter to cmdline:

init=/sbin/preinit

The script is written on shell therefore at the time of execution in system there have to be already all the utilities, necessary for it. That is, actually for execution of a script the section with rootfs has to be already mounted! For these purposes it is possible to use rootfs on a RAM disk, or to be loaded initially from the fighting section with rootfs, but in the read-only mode is our choice. We edit cmdline as appropriate, we add parameter (is 9 number of the section mtd where I have rootfs_ro):

root=ubi0:rootfs_ro ro ubi.mtd=9

preinit script


We mount system sections (are necessary for work of shell):

mount -t proc none /proc
mount -t tmpfs tmpfs /tmp
mount -t sysfs sys /sys

The section rootfs_ro at us is already mounted, we from it were loaded, we mount rootfs_rw in the temporary folder:

ubiattach -m 10 -d 1 > /dev/null
mount -t ubifs ubi1:$rootfs_rw /tmp/aufs/rootfs_rw

If when assembling something went not so, then safely we format rootfs_rw and if it did not leave, then we delete the section and we create again. We try to mount once again. I will not give a code, there are too much "magic number" determined by architecture of NAND. I will tell only that will be necessary a set of utilities of UBI.

We copy mount point of the current rootfs in a temporary directory:

mkdir -p /tmp/aufs/rootfs_ro 
mount --bind / /tmp/aufs/rootfs_ro

We stick together puff pie – we mount the section aufs:

mount -t aufs -o br:/tmp/aufs/rootfs_rw :/tmp/aufs/rootfs_ro=ro none /aufs

After that, the new section with rootfs is available in / aufs.

We do a feint by ears: we transfer mount points of rootfs_ro and rootfs_rw to the new section:

mount --move /tmp/aufs/ rootfs_ro /aufs/aufs/ rootfs_ro
mount --move /tmp/aufs/ rootfs_rw /aufs/aufs/ rootfs_rw

And at the same time we will transfer / dev:

mount --move /dev /aufs/dev

It is clear, that directories to which mount points are transferred have to be in advance created.

We are tidied up for ourselves, we disconnect system sections:

umount -l /proc
umount -l /tmp
umount -l /sys

We change KFS and we start initialization:

exec /usr/sbin/chroot /aufs /sbin/init

In a fighting script by the same principle we gather "pie" for / appl and we mount / backup. Below in drawing the turned-out architecture of final KFS is shown.

image

For reliability augmentation, to section/backup exclusive access strictly of one utility which is responsible for backup and recovery is provided. The utility is in the section "data_ro".

Conclusion


As a result sharply the general survival of system at emergency power off increased. Though application of technology of cascade assembling of KFS it is shown on the example of NAND, the similar principle is not limited by physical type of the data storage medium and also another is easily transferred to eMMC, SD. If in use the system is not engaged in data storage, and only fulfills a certain algorithm (for example, a normal router), then as RO branch when assembling the section aufs it is reasonable to use a RAM disk.

And instead of P.S.: for the present nobody cancelled reserve power capability.

To esteem on a subject

This article is a translation of the original post at habrahabr.ru/post/273425/
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here: sysmagazine.com@gmail.com.

We believe that the knowledge, which is available at the most popular Russian IT blog habrahabr.ru, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.
Best wishes.

comments powered by Disqus