Developing a subject of a backup and recovery on SHD with new architecture, we will consider nuances of work with deduplitsirovanny data in the scenario disaster recovery where SHD with own deduplication are protected, namely as this technology of effective storage can help or prevent to recover data.
The previous article is here: Backup reefs in hybrid storage systems.
Time deduplitsirovanny data take less places on disks, it is logical to assume that backup and recovery have to take less time. Really, why not bekapit / to recover deduplitsirovanny data at once in a compact deduplitsirovanny type? In this case:
- In a backup only unique data are located.
- It is not necessary to reduplitsirovat (to regidrirovat) data on productive system.
- It is not necessary to deduplitsirovat back data on SRK.
- To the contrary, it is possible to recover only those unique data which are necessary for reconstruction. Anything superfluous.
But if to consider a situation more attentively, then it turns out that not everything is so simple, the direct path is not always more effective. At least because in SHD of general purpose and SHD of backup the different deduplication is used.
Deduplication in SHD of general purpose
The deduplication as method of an exception of redundant data and increase of efficiency of storage, was and remains to one of the key directions of development in the industry of SHD.
Principle of a deduplication.
In a case with productive data, the deduplication is intended not only and not just for reduction of the place on disks how many for fall forward of data access due to their more dense placement on fast carriers. Besides, deduplitsirovanny data are convenient for a caching.
One deduplitsirovanny block in a cache memory, at the top level of multi-level storage, or simply placed on a flash, can correspond to tens or even hundreds of identical user data units which took the place on physical disks earlier and had absolutely different addresses, and therefore could not be effectively cached.
Today the deduplication on SHD of general purpose is very effective and profitable. For example:
- On a flash system (All-Flash Array) it is possible to put significantly more logical data, than their "crude" capacity usually allows.
- When using hybrid systems the deduplication helps to select "hot" data units as at the same time only unique data remain. And the deduplication is higher, the it is more than appeals to the same blocks, so — efficiency of multi-level storage is higher.
Efficiency of a solution of a problem of storage by means of a combination of a deduplication and a tiyering. In each option the equal performance and capacity is reached.
Deduplication in backup SHD
Initially the deduplication was widely adopted in these systems. Thanks to the fact that the same data units are copied on SRK tens, and even hundreds of times, at the expense of an exception of redundancy it is possible to reach essential economy of the place. In due time it became the reason of "approach" to tape systems of disk libraries for backup with a deduplication. The disk strongly pressed tapes because the cost of storage of backup copies on disks became very competitive.
Benefit of a deduplitsirovanny backup on disks.
As a result, even such adherents of tapes as Quantum, began to develop at themselves disk libraries with a deduplication.
What deduplication is better?
Thus, in the world of storage at the moment is, two different methods of a deduplication – in backup and in systems of general purpose. Technologies in them are used different — with the variable and fixed block respectively.
Distinction of two methods of a deduplication.
The deduplication with the fixed block is simpler in implementation. It well is suitable for data to which regular access therefore it is more often used in SHD of general purpose is necessary. Its main minus is smaller capability to recognition of identical sequences of data in the general flow. That is two identical flows with small shift will be apprehended as absolutely different, and there will be no deduplitsirovana.
The deduplication with the variable block can distinguish better repetitions in a data stream, but for this purpose it needs more resources of the processor. Besides, it is of little use for providing block or multithreaded data access. It is connected with structure of storage of deduplitsirovanny information: if to speak simply, then it is stored by variable blocks too.
Both methods help to cope with the tasks perfectly, but with unusual tasks everything is far worse.
Let's consider the situation arising on a joint of interaction of these two technologies.
Problems of backup of deduplitsirovanny data
The difference between both approaches for lack of their coordinate interaction leads to the fact that if from storage system which stores already deduplitsirovanny data to perform backup with a deduplication, then data every time "reduplitsirutsya", and then deduplitsirutsya back in the course of their saving on a backup system.
For example, 10 Tb of productive deduplitsirovanny data with total coefficient 5:1 are physically stored. Then in the course of backup there is a following:
- 50 Tb are copied not 10, and completely.
- The productive system in which basic data are stored should perform work on a regidration ("reduplication") of data in the opposite direction. At the same time it has to ensure functioning of productive applications and a data stream of backup. That is three simultaneous heavy processes loading front-side buses of input-output, a cache memory and main cores of both storage systems.
- The direct system of backup should deduplitsirovat data back.
From the point of view of use of processor resources — it can be compared to simultaneous clicking gas and a brake. There is a question — whether it is possible to optimize it somehow?
Problem of recovery of deduplitsirovanny data
At data recovery on volumes with the included deduplication it is necessary to repeat all process in the opposite direction. Not in all storage systems this process works "on the fly", and in many solutions the principle of "post process" is used. That is data at first register in physical disks (even if on a flash) as is, data units are analyzed, compared then, duplicates come to light, and cleaning is made only then.
Comparison of In-line and Post-Process Dedupe.
It means that in storage system at the first pass the place for a complete recovery of all nededuplitsirovanny data can potentially not be enough. And then it is necessary to do recovery in several passes, on each of which a lot of time, the recovery consisting of time and time of a deduplication with release of the place on SHD of general purpose can leave.
This possible scenario belongs not so much to data recovery from a backup (Data recovery) minimizing risks of the class Data loss, how many to recovery after catastrophically big data loss (which is classified as accident, i.e. Disaster). However such Disaster Recovery is to put it mildly not optimum.
Besides, at catastrophic failure it is not obligatory to recover all data at once at all. It is enough to begin only with the most necessary.
As a result, the backup which is urged to be means of the last hope which address when already nothing else worked works not optimum in a case with deduplitsiruyushchy SHD of general purpose.
Why then in general the backup from which in case of accident it is possible to be recovered only with huge work, and almost for certain not completely is necessary? There are means of replication (mirroring, snepshota) which are built in productive storage system which have no significant effect on performance (for example, VNX Snapshots, XtremIO Snapshots). The answer to this question will be the same. However, any normal engineer would try to optimize and improve this situation somehow.
How to combine two worlds?
The old organization of work with data at a backup and recovery looks, at least, strange. Therefore many attempts of optimization of a backup and recovery of deduplitsirovanny data were made, and a number of problems managed to be solved.
Here some examples:
- Windows 2012 Deduplication with Networker
- Windows Server 2012 Deduplication and Backup Exec
- Backup and Restore Considerations for Deduplicated Volumes
- Backup and Restore of Data Deduplication-Enabled Volumes
But it is only "patches" at the level of operating systems and the separate isolated servers. They do not solve problems at the overall hardware level in SHD where it is really difficult to make it.
The matter is that in SHD of general purpose and in backup systems the different, specially developed algorithms of a deduplication — with fixed and variable blocks are used.
On the other hand, it is not always required to do complete backup, and is much more rare — a complete recovery. It is not obligatory to subject to a deduplication and compression all productive data at all. Nevertheless, it is necessary to remember nuances. Because nobody cancelled catastrophic data loss. And standard industrial solutions which have to be provided according to regulations are developed for their prevention. So if it is not possible to recover data from a backup for normal time, then it can cost to responsible people pits.
Let's consider how in the best way to be prepared for a similar situation and to avoid unpleasant surprises.
- Use whenever possible incremental backup and synthetic complete (synthetic full) copies. In Networker, for example, this opportunity is, starting with version 8.
- Mortgage more time for a complete backup, considering need of a regidration of data. Select time of the minimum utilization of processors of system. During backups it is better to observe utilization of processors of productive SHD. It is better that it did not exceed 70% at least on average for the period of a backup.
- It is comprehended apply a deduplication. If data do not deduplitsirutsya and do not press close, then why to spend processor power during a backup? If the system deduplitsirut always, then it has to be rather powerful to cope with all work.
- Take the processor power selected for a deduplication in SHD into account. This function meets even in systems of initial level which not always cope with simultaneous execution of all tasks.
Complete recovery of data, Disaster Recovery
- Prepare the responsible Disaster Recovery or Business Continuity Plan considering behavior of storage systems with a deduplication. Many vendors, including EMC, and also system integrators, offer services of similar planning because in each organization there is the unique combination of the factors influencing process of recovery of operation of applications
- If SHD of general purpose uses post-process deduplication mechanism, then I would recommend to provide in it the buffer of free capacity, on a recovery case from a backup. For example, the size of the buffer can be accepted as 20% of the logical capacity of deduplitsirovanny data. Try to support this parameter at least on average.
- Look for opportunities to archive old data that they did not prevent fast recovery. Even if the deduplication is good and effective, do not wait for failure after which it is necessary to recover from a backup and to completely deduplitsirovat volumes in many tens Tb. It is better to transfer all not operational/historical data to on-line archive (for example, on the basis of Infoarchive).
- The deduplication of data "on the fly" in SHD of general purpose has advantage before post-process from the point of view of speed. She can play special value at recovery after catastrophic loss.
My some reasons concerning backup and recovery of deduplitsirovanny data are that. I will be glad to hear your responses and opinions on this matter here.
And, it is necessary to tell that here will not mention one interesting special case demanding separate consideration yet. So be continued.
This article is a translation of the original post at habrahabr.ru/post/270907/
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here: firstname.lastname@example.org.
We believe that the knowledge, which is available at the most popular Russian IT blog habrahabr.ru, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.