Today almost in each company which is actively using IT for business support there is the data processing center (DPC). Increase of requirements to reliability of DPC – one of tendencies of the market. As the data-center quite often is a key element of business of the company, specialists look for economic methods of increase of its reliability long ago. And sooner or later there is a need to provide not only hardware reliability of DPC, but also its katastrofoustoychivost.
According to the research EMC, 82% of the organizations in the world are not completely sure that they will be able to recover the systems and data. Unplanned idle times and data loss cost the organizations worldwide annually more than 1,7 billion dollars. According to the research Acronis, in Russia only 2% of the polled companies are absolutely sure that their IT infrastructure will pass any tests. 49% of the Russian specialists expect long interruptions in its work in case of natural disaster or accident.
According to foreign statistics, the most frequent reasons of accidents are hardware failures (24%), electric power supply systems (16%), hurricanes (16%) and floods (15%).
In any technically complex systems of accident are inevitable, however they can be made not critical for business. For prevention of such situations katastrofoustoychivy systems are created – on territorially remote site reserve powers are unrolled. But for a start we will decide on the terms describing systems of high reliability.
The Katastrofoustoychivost (Disaster Recovery, DR) is a reducing ability after accident, that is resistance to influence of natural cataclysms and acts of terrorism. Fault tolerance (Fault-Tolerance, FT) – property of system to save working capacity after failure of one or several components. (High Availability, HA) speak about high availability in that case when systems are able to execute required function under the set conditions of time or during the set interval at present. A business continuity (Business Continuity, AF) are processes, methods and the equipment for non-stop execution of critical business functions. And, at last, RTO (Recovery Time Objective) – RCO (Recovery Capacity Objective) – what part of loading the standby system has to provide to RPO (Recovery Point Objective) – how many data will be lost at fallback recovery, time for which it is possible to recover the IT system.
Protection against natural, technogenic disasters or acts of terrorism and providing a continuity of business processes requires reservation of main systems of storage and data handling. In case of accident the building of a data processing center therefore creation of territorially remote site – a reserve data-center is necessary can be damaged. When there is not enough level of reliability Tier III, geographically distributed katastrofoustoychivy infrastructure of DPC is capable to guarantee availability four and even five nine.
Distribution of DPC on several platforms demands the organization of the reserved communication channels, replication of data between storages, planning of backup and recovery of systems. The mechanism of synchronization of the data given for ensuring their relevance in case of refusal of one of nodes and for support of work of those information systems which need such synchronization is necessary. Critical parameter, in addition to flow capacity, is transfer lag of data.
Two main strategy of use of the distributed DPCs – "active/active" when infrastructure applications and services are distributed between platforms are possible, and users work with the nearest DPC, or "active/passive" at which applications are centralized, and users work with the main node. In case of refusal systems, loading automatically switches to reserve DPC. The possibility of application of this or that strategy depends on the application.
Quite often at the heart of DPC, steady against accidents, – a geographically distributed cluster configuration of servers with connection to the general storage network (SAN). Nodes of a spaced cluster are placed on the main and reserve platforms, forming a single system. It provides continuous availability of services even in case of loss of the main DPC. By means of a clustering it is possible to provide automatic switching of loading between platforms of the distributed DPC in case of accident. Also economic modification of a solution at which the remote DPC functions in standby mode is possible and in case of refusal the main DPC supports a limited set of services.
Depending on distance and a solution architecture for communications between platforms it is possible to use Ethernet, the MPLS or IP protocols. Can make distances between DPC at synchronous replication to 80-100 km – it is limited to delays, admissible for applications, in a network. At synchronous replication the application receives confirmation of completion of operation of input-output after its execution on both parties. On the FCIP technology via separate switches it is possible to organize also asynchronous interaction between the DPC deleted from each other on thousands of kilometers, to use hardware compression of a traffic. In comparison with the Fibre Channel protocol (FC) more than 100 km of FCIP work at distance quicker. When using FCIP, packets of the Fibre Channel are encapsulated in TCP/IP, and then transmitted through the IP tunnel. FCIP – the main almost working method of communication of DPC when the FC broadcast on dark optics or through xWDM is impossible or inexpedient. Both direct connection of FCIP devices to each other, and connection through WAN is supported.
Key element of a katastrofoustoychivy solution – a geographically distributed data storage system. Modern SHD provide the built-in means for creation of katastrofoustoychivy solutions. For example, data storage systems on the specified platforms can duplicate completely each other, and platforms connect by redundant high-speed links of communication that allows to implement projects with the highest requirements to reliability of data transmission and their availability, including synchronous replication of data. Or reservation of data can be performed in an asynchronous mode.
Some SHD have an opportunity "to stretch" volumes between platforms by the most disk array. The inexpensive katastrofoustoychivy solution which is not demanding reorganization of architecture of data storage is as a result created. Other option – use for data backup of cloudy infrastructure of Microsoft Azure.
The typical scenario – reserve DPC in other city within the region (distance — 300-400 km). For communication on LAN IP or MPLS/VPLS, DWDM is used; for communication on SAN – FCIP, DWDM. In this case it is possible to apply a number of "metrocluster" technologies, to use asynchronous replication. Synchronous replication at such distance demands restrictions and additional tools. At diversity of platforms on thousands of kilometers speak already about "geocluster".
Methods of a clustering are offered by suppliers of operating systems, virtualization environments, application developers, vendors of IT systems and network equipment. For example, based on VMware vSphere duplication of data storage systems on two territorially separated platforms with possible balancing of loading at the level of a network of DPC is the cornerstone of a metrocluster. At unavailability of one of data-centers virtual computers will be automatically started on the second site. At the same time the speed of recovery of virtual environment (RTO) makes usually several minutes.
You should not forget that implementation of strategy of DR demands serious investments. Implementation of the similar project is, as a rule, connected with big financial costs. To prove and make a decision on creation of such class of systems very difficult. And it is quite probable that you will never use the plan of reserve recovery. However in case of an emergency situation the good plan of recovery will save time and money, will help to minimize losses because of idle time. Serious accident can lead to loss of a data-center, and it is a serious problem for business. According to world statistics, 93% of the companies which lost the data-center for only 10 days are ruined within a year.
It is necessary to find balance between costs for maintenance of a katastrofoustoychivost and losses of business in case of accident taking into account time of a complete recovery of all business processes. The illustrations given below will help to get some idea about costs for implementation of the distributed DPC in the company, to evaluate the volume of inevitable costs more precisely and to avoid possible misunderstanding of heads. In general dependence is as follows: the less required time of recovery, the data protection methods cost dearer (according to Gartner):
If to speak not just about reservation and recovery of systems and data, and about a katastrofoustoychivost, then the choice of a solution, optimum in these or those parameters, is also always a compromise (according to Compulink).
Only the system of high availability has zero indicators of RTO/RPO. Of course, it is the most expensive option (according to Cisco).
Providing a katastrofoustoychivost always demanded essential costs, temporary and financial. It is necessary to have two spaced sites, a fast communication channel between them, a data transmission network, data storage systems with replication support, computational capabilities and the engineering equipment for uninterrupted power supply and coolings of DPC. The staff of highly skilled IT specialists who can configure and support all this is required. It is required to pay attention to design of systems to their implementation and testing. However this task has solutions and without large capital investments.
Virtualization, clouds and katastrofoustoychivost
With distribution of virtualization and cloud computing new methods of protection against accidents appeared:
- Replication in a cloud. Technologies of private and public clouds simplified replication between platforms. Process of replication can cover all virtual computers, specific databases or pictures of data. Besides, cloud computing help the organizations to select the DR option which is most suitable on financial conditions – flexibility of the choice of an allowed time of idle time appeared. That is it is quite often possible to select an acceptable idle time and at the same time to be entered in the budget.
- Virtualization as mechanism of reserve copying / recovery. Here the idea is simple: it is much simpler to recover the virtual computer, than the physical server. It is possible to save in reserve DPC "status pictures" of VM or to mirror virtual computers. In the last of a case the configuration "active/active" turns out – at failure on the VM main platform which is carrying out critical tasks there is a switching on the same VM on the reserve platform.
- Use of technologies of program configuring (Software Defined, SD). In fact, this development of virtualization. The program configured platforms (network equipment, storage systems, the safety control, balancers of loading and so forth) allow to receive the flexible fault-tolerant environment with the "virtual devices" of different function functioning as virtual computers on standard servers. For example, if to involve for DR mechanisms of balancing of loading (Global Server Load Balancing, GSLB), it is possible to switch automatically users to the reserve platform at failure the main. For users process will be transparent.
- IaaS (infrastructure on demand). Cloud platforms and environments of virtualization allow to select necessary IT resources quickly. The possibility of fast recovery of virtual computers and data is important for DR. Cloud computing and virtualization perfectly are suitable for this purpose. It is possible to create very economic solutions IaaS – "active/active" or "active/passive". For example, regular backup of VM and data in DPC of provider is set. In case of accident the new environment is unrolled – VM with their redundant data are started. Process is not instant, but rather fast. In IaaS the main thing – flexibility. At strategy implementation the DR provider will help the customer to retrieve a maximum from this flexibility.
According to EMC conducting survey of the Russian companies in 2014, only 6% of respondents stake on an operation mode "active/active". These companies face data loss less often, than those that rely on backup: 13% against 24%.
If the company has no opportunities of creation of infrastructure of DR, outsourcing can become a reasonable solution. Such services are quite available today. It is interesting that according to recently published research Computer Economics IT Outsourcing Statistics 2015/2016, fallback recovery has the highest potential on economy of costs of clients among the most widespread types of outsourcing. He was marked out by 92% of the polled organizations.
Reserve DPC as service
Instead of creation of own reserve site the organization can use virtual (cloudy) DPC of provider or at all refuse own data-center and pass to cloudy model. Such option is suitable it for many organizations. This modern approach to providing a katastrofoustoychivost received the name "reserve DPC as service" (Disaster Recovery as a Service, DRaaS). DRaaS excludes influence of accidents on business processes, ensures smooth operation, and also removes many questions of material and organizational character from the client.
Reliability of services of provider is provided with two (or more) geographically remote data-centers representing specialized buildings with the high level of reliability. In case of completely virtual DPC in each of them the copy of a virtual data-center of the client – the main and reserve works. All changes in the main copy in real time are reflected in reserve. Failure of any of copies will not influence work of the organization in any way. When there is an accident, instead of the main data-center is instantly connected reserve, and all employees and clients continue to work in the normal mode. According to poll of OSP Data, more than a half (54%) of the Russian respondents consider important presence at service provider of several territorially remote DPCs for providing a katastrofoustoychivost.
One of examples of the approach described above – BDC (Backup Data Center) or "A package of services of reserve DPC" of the SAFEDATA company. Actually it is not limited to DRaaS framework. It is the whole complex of services of design and creation of a reserve site for the main DPC of the customer. Can act as the main data-center of the customer as the physical infrastructure placed on a site of the customer and virtual IT infrastructure.
The SAFEDATA company has own distributed network of DPC in the territory of Moscow connected by own fiber lines of communication. It allows to provide to customers not only services in creation and placement of a reserve site, but also placement of the main DPC distributed between two remote platforms.
The services BDC can include placement of the equipment and virtual computing resources, providing fiber lines of communication and L2 channels, synchronization of data between two platforms, providing Internet channels with the guaranteed bandpass range, protection against DDoS-attacks and backup.
Selecting DPC of SAFEDATA as a site for creation of reserve DPC, the customer gets access to examination in the field of design, creation and service of data-centers, the selected round-the-clock technical support service. Also lease office and warehouse is possible.
Katastrofoustoychivy solutions and services it is offered much today. Please, share in comments that is used by you as these solutions helped out you.
This article is a translation of the original post at habrahabr.ru/post/273947/
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here: firstname.lastname@example.org.
We believe that the knowledge, which is available at the most popular Russian IT blog habrahabr.ru, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.