Developers Club geek daily blog

2 years, 7 months ago
Stay DPC are extremely expensive pleasure, dauntaym in several seconds can turn back serious financial and reputation losses. The accidents which happened quite recently still again proved it. Two large-scale server farms — one in Great Britain, the second in Azerbaijan suffered.

Accidents on server farms in Azerbaijan and Great Britain

Almost all population of Azerbaijan lost Internet access

In one of data-centers of the Delta Telecom company fire flashed. Dauntaym lasted for eight hours. After this incident it was possible to get access to Internet services only about use of channels of the local mobile operators Backcell and Azerfon.

The fire in Baku on a server farm of Delta Telecom became the reason of shutdown. According to the official statement of representatives of this company, several cables in old DPC lit up. Fire and abnormal services were involved in process of liquidation of ignition. Because of incident work of banks was almost paralyzed – operations were not performed, operation of ATMs and payment terminals was stopped. In many regions mobile communication was unavailable.

Shutdown happened on November 16 at 16 o'clock local time. So major accident connected with the Internet occurred in Azerbaijan for the first time. On elimination of its effects 5 hours were required. Service of users was recovered only closer by midnight local time.

Accidents on server farms in Azerbaijan and Great Britain

According to the Renesys company which is engaged in tracking of connections to the Internet, 78% of networks of Azerbaijan left in dauntaym that by quantity makes 6 excess hundreds of networks. These networks used the key Delta Telecom and Telecom Italia Sparkle connection. Experts of Renesys claim that Azerbaijan is one of the states with high risk of shutdown of the Internet because of low quantity of networks which connect the country with external nodes of an exchange of a traffic. The similar situation is characteristic of many neighboring states like Iran, Georgia, Armenia and Saudi Arabia now.

Though in recent years Azerbaijan actively develops the telecommunication infrastructure thanks to receipts from sale of oil and gas, and also takes part in creation of the Trans-Euroasian information backbone (Trans-Eurasian Information Highway; TASIM).

DPC of Telecity kolokeyshn-provider and problem with the UPS

According to results of a number of researches, from 65 to 85% of unplanned idle times of DPC are caused to faults of the UPSes systems. For this reason periodic monitoring of these elements of infrastructure of a data-center, and also timely service and replacement of rechargeable batteries should pay special attention.

Perhaps, engineers of the European kolokeyshn-provider Telecity Group are not too attentive to the uninterruptible power supply units. Nearly two weeks ago the company "upset" the clients renting rooms in turbine halls of a commercial data-center of the company in London twice. Two power supply failures in a data-center of Sovereign House which followed one by one led to discontent of numerous tenants whom London Internet Exchange and AWS Direct Connect (the service allowing third-party to be connected the company to Amazon cloud through private network connections) are among.

Accidents on server farms in Azerbaijan and Great Britain

And to all fault housing and communal services. Problems in a data-center which is located in the territory of the Area of Docklands to the East from the center of London and are serviced by about 10% of Internet traffic of Great Britain, began on their fault. After the first failure in operation of the central power supply network the infrastructure of DPC could not switch to reserve generators automatically. Later network power supply was recovered for some time then in the morning Wednesdays repair of the UPS system began. But then the electricity was gone again, and the infrastructure of a data-center once again did not switch automatically to DGU. Problems in work of DPC did not remain unnoticed British businessmen and normal users who complained of problems in work of VoIP-services and a web hosting, and also the AWS platform.

About accidents in server farms

Many specialists of an average link are ready to tell unofficially, but the manual of data-centers on discussion of the incident, as a rule, imposed the most strict ban.

The data-center which worked in the market from three to five years and did not endure at least one accident — most likely is unique. Accidents happen everywhere, a difference only in effects. In the western market the value of the head of a server farm who endured accident increases because it already has experience of overcoming of difficulties and will be more careful and motivated on accident warning in the future. In our market, most often, heads are ready to fight to the bitter end, without taking out information on the happened incidents on public though effects happen heavy and it is impossible to hide a stop of services for clients in any way. There are bases of incidents in data-centers are collected by some international organizations. The truth access to them is provided only through membership in the closed clubs and there are not absolutely ready to share invaluable information.

Accidents on server farms in Azerbaijan and Great Britain

Analyzing basic reasons of accidents on server farms, two types of errors "are in the lead": connected with a human factor and failure of components of the equipment. Even the project taking into account the high requirements to reliability assuming use of spare equipment or engineering systems is not insured for a case of failure from the accident caused because of a human error or a design stage, or operation in the conditions of the refused equipment. It is known that the slightest error, the short stop of work, accident can cost the company billions of dollars. Therefore many companies respecting themselves and the clients carry out independent engineering examinations of documentation even prior to construction to reveal critical points of failure and to develop solutions on their advance elimination. Also there is a stage of comprehensive preservice tests.

I. Schwartz — the head of department of system integration of GC "Triniti" told about the reasons of accidents (the example took the existing DPC): (from article I. Schwartz: Safety of infrastructure of DPC (log "Algorithm of Safety" No. 3, 2015 year.)
80% of cases I hear complaints to what it is heavy to cool server, something overheats or something happens to power supply. Here one of cases:

The data processing center of the regime enterprise with power more than 1 MW, in DPC is located a computing cluster, project cost more than $10 million. Intra line cooling is used, components of electric power supply systems, cooling systems, gas fire extinguishing are reserved, systems have reliability of N+1, 2N. By "Triniti" it was invited as the independent expert for the analysis of origins of accident in DPC.

External nature of damages of the equipment (considerable deformation of geometry of plastic elements, boiling up and inflation of accumulator elements) demonstrates influence of the increased temperature a progressive tense — of tens of hours to several days.

Accidents on server farms in Azerbaijan and Great Britain
External nature of damages

Proceeding from duration of influence of temperature, the unambiguous output follows that the complex continued active work in the conditions of a stop of the cooling subsystem. The analysis of records of the UPSes logs, interline conditioners, chiller and the stabilizer of external power supply showed the following facts: to and during accident interruptions in external power supply were absent, interruptions of power supply on the pure lines (fed from the UPS) did not happen, despite shutdown of the battery pack and numerous transitions to power supply in the mode a bypass (without stabilization). When exceeding a threshold of pressure of the valve owing to the increased air temperature more than 50 °C abnormal reset of fire extinguishing substance from cylinders of an automatic gas fire extinguishing system took place that resulted in nonserviceability of a fire extinguishing system against the continuing growth of temperature. As it became clear, accident was preceded by 20-hour simultaneous work of two chiller, in the regular mode such work is continued no more than 25 seconds, during rotation of chiller. Simultaneous long operation of two external blocks of the cooling system led to excessive overcooling of the heat carrier owing to what they were switched-off by mistake "Protection against Frosting-up Threat" with a stop of the main circulation pulsers. The additional circulation pulser which is in the machine hall is not capable to perform independent circulation of the heat carrier.

The abnormal stop of interline conditioners and, as a result, sharp growth of temperature in "a hot corridor" was result of lack of circulation. As a result of research of all available journaling systems set: the prime cause of accident were problems with a board of power automatic equipment. Incorrect work of the first chiller, owing to loss of the first phase on power supply of a chiller number 1 led to simultaneous start and work of the second chiller.

The reason which allowed events to develop further and so long became:

1) Absence in the specification on design of the requirements to a monitoring system and the notification connected with the mode of safety of object namely — "The system of auto disconnect and the notification (SAOO)" at design was expected work with the duty operator by the notification on two channels: The SMS (text notifications through the public GSM networks) and email (the electronic notification through internet-public service networks). Both channels were not connected because of the mode of safety of object.

2) At putting into operation of SAOO it was not transferred to automatic functioning in the mode of lack of channels of the notification about accident.

3) The signal line the "accident" which is regularly provided by the vendor (APC) between a hardware complex of monitoring of the NetBotz environment and the UPS was disconnected.

4) The additional circuit of monitoring of parameters of the environment with a signaling output to a protection post was not designed and set.

5) Accident managed to be found, only when the volume motion sensors of security and disturbing signaling brought to a protection post, which recorded falling of the melted-off air stubs and sidewalls of cabinets worked.

The output on this case is applicable at construction of any server: by the specification on design requirements to a monitoring system of parameters of the environment, power supply, the requirement to a signaling output to a post of protection, the requirement to communication channels for the notification, requirements to independence of a circuit of monitoring of the key crucial parameters of operability of a LAN, servers, automatic telephone exchange, etc. equipment behind which supervision is made have to be set. The detailed program of a technique of tests at a stage of delivery of a complex in operation providing the greatest possible combinations of non-staff events has to be developed by the project. Executive documentation has to contain instructions for action during emergency situations. Training of the operating personnel has to be provided. At power supply of three-phase equipment relays of control of phases have to be used.

There are still errors from discharge "stare in disbelief":
simple ignorance or inattention of personnel: two power supply units or distributors connected to the same power supply line instead of two independent lines; the server rack-mounted back to front so that its fans take away air from "hot", but not from "cold" pass; the button of an emergency shutdown of power supply without the corresponding marking and protection conducting to power off the new employee who thought that he just turns off the light … These errors could raise a smile if did not cost much so and did not take away a lot of time.

Accidents on server farms in Azerbaijan and Great Britain

This article is a translation of the original post at habrahabr.ru/post/272131/
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here: sysmagazine.com@gmail.com.

We believe that the knowledge, which is available at the most popular Russian IT blog habrahabr.ru, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.
Best wishes.

comments powered by Disqus