On a twist of fate accident about which the speech will go below happened only three months after the customer and the general designer received from our company the warning letter with the recommendation to set an external service bypass on uninterruptible power supply unit later. But no reaction to warning followed.
In the process of construction of a data-center our company put and mounted the Trinergy UPS flagman block system with a power of 1 MW. Though this UPS is also equipped with the built-in service bypass, all of us recommended to parent organization to make an external general service bypass on system ― in case of accident that at adverse succession of events it would be possible to service this source completely, without interrupting power supply of loading. But specialists of the general contractor objected that the UPS is already equipped with a service bypass which will allow to service internal components of system in any situation. Nothing foretold bad, and cases that the new system was entirely put out of action, was expected.
The ideology of approach of Uptime Institute to ensuring fault tolerance according to requirements of Tier III means use of an external bypass in order that there was an opportunity to service an internal bypass. But in this case neglected this principle. An external bypass specialists of parent organization refused picking of system or because of limitation of the budget, or because of desire to increase a margin.
And meanwhile the object was projected and under construction in already available building which adapted under needs of TsODa. And the object, as always, in a hurry was reconstructed. The waterproofing in the old building was made badly, but did not begin to remake it. Three months later after installation of the equipment spring melt waters began to fill in the UPS, at the same time flooding went not from below, and on top, actually from under a ceiling. Water filtered very much ― into the UPS there was a short circuit, a source, "having loudly gone off with a bang" (according to eyewitnesses), burned down.
And only at this moment it became clear that to repair it and it is just impossible to make a waterproofing without shutdown of all TsODa: the source on a centralized basis fed a data-center on the built-in bypass. As a result, despite the high level of reservation (block system, reservation according to the scheme N+2), after failure of two power blocks power supply of TsODa stopped being uninterrupted, and all became hostages of this situation.
It is necessary to notice that the UPS system proved from the best party. The system resisted, it did not "throw" loading. Only those power blocks on which most of all spilled waters burned down, and the remained three power modules on which water spilled less remained in operating state. But, as the source on a centralized basis kept on itself all DPC, that is through it there was all power supply of object, and the external service bypass was not, for rectification of faults of power blocks it was necessary to carry out an UPS blackout.
As a result, as it was is painful for the customer, it was necessary to select time and to stop DPC then operability of the UPS was completely recovered, and the source is equipped with an external service bypass. For the company owning TSODOM its stop was very critical, painful.
In this case at accident it is several reasons. The first ― is haste at construction and a bad waterproofing. Rechargeable batteries stood on archive racks with plywood regiments, that is in TsODe the complete eclecticism was observed: neighbourhood of the most advanced equipment and "artifacts" of the end of the last century.
In the terms Uptime Institute the system designed according to requirements of Tier II does not mean service of any element without shutdown of loading that perfectly and showed this case. This accident belongs to such incidents which without stopping TsODa cannot be liquidated.
It is an axiomatic case when the customer is warned about possible risks, but he prefers to wave away, and then there is a situation about which he was warned! At the same time the level of costs for the block of a service bypass for the 1 MW source it is incomparable it is small in comparison with losses from TsODa stop.
As a result for a long time it is (more than half a year) while chose a moment for TsODa stop, all IT systems worked in general without protection! Such here risk management. Well it is also necessary to understand that, as well as any machine - "drowned man", a resource of a time between failures of the UPS system after such accident sharply decreased: its different components began to fail more often than it could be expected from the system which did not endure a similar stress.
This history could be heard from the companies specializing in service of ventilation systems and conditioning. However from lips of electricians it seems improbable. But, unfortunately, it is the real truth ― an example of how at construction of TsODa the equipment is put out of action at a construction stage, without being started in work.
The DPC is built on the suburb of the city. The contractor, aiming to reduce construction term, forces suppliers to deliver the equipment to object though to construction readiness of a site still very far. At the same time "upward" (customer) there are official reports that the equipment is on object. But at this particular time to such equipment there can be most surprising things.
For example, on one of such objects under construction delivery of the UPS was performed obviously before term. The source stood several months unclaimed and not connected, than rodents (judging by life activity traces) who built to themselves a nest there and began to live-live did not fail to use and she is kind to acquire. In the same room workers ate food which remaining balance rodents did not disdain. Animals separated the apartments into zones: on one "floor" there was a slot; on another they ate food; on the third ― just in that place where printed circuit boards were located, ― arranged a toilet.
When term came to connect the equipment, employees of customer services, overcoming fastidiousness, in respirators and rubber gloves rose to formally still to the new equipment. It was not succeeded to start IPB, of course, as printed circuit boards were destroyed by caustic liquid and demanded replacement.
As a result it turned out that as the equipment is delivered on object, it does not belong to the supplier any more. And the general contractor is the equipment received and did not begin to operate yet, but it is not serviceable any more. What occurred, ― is payment for short-sighted and not reasonable requirement of the customer to bring on object all provided equipment at once in spite of the fact that it is not necessary for it yet. These half a year the equipment would far more safely be stored on object.
Authors: Sergey Ermakov, Stanislav Ilyenko
Read more than 20 incidents which happened in the Russian DPCs in the new issue of the magazine of ЦОДы.РФ No. 13 devoted to a subject "Accidents in TSODAKH".
This article is a translation of the original post at habrahabr.ru/post/272057/
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here: firstname.lastname@example.org.
We believe that the knowledge, which is available at the most popular Russian IT blog habrahabr.ru, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.