Data-centers are evaluated on fault tolerance levels from I to IV. These levels happen TIA (which are not demanding check, it is simple according to the statement) and Uptime Institute (with rigid certification). TIER III assumes possibility of work at failure of any of nodes in any place of infrastructure. If it is pipe with coolant — there has to be the second same. If it is fuel tank, there has to be the second spare. If this cooling — there have to be reserves on chiller of N+1, etc.
At first this compliance to the TIER III level was established on the project. We protected documentation: roughly speaking, Aptaym's engineers "crossed out" any node and looked, whether the rest will be able to work. Many pass this quest.
The following step — to pass certification on ready object, that is to confirm compliance of documentation and to the principles of fault tolerance already on the implemented object. This the most difficult in Russia because to declare in the plan and to construct — two big differences. The special charm to process was added by customers who have already brought on site the produktiv. Therefore the passable check — very abruptly.
The third step — we have received certification on operation. That is have confirmed that the team and all processes correspond to the principles of Uptime. Such TSODOV in Russia only 2 pieces.
That else it is necessary to know about these certificates
TIA TIER 3 is received "just like that" according to the statement "our project corresponds recommendations of TIA". Therefore further we do not consider this type, and we speak about TIER III on Uptime Institute.
There three types of certificates: the project (once on the project is given, burns down in two years), object (it is given on the constructed object and confirms that fact that that has turned out — still TIER III, but not TIER II, for example). The certificate on object — eternal. The third type — the certificate on operation where the level of the center is regularly checked.
Check time in 1-3 years, depending on the readiness level which have shown at the last check. Such frequency — consequence of the general rule that on average 70-90% of dauntaym occur because of human factor. That is TsOD of 10-year without fresh confirmation of the certificate on operation can give any surprises. Regular certificates on operation share on three types: Gold, Silver and Bronse. If you pass quest without knot and without hitch — give Gold, he demands repeated check in 3 years. If have taken place with the notes "on the four" — that time in 2 years of Silver. Worst of all Bronse — this passing on "well" with certificate expiration date in 1 year.
We have received Gold.
As there was check
Guys from Aptaym at first have arrived to certify to us object (after we have constructed it on the certified project). At this moment was to receive the third certificate on operation early — by my assessment about a year after start of TsODA is necessary to ustakanit all processes and completely to train team of operation.
A bit later we have called them once again with audit before certification. Sense of audit — to check that not so that it is necessary to finish and give a lot of recommendations about work improvement. In our case was quite so.
In ten months they have arrived once again for three days. The first some hours simply went on object, were guided, looked in different corners and drove fingers on hard-to-reach spots, in every possible way rejoiced. Then all crowd have sat down in our rooms for administrators (warm office with kitchen) and were imposed with documentation. Two days only checked compliance of pieces of paper each other, plus knowledge of people of them.
One more type of activity — called certain engineers (for example, dispatchers) and spoke: "Such accident what you will do?". He answered according to regulations of actions, it released.
That in general check for certifications
- Operational load on the personnel. For example, at us long enough sherstit operating schedules of managers that each of them developed no more, than it is necessary on shopping mall for such position. Verified each change, lists in logs (what exactly this person was in change) and considered then monthly operating time.
- Knowledge of abnormal procedures (who and that does).
- Compliance of any formal certificates, diplomas and so on to posts. Who is responsible for the pumper, for first aid, etc. — relevance of knowledge.
- Duty regulations and their relevance, the description of all processes and procedures, instructions on each case.
- Procedures of check of the equipment and in general service — that all instructions were precisely observed and covered the necessary processes under specific object. In our case — that all instructions corresponded to the actual arrangement of units and covered all situations. Procedures of opening closing of changes, introduction of data on the equipment, testing procedures, etc.
- As there is training of the personnel and as regular trainings on faults are carried out.
- As the internal library with "operating experience" as processes of expansion on power supply are arranged, to cooling as the equipment, etc. is taken out-is brought is updated.
Our situation was picked most of all by data on the personnel and logs of changes. On this certification touch equipment on minimum — it is supposed that everything has been made at stage of obtaining the certificate of Facility.
As I spoke, it is better to pass somewhere in year after the beginning of operation by new team because one of parts of check — as people have found project faults (or developed TsOD from the project) as studied the equipment and that have corrected "on live" already on the started data-center.
From defects: for example, at stage of certification it has become clear that it is necessary to do the most detailed instructions. And we, for example, have 6 identical subsystems. The first has detailed instruction on switching in case of accident. On the second was "do similarly 1" — it is necessary to change, write just the same, only the instruction that in place nothing was mixed.
It is still important to process correctly all documents on improvements, including log of upgrade. It is necessary to understand that some changes in general can lower the level of reliability of TsOD in general.
Some special surprises during check at us was. There is leaf of requirements which should be studied carefully, and to represent that each point will pick at once three paranoids. "Get to the bottom" of pieces of paper very strongly that, in general, it is correct — simply on normal checks nobody builds correlation between different documents, and here — quite, and quite deep.
For example, after excursion they have asked us to unload accurate map of that as well as where they went on object — it becomes on the admission monitoring system, on video surveillance.
Some more links about our TsOD:
- Excursion on our object "Compressor" where earlier the train came around
- Than TIER differs from other TIER, and TIA from UI
- Stages of construction of TsODA
- How to operate TsOD of the increased responsibility
- And my mail for questions — AAshavskiy@croc.ru.
Actually, if you prepare for such check, with pleasure I will answer questions in comments.
This article is a translation of the original post at habrahabr.ru/post/268419/
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here: firstname.lastname@example.org.
We believe that the knowledge, which is available at the most popular Russian IT blog habrahabr.ru, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.