Developers Club geek daily blog

2 years, 7 months ago
Network monitoring: as we monitor that all nodes worked for the large companies
By the form this optics going on the wood to a collector it is possible to conclude that the assembler did not observe technology a little. Fastening on a photo also prompts that it is, probably, the seaman – a node sea.

I from command of ensuring physical operability of a network, in other words – the technical support which is responsible for that bulbs on routers blinked as it is necessary. We have "under a wing" different large companies with infrastructure over all country. In their business we do not climb, our task – that worked a network at the physical control layer and the traffic passed as it is necessary.

The general sense of work – permanent poll of nodes, removal of telemetry, runs of tests (for example, check of settings for search of vulnerabilities), ensuring working capacity, application monitoring, a traffic. Sometimes inventories and other perversions.

I will tell how it is organized also to steam of stories from departures.

As it usually happens

Our command sits at office in Moscow and removes network telemetry. Actually, it is constants a ping of nodes, and also monitoring data acquisition if pieces of iron smart. The most frequent situation – a ping does not pass several times in a row. In 80% of cases for retail network, for example, it appears shutdown of power supply therefore we, seeing such picture, do the following:
  1. At first we call provider concerning accidents
  2. Then – on power plant concerning shutdown
  3. Then we try to establish connection with somebody on object (it not always works well, for example, in 2 nights)
  4. And, at last, if in 5-10 minutes the aforesaid did not help, we leave or we send "avatar" — the engineer-contract employee sitting somewhere in Izhevsk or Vladivostok if a problem there.
  5. With "avatar" we hold a continuous communication and "we conduct" it on infrastructure — we have sensors and service manuals, it has flat-nose pliers.
  6. Then the engineer sends us the report with a photo on the fact that it was.

Dialogs are sometimes such:
— So, communication vanishes between buildings No. 4 and 5. Check a router in the fifth.
— The order, is included. There is no connection.
— Ok, go on a cable to the fourth body, there still a node.
— … Oops!
— What happened?
— Here the 4th house was demolished.
— What??
— I put a photo in the report. I will not be able to recover the house in SLA.

Network monitoring: as we monitor that all nodes worked for the large companies

But is more often nevertheless it turns out to find break and to recover the channel.

About 60% of departures – "in milk" because or power supply is killed (a shovel, the foreman, malefactors), or the provider does not know about the failure, or the short-term problem is fixed before arrival of the assembler. However there are situations when we learn about a problem before users and before IT services of the customer, and report about a solution before they in general understand that something happened. Most often such situations happen at night when activity in the companies of customers low.

To whom it is necessary and why

As a rule, any large company has the IT department which accurately understands specifics and tasks. On average and big business work "эникеев" and engineers-network sales managers is often autsorsit. It is simply profitable and convenient. For example, one retailer has the very cool IT specialists, but they are engaged not in replacement of routers and tracing of a cable.

What we do

  1. We work according to addresses — tiketa and panic calls.
  2. We do prevention.
  3. We monitor recommendations of vendors of iron, for example, on TO terms.
  4. We are connected to monitoring of the customer and we take off from it data to leave on incidents.

With monitoring history often is that it is absent. Or it was lifted 5 years ago and not really actual. In the simplest case if there is no monitoring really, we offer the customer simple opensorsny Russian Zabbix free of charge – both to it it is good, and to us it is simpler.

The first method — simple checks is just the machine which pings all nodes of a network and monitors that they correctly answered. Such implementation does not demand in general any changes or the minimum cosmetic changes in the customer's network. As a rule, in very simple case we put Zabbiks directly to ourselves in one of data-centers (the benefit at us their whole two at office of KROK on Volochayevskaya). In more difficult, for example, if the protected network – on one of machines in the customer's TsODe is used:

Network monitoring: as we monitor that all nodes worked for the large companies

Zabbiks it is possible to apply and it is more difficult, for example, it has agents who are put on * nix and win-nodes and show system monitoring, and also the external check mode (with support of the SNMP protocol). Nevertheless, if business needs something similar, then or they already have the monitoring, or more functional and rich candidate solution is selected. Of course, it any more not open source software, and it costs money, but even banal exact inventory already approximately on a third beats off costs.

We do it too, but it is history of colleagues. Here they sent couple of screens of Infosim:

Network monitoring: as we monitor that all nodes worked for the large companies

Network monitoring: as we monitor that all nodes worked for the large companies

I am the operator "avatar" therefore I will tell further about the work.

As standard incident looks

Before us screens with here such general status:

Network monitoring: as we monitor that all nodes worked for the large companies

On this object of Zabbix collects for us very many information: party number, serial number, loading of a CCP, description of the device, availability of interfaces, etc. All necessary information is available from this interface.

Ordinary incident usually begins with the fact that one of the channels conducting to, for example, shop of the customer falls off (which at it pieces 200-300 over the country). The retail now pro-rummaged not that about seven years ago therefore the cash desk will continue work — channels two.

We undertake phones and we do at least three calls: to provider, power plant and people on site ("Yes, we loaded armature here, someone's cable was touched … And, yours? Well, it is good that found").

As a rule, without monitoring before escalation there would pass hours or days — the same alternative channels check not always. We know at once and we leave at once. If there is additional information except ping (for example, model of the buggy piece of iron) – at once complete the exit engineer with necessary parts. Further already in the place.

Regular challenge, the second for frequency, – failure of one of terminals at users, for example, DECT-phone or a Wi-Fi-router distributing a network on office. Here we learn about a problem from monitoring and almost at once we receive a call with parts. It is sometimes ringing adds nothing new ("I take the call, something does not call"), it is sometimes very useful ("We from a table dropped it"). It is clear, that in the second case it is obviously not break of the trunk.

The equipment in Moscow undertakes from our warehouses of a hot reserve, we have them several type of such:

Network monitoring: as we monitor that all nodes worked for the large companies

Customers usually have stocks of often failing component parts – tubes for office, power supply units, fans and so on. If it is necessary to deliver something that is not present on site, not to Moscow, usually we go (because mounting). For example, I had a night departure to Nizhny Tagil.

If the customer has the monitoring, they can unload given us. Sometimes we develop Zabbiks in poll mode, just to provide transparency and control of SLA (it is free for the customer too). We do not put additional sensors (it colleagues who provide a continuity of productions do), but we can be connected also to them if protocols not exotic.

In general – we do not touch infrastructure of the customer, simply we support in that type as it is.

By experience I will tell that the last ten customers passed to external support because we very predictable in respect of costs. Accurate budgeting, good management of cases, report on each request, SLA, reports on the equipment, prevention. Ideally, of course, we for the customer's CIO like cleaners — come and we do, everything is pure, we do not distract.

From what should be noted – in some large companies inventory becomes this problem, and we are attracted sometimes purely to its carrying out. Plus we do storage of configurations and their management that is convenient at different moving reconnections. But, besides in difficult cases it too not I – at us am special command which transports data-centers.

And one more important point: our department is not engaged in critical infrastructure. In total in TSODOV and all bank insurance-operator, plus systems of a kernel of retail is X command. These children.

Still practice

Many modern devices are able to give a lot of service information. For example, at network printers toner level in a cartridge is very easily monitored. It is possible to expect replacement term in advance, plus to have the notification for 5-10% (if the office suddenly began to publish without restraint not in the standard diagram) — and at once to send enikeya before at accounts department panic begins.

Very often take away annual statistics which the same monitoring system does plus we from us. In a case with Zabbiks this simple costs planning and understanding that where got to, and in a case with Infosim – also material for calculation of scaling for a year, loadings of administrators and any other pieces. In statistics there is an energy consumption – in the last year almost all began to ask it, probably, to scatter internal costs between departments.

Sometimes these heroic rescue turn out. Such situations – a big rarity, but from what I remember for this year – saw about 3 nights temperature increase to 55 degrees on a tsiskokommutator. In far server there were "silly" conditioners without monitoring, and they failed. We called the cooling (not of ours) engineer at once and called the duty administrator of the customer. It extinguished part of noncritical services and held server from thermal shotdown before arrival of the guy with the mobile conditioner, and then and repairs of regular.

At Polikomov and other expensive equipment of a video conferencing extent of battery charging before conferences is very well monitored, too it is important.

Monitoring and diagnostics are necessary to all. As a rule, without experience to implement long and difficult: systems happen or extremely simple and preconfigured, or about an aircraft carrier the size and to a lot of standard reports. Sharpening by a file under the company, inventing of implementation of the problems of internal IT division and information output which is necessary to them most of all, plus support of all history in an actual status – a way grabel if there is no experience of implementations. Working with monitoring systems, we select golden mean between free and top solutions – as a rule, not the most popular and "thick" vendors, but accurately solving a problem.

Once there was rather atypical address. The customer needed to give a router to some separate division, and precisely according to the inventory. In a router there was a module with the specified seriynik. When the router began to be prepared to the road, it became clear that this module is absent something. And nobody can find it. The problem is slightly aggravated by the fact that the engineer who last year worked with this branch already on pension, and went to grandsons to other city. Contacted us, asked to look. Fortunately, iron gave account on seriynik, and Infosim did inventory therefore we in a couple of minutes found this module in infrastructure, described topology. The fugitive was tracked down on a cable – it was in another server in a cabinet. History of movement showed that it got there after failure of the similar module.

Network monitoring: as we monitor that all nodes worked for the large companies
Frame from the feature film about Hottabycha which is precisely describing the population relation to cameras

It is a lot of incidents with cameras. Once 3 cameras failed at once. Cable rupture on one of sections. The assembler blew new in a corrugation, two cameras from three after a row shamanism rose. And the third – is not present. Moreover, unclear, where it in general. I lift a video flow – the last frames directly before falling – 4 mornings, three men in scarfs on persons, something bright approaches below, the camera strongly shivers, falls.

Once configured the camera which has to be focused on the "hares" climbing through a fence. While went, thought how we will designate a point where the violator has to appear. It was not useful – in those 15 minutes that we were there, on object got the person 30 only in a point necessary to us. Directly checkerboard pattern.

As I already gave an example above, a story about the demolished building – not a joke. Once was gone a link to the equipment. On site – there is no pavilion where there passed copper. The pavilion was demolished, the cable was gone. We saw that the router died. The assembler arrived, begins to look – and distance between nodes of steam of kilometers there. It has in a set a vipnetovsky tester, the standard — rang out from one connector, rang out from another – went to look for. Usually a problem it is visible at once.

Network monitoring: as we monitor that all nodes worked for the large companies
Tracing of a cable: it is optics in a corrugation, continuation of story from the top of a post about a sea node. Here as a result except absolutely surprising mounting the problem that the cable departed from fastenings was found out. Here all who feel like it climb, and loosen a metalwork. Approximately five-thousandth representative of the proletariat tore optics.

On one object all nodes were approximately weekly disconnected. And at the same time. We long enough looked for pattern. The assembler found the following:
  • The problem occurs always in change of the same person.
  • It differs from others in the fact that very heavy coat carries.
  • Behind a hanger for clothes the automatic machine is mounted.
  • Someone carried away an automatic machine cover already long ago, in prehistoric times.
  • When this companion comes to object, it hangs up clothes, and it switches-off automatic machines.
  • It right there includes them back.

On one object at the same time the equipment was switched off at night. It became clear that local handymen were connected to our power supply, displayed the extender and stick a teapot and the rangette there. When these devices work at the same time – beats out all pavilion.

In one of shops of our immense homeland constantly with closing of change all network fell. The assembler saw that all power supply is brought to the line of lighting. As soon as in shop turn off the upper lighting of the hall (consuming a lot of energy), also all network equipment is disconnected.

There was a case that the janitor a shovel killed a cable.

Often we see just the copper lying with broken gofry. Once between two workshops local handymen just forwarded the twisted pair cable without any protection.

Far away from a civilization employees often complain that they are irradiated by "our" equipment. Switches on some far objects can be in the same room, as the person on duty. Respectively, harmful blocks which by hook or by crook disconnected them at the beginning of change few times came across to us.

In one far city on optics hung up a mop. Otkolupali to a corrugation from a wall, began to use it as fixture for the equipment.

Network monitoring: as we monitor that all nodes worked for the large companies
In this case with power supply there are obviously problems.

That "big" monitoring is able

Still I will shortly tell about opportunities of more serious systems, on the example of installations of Infosim, 4 solutions united in one platform There:
  • Management of failures – control of failures and correlation of events.
  • Performance management.
  • Inventory and automatic detection of topology.
  • Management of configurations.

What is important, Infosim supports a lot of the equipment "from a box" at once, that is easily sorts all their internal exchange and gets access to all of them technical data. Here list vendorov:cisco Systems, Huawei, HP, AVAYA, Redback Networks, F5 Networks, Extreme Networks, Juniper, Alcatel-Lucent, Fujitsu Technology Solutions, ZyXEL, 3Com, Ericsson, ZTE, ADVA Optical Networking, Nortel Networks, Nokia Siemens Networks, Allied Telesis, RADCOM, Allot Communications, Enterasys Networks, Telco Systems, etc.

Separately about inventory. The module not just shows the list, but also itself forms topology (at least, in 95% of cases tries and gets correctly). It allows to have near at hand actual base of the used and idle IT of the equipment (network, server hardware, etc.), to make in time replacements of an obsolete equipment (EOS/EOL). Generally, it is convenient for big business, but in small a lot of things from this become hands.

Examples of reports:
  • Reports by on the OS types, firmwares, models and hardware manufacturers;
  • The report on the number of free ports on each switch in a network / on the selected vendor / on model / on a subnet, etc.;
  • The report on again added devices for set the period;
  • The notification on the low level of a toner in printers;
  • Assessment of suitability of a communication channel for a traffic sensitive to delays and losses, active and passive methods;
  • Tracking quality and availability of communication channels (SLA) – reports generation on quality of communication channels with breakdown on telecom operators;
  • Control of failures and correlation events functionality is implemented at the expense of the Root-Cause Analysis mechanism (without the need for writing governed the administrator) and the Alarm States Machine mechanism. Root-Cause Analysis is the analysis of the prime cause of accident based on the following procedures: 1. automatic detection and localization of the place of failure; 2. reduction of number of abnormal events to one key; 3. identification of effects of failure – on whom and what failure affected.

Still it is possible to put on a network here such pieces which are integrated into monitoring at once:

Network monitoring: as we monitor that all nodes worked for the large companies
Stablenet – Embedded Agent (SNEA) — the computer the size is slightly more than pack of cigarettes.

Installation is executed in ATMs, or the selected network segments where availability check is required. With their help load testings are executed.

Cloudy monitoring

One more model of installation – SaaS in a cloud. Did for one global customer (the company of a continual loop of production with distribution geography from Europe across Siberia).

Tens of objects, including – plants and finished goods warehouses. If at them channels fell, and their support was performed from foreign offices, then shipment delays began that on a wave conducted to losses further. All works became on request and for investigation of incident a lot of time was spent.

We configured monitoring specifically under them, then dopilit on a number of sections on features of their routing and iron. All this became in KROK cloud. Made and handed over the project very quickly.

Result such:
  • Due to partial transfer of management of network infrastructure it was succeeded to optimize at least for 50%. Unavailability of the equipment, loading of the channel, exceeding of the parameters recommended by the vendor: all this is fixed within 5-10 minutes, is diagnosed and eliminated within an hour.
  • At receipt of service from a cloud the customer transfers capital expenses on expansion of the system of network monitoring to operational costs for a subscriber fee for our service which at any time it is possible to refuse.

The benefit of a cloud is that in the solution we are as if above their network and we can look at all events more objectively. At that time, if we were network vnur, we would see a picture only to a failure node and that behind it occurs, to us would not be any more it known.

Couple of pictures finally

It — "a morning puzzle":

Network monitoring: as we monitor that all nodes worked for the large companies

And we found a treasure:

Network monitoring: as we monitor that all nodes worked for the large companies

In a chest was here that:

Network monitoring: as we monitor that all nodes worked for the large companies

Well and finally about the most cheerful departure. I somehow time left on object of retail.

There was a following: started anew to drip from a roof on a drop ceiling. Then in a drop ceiling the lake which blurred and pressed through one of tiles was formed. As a result all this rushed on the electrician. Further definitely I do not know what happened, but somewhere in the neighboring room korotnut, and the fire began. At first powder fire extinguishers worked, and then there arrived firefighters and filled in all with foam. I arrived after them to dismantling. It is necessary to tell that the tsiska of 2960 got the idea after all this is I could take away a config and send the device to repair.

One more time at drawdown of powder system of tsiskovskiya 3745 in one bank was filled with powder almost completely. All interfaces were hammered – 2 on 48 ports. It was necessary to include on site. Remembered last case, decided to try to remove configs "on hot", shook out, cleaned as were able. Cut – at first the device told "pF" and sneezed in us a big stream of powder. And then zaurchat and rose.

This article is a translation of the original post at
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here:

We believe that the knowledge, which is available at the most popular Russian IT blog, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.
Best wishes.

comments powered by Disqus