The tool for detection of unexpected failures is the statistics — from here and the name of article. The statistical detector of anomalies tracks down the characteristic changes which are going beyond admissible, and sends notifications. The main advantage of the statistical technique of determination of anomalies is that by means of it it is possible to find earlier not being found problems. It is possible to refer need of independent definition of the prime cause to shortcomings. All this can seem too abstract therefore I will give specific example.
To avoid confusion, some main terms are provided below:
- The traced signals: results of continuous tracking for identification of violations.
- Violation: the nature of changes specifying that the website does not work properly. Human intervention is necessary for definition of the reason.
- Anomaly: the unusual nature of changes of the traced signals indicating possible violations.
- Notification: the automatic signal indicating anomaly availability. It is usually sent at anomaly detection.
Certain obstacle at creation of effective statistical detectors is possibility of false drops: notifications on nonexistent violations. If the detector of anomalies has too many false drops, the user will disconnect notifications or will mark them as spam, or it will be simple to ignore them. And such detector obviously will be not really effective.
In this article I will describe the statistical detector of anomalies which is used by the eBay company. The similar scheme is also used in t.onthe.io analytics tool. In the considered search engine the traced signals will be based on the most repeating search results which are given out by set of reference requests. The way of transfer of results in numerical form is necessary for operation of the statistical detector of anomalies. In our case we will apply metrics to results.
Each reference request will be processed by about 50 metrics which will allow to integrate the elements returned by request. Two metrics for example — quantity of the returned elements and the average price of the returned elements. In total 3000 reference requests and 50 metrics — that is 150 000 values are used. Now, reference requests are given each 4 hours or 6 times a day, thus, we have 900 000 values a day. Today with terabyte databases — this ridiculous quantity. But the analysis of these values for detection of anomalies, and with low indicator of false drops, is rather complex challenge.
I will try to explain clearly the used approach by means of schemes and diagrams. The traced signals, that is the integrated values of metrics are given in the first image:
Each couple (request, metrics) represents value which can be traced in time that in effect is time series. In the sum it is 150 000 time series and therefore it is reasonable to expect that during each 4-hour period of tracking, at least, one of ranks will be defined as abnormal. From what it is possible to draw conclusion what it does not make to notify on each abnormal time series special sense as it will give to big number of false drops.
In our approach we accumulate data, and process begins with very simple action: considerations of each time series and calculation of deviation between the last and expected value which is calculated by extrapolation of the previous values. I call it "surprise" — the more deviation, the surprise is more. On the image the surprise for each triplet (request, metrics, time) is shown below.
The essence of our detector of anomalies consists in the following: at every moment of data acquisition of T we assume that there will be great value of surprise for several triplets (request, metrics, T). In case of detection of unusually high quantity of triplets with great value of surprise the notification on anomaly will be sent. For possibility of quantitative assessment we calculate the 90th pertsentil surprises on T time for all 3000 requests of separate metrics: .
That gives on new time series on T for each metrics. Approximate time series for the first two metrics are shown below.
Thus 150 000 time series have been reduced to only 50. Similar association of values is very convenient reception at detection of anomalies.
At the end the detector of anomalies uses simple check of the integrated time series. Availability of anomaly is considered in that case, when the current value of any of 50 rows more than three times more than dispersion of mean value of this row. Here example with use of data of eBay with metrics of the average price of sale. Anomaly in the T timepoint = 88 is accurately noticeable.
- Statistics — the tool for detection of anomalies.
- The statistical detector of anomalies tracks down the characteristic changes which are going beyond admissible, and sends notifications.
- At every moment of data acquisition of T it is supposed that there will be great value of surprise for several triplets.
- Association of values of time series is convenient reception at detection of anomalies.
This article is a translation of the original post at habrahabr.ru/post/265571/
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here: email@example.com.
We believe that the knowledge, which is available at the most popular Russian IT blog habrahabr.ru, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.