Developers Club geek daily blog

1 year, 6 months ago
Often clients ask us about a p99-metrics (the 99th pertsentil).

It is definitely reasonable request and we are going to add similar functionality to VividCortex (I will tell about it later). But at the same time, when clients ask about it, they mean something absolutely certain — something that can be a problem. They ask not the 99th pertsentil on some metrics, they ask a metrics on the 99th pertsentilyu. This commonplace for such systems as Graphite, but all this yields not that result which from such systems is expected. It is a post will tell you that, perhaps, you have incorrect representations about pertsentil, about exact degree of your delusions and about what all of you can make correctly in this case.

(It is transfer of article which was written by Baron Schwartz.)



We refuse averages



In the last several years at once the many people started talking that there is a number of problems in monitoring on mean values. It is good that this subject began to be discussed now actively as for a long time mean values of parameters in monitoring were generated and accepted without any fixed analysis.

Mean values are a problem and practically do not help so far as concerns monitoring. If you just watch averages, you most likely pass those data which make the greatest influence on your system: by search of any problems, events, especially important for you, by determination will be emissions. There are two problems with mean values in case of existence of emissions:

  • Averages hide emissions and you do not see them.
  • Emissions displace mean values so in system in which there are emissions, mean values do not reflect a normal state of system any more.


So when you average any metrics in system with errors, you integrate all worst: you observe not absolutely normal system status any more, but at the same time do not see anything unusual.

By the way work of the majority of program systems just teems with extreme emissions.

Viewing of the emissions which are in a long tail on the frequency of emergence is very important because shows you how exactly badly you obratyvat requests in some exceptional cases. You will not see it if you work only with averages.

As told Werner Vogels from Amazon at opening of re:Invent: the only thing what can tell you mean values about is the fact that you service a half of your clients even worse. (Though this statement is absolutely correct on spirit, it not absolutely reflects reality: here it would be more correct to tell about a median (it the 50th pertsentil) — this metrics provides the specified property)

The Optimizely company published record in this post few years ago. She perfectly explains why averages can lead to unexpected effects:
"Though it is very easy to understand mean values they also can lead to the strongest delusions. Why? Because supervision over the average time of a response is similar to measurement of average temperature of hospital. While what really you worries it is temperature of each of patients and in especially which of patients needs your help first of all."


Brendan Gregg also well explained it:
"As the statistical characteristic, mean values (including an arithmetic average) in practical application have a set of advantages. However the possibility of the description of value distribution is not one their them."


Forward to pertsentilyam



(Quantiles — in broader representation) are often extolled by Pertsentili as means for overcoming of this fundamental lack of mean values. Sense of the 99th pertsentilya in that to collect all data set (in other words all collection of measurements of system) and to sort them, then to cast away 1% of the greatest and to take the greatest value from remained. The received value possesses two important properties:

  1. This greatest value from values which turn out in 99% of cases. If this value, for example, is measurement of load time of the web page, then it reflects the worst case of service which turns out at least at 99% of visits of your service.
  2. This value is steady against really strong emissions which happen for a set of the reasons, including measurement errors.


By itself, you are not obliged to select 99%. Shiriko widespread options are the 90th, 95th and 99.9th (or even even more nine) pertsentit.

And now you assume: averages it is bad, and pertsentit it perfectly — let's calculate pertsentil on metrics and we will save them in our storage for storage of time series (TSDB)? But everything is not so simple.

How exactly TSDB store and process metrics



There is a big problem from a pertsentilyama in time series of data. The problem is that the majority of TSDB almost always store the aggregated metrics on temporary intervals, but not all selection of the measured events. Afterwards TSDB average these metrics on time in a number of cases. The most important:

  • They average metrics if discretization of time in your request differs from discretization of time which was used at aggregation of data when saving. If you want to output the diagram of a metrics in a day, for example, width 600px, then each pixel will reflect 144 seconds of data. This averaging is implicit and users about it do not suspect in any way. And actually these services would have to display warning!
  • TSDB average data in a case when save them for long-term storage in lower resolution, as really happens in the majority of TSDB.


And here the problem appears. You deal with averaging in some form again. Averaging of pertsentily does not work as for a vychislyaeniye pertsentilya in new scale you have to have a complete sample of events. All calculations are actually incorrect. Averaging pertsentily does not make any sense. (Effects of it can be any. I will return to it later.)

Why calculation of pertsentily works not as you expect?

Unfortunately, some products extended by open-source for monitoring incite to use of pertsentilny metrics which actually will be then perediskretizirovana when saving. For example StatsD, allows to calculate wished pertsentil then generates a metrics with a name like foo.upper_99 and periodically throws off them for saving in Graphite. Everything is excellent if discretization of time when viewing does not change, but we know what it is all the same occurs.

Misunderstanding of how all these calculations happen, is extremely widespread. Reading a branch of comments to this StatsD GitHub perfectly it illustrates a tiketa. Some companions speak there about things which have nothing in common with reality.

Why calculation of pertsentily works not as you expect?
— Suzy, how many there will be 12+7?
— Billion!
— Thanks!
— … ee, but same it seems cannot be the truth?
— too she spoke about 3+4


Possibly by the shortest method to designate a problem will be to tell so: Pertsentili is found from a collection of measurements and have to be recalculated completely every time when this collection changes. TSDB periodically average data on different periods, but at the same time do not store initial selection of measurements.

Other ways of calculation of pertsentily



But, if calculation pertsentily really demands a complete sample of original events (for example everyone time of each loading of the web page) — then we have a big problem. The problem of "Big Data" will be to tell more precisely. For this reason truthful calculation of pertsentily is extremely expensive.

There are several methods of calculation * approximate" pertsentily which are almost also good as storage of a complete sample of measurements with the subsequent its sorting and calculation. You will be able to find a set of scientific researches on the different directions including:
  • histograms which separate all collection of events on ranges (or to baskets) and are calculated after that how many events get to each of ranges (baskets)
  • approximate stream data structures and algorithms (calculation of sketches, "sketchs")
  • storages which do selection of a collection of events for providing approximate answers
  • solutions with time limits, to quantity or about both at once


The essence of the majority of these solutions consists in approach of distribution of a collection one way or another. From information on distribution you will be able to calculate approximate pertsentil, and also some other interesting metrics. Besides from the blog of the Optimizely company, it is possible to give an interesting example of distribution of response time, and also average and the 99th pertsentilya:

Why calculation of pertsentily works not as you expect?

There is a set of methods of calculation and storage of approximate distributions, however histograms are especially popular because of their relative simplicity. Some solutions on monitoring support histograms. Circonus for example, one of such. Theo Schlossnagle, the CEO of the Circonus company, often writes about benefits of histograms.

Eventually, it is useful to locate distribution of an initial collection of events not only for calculation pertsentily, but also allows to reveal some things about which pertsentil cannot tell. Eventually, pertsentil is only number which only tries to reflect a large number of information on data. I will not come as far as it was made by Theo when it tvitnut that "the 99th is not better than an average at all" because here I agree with fans of pertsentily
that pertsentit much more informativny than mean values in submission of some important characteristics of initial selection. But nevertheless, pertsentil will not so well tell you about data as more detailed histograms. The illustration from the Optimizely company above in the text contains 10 times more information, than it can make any single number.

Even more best pertsentil in TSDB



Collecting of metrics on ranges will be the best method of calculation pertsentily in TSDB. I stated the similar assumption as a set of TSDB in practice are only the key value collections arranged on time marks without a possibility of storage of histograms.

Band metrics provide the same opportunities, as sequence of histograms in time. All you need to do is is to select limits which will separate values on ranges, and then to calculate all metrics separately on each of ranges. The metrics will be same as well as for the histogram: namely number of events of value of which got to this range.

But generally, the choice of ranges for separation is a difficult task. Usually good choice will be ranges with logarithmic progressing sizes or ranges which provide storage of the hardened values for acceleration of calculations (at the price of failure from the smooth growth of counters). But ranges with identical sizes will hardly be the good choice. More information on this subject is in a note from Brendan Gregg.

There is a fundamental contradiction between the number of the saved data and their degree of their accuracy. However even the rough raspreredeleniye of ranges provides the best data representation with what an average. For example, Phusion Passenger Union Station shows band metrics of a waiting time on 11 ranges. (To me it does not seem at all that the given illustration is evident; value on a y axis confuses a little, actually it is the 3D diagram projected in 2D by a nonlinear method. It not less it all the same gives more information than it could give mean value.)

Why calculation of pertsentily works not as you expect?

How it can be implemented by means of popular open-source of products? You have to determine ranges and create columns in the form of stacks as in drawing above.

But it will be much more difficult to calculate pertsentil according to these data now. You will have to walk on all ranges upside-down, from big to smaller, summing up counters of number of events on the way. As soon as you receive the sum of number of events of big than 1% of total quantity, this range will store value of 99% pertsentilya. There are many nuances — mild equalities; how exactly to process boundary cases what value to select for pertsentilya (range on top or from below? and can in the middle? or can weighed from all?).

And in general similar calculations can strongly confuse. For example, you can assume that you need 100 ranges for calculation of the 99th pertsentilya, but actually everything can be differently. If you have only two ranges and 1% of all values gets to upper, then you will be able to receive 99% pertsentil and so. (If for you it seems strange, then reflect on quantiles in general; I consider that the understanding of an essence of quantiles is very valuable.)

So here not everything is simple. It is possible in the theory whether but in practice strongly depends on that supports storage the necessary types of requests for receipt of approximate values pertsentily on band metrics. If you know storages in which it is possible — write in comments (on the author's website — a lane comment)

Well the fact that in systems of similar Graphite (that is in those which expect that all metrics can be averaged and resampled freely) all band metrics are absolutely steady against these types of conversions. You receive correct values because all calculations are commutative in relation to time.

Outside pertsentily: thermal cards



Pertsentil is only number as well as an average. The average displays the center of mass of selection, pertsentil shows a mark of the top level of the specified selection share. Think about pertsentilyakh as of traces of waves on the beach. But, though pertsentil displays top levels, and not just the central trend as an average, it is all the same not so informative and detailed in comparison with distribution which in turn describes everything selection entirely.

Get acquainted, there are thermal cards — which actually are 3D diagrams in which histograms are turned and combined together on a current of time, and values are displayed in the color. And again, the Circonus company provides an excellent example of visualization of thermal cards.

Why calculation of pertsentily works not as you expect?

On the other hand, as I know, Graphite does not provide yet an opportunity to create thermal cards on band metrics. If I am not right and it can be made by means of some trick — let me know (to the author of article — a lane comment).

Thermal cards also perfectly are suitable for display of a form and density of delays in particular. Other example of the thermal card on delays are a report on stream delivery on the Fastly company.

Why calculation of pertsentily works not as you expect?

Even some ancient tools which to you already seem primitive can create thermal cards. For example Smokeping, uses shading for display of ranges of values. Bright green designates an average:

Why calculation of pertsentily works not as you expect?

But whether it is really bad to store metrics on pertsentilyam?



Well, after all mentioned difficulties and nuances which would need to be considered perhaps old kind StatsD-metrics of upper_99 for demonstration pertsentily does not seem to you such bad. Eventually, it very simple, convenient and already ready to use. Whether really this metrics is so bad?

Everything depends on circumstances. They perfectly are suitable for a set of scenarios of use. I mean that anyway all of you equally limit yourself to the fact that pertsentil not always well describe data. But if to you also it is not important, then the greatest problem for you is resampling of these metrics that will mean that you will then watch incorrect data.

But excellent article about calculation of LA — a lane comment). In a similar way the set of systems is compressed in a similar way display different metrics of the performance. A set of metrics from Cassandra are result of work of Metrics library library (Coda Hale) and actually are floating averaging (exponential weighed floating average) to which the many people have a permanent disgust.

But we will return to metrics on pertsentilyam. If you save p99 metrics, and then reduce and browse the average version for a wide interval of time — though can do it and it will not "be correct" and can even be that the diagram will be very other than real value of the 99th pertsentilya, but the fact that it will be wrong, optional means that this diagram cannot be used in the desirable purposes, namely for understanding of the worst cases in a vzamodeystviye of users with your application.

So everything depends occasionally. If you understand how pertsenil work also the fact that to carry out averaging pertsentily incorrectly, and it suits you, then storage pertsentily can be admissible and even useful. But here you enter a moral dilemma: with such approach you can strongly confuse nothing the suspecting people (perhaps even your colleagues). Look at comments to a tiket on StatsD once again: misunderstanding of an essence of process is directly felt.

If you permit me to draw not really good analogy: I sometimes eat and I drink such viands from my refrigerator what to give them to other people — simply a crime. (Just ask my wife about it) (the author's wife — a lane comment). If you put off to people a bottle with a text of "alcohol" and it will contain methanol, then they will just go blind. Others will ask: "and what alcohol in this bottle?" It is better for you to adhere to the same responsibility concerning similar questions.

What does VividCortex?



At the moment our TSDB does not support the histogram and we do not support calculation and saving pertsentily (though you can just send us any metrics if it is necessary).

We plan support of storage of band metrics of high resolution, that is metrics with a large number of ranges for the future. We will be able to implement something similar as the majority of ranges most likely will be empty and our TSDB will be able effectively to process the rarefied data (also possibly that after averaging on time they will not be so much rarefied any more — a lane comment). It will give us the chance to issue histograms of times per second (all our data are stored with the resolution of 1 second). Band metrics will be peredeskritizirovana in 1-minute permission after the period set in settings which is determined by default in 3 days. At the same time band metrics will be perediskretizirovana in 1-minute permission without any mathematical problems.

And as a result, from these band metrics we will receive a pozmozhnost to receive any desirable pertsentil, to show an error assessment, to show the thermal card and to show distribution curve.

It will be not fast in implementation and will demand big efforts from engineers, but work is begun and the system is already developed taking into account all this. I cannot promise when it is implemented, but I consider it necessary to tell about our long-term plans.

Outputs



The post turned out slightly more long, than I conceived at first, but I touched upon many subjects.

  • If you are going to calculate pertsentil for any interval and afterwards to save result in the form of time series — as it is done by some existing storages — you can receive not absolutely what you expect.
  • Exact calculation of pertsentily demands big computing costs.
  • Approximate values of pertsentily can be calculated according to histograms, band metrics, and also other useful ADP equipments.
  • Such data will also allow to issue distributions and thermal cards that there will be even more informative what simple pertsentil.
  • If all this is unavailable right now or you cannot afford it, drag, use metrics on on pertsentilyam, but you remember effects.


I hope all this was useful to you.

P.S.



  • Someone mentioned in twitter about effect: "oops, pntnko, I appear I do everything somehow incorrectly. But I switched to calculation of percent of requests which are executed in time smaller/bigger than indicated value and I save this metrics instead of former." But it also does not work. Approach on shares (and the percent is a share) all the same does not work with calculation of averages. Instead, save a metrics of number of requests which are not executed for desirable time. This will work.
  • Not at once could find an excellent post from Theo on this subject. Here it: http://www.circonus.com/problem-math/

This article is a translation of the original post at habrahabr.ru/post/274303/
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here: sysmagazine.com@gmail.com.

We believe that the knowledge, which is available at the most popular Russian IT blog habrahabr.ru, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.
Best wishes.

comments powered by Disqus