Developers Club geek daily blog

1 year, 10 months ago
image

Somehow I considered that 1 minute of idle time of hh.ru on weekdays affects about 30 000 users in the afternoon. We constantly solve a problem of decrease in number of incidents and their duration. We can reduce quantity of problems the correct infrastructure, architecture of the application is a separate subject, we will not take it in attention yet. Let's talk better about how quickly to understand what occurs in our infrastructure. Here just we are also helped by monitoring.

In this article on the example of hh.ru I will tell and I will show how to cover with monitoring all layers of infrastructure:
  • metrics client-side
  • metrics from frontends (nginx log)
  • network (that can be got from TCP)
  • application (log)
  • database metrics (postgresql in our case)
  • operating system (cpu usage can be useful too)


We formulate a task


I for myself formulated tasks which monitoring has to solve:
  • to learn what broke
  • to quickly learn where broke
  • to see what was repaired
  • capacity planning: whether there are enough at us resources for growth
  • planning of optimization: we select where it is necessary to optimize that there was an effect
  • control of optimization: if after release of optimization in production of effect it is not visible, then the release should be rolled away, and to throw out a code (or to reformulate a task)

At first several words about architecture of hh.ru: all logic is distributed between several tens services on java/python, pages for users "gather" our frontik service collector, the main DB — postgresql, on frontends and internal balancers is used by nginx + haproxy.

image

Now we will begin to cover with monitoring all layers. As a result there is a wish:
  • to see how the system from the point of view of the user works
  • to see what occurs in each subsystem
  • the nobility, on what resources leave
  • to look at all this in time: as was a minute ago, yesterday, last Monday

That is we will speak about diagrams. About alerta, workflow, KPI it is possible to look at my slides with RootConf 2015 (found also "ekranka").

Metrics Client-side



image
The most reliable information how the user sees the website, is in the browser. It can be received through Navigation timing API, our js the snippet shoots metrics on the GET server request with parameters, further it accepts nginx:
location = /stat {
    return 204;
    access_log /var/log/nginx/clientstat.access.log;
}

then we just parsy the turned-out log also receive such diagram:

image

These are stekirovanny 95th persentil of times of the main stages of rendering of the page. "Processing" can determine by a miscellaneous, in this case the user sees the page strongly earlier, but our frontend-developers considered that they need to see so.
We see that channels are normal (on a stage of "transfer"), there are no emissions of stages of "server", "tcp".

What to remove from a frontend


The main monitoring at us is constructed on metrics which we remove from frontends. On this layer we want to find out:
  • Whether there are errors? How many?
  • Quickly or slowly? At one user or at all?
  • How many requests? How it is normal / failures/bots?
  • Whether rebates the channel to clients?

All this is in nginx log, it is necessary to expand its default format only a little:
  • $request_time — time from receipt of the first byte of request before record of the last byte of the answer in a socket (it is possible to consider that we consider transfer on an answer network)
  • $upstream_response_time — how many the backend thought
  • Optionally: $upstream_addr, $upstream_status, $upstream_cache_status

The histogram well is suitable for visualization of all flow of requests.
We defined in advance what is fast request (to 500ms), average (500ms-1s), slow (1s+).
We draw on a y axis requests per second, we reflect time of the answer in the color, also added an error (HTTP-5xx).
Here an example of our "traffic light" on the basis of $request_time:

image

We see that we in dive have slightly more 2krps, the majority of requests fast (the exact legend is in a tultipa), this day was 2 emissions of errors affecting 10-15% of requests.
Let's look as the diagram differs if to take $upstream_response_time as a basis:

image

It is visible that backends are reached already by less requests (the part is given from nginx cache), it is few "slow" requests, that is, on the previous diagram red is generally a transfer contribution on a network to the user, but not waiting of the answer of a backend.
On any of diagrams it is possible to consider the scale of quantity of errors in more detail:

image

At once it is clear that there were 2 emissions: the first short ~ 500rps, second more long ~ 100rps.
Errors sometimes need to be spread out on urla:
image

In this case errors not on some one url, and are distributed according to all main pages.
On the main dashbord we also have a diagram with HTTP-4xx errors, we separately look at quantity of HTTP-429 (we give this status if the limit on amount of requests from one IP is exceeded).

image

Here it is visible that some bot came. It is simpler to look at more detailed information on bots in a log with eyes, in monitoring to us rather just the fact that something changed. All errors can be broken according to the specific statuses, it becomes rather simply and quickly.

Sometimes happens it is useful to look at a traffic spectrum, what urla how many the traffic generates.

image

On this diagram the layout of a traffic of static content, emission is an output of new release, at polzovayetel caches of css and js proinvalidirovatsya.

On urla it is worth mentioning about breakdown of metrics separately:
  • manual rules for parsing of the main url become outdated
  • it is possible to try to normalizovyvat urla somehow:
    to throw out arguments of request, id and other parameters in the url
    / vacancy/123? arg=value-> / $id vacancy/
  • after all normalization dynamic top of url well works (on rps or the sum of $upstream_response_time), you detail only the most significant requests, for the others total metrics are considered
  • it is also possible to make separate top by errors, but with a cut-off from below that there was no garbage

Log at us parsitsya on each server, on diagrams we usually watch a final picture, but it is possible to look by servers too. For example, it is convenient for a balancing assessment:

image

Here it is well visible how brought out of front2 cluster, at this time requests were processed by 2 next servers. The total quantity of requests did not sink, that is these works did not affect users.
Similar diagrams on the basis of metrics from a log of nginx allow to see well how our website works, but they do not show the reason of problems.

Layer of the collector


Instant monitoring of any program gives the answer to a question: where now there is a management (calculations, waiting of data from other subsystems, work with a disk, memory)? If to consider this task in time, we need to understand on what stages (actions) time left.
In our service collector all significant stages of each request are logged, for example:

2015-10-14 15:12:00,904 INFO: timings for xhh.pages.vacancy.Page: prepare=0.83 session=4.99 page=123.84 xsl=36.63 postprocess=13.21 finish=0.23 flush=0.49 total=180.21 code=200

In processing of the page of vacancy (handler=" xhh.pages.vacancy.Page") we spent:
  • ~ 5ms on processing of the user session
  • ~ 124ms on waiting of all requests to services behind data of the page
  • ~ 37ms on template making
  • ~ 13ms on localization

From this log we take the 95th persentil for each stage of each processor (as we parsit a log I will tell slightly below on more bright example), we receive diagrams for all pages, for example for the page of vacancy:

image

Here it is well visible emissions within specific stages.
Persentili possess a number of shortcomings (for example, it is difficult to integrate data from several measurements: servers/files / itd), but in this case the picture turns out much more visually than the histogram.
If we need to receive, for example, more or less exact ratio of any stages, then it is possible to look at the sum of times. Summing up answer times with breakdown on processors, it is possible to evaluate, than to occupy all our cluster.

image

On a y axis at us the so-called "resource" seconds spent per second. It is very convenient that the ratio considers interrogation rate at once and if at us one very heavy processor is called seldom, then it will not get to a top. If not to take parallelization of processing of some stages in attention, then it is possible to consider that in dive we spend 200 main cores for all requests.

We applied this acceptance also to a problem of profiling of template making.
The collector logs what template, how many was rendered for each request:
2015-10-14 15:12:00,865 INFO: applied XSL ambient/pages/vacancy/view.xsl in 22.50ms

The agent of service of monitoring which we use is able to parsit a log, we for this purpose need to write approximately such config:
plugin: logparser
config:
  file: /var/log/hh-xhh.log
  regex: '(?P<datetime>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}),\d+ [^:]+: applied XSL (?P<xsl_file>\S+) in (?P<xsl_time>\d+\.\d+)ms'
  time_field: datetime
  time_field_format: 2006-01-02 15:04:05
  metrics:
    - type: rate
      name: xhh.xsl.calls.rate
      labels:
        xsl_file: =xsl_file

    - type: rate
      name: xhh.xsl.cpu_time.rate
      value: ms2sec:xsl_time
      labels:
        xsl_file: =xsl_file

    - type: percentiles
      name: xhh.xsl.time.percentiles
      value: ms2sec:xsl_time
      args: [50, 75, 95, 99]
      labels:
        xsl_file: =xsl_file

From the received metrics we draw the diagram:

image

If suddenly at release at us time of imposing of one template grows (or all at once), then we will see it. Also we can select that it makes sense to optimize …

image

… and to check that optimization brought result. In this example the general piece which is included in all other templates was optimized.
What occurs on collectors, we understand rather well now, we move further.

Internal balancer



image

At breakdowns we, as a rule, have a question: some specific service (if yes, that what) or all (broke at once for example at problems with bases, other storages, a network). Each service stands behind balancers with nginx, a log for each service are written to separate access log.
The simple diagram of top-5 of services by quantity of errors strongly simplified to us life:

image

Here it is visible that most of all errors are the share of a log of service of sessions. If we have a new service / log, then it rasparsitsya automatically and new data will be considered in this top at once. We have also histograms of time of the answer on all services, but most often we look at this diagram.

Services



image

In logs of services at us also processing stages of requests:
2015-10-21 10:47:10.532 [Grizzly-worker(14)] INFO r.hh.health.monitoring.TimingsLogger: response: 200; context: GET /hh-session/full; total time 10 ms; jersey#beforeHandle=+0; HHid#BeforeClient=+0; HHid#Client=+6; DB#getHhSession=+3; pbMappers=+0; makeHeaderSession=+1; currentSessionInBothBodyAndHeader=+0; jersey#afterHandle=+0;

We visualize:

image

It is visible that the DB is responsible for constant time, and the hhid service in peaks of attendance is a little slowed down.

Database


image

All main DB at us work at postgresql. We autsorsit administration of the database in postgresql-consulting, but too it is important to ourselves to understand what happens to a DB as it is not reasonable to pull DBA at all problems.
The most important question which concerned us: whether everything is good with a DB now? On the basis of pg_stat_statements we draw the diagram of average time of a vypoleneniye of requests:

image

On it it is clear whether there is in a DB something unusual now or not (unfortunately nothing except an average on the available data can be made). Further we want to learn, than the base is occupied at present, at the same time first of all interests us what requests load CPU and disks:

image

It is top-2 of requests on use of CPU (we remove top50+other). Here it is well visible emission of specific request to the hh_tests base. In need of it it is possible to copy and consider completely.

In postgresql there is also a statistics about that how many requests wait for operations with a disk. For example, we want to learn what caused this emission on / dev/sdc1:
image

Having constructed the diagram of top of requests on io, we easily receive the answer to our question:

image

Network



image

Network — rather unstable environment. Often, when we do not understand what occurs, it is possible to write off brakes of services for problems with a network. For example:
  • the hhsession service waited for the answer from hhid 150ms
  • the hhid service considers that it responded to this request for 5ms (at us all requests have $request_id identifier, we are able to learn specific requests on logs)
  • between them only network

For yesterday between these hosts you have no results of ping. How to exclude a network?

TCP RTT


TCP Round-Trip Time — time from sending a packet before receipt of ACK. The monitoring agent on each server removes rtt for all current connections. For ip from our network we aggregate times separately, we receive approximately such metrics:
{
	"name": "netstat.connections.inbound.rtt",
	"plugin": "netstat",
	"source_hostname": "db17", 
	"listen_ip": "0.0.0.0", 
	"listen_port": "6503",
	"remote_ip: "192.168.1.48",
	"percentile": "95",
}

On the basis of such metrics it is possible to look backdating how the network between any two hosts in a network worked.
But TCP RTT values even close do not match ICMP RTT (what shows ping). The matter is that the TCP protocol rather difficult, in it is applied many different optimization (SACK itd), it is well illustrated by the picture from this presentation:

image

Each TCP connection has one RTT counter, it is visible that RTT#1 more or less fair, in the second case during measurement we sent 3 segments, in the third — sent 1 segment, received several ACK.
In practice of TCP RTT it is rather stable between the selected hosts.

image

It is TCP RTT between master-BD and a remark in other data-center. It is well visible how communication vanished during works on the channel as braked a network after recovery. At this TCP RTT ~ 7ms at ping ~ 0.8ms.

OS metrics


We, as well as everything, look at use of CPU, memory, disks, utilization of disks on IO, a traffic on network interfaces, packet rate, use of swap. It is separately necessary to mention swap i/o — very useful metrics, allows to understand whether swap was used really or in it not used pages just lie.

But these metrics happen insufficiently if there is a wish to understand backdating what occurred on the server. For example, we will look at the diagram of CPU on the DB master servers:

image
What was it?

For the answer to similar questions we have metrics on all started processes:
  • CPU per process+user
  • Memory per process+user
  • Disk I/O per process+user
  • Swap per process+user
  • Swap I/O per process+user
  • FD usage per process+user

Approximately in such interface we can "be played" with metrics in real time:

image

In this case on the server we started a backup, pg_dump of percents practically did not use, and here the contribution of pbzip2 is well visible.

We summarize


It is possible to understand many different tools as monitoring. For example it is possible to check external service availability of the main page of times a minute, thus you can learn about the main failures. To us it was a little, we wanted to learn to see the broad picture of quality of work of the website. With the current solution we see all significant okloneniye from a normal picture.

The second part of a task — to reduce duration of incidents. Here we were strongly helped by a "carpet" covering monitoring of all components. Separately it was necessary to bother with simple visualization of a large number of metrics, but after we made good dashborda, life strongly became simpler.


Article is written based on my report on Highload ++ 2015, slides are available here.
I cannot but mention service of monitoring okmeter.io which we use more than a year.

This article is a translation of the original post at habrahabr.ru/post/272263/
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here: sysmagazine.com@gmail.com.

We believe that the knowledge, which is available at the most popular Russian IT blog habrahabr.ru, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.
Best wishes.

comments powered by Disqus