Machine learning is engaged in search of the hidden patterns in data. The growing growth of interest in this subject in IT community is connected with the exclusive results received thanks to it. Voice recognition and the scanned documents, search engines — all this is created with use of machine learning. In this article I will tell about the current project of our company: how to apply methods of machine learning to increase in performance of DBMS.
The existing mechanism of the scheduler of PostgreSQL understands the first part of this article, in the second part it is told about opportunities of its improvement using machine learning.
What is the execution plan of the SQL query?
Let's remind that SQL — declarative language. It means that the user specifies only what operations have to be done with data. DBMS is responsible for the choice of a method of execution of these operations. For example, request
SELECT name FROM users WHERE age > 25;
it is possible to execute by two methods: to read all records from the table users and to check each of them for execution of a condition of age> 25 or to use an index across the field of age. In the second case we do not browse excess records, but we spend more time for processing of one record because of operations with indexes.
Let's review more advanced query
SELECT messages.text FROM users, messages WHERE users.id = messages.sender_id;
This JOIN can be executed already by three methods:
- The nested loop (NestedLoopJoin) browses all possible couples of records from two tables and checks execution of a condition for each couple.
- Merge (MergeJoin) will sort both tables by the id and sender_id fields respectively, and then uses a method of two pointers for search of all couples of records meeting a condition. This method is similar to a method of merge sort (MergeSort).
- Hashing (HashJoin) builds a hash table across the field of the smallest table (in our case this field of users.id). The hash table allows for each record from messages quickly to find record in which users.id = messages.sender_id.
If in request it is required to execute more than one operation Join, then it is also possible to execute them in a different order, for example, in request
SELECT u1.name, u2.name, m.text FROM users as u1, messages as m, users as u2 WHERE u1.id = m.sender_id AND u2.id = m.reciever_id;
The tree of execution of request is called the execution plan of request.
It is possible to look at that plan which selects DBMS for specific request, using command
EXPLAIN SELECT u1.name, u2.name, m.text FROM users as u1, messages as m, users as u2 WHERE u1.id = m.sender_id AND u2.id = m.reciever_id;
To execute request and to look at the plan selected for it, it is possible to use command
EXPLAIN ANALYSE SELECT u1.name, u2.name, m.text FROM users as u1, messages as m, users as u2 WHERE u1.id = m.sender_id AND u2.id = m.reciever_id;
Runtime of different plans of the same request can differ on many orders. Therefore the right choice of the execution plan of request exerts serious impact on DBMS performance. Let's understand in more detail as there is a choice of the plan in PostgreSQL now.
How DBMS looks for the optimum execution plan of request?
It is possible to separate process of search of the optimum plan for two parts.
First, it is necessary to be able to estimate the cost of any plan — quantity of the resources necessary for its execution. In a case when on the server other tasks and requests, the evaluated runtime of request in direct ratio to quantity of the resources spent for it are not carried out. Therefore it is possible to consider that the cost of the plan is its runtime in some conventional units.
Secondly, it is required to select the plan with the minimum estimation of cost. It is easy to show that the number of plans grows exponential with increase in complexity of request therefore it is impossible just to touch all plans, to estimate the cost of everyone and to select the cheapest. For search of the optimum plan more difficult algorithms of discrete optimization are used: a dynamic programming on subsets for simple requests and the genetic algorithm for difficult.
In our project we concentrated on the first task: according to the plan given us it is necessary to predict its cost. How it can be made, without starting the plan for execution?
This task is also separated into two subtasks. At first for each top of the plan (plan node) it is predicted how many trains will be selected in it. Then on the basis of this information the cost of execution of each top, and, respectively, all plan is estimated.
We conducted small research to set what of two subtasks in PostgreSQL is solved worse.
Each point in drawings corresponds to one top of the plan below. For each top the quantity of the trains which are selected in it and cost of its execution were predicted, and then the real quantity of the selected trains and runtime are measured. On the right picture only those tops for which the quantity of trains is predicted correctly therefore according to it it is possible to judge quality of estimation of cost are displayed.
|Dependence of true quantity of trains on predicted||Dependence of operating time of the plan on cost,
if the quantity of trains is predicted correctly
In the first drawing it is visible that the result of a solution of the first subtask differs from true on several orders. In the second drawing it is visible that at the correct solution of the first subtask the PostgreSQL model rather adequately estimates the cost of execution of this or that plan as strong correlation with runtime is visible. As a result it was established that DBMS performance suffers from inaccurate solutions of both subtasks, but in each top she suffers from incorrectly assigned amount of trains more.
Let's consider the solution of the first subtask which is used in PostgreSQL.
How DBMS evaluates quantity of trains in tops?
For a start we will try to predict quantity of the trains which are selected by simple request
SELECT name FROM users WHERE age < 25;
In order that was though some opportunity to make it, we need some information on data, statistics on them. In PostgreSQL as this information on data histograms are used.
Using the histogram, we can easily recover a share of those users who are younger than 25 years. For each top of the plan the share of all selected trains in relation to all processed trains is called an optionality (selectivity). The optionality of SeqScan will be equal in the given example to about 0.3. For receipt of quantity of the trains which are selected by top will be to increase a top optionality enough by quantity of the processed trains (in case of SeqScan'a it will be a record count in the table).
Let's review more advanced query
SELECT name FROM users WHERE age < 25 AND city = 'Moscow';
In this case by means of histograms on age and the cities we will be able to receive only marginal an optionality, that is a share of users 25 years and a share of Muscovites among users are younger. In the PostgreSQL model all conditions (except couples of conditions of a type
5 < a AND a < 7, which automatically turn into a condition
5 < a < 7), are considered as independent. Mathematicians call two conditions of A and B independent if the probability that both conditions at the same time are satisfied, is equal to work of their probabilities: P (A and B) = P(A)P(B). However in applied sense it is possible to understand independence of two values as the fact that distribution to other value does not depend on value of one value.
In what problem?
In certain cases the assumption of independence of conditions is not executed. In such cases the PostgreSQL model works not really well. There are two methods to deal with this problem.
The first method consists in creation of multidimensional histograms. The problem of this method is that with increase in dimension the multidimensional histogram demands exponential growing quantity of resources for saving of the same accuracy. Therefore it is necessary to be limited to histograms of small dimension (2-8 measurements). From here the second problem of this method follows: it is necessary to understand somehow for what couples (either the three, or the fours...) columns it makes sense to build multidimensional histograms and to what it is optional.
To solve this problem, it is required or the good administrator who will study plans of resource-intensive requests, to define correlations between columns and to specify manually what histograms need to be completed, or software which by statistical tests will try to find columns dependent from each other. However not for all dependent columns it makes sense to build histograms therefore the software also has to analyze co-occurrence of columns in requests. At the moment there are patches allowing to use multidimensional histograms in PostgreSQL, but in them the administrator needs to set manually for what columns these multidimensional histograms have to be constructed.
We use machine learning for an optionality assessment
However this article is devoted to alternative approach. Alternative approach is an application of machine learning for finding of a joint optionality of several conditions. As it was already told above, machine learning is engaged in search of patterns in data. Data are a set of objects. In our case object is set of conditions in one top of the plan. On these conditions and their marginal vyborochnost we need to predict a joint optionality.
Observed signs of top of the plan will be marginal an optionality of all its conditions. Let's consider equivalent among themselves all conditions differing only in constants. It is possible to consider this assumption as typical acceptance of machine learning — hashing trick — applied to reduction of dimension of space. However behind it there is stronger motivation: we assume that all information, necessary for a prediction, for constants of a condition contains in its marginal optionality. It is possible to show it strictly for simple conditions of a type of a < const: здесь по выборочности условия мы можем восстановить значение константы, то есть потери информации не происходит.
The turned-out problem of machine learning will look as it is provided in drawing.
We have to predict the most left column on the known values in all other columns. Such task in which it is required to predict some real number in machine learning is called a problem of regression. The method solving it is called, respectively, a regressor.
Let's pass to logarithms in all columns. Let's notice that if now to use a linear regression, then as a special case we will receive the current PostgreSQL model.
In a case when all configured settings are equal 1, we receive standard model of an optionality PostgreSQL:
The standard method of grebnevy regression suggests to look for parameters by means of minimization of the following functionality:
For testing of different approaches we used TPC-H benchmark.
As simple regressor the following methods were used:
- Grebnevy linear regression + stochastic gradient descent. This method is good the fact that it allows to use dynamic training (online learning) therefore does not demand to keep any observed objects.
- Set of grebnevy linear regressions + stochastic gradient descent. Here it is supposed that for each set of conditions the separate grebnevy linear regressor is created. This method, as well as last, is good the fact that it allows to use dynamic training therefore does not demand to keep any observed objects, however works slightly more precisely previous as contains significantly more than the configured settings.
- Set of grebnevy linear regressions + analytical solution by the Gauss method. This method demands to keep all observed objects, but at the same time, unlike two previous, is much quicker configured under data.
However in same and its minus: he behaves rather unstably.
Let's explain the nature of the instability arising at an analytical solution. Answers of our regressor are entry values for the optimizer who looks for the optimum plan. The objects (the performed plans) observed by us, are output values of the optimizer. Therefore the objects observed by us depend on answers of a regressor. Such feedback systems it is much more difficult for studying, than systems in which the regressor does not influence environment. In these terms the analytical solution by method of Gauss is unstable — it is quickly trained, but proposes more risky solutions therefore in general the system works worse.
After detailed studying of linear model we found out that it insufficiently fully describes data. Therefore the best results from the methods tested by us were shown by kNN.
- kNN. Essential minus of this method consists in need of saving for memory of all objects with the subsequent organization of fast search in them. It is essential to improve this situation it is possible, using algorithm of selection of objects. Idea of naive algorithm of selection of objects: if a prediction on object rather good, then it is not necessary to remember this object.
Also this method is stabler, than a linear regression: convergence on a benchmark of TPC-H requires only 2 cycles of training given in drawing above.
What leads use of machine learning to
Let's give the received results for algorithm of kNN.
|Before machine learning||After machine learning|
It is possible to see that the offered approach really accelerates DBMS operating time. On one of types of requests of a benchmark acceleration makes 30-45%, on another — by 2-4 times.
What ways of development are?
There are a lot more directions for further improvement of the available prototype.
- Problem of search of plans. The current algorithm guarantees that in those plans to which the algorithm meets predictions of an optionality will be correct. However it does not guarantee a global optimality of the selected plans. Search of globally optimum plans or at least the best local optimum is a separate task.
- The mode of interruptions for the termination of execution of the unsuccessful plan. In the standard PostgreSQL model it does not make sense to us to interrupt execution of the plan as we have only one best plan, and it does not change. With implementation of machine learning we can interrupt execution of the plan in which hard errors in an optionality prediction were made, to consider the acquired information and to select the new best plan for execution. In most cases the new plan will significantly differ from previous.
- Modes of obsolescence of information. In the course of work of DBMS data and typical requests change. Therefore given which were received in the past can be already irrelevant. Now in our company work on good system of determination of relevance of information, and, respectively, "zabyvaniye" of outdated information is conducted.
What was it?
In this article we:
- disassembled the mechanism of work of the scheduler of PostgreSQL;
- noted problems in the current algorithm of an assessment of an optionality;
- showed how it is possible to use methods of machine learning for an optionality assessment;
- experimentally established that use of machine learning leads to improvement of work of the scheduler and, respectively, acceleration of work of DBMS.
Thanks for attention!
- About the scheduler of PostgreSQL
- Explaining the Postgres Query Optimizer, Bruce Momjian
- Materials for the PostgreSQL developers
- Reports about the internal device PostgreSQL, Neil Conway
- About machine learning (from a course of lectures of K. V. Vorontsov)
This article is a translation of the original post at habrahabr.ru/post/273199/
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here: firstname.lastname@example.org.
We believe that the knowledge, which is available at the most popular Russian IT blog habrahabr.ru, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.