Developers Club geek daily blog

1 year, 5 months ago
In PostgreSQL, unfortunately, yet not many actively use function of a partitsionirovaniye of tables. In my opinion, very adequately tells about it in the work of Hubert Lubaczewski (depesz.com). I offer you one more transfer of its article!
Partitsionirovaniye in PostgreSQL – That? What for? As?
Recently I noticed that I in increasing frequency face cases where it would be possible to use a partitsionirovaniye. And though, theoretically, most of people knows about its existence, actually this feature is not too well understood, and some are even rather afraid of it.

So I will try to explain moderately the knowledge and opportunities what is it why it should be used and as to make it.

As you for certain know, in PostgreSQL there are tables, and in tables there are data. Sometimes it is only several lines, and sometimes – billions.

Partitsionirovaniye is a method of separation big (proceeding their record counts, but not columns) tables on much small. Also it is desirable that it occurred by method, transparent for the application.

One of seldom used features of PostgreSQL is the fact that it is object relational database. And "object" here a key word because objects (or more likely classes) know that "inheritance" is called. It is used for a partitsionirovaniye.

Let's look about what the speech.

I will create the normal table users:

$ create table users (
    id             serial primary key,
    username       text not null unique,
    password       text,
    created_on     timestamptz not null,
    last_logged_on timestamptz not null
);

Now, to complete the picture, let's add several lines and an additional index:

$ insert into users (username, password, created_on, last_logged_on)
    select
        random_string( (random() * 4 + 5)::int4),
        random_string( 20 ),
        now() - '2 years'::interval * random(),
        now() - '2 years'::interval * random()
    from
        generate_series(1, 10000);
$ create index newest_users on users (created_on);

So, at us the test table turned out:

$ \d
                                      Table "public.users"
     Column     |           Type           |                     Modifiers                      
----------------+--------------------------+----------------------------------------------------
 id             | integer                  | not null default nextval('users_id_seq'::regclass)
 username       | text                     | not null
 password       | text                     | 
 created_on     | timestamp with time zone | not null
 last_logged_on | timestamp with time zone | not null
Indexes:
    "users_pkey" PRIMARY KEY, btree (id)
    "users_username_key" UNIQUE CONSTRAINT, btree (username)
    "newest_users" btree (created_on)

With some accidental data:

$ select * from users limit 10;
 id | username |       password       |          created_on           |        last_logged_on         
----+----------+----------------------+-------------------------------+-------------------------------
  1 | ityfce3  | 2ukgbflj_l2ndo3vilt2 | 2015-01-02 16:56:41.346113+01 | 2015-04-15 12:34:58.318913+02
  2 | _xg_pv   | u8hy20aifyblg9f3_rf2 | 2014-09-27 05:41:05.317313+02 | 2014-08-07 14:46:14.197313+02
  3 | uvi1wo   | h09ae85v_f_cx0gf6_8r | 2013-06-17 18:48:44.389313+02 | 2014-06-03 06:53:49.640513+02
  4 | o6rgs    | vzbrkwhnsucxco5pjep0 | 2015-01-30 11:33:25.150913+01 | 2013-11-05 07:18:47.730113+01
  5 | nk61jw77 | lidk_mnpe_olffmod7ed | 2014-06-15 07:18:34.597313+02 | 2014-03-21 17:42:44.763713+01
  6 | 3w326_2u | pyoqg87feemojhql7jrn | 2015-01-20 05:41:54.133313+01 | 2014-09-07 20:33:23.682113+02
  7 | m9rk9mnx | 6pvt94s6ol46kn0yl62b | 2013-07-17 15:13:36.315713+02 | 2013-11-12 10:53:06.123713+01
  8 | adk6c    | egfp8re0z492e6ri8urz | 2014-07-23 11:41:11.883713+02 | 2013-10-22 07:19:36.200513+02
  9 | rsyaedw  | ond0tie9er92oqhmdj39 | 2015-05-11 16:45:40.472513+02 | 2013-08-31 17:29:18.910913+02
 10 | prlobe46 | _3br5v97t2xngcd7xz4n | 2015-01-10 20:13:29.461313+01 | 2014-05-04 06:25:56.072513+02
(10 rows)

Now, when the table is ready, I can create partition that means – the inherited tables:

$ create table users_1 () inherits (users);
 
$ \d users_1
                                     Table "public.users_1"
     Column     |           Type           |                     Modifiers                      
----------------+--------------------------+----------------------------------------------------
 id             | integer                  | not null default nextval('users_id_seq'::regclass)
 username       | text                     | not null
 password       | text                     | 
 created_on     | timestamp with time zone | not null
 last_logged_on | timestamp with time zone | not null
Inherits: users

Thus, at us the new table which has certain interesting properties turned out:

  • she uses the same sequence, as the main table, for the column id;
  • all columns have identical determination, including restrictions of not null;
  • there is neither primary key, nor restrictions of uniqueness for a user name, nor an index for created_on.

Let's try once again, but this time with more "explosive" effect:

$ drop table users_1;
$ create table users_1 ( like users including all );
$ \d users_1
                                     Table "public.users_1"
     Column     |           Type           |                     Modifiers                      
----------------+--------------------------+----------------------------------------------------
 id             | integer                  | not null default nextval('users_id_seq'::regclass)
 username       | text                     | not null
 password       | text                     | 
 created_on     | timestamp with time zone | not null
 last_logged_on | timestamp with time zone | not null
Indexes:
    "users_1_pkey" PRIMARY KEY, btree (id)
    "users_1_username_key" UNIQUE CONSTRAINT, btree (username)
    "users_1_created_on_idx" btree (created_on)

Now we have all indexes and restrictions, but we lost information on inheritance. But we can add it later with the help:

$ alter table users_1 inherit users;
$ \d users_1
                                     Table "public.users_1"
     Column     |           Type           |                     Modifiers                      
----------------+--------------------------+----------------------------------------------------
 id             | integer                  | not null default nextval('users_id_seq'::regclass)
 username       | text                     | not null
 password       | text                     | 
 created_on     | timestamp with time zone | not null
 last_logged_on | timestamp with time zone | not null
Indexes:
    "users_1_pkey" PRIMARY KEY, btree (id)
    "users_1_username_key" UNIQUE CONSTRAINT, btree (username)
    "users_1_created_on_idx" btree (created_on)
Inherits: users

We could make it in one step, but then different unpleasant notifications appear:

$ drop table users_1;
 
$ create table users_1 ( like users including all ) inherits (users);
NOTICE:  merging column "id" with inherited definition
NOTICE:  merging column "username" with inherited definition
NOTICE:  merging column "password" with inherited definition
NOTICE:  merging column "created_on" with inherited definition
NOTICE:  merging column "last_logged_on" with inherited definition
 
$ \d users_1
                                     Table "public.users_1"
     Column     |           Type           |                     Modifiers                      
----------------+--------------------------+----------------------------------------------------
 id             | integer                  | not null default nextval('users_id_seq'::regclass)
 username       | text                     | not null
 password       | text                     | 
 created_on     | timestamp with time zone | not null
 last_logged_on | timestamp with time zone | not null
Indexes:
    "users_1_pkey" PRIMARY KEY, btree (id)
    "users_1_username_key" UNIQUE CONSTRAINT, btree (username)
    "users_1_created_on_idx" btree (created_on)
Inherits: users

Anyway, now we have two tables – the main and first partition.

If I make any action – selection/updating/removal – with users, both tables will be scanned:

$ explain analyze select * from users where id = 123;
                                                         QUERY PLAN                                                          
-----------------------------------------------------------------------------------------------------------------------------
 Append  (cost=0.29..16.47 rows=2 width=66) (actual time=0.008..0.009 rows=1 loops=1)
   ->  Index Scan using users_pkey on users  (cost=0.29..8.30 rows=1 width=48) (actual time=0.008..0.008 rows=1 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_1_pkey on users_1  (cost=0.15..8.17 rows=1 width=84) (actual time=0.001..0.001 rows=0 loops=1)
         Index Cond: (id = 123)
 Planning time: 0.327 ms
 Execution time: 0.031 ms
(7 rows)

But if I address a partition directly, the request will be executed only on it:

$ explain analyze select * from users_1 where id = 123;
                                                      QUERY PLAN                                                       
-----------------------------------------------------------------------------------------------------------------------
 Index Scan using users_1_pkey on users_1  (cost=0.15..8.17 rows=1 width=84) (actual time=0.002..0.002 rows=0 loops=1)
   Index Cond: (id = 123)
 Planning time: 0.162 ms
 Execution time: 0.022 ms
(4 rows)

If we wanted, we could address only the table of users without its partition, using a key word of ONLY:

$ explain analyze select * from only users where id = 123;
                                                    QUERY PLAN                                                     
-------------------------------------------------------------------------------------------------------------------
 Index Scan using users_pkey on users  (cost=0.29..8.30 rows=1 width=48) (actual time=0.008..0.008 rows=1 loops=1)
   Index Cond: (id = 123)
 Planning time: 0.229 ms
 Execution time: 0.031 ms
(4 rows)

You could notice that I told that selection/updating/removal works at all partition. And what about inserts? The insert needs to add somewhere data so it always works as if ONLY was used. Therefore if I need to add a line to users_1, I have to make so:

INSERT INTO users_1 ...

Looks good somehow not too, but do not worry, there are methods it to bypass.

Let's try to make this partitsionirovaniye. For a start we need to decide what will be a partitsionirovaniye key – in other words on what algorithm partition will be selected.

There is couple of most obvious:

  • partitsionirovaniye by date – for example to select partition, based on a year in which the user was created;
  • partitsionirovaniye on the range of identifiers – for example, the first million users, the second million users, and so on;
  • partitsionirovaniye on something to another – for example, on the first letter of a user name.

There is couple more of another, not so often used options, like "a partitsionirovaniye on a hash from a user name".

Why it is worth using one scheme, but not another? Let's understand their merits and demerits:

  • partitsionirovaniye by date:
    • advantages:
      • it is easy to understand;
      • the quantity of lines in this table will be rather stable;

    • shortcomings:
      • demands support – from time to time we should add new partition;
      • search in a user name or id will demand scanning of all partition;


  • partitsionirovaniye on id:
    • advantages:
      • it is easy to understand;
      • the quantity of lines in a partition will be stable for 100%;

    • shortcomings:
      • demands support – from time to time we should add new partition;
      • search in a user name or id will demand scanning of all partition;


  • partitsionirovaniye on the first letter of a user name:
    • advantages:
      • it is easy to understand;
      • any support – there is strictly certain set of partition and we never should add new;

    • shortcomings:
      • the quantity of lines will steadily grow in partition;
      • in some partition there will be significantly more lines, than in others (it is more than people with the nicknames beginning on "t *" than on "y *");
      • search in id will demand scanning of all partition;


  • partitsionirovaniye on a user name hash:
    • advantages:
      • any support – there is strictly certain set of partition and we never should add new;
      • lines will be equally distributed between partition;

    • shortcomings:
      • the quantity of lines will steadily grow in partition;
      • search in id will demand scanning of all partition;
      • search in a user name will scan only one partition, but only when using additional conditions.



The last lack of approach with the hashed user names is very interesting. Let's look what there occurs.

For a start I need to create more partition:

$ create table users_2 ( like users including all );
$ alter table users_2 inherit users;
...
$ create table users_10 ( like users including all );
$ alter table users_10 inherit users;

Now the table users has 10 partition:

$ \d users
                                      Table "public.users"
     Column     |           Type           |                     Modifiers                      
----------------+--------------------------+----------------------------------------------------
 id             | integer                  | not null default nextval('users_id_seq'::regclass)
 username       | text                     | not null
 password       | text                     | 
 created_on     | timestamp with time zone | not null
 last_logged_on | timestamp with time zone | not null
Indexes:
    "users_pkey" PRIMARY KEY, btree (id)
    "users_username_key" UNIQUE CONSTRAINT, btree (username)
    "newest_users" btree (created_on)
Number of child tables: 10 (Use \d+ to list them.)

In PostgreSQL there is an option constraint_exclusion. And if to configure it on "on" or "partition", PostgreSQL will pass partition which do not may contain coinciding lines.

In my Pg it is set by default:

$ show constraint_exclusion;
 constraint_exclusion 
----------------------
 partition
(1 row)

So, as all my partition and a base table have no intelligent restrictions so any request will scan all 11 tables at once (the main and 10 partition):

$ explain analyze select * from users where id = 123;
                                                          QUERY PLAN                                                           
-------------------------------------------------------------------------------------------------------------------------------
 Append  (cost=0.29..89.98 rows=11 width=81) (actual time=0.009..0.013 rows=1 loops=1)
   ->  Index Scan using users_pkey on users  (cost=0.29..8.30 rows=1 width=48) (actual time=0.007..0.007 rows=1 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_1_pkey on users_1  (cost=0.15..8.17 rows=1 width=84) (actual time=0.001..0.001 rows=0 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_2_pkey on users_2  (cost=0.15..8.17 rows=1 width=84) (actual time=0.000..0.000 rows=0 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_3_pkey on users_3  (cost=0.15..8.17 rows=1 width=84) (actual time=0.001..0.001 rows=0 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_4_pkey on users_4  (cost=0.15..8.17 rows=1 width=84) (actual time=0.000..0.000 rows=0 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_5_pkey on users_5  (cost=0.15..8.17 rows=1 width=84) (actual time=0.000..0.000 rows=0 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_6_pkey on users_6  (cost=0.15..8.17 rows=1 width=84) (actual time=0.001..0.001 rows=0 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_7_pkey on users_7  (cost=0.15..8.17 rows=1 width=84) (actual time=0.000..0.000 rows=0 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_8_pkey on users_8  (cost=0.15..8.17 rows=1 width=84) (actual time=0.000..0.000 rows=0 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_9_pkey on users_9  (cost=0.15..8.17 rows=1 width=84) (actual time=0.001..0.001 rows=0 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_10_pkey on users_10  (cost=0.15..8.17 rows=1 width=84) (actual time=0.000..0.000 rows=0 loops=1)
         Index Cond: (id = 123)
 Planning time: 1.321 ms
 Execution time: 0.087 ms
(25 rows)

It is not too effective, but we can put restriction.

Let's say our partition were created by a partitsionirovaniye method on id, and 100,000 identifiers are stored in each partition.

We can add several restrictions:

$ alter table users_1 add constraint partition_check check (id >= 0 and id < 100000);
$ alter table users_2 add constraint partition_check check (id >= 100000 and id < 200000);
...
$ alter table users_10 add constraint partition_check check (id >= 900000 and id < 1000000);

Now we repeat the previous request:

$ explain analyze select * from users where id = 123;
                                                         QUERY PLAN                                                          
-----------------------------------------------------------------------------------------------------------------------------
 Append  (cost=0.29..16.47 rows=2 width=66) (actual time=0.008..0.009 rows=1 loops=1)
   ->  Index Scan using users_pkey on users  (cost=0.29..8.30 rows=1 width=48) (actual time=0.008..0.009 rows=1 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_1_pkey on users_1  (cost=0.15..8.17 rows=1 width=84) (actual time=0.000..0.000 rows=0 loops=1)
         Index Cond: (id = 123)
 Planning time: 1.104 ms
 Execution time: 0.031 ms
(7 rows)

It scans only 2 tables: the main (in which there are all data now, and there are no restrictions so it cannot be excluded) and a suitable partition.

Pretty cool, huh?

We can add similar conditions of a partitsionirovaniye on a user name or created_on without problems. But look what occurs when a partitsionirovaniye key more difficult:

$ alter table users_1 drop constraint partition_check, add constraint partition_check check (abs( hashtext(username) ) % 10 = 0);
$ alter table users_2 drop constraint partition_check, add constraint partition_check check (abs( hashtext(username) ) % 10 = 1);
...
$ alter table users_10 drop constraint partition_check, add constraint partition_check check (abs( hashtext(username) ) % 10 = 9);

In case you it is not aware, hashtext () takes a line and returns an integral number in the range from-2147483648 to 2147483647.
Thanks to simple arithmetics we know that abs(hashtext(string)) of % 10 will always issue value in the range of 0. 9, and it is easy to count it for any parameter.

Whether knows about this PostgreSQL?

$ explain analyze select * from users where username = 'depesz';
                                                              QUERY PLAN                                                               
---------------------------------------------------------------------------------------------------------------------------------------
 Append  (cost=0.29..89.98 rows=11 width=81) (actual time=0.023..0.023 rows=0 loops=1)
   ->  Index Scan using users_username_key on users  (cost=0.29..8.30 rows=1 width=48) (actual time=0.016..0.016 rows=0 loops=1)
         Index Cond: (username = 'depesz'::text)
   ->  Index Scan using users_1_username_key on users_1  (cost=0.15..8.17 rows=1 width=84) (actual time=0.000..0.000 rows=0 loops=1)
         Index Cond: (username = 'depesz'::text)
   ->  Index Scan using users_2_username_key on users_2  (cost=0.15..8.17 rows=1 width=84) (actual time=0.001..0.001 rows=0 loops=1)
         Index Cond: (username = 'depesz'::text)
   ->  Index Scan using users_3_username_key on users_3  (cost=0.15..8.17 rows=1 width=84) (actual time=0.001..0.001 rows=0 loops=1)
         Index Cond: (username = 'depesz'::text)
   ->  Index Scan using users_4_username_key on users_4  (cost=0.15..8.17 rows=1 width=84) (actual time=0.000..0.000 rows=0 loops=1)
         Index Cond: (username = 'depesz'::text)
   ->  Index Scan using users_5_username_key on users_5  (cost=0.15..8.17 rows=1 width=84) (actual time=0.001..0.001 rows=0 loops=1)
         Index Cond: (username = 'depesz'::text)
   ->  Index Scan using users_6_username_key on users_6  (cost=0.15..8.17 rows=1 width=84) (actual time=0.000..0.000 rows=0 loops=1)
         Index Cond: (username = 'depesz'::text)
   ->  Index Scan using users_7_username_key on users_7  (cost=0.15..8.17 rows=1 width=84) (actual time=0.001..0.001 rows=0 loops=1)
         Index Cond: (username = 'depesz'::text)
   ->  Index Scan using users_8_username_key on users_8  (cost=0.15..8.17 rows=1 width=84) (actual time=0.001..0.001 rows=0 loops=1)
         Index Cond: (username = 'depesz'::text)
   ->  Index Scan using users_9_username_key on users_9  (cost=0.15..8.17 rows=1 width=84) (actual time=0.000..0.000 rows=0 loops=1)
         Index Cond: (username = 'depesz'::text)
   ->  Index Scan using users_10_username_key on users_10  (cost=0.15..8.17 rows=1 width=84) (actual time=0.001..0.001 rows=0 loops=1)
         Index Cond: (username = 'depesz'::text)
 Planning time: 1.092 ms
 Execution time: 0.095 ms
(25 rows)

No. Does not know. In fact, PostgreSQL can make an automatic exception of partition only for the checks based on range (or equality). Anything based on functions. Even the simple module from number is already search:

$ alter table users_1 drop constraint partition_check, add constraint partition_check check ( id % 10 = 0);
$ alter table users_2 drop constraint partition_check, add constraint partition_check check ( id % 10 = 1);
...
$ alter table users_10 drop constraint partition_check, add constraint partition_check check ( id % 10 = 9);
$ explain analyze select * from users where id = 123;
                                                          QUERY PLAN                                                           
-------------------------------------------------------------------------------------------------------------------------------
 Append  (cost=0.29..89.98 rows=11 width=81) (actual time=0.009..0.016 rows=1 loops=1)
   ->  Index Scan using users_pkey on users  (cost=0.29..8.30 rows=1 width=48) (actual time=0.009..0.009 rows=1 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_1_pkey on users_1  (cost=0.15..8.17 rows=1 width=84) (actual time=0.001..0.001 rows=0 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_2_pkey on users_2  (cost=0.15..8.17 rows=1 width=84) (actual time=0.001..0.001 rows=0 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_3_pkey on users_3  (cost=0.15..8.17 rows=1 width=84) (actual time=0.001..0.001 rows=0 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_4_pkey on users_4  (cost=0.15..8.17 rows=1 width=84) (actual time=0.001..0.001 rows=0 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_5_pkey on users_5  (cost=0.15..8.17 rows=1 width=84) (actual time=0.000..0.000 rows=0 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_6_pkey on users_6  (cost=0.15..8.17 rows=1 width=84) (actual time=0.001..0.001 rows=0 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_7_pkey on users_7  (cost=0.15..8.17 rows=1 width=84) (actual time=0.000..0.000 rows=0 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_8_pkey on users_8  (cost=0.15..8.17 rows=1 width=84) (actual time=0.001..0.001 rows=0 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_9_pkey on users_9  (cost=0.15..8.17 rows=1 width=84) (actual time=0.000..0.000 rows=0 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_10_pkey on users_10  (cost=0.15..8.17 rows=1 width=84) (actual time=0.001..0.001 rows=0 loops=1)
         Index Cond: (id = 123)
 Planning time: 0.973 ms
 Execution time: 0.086 ms
(25 rows)

It is sad. Because the keys of a partitsionirovaniye based on modules of numbers have one benefit huge (in my opinion) – stable number of partition. You should not create them in the future if only you do not decide to make a partitsionirovaniye again at achievement of any higher data volume.

Whether it means that you cannot use difficult (based on functions or modules from numbers) partitsionirovaniye keys? No. You can use them, but then requests will turn out more difficult:

$ explain analyze select * from users where id = 123 and id % 10 = 123 % 10;
                                                         QUERY PLAN                                                          
-----------------------------------------------------------------------------------------------------------------------------
 Append  (cost=0.29..16.48 rows=2 width=66) (actual time=0.010..0.011 rows=1 loops=1)
   ->  Index Scan using users_pkey on users  (cost=0.29..8.31 rows=1 width=48) (actual time=0.010..0.010 rows=1 loops=1)
         Index Cond: (id = 123)
         Filter: ((id % 10) = 3)
   ->  Index Scan using users_4_pkey on users_4  (cost=0.15..8.17 rows=1 width=84) (actual time=0.001..0.001 rows=0 loops=1)
         Index Cond: (id = 123)
         Filter: ((id % 10) = 3)
 Planning time: 1.018 ms
 Execution time: 0.033 ms
(9 rows)

Here I added one more condition, here it:

id % 10 = 123 % 10

PostgreSQL can rewrite it in the course of analysis of expression:

id % 10 = 3

because he knows that the operator of % for integral numbers is immutabelny. And now, as part of request, I have an exact key of a partitsionirovaniye – id of % 10 = 3. Thus, Pg can use only those partition which or have no partitsionirovaniye key (that is, a base table), or there is a key corresponding to request.

Whether it is worth entering additional complication – to solve to you.

If you prefer not to change requests, and you will not complicate to add new partition from time to time, then you should get acquainted with PG Partition Manger written by my former colleague to Keith Fiske is a feature set which you start manually for determination of partition, and one more which you start on krone, and she undertakes creation of new partition for future data.

I already mentioned inserts, but did not explain how to bypass a problem with inserts which have to be added to partition.

In general, it is work for the trigger. Pg_partman from Keith creates such triggers for you, but I want that you understood what occurs, and did not use pg_partman as "a black box", and it is rather as an auxiliary tool which does tiresome work for you.

Now my scheme of a partitsionirovaniye is based on the module from number (as far as I know, partman so cannot make) so let's write suitable function of the trigger. She will be called at an insert of data in the table users and has to redirect without errors an insert to the corresponding partition. So, we write:

$ create function partition_for_users() returns trigger as $$
DECLARE
    v_parition_name text;
BEGIN
    v_parition_name := format( 'users_%s', 1 + NEW.id % 10 );
    execute 'INSERT INTO ' || v_parition_name || ' VALUES ( ($1).* )' USING NEW;
    return NULL;
END;
$$ language plpgsql;

And now determination of the trigger:

$ create trigger partition_users before insert on users for each row execute procedure partition_for_users();

Let's try to add a line:

$ insert into users (username, password, created_on, last_logged_on)
    values (
        'depesz',
        random_string( 20 ),
        now() - '2 years'::interval * random(),
        now() - '2 years'::interval * random()
    );
$ select currval('users_id_seq');
 currval 
---------
   10003
(1 row)

Let's look whether data are visible:

$ select * from users where username = 'depesz';
  id   | username |       password       |          created_on           |        last_logged_on         
-------+----------+----------------------+-------------------------------+-------------------------------
 10003 | depesz   | bp7zwy8k3t3a37chf1hf | 2014-10-24 02:45:51.398824+02 | 2015-02-05 18:24:57.072424+01
(1 row)

Looks good, but where they are? In the main table?

$ select * from only users where username = 'depesz';
 id | username | password | created_on | last_logged_on 
----+----------+----------+------------+----------------
(0 rows)

No. So can, in the necessary partition?

$ select * from users_4 where username = 'depesz';
  id   | username |       password       |          created_on           |        last_logged_on         
-------+----------+----------------------+-------------------------------+-------------------------------
 10003 | depesz   | bp7zwy8k3t3a37chf1hf | 2014-10-24 02:45:51.398824+02 | 2015-02-05 18:24:57.072424+01

Yes. The trigger worked. But this method has one shortcoming. Namely – "RETURNING" does not work:

$ insert into users (username, password, created_on, last_logged_on)
    values (
        'test',
        random_string( 20 ),
        now() - '2 years'::interval * random(),
        now() - '2 years'::interval * random()
    )
    returning *;
 id | username | password | created_on | last_logged_on 
----+----------+----------+------------+----------------
(0 rows)

So occurs because, from the point of view of the contractor, the insert returned nothing – the trigger returned NULL.

I did not manage to find a successful solution of this problem yet. In the cases I just prefer to receive initial value of a key in advance, using nextval (), and then I insert ready value – so it is already available after an insert:

$ select nextval('users_id_seq');
 nextval 
---------
   10005
(1 row)
 
$ insert into users (id, username, password, created_on, last_logged_on)
    values (
        10005,
        'test',
        random_string( 20 ),
        now() - '2 years'::interval * random(),
        now() - '2 years'::interval * random()
    );

To all this there is one refining. Routing of all inserts via the trigger slows down them, for every line PG it will be necessary to execute one more "insert".

For urgent volume inserts will to force them work as the best solution directly with partition. Therefore, for example, instead of

COPY users FROM stdin;
....
\.

you find out previously how many identifiers are necessary for you, for example, here in such a way:

select nextval('users_id_seq') from generate_series(1, 100);

And then you issue suitable:

COPY users_p1 FROM stdin;
....
\.
COPY users_p2 FROM stdin;
....
\.
...

Not the most convenient method, but it can be useful if you import large volumes of data to partitsionirovanny tables.

So, now you have to understand that such partitsionirovaniye and as it works. The following question in heading was: what for?

It is rather easy to answer it: for ensuring performance or simplification of service.

As a simple example, we will take the table users in which 1 billion lines (1,000,000,000).

Search in it will progressively rise in price even taking into account indexing, just because depth of indexes will grow.
It is visible even in my small test table.

Let's reset all partition and the trigger of a partitsionirovaniye:

$ drop table users_1;
$ drop table users_2;
...
$ drop table users_10;
$ drop trigger partition_users on users;

Now in the table users of 10,000 lines. Simple search in a user name borrows 0.020ms - it is the best time from three attempts.

If I add more lines:

$ insert into users (username, password, created_on, last_logged_on)
    select
        random_string( (random() * 4 + 5)::int4),
        random_string( 20 ),
        now() - '2 years'::interval * random(),
        now() - '2 years'::interval * random()
    from
        generate_series(1, 100000);

the same search will borrow 0.025ms. Increase in search time on 0.005ms can be small, but we have still only 110,000 lines, and in system there are no other tables so all table with indexes is located in memory.

Of course, your partitsionirovaniye has to be intelligent. For example, if you usually perform search in a user name, then it is senseless to do a partitsionirovaniye on id – Pg should be looked for on all partition (it can become intelligent in the future, but I will tell about it at the very end of article).

That is you need to decide on what you usually request – whether it be search in some key or, perhaps, you usually browse only fresh data? And to partitsionirovat so that to limit quantity of partition which Pg need to be scanned.

Important the fact that the partitsionirovaniye makes your life easier, especially if you more the database manager, than the programmer. Any maintenance tasks (index creation, vacuum, pg_reorg/pg_repack, pg_dump) can be effectively broken into so many subtasks how many is available for you partition. So instead of one hours-long transaction for repacking of the big table you will have 20 much faster and using less than the place on a disk transactions, and the result, in general, will be the same!

Of course, business is not limited to good news. In a partitsionirovaniye there is one big shortcoming: you can have no foreign keys indicating the partitsionirovanny table.

It simply does not work. You could get the foreign keys indicating directly partition, but it (usually) is senseless.

Whether big it is a problem personally for you, depends on your yuzkeys. It seems to me that in most cases when we reach tables enough big in order that the partitsionirovaniye was justified, the application is tested rather well, and we can reconcile to lack of foreign key. Besides, we can always add a task in kroner for testing of existence of "bad" values.

Now we know that such partitsionirovaniye as it works and why it is used. There was the last question: how to transform the table to partitsionirovanny. Usually, the application is not created with partitsionirovanny tables – at the beginning it does not make sense. But, soon, you will have some table with a set of lines and you will think: "It was necessary to partitsionirovat it at once during creation".

But maybe, we can still partitsionirovat it when the application already works? With a minimum of problems?
Let's look. For the test I created the pgbench database on 97 GB. Its most part, 83 GB, is in the table pgbench_accounts which contains 666,600,000 records.

The scheme at this table here such:

   Table "public.pgbench_accounts"
  Column  |     Type      | Modifiers 
----------+---------------+-----------
 aid      | integer       | not null
 bid      | integer       | 
 abalance | integer       | 
 filler   | character(84) | 
Indexes:
    "pgbench_accounts_pkey" PRIMARY KEY, btree (aid)

And all requests to it are based on the column aid which contains values from 1 to 666,600,000.

So give partitsioniruy its, based on the range of aid values.

Let's say I will place 10 million lines in each partition, then I will need 67 partition.

But how I can check that my actions will not break work? Very simply. I will start pgbench in the scraper. Exact reports on speeds are not interesting to me, there is enough information on how strongly my work influences what does pgbench.

With these thoughts I started function:

$ while true
do
    date
    pgbench -T 10 -c 2 bench
done 2>&1 | tee pgbench.log

It will banish 10-second tests and to save statistical data in the file so I will be able to trace later interrelation of result with my work on a partitsionirovaniye.

When everything is ready, I will create partition with checks in the right places:

do $$
declare
    i int4;
    aid_min INT4;
    aid_max INT4;
begin
    for i in 1..67
    loop
        aid_min := (i - 1) * 10000000 + 1;
        aid_max := i * 10000000;
        execute format('CREATE TABLE pgbench_accounts_p_%s ( like pgbench_accounts including all )', i );
        execute format('ALTER TABLE pgbench_accounts_p_%s inherit pgbench_accounts', i);
        execute format('ALTER TABLE pgbench_accounts_p_%s add constraint partitioning_check check ( aid >= %s AND aid <= %s )', i, aid_min, aid_max );
    end loop;
end;
$$;

partition are ready, and I can be convinced that checks are used:

$ explain analyze select * from pgbench_accounts where aid = 123;
                                                                       QUERY PLAN                                                                       
--------------------------------------------------------------------------------------------------------------------------------------------------------
 Append  (cost=0.57..16.75 rows=2 width=224) (actual time=6.468..6.473 rows=1 loops=1)
   ->  Index Scan using pgbench_accounts_pkey on pgbench_accounts  (cost=0.57..8.59 rows=1 width=97) (actual time=6.468..6.469 rows=1 loops=1)
         Index Cond: (aid = 123)
   ->  Index Scan using pgbench_accounts_p_1_pkey on pgbench_accounts_p_1  (cost=0.14..8.16 rows=1 width=352) (actual time=0.004..0.004 rows=0 loops=1)
         Index Cond: (aid = 123)
 Planning time: 3.475 ms
 Execution time: 6.497 ms
(7 rows)

Now it is necessary to add the trigger -" a router":

$ create function partition_for_accounts() returns trigger as $$
DECLARE
    v_parition_name text;
BEGIN
    v_parition_name := format( 'pgbench_accounts_p_%s', 1 + ( NEW.aid - 1 ) / 10000000 );
    execute 'INSERT INTO ' || v_parition_name || ' VALUES ( ($1).* )' USING NEW;
    return NULL;
END;
$$ language plpgsql;
 
$ create trigger partition_users before insert on pgbench_accounts for each row execute procedure partition_for_accounts();

All this is remarkable, but it will work only for svezhevstavlenny lines, and I have already 666 million lines in the initial table. What with it to do?

I need to move them. It is rather simple in the theory, but there is couple of reefs:

  1. Both lines should not be at the same time seen at all for any transactions (that is, from the main table and of a partition).
  2. I cannot delete all lines and insert them into partition because it will block all base table for the period of movement.

The second problem can be mitigated if to work with batches of data. But we cannot use for this SQL.
From time to time somebody asks about how to separate big operation into portions and to cause it one sql function which it will be iterative to process portions of data. This approach has one fundamental problem: function call is a transaction. Therefore everything that is done by this function, will occur in one transaction. So the problem of blocking is not solved.

But we can use for this psql (or ruby, perl, python – is not important), moving only a small amount of lines every portion and, thus, blocking the main table on a short timepoint.

In general, the single request will look so:

with x as (delete from only pgbench_accounts where aid between .. and .. returning *)
insert into appropriate_partition select * from x;

I selected a serving size – 1000, this rather small value that process was not dragged out, and rather big that the final quantity of portions was not excessive (666 thousand).

Now let's create the batch file:

\pset format unaligned
\pset tuples_only true
\o /tmp/run.batch.migration.sql
SELECT
    format(
        'with x as (DELETE FROM ONLY pgbench_accounts WHERE aid >= %s AND aid <= %s returning *) INSERT INTO pgbench_accounts_p_%s SELECT * FROM x;',
        i,
        i + 999,
        ( i - 1 ) / 10000000 + 1
    )
FROM
    generate_series( 1, 666600000, 1000 ) i;
\o

When I started it in psql, he created file/tmp/run.batch.migration.sql which is rather volume (97 GB) as he contains 666,600 requests, similar to these:

with x as (DELETE FROM ONLY pgbench_accounts WHERE aid >= 1 AND aid <= 1000 returning *) INSERT INTO pgbench_accounts_p_1 SELECT * FROM x;
with x as (DELETE FROM ONLY pgbench_accounts WHERE aid >= 1001 AND aid <= 2000 returning *) INSERT INTO pgbench_accounts_p_1 SELECT * FROM x;
with x as (DELETE FROM ONLY pgbench_accounts WHERE aid >= 2001 AND aid <= 3000 returning *) INSERT INTO pgbench_accounts_p_1 SELECT * FROM x;

Now, when everything is prepared, I can start process (of course, by means of "screen" or in "tmux" that it was lost nothing if ssh connection with the server breaks):

$ psql -d bench -f /tmp/run.batch.migration.sql

It will take some time. In a case with my test database the average packet is processed for ~ 92 ms, so, at me is ahead of 17 hours of data movement.

Only 7 hours went to realities. Not bad.

Upon termination of the table pgbench_accounts ~ 83 GB still weigh (I think, my disk lacks speed to cope with pgbench, movement and vacuum).

But I checked and, seemingly, that all lines moved to partition:

$ select count(*) from only pgbench_accounts;
 count 
-------
     0
(1 row)

How about pgbench speed during movement process?

There were 4 phases:

  1. Prior to work on movement.
  2. After creation of partition.
  3. After creation of the trigger.
  4. During movement.

Results?

  phase  |    min    |       avg        |    max    
---------+-----------+------------------+-----------
 Phase 1 | 28.662223 | 64.0359512839506 | 87.219148
 Phase 2 | 21.147816 | 56.2721418360656 | 75.967217
 Phase 3 | 23.868018 | 58.6375074477612 | 75.335558
 Phase 4 |  5.222364 | 23.6086916565574 | 65.770852
(4 rows)

Yes, movement slowed down everything. But note, please, that it is the normal personal computer with SATA disks, but not the SSD residing under high loading – pgbench banished requests as quickly as could.

Besides, some deceleration happened because vacuum not too well copes with removals. In my opinion, result absolutely acceptable.

Upon termination of I could make:

$ truncate only pgbench_accounts;

And then, to check, whether all OK:

$ select count(*) from pgbench_accounts;
   count   
-----------
 666600000
(1 row)

All this was done without any errors and without work interruption of "this application".

At the end I will add that the partitsionirovaniye will become (relative) even more abruptly soon. For quite some time now we can store partition on different servers. Also work (though hardly this updating will appear before version 9.6) on giving the chance to carry out parallel scannings that will significantly improve all process is now conducted.

I hope, this text will be useful to you.

What else aspects of a partitsionirovaniye of tables in PostgreSQL you would like to discuss? Let's be glad to add the program of reports of the PG Day conference '16 Russia with the subjects, most interesting to you! We already opened sales of early bird of tickets, hurry to be registered at the bottom price!

This article is a translation of the original post at habrahabr.ru/post/273933/
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here: sysmagazine.com@gmail.com.

We believe that the knowledge, which is available at the most popular Russian IT blog habrahabr.ru, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.
Best wishes.

comments powered by Disqus