Developers Club geek daily blog

1 year, 4 months ago
Document storage in Postgres is a little simpler, now we have serious procedures of saving, an opportunity to start full-text search, and some simple procedures of search and filtering.

It is only a half of history, of course. Rudimentary searches can serve needs of the application, but they will never work in long perspective when we have to ask deeper questions.

Initial document


Document storage is very big subject. How to store the document (and what to store), for me, is separated into three areas:

  • Document/domain model. A view of all this from the developer, but if you the admirer of DDD (Domain Driven Design), then it plays a role.
  • Real world. Accounts, purchases, orders — business works at these things — let's think of it.
  • Transactions, results of process, sources of events. Actually, when "something happens" to the application, you trace everything that occurred at the same time and store it.

I very much am drawn towards the last. I am the information drive and when something happens, I want to know that/why/where to any limits.

Here what I did earlier to save information on the people buying something in Tekpub. It is the document format which I was going to put on use, but did not reach before (because of sale on Plularsight).

{
  "id": 1,
  "items": [
    {
      "sku": "ALBUM-108",
      "grams": "0",
      "price": 1317,
      "taxes": [],
      "vendor": "Iron Maiden",
      "taxable": true,
      "quantity": 1,
      "discounts": [],
      "gift_card": false,
      "fulfillment": "download",
      "requires_shipping": false
    }
  ],
  "notes": [],
  "source": "Web",
  "status": "complete",
  "payment": {
    //...
  },
  "customer": {
    //...
  },
  "referral": {
    //...
  },
  "discounts": [],
  "started_at": "2015-02-18T03:07:33.037Z",
  "completed_at": "2015-02-18T03:07:33.037Z",
  "billing_address": {
    //...
  },
  "shipping_address": {
    //...
  },
  "processor_response": {
    //...
  }
}

It is the big document. I love big documents! This document — exact result of all movements of information in the course of a design of the order:

  • The client's addresses (for drawing of the account, for delivery)
  • Payment information and what was purchased
  • As they got also summary of what occurred on their way here (in the form of notes)
  • The exact answer from the processor (which in itself is the big document)

I want that this document was autonomous, self-sufficient object which needs no other documents to be complete. In other words, I would like to be capable:

  • To perform the order
  • To start some reports
  • To notify the client on changes, execution, etc.
  • To take further measures if it is required (write-off, canceling)

This document is complete in itself and it is fine!

OK let's write some reports, enough.

Forming of data. Actual table


Moving analytics it is important to remember two things:

  • Never you see off it on the working system
  • Denormalization is regulation

Execution of huge requests according to the consolidation tables occupies eternity, and it does not lead to anything eventually. You should build reports on historical data which do not change (or change very little) over time. Denormalization is helped with a speed, and by speed — your friend at creation of reports.

Considering it, we have to use kindness of PostgreSQL to create our data in the fact table of sales. The "actual" table is the just denormalized data set which represents an event in your system — the smallest volume of the acquired information on the fact.

For us this fact — sale, and we want that this event looked so:

image

I use an example of the Chinook base with some accidental data on sales created by means of Faker.

Each of these records is a single event which I want to accumulate, and all information on dimension with which I want to integrate them (time, the supplier) — are already included. I can add more (category, etc.), but also it will be enough so far.

These data in are to a tabular form, it means that we have to get them from the document shown above. The difficult task, but is much simpler as we use PostgreSQL:

with items as (
  select body -> 'id' as invoice_id,
  (body ->> 'completed_at')::timestamptz as date,
  jsonb_array_elements(body -> 'items') as sale_items
  from sales
), fact as (
  select invoice_id,
  date_part('quarter', date) as quarter,
  date_part('year', date) as year,
  date_part('month', date) as month,
  date_part('day', date) as day,
  x.*
  from items, jsonb_to_record(sale_items) as x(
    sku varchar(50),
    vendor varchar(255),
    price int,
    quantity int
  )
)

select * from fact;

It is a set of the generalized tabular expressions (GTE) integrated together functionally (about it below). If you never used OTV — they can look a little unusually … so far you do not get accustomed and will not understand that you just integrate things together names.

In the first request above, I extend id of sale and I call it invoice_id, and after that I extend timestamp and I convert it into timestampz, Simple actions in essence.

That it becomes more interesting here - it is jsonb_array_elements which extends an array of objects from the document and creates record for each of them. That is, if we had the only document in base, with three objects and would start the following request:

select body -> 'id' as invoice_id,
(body ->> 'completed_at')::timestamptz as date,
jsonb_array_elements(body -> 'items') as sale_items
from sales

Instead of one record representing sale, we would receive 3:

image

Now, when we selected objects, we need to share them on separate columns. Here also the following cunning with jsonb_to_record appears. We can use this function at once, describing values of types on the fly:

select * from jsonb_to_record(
  '{"name" : "Rob", "occupation": "Hazard"}'
) as (
  name varchar(50),
  occupation varchar(255)
)

In this simple example I convert jsonb into the table — only to tell me PostgreSQL enough as to make it. It what we also do in the second OTV ("event") higher. Also, we use date_part to convert dates.

It gives us the table of events which we can save in representation if we:

create view sales_fact as 
-- the query above

You can think that this request awfully slow. Actually, it is rather fast. It is not any mark of level, or something like that — just relative result to show you that this request, actually, fast. I have 1000 test documents in base, execution of this request on all documents returns approximately for the tenth fraction of a second:

image

PostgreSQL. First-class thing.

Now we are ready to some accumulation!

Report on sales


Farther more and more simply. You just integrate data which you want and if you about something forgot — just add it to the representation and it is not necessary to care about any associations of tables. Data transformation which really happens quickly is simple.

Let's look at the five of the best sellers:

select sku, 
  sum(quantity) as sales_count,
  sum((price * quantity)/100)::money as sales_total
from sales_fact
group by sku
order by salesCount desc
limit 5

This request returns data in 0:12 seconds. Quickly enough for 1000 records.

OTV and functional requests


One of things which really is pleasant to me in RethinkDB is its own language of requests, ReQL. He is inspired by Haskell (according to command) and all consists in composition (especially for me):

To understand ReQL, it helps to understand functional programming. Functional programming is included into a declarative paradigm in which the programmer aims to describe value which he wants to count, than to describe the steps necessary for calculation of this value. Languages of requests of databases, as a rule, aims at a declarative ideal that at the same time gives to the processor of requests the greatest freedom in the choice of the optimum execution plan. But while SQL tries to obtain it using a special key word and specific declarative syntax, ReQL has an opportunity to express as much as difficult operations through functional composition.


As it is possible to see higher, we can approximate it, using OTV integrated together each of which will transform data in the specific way.

Conclusion


There is a lot everything that I could write, but just let's sum up all this the fact that you can do everything that are able to do other dokumento-oriented systems and it is even more. Possibilities of requests in Postgres are very big — there is very small list of things which you will not be able to make and as you saw, an opportunity to transform your document in to tabular structure very much helps.

And it is the end of this small series of articles.

This article is a translation of the original post at habrahabr.ru/post/272675/
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here: sysmagazine.com@gmail.com.

We believe that the knowledge, which is available at the most popular Russian IT blog habrahabr.ru, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.
Best wishes.

comments powered by Disqus