Developers Club geek daily blog

2 years, 10 months ago
HV data storage format as attempt of a solution of the problem of visual storage of text boxes

Not so long ago before me there was a task to have an opportunity to store data in text form that not only the program worked with them, but the person could read and edit (and also to create from scratch in a text editor). For this purpose there is already a set of convenient and good formats, for example JSON, YAML, XML and so on. But in the considered systems the moments which, nevertheless, were not pleasant a little came across.

I will pay special attention to bright inconvenience of the majority of such formats (naturally, in my opinion) including very powerful and popular — the problem connected with storage of the text: how to write a text box which may contain any text characters that its contents had not to be changed and it did not influence parsing, also the different substrings matching office combinations and different non-standard indents can meet there. For example, the text should not contain signs in XML "<" и ">" — they have to be replaced on "<" and ">" respectively. In many other systems the text is required to be quoted. But what to do if the text already contains quotes? To use other types of quotes? To shield? All this means that it is necessary to make changes to the text, and not the fact that after that will conveniently read and edit it if work with data in a normal text editor, for example, a notepad or a data entry field (textarea) in the browser is necessary. Still there is a YAML format in which the text is not required to be quoted, but there it is very important to observe the correct indents that for storage of multiline and multi-level data it seems not really convenient. Also it increases a share of the characters which are not relating to data — several office spaces at the left on every line significantly increase weight.

In addition to the text, I needed to store, in fact, two more base data types — the integral and fractional number, and also associations (data structures (blocks) and arrays). That is 5 types turn out: integral number, floating-point number, text, structure, array. There was no need for use of macroes, expressions and other expansions — the numbers and texts which are simply distributed on different blocks and arrays were necessary. Due to such simplicity the most trivial format which could store, to all other, text boxes was necessary, considering the moments which made a reservation above. Also there was a wish to see data with as it is possible smaller quantity of control characters that it was simpler to understand and remember syntax.

Generally, the bicycle with unusual construction of a wheel the HV format (the initial internal name — "human values") was created. On it I will show a practical solution of the specified problem as this solution I see. The format turned out nezamudrenny — that, in principle, and was required — as was already told, supports only three simple data types (an integral number, a floating-point number and the text) and two composite types (a data structure and an array which comprise both simple types, and compound). Main control characters only 3. There are 3 more additional control characters, but it for special cases of formatting of text boxes, and also for designation of comments. These cases belong to the question raised in article (about convenient storage of text boxes) and will be considered below on examples.

Data fields can be one-line (an integral number, a floating-point number, the one-line text) and multiline (structure, an array, the multiline text). At first the field name, then control character which specifies that field value holds or one place, or a little is written. And then — field value. If several lines, then at the end of value are specified the completing line. Actually, it is also the main essence of a format stated in a short type.

Will be most evident to show features of the HV format on examples.
I will begin with the general description that features of syntax became clear, and I will gradually pass solutions of the problem put in article to my vision.

a: 1
b: 2.2
c: abcd


Here 3 simple data types are provided:
a — the integral number equal 1
b — the floating-point number equal 2.2
with — the text box consisting of one line equal to "abcd"

In the following example a data structure and an array which is in this structure:

xxq+
  a: 12.33
  b: -15
  x+
    : ab
    : cd
    : ef
  ^
^


Here the structure contains two fields:
a — the floating-point number equal 12.33
b — the integral number equal-15
x \an array of text boxes which are equal to "ab", "cd" and "ef"
For array cells the name of a field is not written.

At once I will tell that indents have no value, and data in the following example are absolutely identical to data from previous:

xxq+
a:    12.33
b:       -15
x+
: ab
:      cd
:ef
            ^
^


And option submission of the same data, but in general without spaces:

xxq+
a:12.33
b:-15
x+
:ab
:cd
:ef
^
^


So, the most important control characters — ":" (if value holds one place) and "+" (if value holds several places).

And now, directly, my vision of a solution of a question of submission of the multiline text containing various characters:

t+
  ABCD
  EFGH<12>@@
  ijklmnopq
  "ABC" + "DEF" = "ABCDEF"
  "A('a')" =//= "B"('''')\
  abcd
^


In this example the text turns out such:

ABCD
EFGH<12>@@
ijklmnopq
"ABC" + "DEF" = "ABCDEF"
"A('a')" =//= "B"('''')\
abcd


The quotes, slashes and other characters which are contained in the text are not substituted in any way and not shielded — in it there is no need. That is, the text remains completely original and does not demand additional conversions.

The text is limited in the completing line. The completing line by default is equal to control character "^". The same line is used for end of all multiline fields, such as structures and arrays (it is shown on examples above). Value will be read out line by line without indents, the completing line will not meet yet. Not substring, and a line entirely (indents as I already spoke, are ignored and can be any).

At record of text boxes there can be two quite reasonable questions:

1) What if in a source text the line which will be equal completing, that is "^" meets?
2) What if indents in the text are important and they cannot be ignored?

For permission of the first case the HV format allows to redefine the completing line. It needs to be specified before field value, well and, respectively, after:

eee+ END
  hello
  ^
  ^
  ^
  ^
  abcd
END


The text which is contained in the eee field such:

hello
^
^
^
^
abcd


Important nuance — redefinition of the completing lines is possible only for text boxes. The others multiline values (structures and arrays) always come to the end with a sample digit "^".

For permission of the second case (indents matter) HV has the whole 2 options.
Option A. To consider all indents on the right and to the left of the text in every line:

text@
  Это красная строка.
А это обычная строка.
Все отступы от начала строки будут сохранены
   в тексте.
                      Вот так.
^


Option B. To begin to consider indents from the first not whitespace character in every line, and this first character will not be considered:

text%
  -А
   *Б
    =В гдеёжзи
^


The text will turn out following:

А
Б
В гдеёжзи


Interesting feature - encapsulation of the serialized data in the form of the text
I want to pay attention to one more feature which though quite interesting, useful and almost unique, but need for its application rather seldom meets. This feature becomes automatically available thanks to a possibility of replacement of the completing line, thereby leaving the original text without changes. The sense is that it is so possible to insert one data in the HV format as a text box into other data of the HV format. It will not bring to any to syntax errors when parsing. It is useful can do that case if there are several processors of the texts which are at the different levels, and they do not know in what format each of them works — they just transfer the text to the following level.
For example for the first level it is necessary to transfer two arrays in the HV format:

a+
  : 1
  : 2
^
b+
  : 3
  : 4
^


But it should be transferred in a type of the text through the second level:

level_2+
  for_level_1+ &
    a+
      : 1
      : 2
    ^
    b+
      : 3
      : 4
    ^    
  &
^


The for_level_1 field is text. Here the completing line is just replaced on "&".
It is impossible to Rasparsit the data intended for the first level at the second level under the terms of an example at once — the second level does not know how this text has to be processed — HV can there, JSON can, and the text which is not intended for parsing can just. It solves the first level (under the terms of an example).

That is, it is possible to transfer any serialized data in the HV text box — though the same HV though JSON, XML, YAML and so on. I did not meet a possibility of safe encapsulation without text editing in one of the considered formats. This feature though it is rare where it can be necessary, but after all.


So, the main key characters 3 pieces turned out:

: — value in one line
+ — value in several lines
^ — the end of multiline value

And 3 additional:

@ — the formatted multiline text
% — the marked multiline text
# — the comment

There are no obligatory brackets, quotes, explicit indications of data types. Typification and all checks on compliance are performed in the processor of HV — he knows in advance what names of fields can will meet also what values of type and a format they have to contain. Excessive simplicity does it ported practically on any programming language.

By the first consideration of HV can seem similar to YAML — too minimalist, the text without quotes too. But, as HV was created from scratch, but not on the basis of any existing format, distinctions with YAML it is more, than similarities. HV is undemanding to indents. The general share of the office text in a HV format is less because YAML demands observance of indents and often uses the combinations consisting of the 2nd and more characters, for example "---", ": | -", ":>", and HV — always only single characters. And the mechanism which limits the text in the pereopredeleyaemy completing line — I did not meet in one of the considered formats. And as it seems to me, it is rather convenient and visual mechanism.

Generally, such laconic format for storage of simple data for convenient, in my opinion, perception by the person turned out. Of course, there is no storage of functions, associative arrays, macroes, preprocessors, short circuits, arithmetic expressions and other abrupt pieces of which many other formats can brag. But these frills are also not required as the HV format carries out and solves the problems set for it which were stipulated above, for example, does not demand to quote the text or brackets, does not demand shielding of characters, does not demand to specify obviously data type, looks is quite trivial, supports the most basic set of types, uses few sample digits, etc.

I hope, I correctly could state the reasons of creation of the HV format and its feature. If after all nedoobjyasnit something — I will be glad to answer adequate questions.

For those who want to get acquainted better with the HV format on the http://vaomark.com/z23F0Cz resource more detailed description and a heap of the examples covering all parties is placed.

In the same place it is possible to download the actual source code of the processor of HV and the module of testing for Python 2.7. By the way, the processor on C ++ is shortly going to port, Java, PHP and other languages — everything will be available everything according to the same link.

P.S.: The HV format is constructed on my vision of a solution of a problem of storage of text boxes in the serialized type that values were in the original, not changed type there and they could be read and changed conveniently in any simple editor. Someone will consider that the successful solution, someone — on the contrary turned out; maybe someone will offer the. Someone considers that the issue touched in article not such and a problem that all and is so convenient. It would be desirable to learn your opinion.

This article is a translation of the original post at habrahabr.ru/post/271501/
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here: sysmagazine.com@gmail.com.

We believe that the knowledge, which is available at the most popular Russian IT blog habrahabr.ru, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.
Best wishes.

comments powered by Disqus