Developers Club geek daily blog

1 year, 1 month ago
War, world and ABBYY Compreno: continuation of our affair with TolstoyRecently we told here about how the All Tolstoy in One Click project was done. By means of 3249 (three thousand two hundred forty nine) volunteers and 1 (one) good OCR technology we digitized 46820 pages of 90-volume collected works of the writer, carefully subtracted them and laid out in general access.

But if you thought that our "affair with Tolstoy" on it ended, then you were mistaken – having digitized texts of the writer, we began to investigate them by means of technology of information extraction ABBYY Compreno – not to vanish to such rich material. About what gave us "Thick text mining" and where now the received results are used, read further.

Introduction


The main goal of the All Tolstoy in One Click project was to make Tolstoy's creativity rather general property that all texts which issued from its pen were available in one click in any point of Earth. As, by the way, the author who still during lifetime refused all rights to the texts also bequeathed (yes, the anonymous, Lev Tolstoy knew about a copyleft and an opendata long before this your Internet and Richard Stallman).

However an opportunity to load the book in a convenient format in the reader or the tablet – not the only plus of digitization. Now it is possible not only to read Tolstoy's texts, but also "to measure", that is to investigate by different quantitative methods, using all arsenal of means of hands-off processing of the text (AOT, it is NLP). If you have all texts of the writer in electronic form, even by means of one-two competent search queries you can obtain curious data for which production in other times some literary critic could spend weeks and months of persistent work. And if you besides have an advanced technology of the analysis of a natural language, that is chances to make serious philological discovery (even without being a philologist). Below I will tell that we managed to namerit and learn, but before it is the couple of words about the one who as well as why is engaged in hands-off processing of artistic texts and that interesting can turn out at the same time.

Lyrical digression: Distant Reading and "computing philology"


In 2010 Google counted 130 million books in the world, and to this statistics was attributed "at least, till Sunday". Today they are for certain several millions more. In itself it is not a problem – and it is so clear that to read "all about everything" – bad idea if only you not the teenager of 12 years who is greedy absorbing according to the encyclopedia in a week. Worse the fact that since a certain moment even the list of books in one narrow subject or, for example, one literary direction becomes very heavy. For example, only one Victorian England generated more than 60000 works of art. Hardly even among the scientists who are purposefully investigating literature of that era there is a person who mastered at least percent of this collection.

Possible (though disputable) the solution of this problem one of the first was proposed by the scandalous critic and the former neomarxist Franco Moretti heading nowadays Stanford Literary Lab. He declared that literary critics have to "stop reading books today and to begin to consider, to map and visualize them". "Remote" (distant reading), that is the automatic analysis of the text body, calculation of statistics, creation of graphs, etc. opposes to normal reading (close reading) Moretti reading. In his opinion, only this way we can make literary criticism "objective" and avoid the conclusions drawn based on "ridiculously small" selections. It is possible to look at results of the researches Stanford Literary Lab executed in the spirit of "distant reading" here.

"Remote reading" by means of Compreno


Researchers from Stanford generally use the simplest statistics – for example, words frequency and N-grams and their distribution in the text. We from the very beginning decided to investigate such aspects of the artistic text which simple Ctrl+F cannot pull out. For example, speech activity of heroes: try to count to a descent how many time tells something Natasha Rostova (or any other character). Quickly enough you will understand that for this purpose, first, it is quite good to you to be able to permit automatically a pronominal anaphora (for examples like "Natasha began to put on a dress. — Now, now, do not go, the father — she shouted to the father"), secondly, somehow to limit a set of verbs by which the fact "speakings" (and they are quite various) can be expressed, and in the third, to have at least automatic morphology, and also syntax is better (since. the words order is free, and it is not so simple to find speaking in examples like "He never blessed the children and only, having held up it shchetinisty, still a cheek unshaven now, told, having strictly and at the same time attentive gently inspected it: — It is healthy?. well, so sit down!").

Fortunately, all this "is sewn already up" in Compreno. Sintaktiko-semantichesky trees which are issued by the parser contain all necessary information on the one who as as told, in them the syntax and lexical homonymy is already removed and the pronominal anaphora is resolved. For example, in such fragment "Really? — Anna Mikhaelovna exclaimed. — Ah, it is awful! It dread to think … This is my son — it added, indicating Boris. — He wanted to thank you" it is necessary to understand who such it, and it is correct to define a semantic class of a multiple-valued verb to add. Compreno copes with both tasks – the subtree for "so looks it added, indicating Boris":

War, world and ABBYY Compreno: continuation of our affair with Tolstoy

Allows to receive from such trees of mentioning of characters and the necessary information about them our mechanism of information extraction which we already not once described from the different parties here. Thanks to a support on deep syntax and semantic hierarchy we can cover a big class of cases with 1-2 wood templates. For example, the rule which looks for such structure:

War, world and ABBYY Compreno: continuation of our affair with Tolstoy

will work on such different examples as:

— And you want to kiss me? — she whispered it is slightly heard, frowningly looking at it, smiling and nearly crying with nervousness.
Denisov, you it do not joke, - Rostov shouted, is such high, such fine feeling, it …
More quietly, more quietly, unless it is impossible more quietly? — probably more suffering, than the dying soldier, the sovereign spoke and drove off away.
The aunt cleared the throat, swallowed saliva and said in French that she is very glad to see Elaine;


In addition to speech activity, we investigated also some other aspects of behavior of heroes of Tolstoy. Below I will tell about what we managed to learn.

Impulsive Natasha Rostova and unperturbable Andrey Bolkonsky: what managed to be understood by means of Compreno


For a start we just counted how many times each character of "War and peace" makes any expression, and made a top of the most "talkative" heroes in absolute digits. It will hardly surprise those who are familiar with contents of the novel:

War, world and ABBYY Compreno: continuation of our affair with Tolstoy

Here rate, apparently, is no more than the indicator of "centrality" of the character.

If to normalize the received digits on total quantity of references in the text (previously having moved away too low-frequency heroes), our top considerably changes:

War, world and ABBYY Compreno: continuation of our affair with Tolstoy

Now above Petya Rostov – the emotional and talkative child in the first volume, the young enthusiastic teenage romantic – in the fourth (up to own death). Three female characters – the princess Marya, silent, modest and exhausted with the strict father which we recognize mainly by talk with other characters and an internal monologue, Natasha Rostova, the direct and living heroine whose remarks the reader hears throughout all novel (in the first volume to her 13 years, in an epilog – 29), and Anna Drubetskaya, the active intriguer capable to sweet-talk any person of into submission necessary to it follow.

Here it is necessary to tell that Tolstoy considered important to supply each character with own style of the speech – it was part of its creative method. Even ("recognized as the whole world for ingenious works of art of the composition of Shakespeare<…> to me were disgusting") he explained the well-known dislike for Shakespeare with the fact that allegedly "Shakespeare has no main thing if not the only means of the image of characters, "language", that is that each person spoke, inherent to its character, language". Therefore at the following stage we tried to select some significant parameters in which the speech of characters can steadily differ.

The first arising parameter – quantity of exclamatory and interrogative sentences. Based on the ratio of questions, exclamations and all other (conditionally neutral) speech it is already possible to understand very much about the character. Let's compare three young people Growing, Andrey Bolkonsky and Pierre Bezoukhov. The predictable champion in exclamations – younger of Growing, Petya:

War, world and ABBYY Compreno: continuation of our affair with Tolstoy

Natasha is more senior than Petya and shows slightly more restraint, but all the same remains very emotional, "is only a third "conditionally neutral":

War, world and ABBYY Compreno: continuation of our affair with Tolstoy

Petya and Natasha's elder brother Nikolay exclaims and asks even less, however the share of the neutral speech falls short of a half – as well as all Rostova, he is very emotional too:

War, world and ABBYY Compreno: continuation of our affair with Tolstoy

Another matter – the prince Andrey Bolkonsky perfectly stood, proud, belonging to secular society with cold contempt and showing emotions only in a circle of close people (not for nothing in the Oscar-winning Soviet screen version it was played by the strong-willed handsome Vyacheslav "Stirliz" Tikhonov). Bolkonsky exclaims very little and he asks slightly:

War, world and ABBYY Compreno: continuation of our affair with Tolstoy

Pierre Bezoukhov – perhaps, the most reflexing character of the novel. It is obviously more emotional, than Andrey Bolkonsky, but not towards "exclamations" as all family Growing. Pierre exclaims seldom, but asks almost also often as absolutely childly direct Petya with Natasha:

War, world and ABBYY Compreno: continuation of our affair with Tolstoy

Also by means of Compreno it is possible to receive easily the characteristic which Tolstoy gives to the act of pronouncing the speech, and it can act as a peculiar parameter too. Most often such characteristic is expressed as the verbal adverb phrase attached to a verb of saying (Pierre cried, striking with resolute and drunk gesture a table) or addition in an instrumental case with a pretext with (the prince Vasily still with big, than before, twitching of cheeks asked). For example, the speech of the rich, important and mercenary prince Vasily Kuragin more often than at other characters, is followed by verbal adverb phrases in which it is characterized or his appearance (rubbing a bald head, putting in order a jabot), or the hidden intentions, properties of character, the movement of soul (telling things to which he also did not want that trusted, with rage moving up to itself the removed little table); Anna Mikhaelovna Drubetskaya who is eternally fawning upon heroes from whom she needs something often speaks "smiling" or "with a smile"; at phlegmatic, constantly sleepy Kutuzov speaking often is followed by the movement of the head: he nods it, lowers it.

Sensitive Marya Bolkonskaya and intrigues around Pierre's inheritance: deep syntax of "War and peace"


In our following microresearch we decided not to be limited to speech activity and to consider all situations of "activity" of the hero in the text. For this purpose we collected statistics on deep positions to which characters under different predicates get. Deep positions in trees of Compreno are similar to semantic roles: for example, if the hero makes active action (speaks, goes, shoots, beats), he gets to a position of the agent (Agent); if it appears as passive object of external influence (it is abused, carry, beat, praise, love), gets to a position of object (Object) if he perceives, sees, hears, feels or, for example, loves something, then becomes an eksperiyentser (Experiencer); if acts as the addressee of the message (she told Pierre), gets to a position of the addressee (Addressee). There are also other positions (all them in our model about 500), but here we use some most widespread from among those that can appear under a predicate.

It is important that deep positions reflect semantic roles of the participant of a speech situation and do not depend directly on specific implementation in the sentence. So, in phrases Pierre loved Natasha and Natasha was loved by Pierre Pierre will be an eksperiyentser, and Natasha – object regardless of pledge.

It turned out that the statistics on deep positions allows to obtain some information on distinctions in characters of heroes and gives "objective" confirmation to those images which form at the reader during acquaintance to the novel. Let's look at the chart where shares of the deep positions selected by us for the main heroes in the first volume of "War and peace" are provided:

War, world and ABBYY Compreno: continuation of our affair with Tolstoy

In general distributions of rates look similar, and enough predictably the most frequency position for all heroes appeared agentive. However the dispersion is rather big here – from 40,7% at the princess Marya and 44,6% is at Boris Drubetsky to 68,3% at Anna Drubetskaya. These three "extreme" characters are also of interest.

The princess Marya is remarkable, first of all, abnormally high rate of hit in a position of an eksperiyentser. In combination with low rate of the agentive uses it gives us a portrait of the character who a lot of is feeling, but a little acting that for the first volume completely is true. Andrey Bolkonsky's sister together with the father – the Ekaterina's general, old, pedantic and strict to petty tyranny, – "without quitting the place" lives in a manor in Bald Mountains, spending time in correspondence with the brother and the girlfriend Giuly, communication with pilgrims and occupations by algebra and geometry which the old prince arranges it. In sight of the reader it appears only in connection with arrivals to Bald Mountains of other heroes. Literary critics consider that the image of the princess Marya is created by Tolstoy under strong influence of sentimentalism of the 18th century.

Anna Drubetskaya's champion's title on a share of the agentive uses also easily explainably a plot of the first volume. This elderly lady of a notable surname, but very modest fortune at the beginning of the novel develops rough activity which ultimate goal is wellbeing and promotion of her only son Boris. It is described as "one of those women, especially mothers who, having got itself something into the head once, will not lag behind until grant their desires, and otherwise are ready for daily, every minute stickings and even for scenes". At first Anna Mikhaelovna besieges the rich and influential prince Vasily, trying to obtain transfer of the son in guard, then against it for inheritance Bezukhov's column successfully intrigues, at the same time getting at Growing money that "to obmundirovat Boris".

Boris for the present did not become same cynical, dexterous and self-interested as mother, – it happens in the following volumes. He does not wish to cross through own pride, and therefore opposes to Anna Mikhaylovna's requests to be "lovely", "tender" and "attentive" during visits to important persons and extremely reluctantly participates in its efforts, acting as passive object. Boris's passivity is reflected in our diagram by a big share of an object deep position.

"Thick neck" of Natasha in your smartphone: we recover "War and peace"


Attempts "to count" literature often cause criticism in that spirit that supposedly authors try to measure immeasurable and by that vulgarize and emasculate imperishable work of the classic. It is interesting that such charges sounded 100 years ago when any distant reading and was not trace. "Was considered that to study the work – means to anatomize it, and for this purpose it is necessary to kill the living being, as we know, at first. We were constantly reproached with this crime", – Boris Eykhenbaum, one of the largest representatives of a formal method in literary criticism wrote in 1921 (and formalists of those times were something like the people who thought up distant reading in the theory long before the invention of the computer and not having an opportunity to try it in practice).

That also we were not accused of "murder" of the novel, we decided to be engaged in opposite business, that is its revival. For this purpose we together with colleagues from Higher School of Economics joined development of mobile application "Live pages" of the Samsung company in which results of system operation of information extraction based on ABBYY Compreno are used now.

In the Live Pages application several non-standard scenarios of acquaintance to works of art and their heroes – a taymlayna with events and destinies, cards and "quotation collections" of characters, interactive maps with a binding of places to novel episodes are implemented.

War, world and ABBYY Compreno: continuation of our affair with Tolstoy

War, world and ABBYY Compreno: continuation of our affair with Tolstoy

War, world and ABBYY Compreno: continuation of our affair with Tolstoy

All this relies on infographics, is made in game style and as it seems to us, has more chances to hook on the tenth-grader-gadzhetomana with ADD, than thick volume which will be handed to it by the school librarian.

War, world and ABBYY Compreno: continuation of our affair with Tolstoy

War, world and ABBYY Compreno: continuation of our affair with Tolstoy

War, world and ABBYY Compreno: continuation of our affair with Tolstoy

In addition to the speech of heroes for a quotation collection, Compreno was used for extraction of dates for taymlayn, locations for cards, and also epithets – different characteristics which so liked to award the characters with Tolstaya. All, of course, remember short moustaches of the small princess, Bolkonsky's wife whether but many reflected that the most brilliant handsome Andrey had "small chubby handles" (and it in combination with small growth), and graceful thin Natasha Rostova has "a thick neck" and "a big mouth"?

War, world and ABBYY Compreno: continuation of our affair with Tolstoy

War, world and ABBYY Compreno: continuation of our affair with Tolstoy

War, world and ABBYY Compreno: continuation of our affair with Tolstoy

War, world and ABBYY Compreno: continuation of our affair with Tolstoy

Everyone can download the application and make many discoveries in the same vein. And we will return to our studies meanwhile and we will continue "to anatomize" texts by means of Compreno, to look for in them new unexpected things and to open mysterious "Tolstoy's code" who made its works immortal.

This article is a translation of the original post at habrahabr.ru/post/273301/
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here: sysmagazine.com@gmail.com.

We believe that the knowledge, which is available at the most popular Russian IT blog habrahabr.ru, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.
Best wishes.

comments powered by Disqus