Some modern approaches in the field of natural languag processing

1 year ago
The results of scientific researches received in recent years in problems of voice recognition [1], machine translation [2], determination of a shade of the sentence [3] and word classes [4] showed prospects of neural network algorithms of deep learning in comparison with classical methods of natural languag processing (natural language processing). However, in the field of question-answer and dialogue systems still there are many unresolved tasks [5, 6]. In this article the overview of results of application of modern algorithms for problems of processing and a natural language understanding is given. The overview contains the description of several different approaches and does not apply for completeness of researches.

Human: how many legs does a cat have?
Machine: four, i think.
Human: What do you think about messi?
Machine: he ’s a great player.
Human: where are you now?
Machine: i ’m in the middle of nowhere.

(from article A Neural Conversational Model. KDPV from the movie Ex Machina)

War, world and ABBYY Compreno: continuation of our affair with Tolstoy

1 year ago
Recently we told here about how the All Tolstoy in One Click project was done. By means of 3249 (three thousand two hundred forty nine) volunteers and 1 (one) good OCR technology we digitized 46820 pages of 90-volume collected works of the writer, carefully subtracted them and laid out in general access.

But if you thought that our "affair with Tolstoy" on it ended, then you were mistaken – having digitized texts of the writer, we began to investigate them by means of technology of information extraction ABBYY Compreno – not to vanish to such rich material. About what gave us "Thick text mining" and where now the received results are used, read further.


The main goal of the All Tolstoy in One Click project was to make Tolstoy's creativity rather general property that all texts which issued from its pen were available in one click in any point of Earth. As, by the way, the author who still during lifetime refused all rights to the texts also bequeathed (yes, the anonymous, Lev Tolstoy knew about a copyleft and an opendata long before this your Internet and Richard Stallman).

However an opportunity to load the book in a convenient format in the reader or the tablet – not the only plus of digitization. Now it is possible not only to read Tolstoy's texts, but also "to measure", that is to investigate by different quantitative methods, using all arsenal of means of hands-off processing of the text (AOT, it is NLP). If you have all texts of the writer in electronic form, even by means of one-two competent search queries you can obtain curious data for which production in other times some literary critic could spend weeks and months of persistent work. And if you besides have an advanced technology of the analysis of a natural language, that is chances to make serious philological discovery (even without being a philologist). Below I will tell that we managed to namerit and learn, but before it is the couple of words about the one who as well as why is engaged in hands-off processing of artistic texts and that interesting can turn out at the same time.

"Book of problems" for ABBYY Compreno

1 year, 1 month ago
Hi! Last time we told about how the technology of understanding and the analysis of texts in natural languages of ABBYY Compreno is arranged. Many ask us – how many it is already possible to develop technology and where already, at last, products based on Compreno. As promised, today's material is devoted to products and what problems of business they solve already today.

On the basis of our technology it is possible to create a number of solutions for different type of tasks. But focus of our attention is a corporate market, the companies which need to obtain in a short time significant information from data arrays today. This direction is perspective for us and from the point of view of a demand of such technologies clients, and from the point of view of the fastest return of our investments into technology.

At once we will note that solutions based on the Compreno technology are application or technology modules which are built in any solutions, adding it features.

Practical aspects of automatic generation of unique texts for SEO

1 year, 1 month ago
The most awful horror story for persons interested to post content written by the computer on the websites — sanctions of search engines. We too were frightened in due time by the fact that the website with the nonunique and/or generated texts, will be badly indexed or that it in general will get under Bang. At the same time nobody could tell exact requirements to texts to us. In general the subject of unique content and its role in website promotion is more similar to occult knowledge. Each following "specialist" promises to open the terrible truth on the page, but the truth does not open, and the essence of many discussions at forums to be reduced to the fact that, say, Yandex, will recognize the generated content by means of magic. Not such words, but sense in it.

As recently to us customers handled a task to create descriptions for goods on the website, we decided to study this question in more detail. What algorithms exist for determination of automatically written texts what properties the text not to be recognized as web spam and what means can generate it has to have?

Intensiv on German: as ABBYY Compreno learns modern languages

1 year, 1 month ago
As you know, ABBYY is engaged in development of technology of the analysis of natural languages of Compreno. Now the system works at the English and Russian languages, and is actively used in many projects. However initially technology was conceived as multilingual therefore we pay also to "training" in other foreign languages much attention. And here it is possible to draw some analogy to the person: after studying of one foreign language others are given easier. In particular, now we add German to technology and in parallel we investigate possibilities of the market – whether there is an interest in this direction. At once we will make a reservation – so far the speech about the products supporting German does not go, we at the very beginning of a way.

Library of machine learning Google TensorFlow – the first impressions and comparison with own implementation

1 year, 2 months ago
Absolutely recently Google made available to all the library for machine learning, under the name TensorFlow. For us it turned out interestingly also the fact that the most modern neural network models for text processing, in particular, of training of the “последовательность-в-последовательностü” type (sequence-to-sequence learning) are a part. As we have several projects connected with this technology, we decided that it is an excellent opportunity to cease to invent the bicycle (probably it is time already) and to quickly improve results. Having imagined contented faces of clients, we got to work. And here what from this turned out …

Concepts of a natural language against formal classifications in OpenStreetMap

1 year, 2 months ago
Those who though he is a little familiar with the OpenStreetMap project probably heard about couple of principles which are underlain in its basis: "any tags you like" and that fact that initially in this project filling of the cartographic database, but not how contents of this base display Standard style on But whether so everything is good and iridescent with semantic structure of this database, considering the first principle? Reading a Russian-speaking branch of the forum OSM, I decided to understand a situation and to describe it here.

Algorithm of information extraction in ABBYY Compreno. Part 2

1 year, 2 months ago
And again hello!

I hope, you were interested by our yesterday's post about system of information extraction of ABBYY Compreno in which we have told about system architecture, the semantico-syntactical parser and its role and, the most important, about information objects.

From the second part of article you learn how the engine of information extraction is arranged.

Algorithm of information extraction in ABBYY Compreno. Part 1

1 year, 3 months ago
Hi, Habr!

My name is Ilya Bulgakov, I am the programmer of department of information extraction in ABBYY. In series from two posts I will tell you our main secret – as the technology of Information extraction in ABBYY Compreno works.

Earlier my colleague Danya Skorinkin of DSkorinkin has managed to tell about view of system from ontoinzhener, having touched upon the following subjects:

This time we will fall more deeply to subsoil of the ABBYY Compreno technology, we will talk about system architecture in general, the basic principles of its work and algorithm of information extraction!

How many tweets are necessary to learn your character?

1 year, 4 months ago
The extensive growth of number of unstructured data (tweets, posts, comments, photo and video) generated by mankind – both fantastic opportunities, and headache for many old and new industries.

The other day we already gave factual account on volumes of number of the messages made by mankind per day it is clear that billions of expressions demand absolutely other solutions and technologies. "Old" (horror, there have passed 3-5 years, and already old) approaches and the people developing them fight for place in the sun. But …


We give transfer of recent material from division of IBM Watson as classical example:

