Developers Club geek daily blog

2 years, 10 months ago
We study the graph - the oriented Neo4j DBMS on the example of the lexical Wordnet base Neo4j DBMS is NoSQL the database oriented to storage of graphs. A highlight of a product is declarative language of requests Cypher.

Cypher borrowed a key word like WHERE, ORDER BY from SQL; syntax from such different languages as Python, Haskell, SPARQL; and as a result there was a language allowing to make requests to graphs in a visual form like ASCII art. For example, I would present heading of this article in the form of the graph (Neo4j) — [we study]-> (Wordnet). And it is almost ready database request!



For studying of columns - the oriented database some graph is necessary. It can be a social network, a dump of Wikipedia or the scheme of the railroads. We will go in the simple way and we will use the huge public graph of the lexical Wordnet base. Linguists from Princeton did huge work on systematization of a lexicon of English, and enthusiasts translated the database into many languages, including Russian. For example, in this base over 80 thousand nouns connected among themselves by the lexical relations, such as "synonym", "part bigger", "material for" , etc. This base is the natural graph, and we import it to Neo4j.

Neo4j installation


Installation process for different OS is described on the website. Everything described here by software platform and independent, but for determinancy all instructions will be for Debian/Ubuntu.

1. To add a repository


wget -O - https://debian.neo4j.org/neotechnology.gpg.key | sudo apt-key add -
echo 'deb http://debian.neo4j.org/repo stable/' >/tmp/neo4j.list
sudo mv /tmp/neo4j.list /etc/apt/sources.list.d
sudo apt-get update

2. To set Neo4j (community edition)


sudo apt-get install neo4j

This command will set software in your house directory and will start service which will work from neo4j user name.

3. To permit remote access


If you installed Neo4j on the computer, pass this step. If access to the server from other computers in the local area network is required, edit the /var/lib/neo4j/conf/neo4j-server.properties file

For access from any computer of a local area network set parameters:

org.neo4j.server.webserver.address=0.0.0.0
dbms.security.auth_enabled=false

The port 7474 is by default used, it is possible to change port, having added a line to the same file:

org.neo4j.server.webserver.port=7474

Pay attention that we did not configure safety of DBMS! Read the instruction in more detail.

It is possible to check installation, having gathered the address and port of the server in the browser. Neo4j implements the magnificent graphic console via the browser. Through the same port there are REST requests to base from the client software which we will set on the following step.

Installation of the client (Python)


To import the Wordnet base to Neo4j, we will use a script on the Python.

1. At first it is necessary to set py2neo library


pip install py2neo

2. Download my script from a gitkhab


mkdir habrawordnet2neo4j
cd habrawordnet2neo4j
git clone https://github.com/sergey-zarealye-com/wordnet2neo4j.git

The script hardly applies for industrial quality of a code, but if you want to experiment with Neo4j from the Python, then browse a code, it will help you quicker to begin to program.

Receipt of the lexical Wordnet database


On the Download page of the Wordnet project it is offered to download base together with the software for its viewing. But we want to use for viewing of Neo4j! Therefore it is enough to download only files with data:

  • The freshest version of the English-speaking Wordnet base is available according to the link
  • The previous versions (for example for compatibility with ImageNet)
  • I suggest to download the Russian version from the website wordnet.ru

Extract files to the available place.

Import of data to Neo4j


Lexical data in the Wordnet lie in files in parts of the speech. For example, nouns are in the data.noun file; verbs — in data.verb; and with other word classes I also did not try.

1. Import of nouns


For import of nouns pass into a directory where placed my scripts (we called it just habrawordnet2neo4j) and execute command in the console:

python wordnet2neo4j.py -i rwn3/data.noun --neo4j http://127.0.0.1:7474 --nodelabel Ruswordnet --reltype Pointer --encoding cp1251 --limit 1000

Let's sort parameters in more detail.

-i		путь к файлу данных Wordnet
--neo4j		URL сервера базы данных Neo4j
--nodelabel	Метка узлов, соответствующих словам Wordnet 
		в создаваемом графе (в Neo4j узлы графа снабжают 
		текстовыми метками; это просто идентификатор)
--reltype	Тип ребер графа, соответствующих указателям Wordnet 
		(в Neo4j ребра графа могут иметь тип; это просто 
		идентификатор)
--encoding	Кодировка файла данных; русскоязычная база записана 
		в кодировке cp1251; для англоязычных файлов этот 
		параметр не нужно указывать
--limit		Максимальное количество обрабатываемых строк файла; 
		дело в том, что мой скрипт работает довольно медленно, 
		и чтобы попробовать можно ограничить объем импортируемых 
		данных, например первыми 1000 строками файла; для импорта 
		полного файла этот параметр не нужно указывать, 
		и приготовьтесь подождать час-полтора.


2. Import of verbs


For import of verbs execute command in the console:

python wordnet2neo4j.py -i rwn3/data.verb --neo4j http://127.0.0.1:7474 --nodelabel Ruswordnet --reltype Pointer --encoding cp1251 --limit 1000

It is optional to import verbs though some of them are connected with nouns, and it is interesting to study it.

3. Be convinced that data are imported


For this purpose open the Neo4j console (enter the address and port of the DBMS server) in the browser and enter the following request:

MATCH (node)-[relation]-() RETURN node, relation LIMIT 100

If received the image of the graph in the screen, then everything passed successfully.

We execute simple requests


We will perform all further operations in the browser, in the Neo4j console. I will consider that as tags of nodes you used Ruswordnet, and as type of edges Pointer (as it is specified in the previous section). And that you imported Russian Wordnet base entirely.

1. Hello World


As it is specified on the website of the Russian Wordnet base, about a half of the units of meaning containing the most all-usable words are translated. Therefore we will try to find the first that came to mind in base:

MATCH (n:Ruswordnet {name: "выкапывание_трупа"}) RETURN n

Execute request, be convinced that this concept is found, so according to the Russian linguists, it is among the most all-usable. Let's sort this simple request.

The key word of MATCH means approximately the same as SELECT in SQL. Roughly speaking, "to find elements, suitable to a template, the column".

Parentheses designate nodes of the graph. The template (n:Ruswordnet) would designate that we want to find all nodes with a tag of "Ruswordnet". Here n — the identifier, it is possible to tell "variable". 


Nodes of the graph (and edges too) can be supplied with any attributes. To find a specific node, we set in request a condition on attributes in the format similar to JSON: { name: "vykapyvaniye_trupa" }. Thus, phrase

MATCH (n:Ruswordnet {name: "выкапывание_трупа"})

means that from all graph all nodes with a tag of Ruswordnet and the name attribute equal to the concept specified there will be selected.

The key word of RETURN speaks to us what variables interest us. In this case we just wanted to see a node (nodes) corresponding to the set conditions therefore we write RETURN n. It is important to understand that n is a collection of the nodes satisfying to request. To be convinced of it, just replace concept of request:

MATCH (n:Ruswordnet {name: "лев"}) RETURN n

If you imported the Wordnet base entirely, you will see six nodes of the concepts "lion". Let's understand why.

2. Variable = collections


Let's execute such request:

match (n:Ruswordnet {name: "лев"})--(m) return n,m

Here we set already more difficult template for search. We want to find all nodes (n) corresponding to the concept "lion" and also all nodes (m) connected with lions. Communication, i.e. an edge of the graph is designated by two hyphens. It is possible to specify in an explicit form the direction interesting us by the character-> (it I also called ASCII art).

We study the graph - the oriented Neo4j DBMS on the example of the lexical Wordnet base

If you do not display names of units of meaning, press the Ruswordnet (23) button in the upper left corner of the graph, and in status bar in the bottom of the console select "name" in the field of Caption. Will be so more evident.

Now we understood that the lion it, appears not only the Bulgarian currency (bulgarian_money) for which kopek is the stotinka, but also a big cat, and constellation, an astrological sign, and something, connected with pride.

3. We connect edges


In the Wordnet base of an edge are called pointers (Pointer), and a large number of the linguistic types of pointers is used. They are designated by characters some of which I provide in the table:
Character English name of the linguistic relation Linguistic relation
! Antonym Antonym
@ Hypernym Generalization
@i Instance Hypernym Generalization copy
~ Hyponym Refining
~ i Instance Hyponym Refining copy
#m Member holonym The concept including this concept
#s Substance holonym Substance of which the subject consists
#p Part holonym The subject including as part this subject
%m Member meronym Part of more general concept
%s Substance meronym Of what substance the subject consists
%p Part meronym Part of a subject
= Attribute Attribute
+ Derivationally related form Derivative form

In the course of import we appropriated to edges of the graph the pointer_symbol attribute, and now we can make requests taking into account attributes of edges. Let's understand that such generalization (hypernum):

MATCH (n:Ruswordnet {name: "лев"})-[p:Pointer {pointer_symbol: "@"}]->(m) 
RETURN n,m

Square brackets designate specifications of edges. We want to find edges like Pointer which attribute pointer_symbol is equal "@" i.e. to the generalization character in this request. By the way, refining character opposite to generalization "~".

We study the graph - the oriented Neo4j DBMS on the example of the lexical Wordnet base

Now it is clear that generalization for a lion is a cat, and also the person. Of course, it is about different units of meaning: the lion (cat) is one node of the graph, and a lion (person) — other node corresponding to zodiac sign. Lion (popularity) is a result of a poor translation into Russian; the lion (celebrity), i.e. a celebrity, a secular lion means.

Let's understand that such part holonym:

MATCH (n:Ruswordnet {name: "лев"})-[p:Pointer {pointer_symbol: "#p"}]->(m) 
RETURN n,m

And, it is clear now: the lion enters the zodiac as a component, the zodiac means part holonym for a lion is.

From the table it is visible that the Wordnet contains many interesting relations, for example, from what substances that is made. Unfortunately, there is no information that the lion is made of meat therefore we will raise a question on another: to find such nodes of the graph which are connected by the relation "of what substance it is made".

MATCH (n)-[p:Pointer {pointer_symbol: "#s"}]->(m) 
RETURN n,m LIMIT 10

In this request we do not impose any conditions on nodes (n) and (m). We only want that they were connected by edges with the #s attribute. Pay attention, the key word of LIMIT familiar to us from SQL appeared. If it was not here, the server would return us many results, and it would be bad to our browser.

As a result of request we learned that cigarettes consist of marijuana, and soup from volovy tails — of volovy tails.

4. Chains of arbitrary length


In the childhood all played such game: to turn a fly into an elephant. It was for this purpose necessary to change on single letters in the word until the word MUHA turned into the word SLON. Let's learn in the lexical graph whether LEV and OVTSA are connected among themselves.
MATCH (n:Ruswordnet {name: "лев"})-[p:Pointer*1..3]-(m:Ruswordnet {name: "овца"}) 
RETURN n,m,p

Construction [p:Pointer*1. 3] says that it is required to find the chain of edges like Pointer from one to three long connecting the lion node with the sheep node.

We study the graph - the oriented Neo4j DBMS on the example of the lexical Wordnet base

It differs from classical children's game, but too it is interesting: OVTSA — PROSTAK — CHELOVEK — LEV … it sounds is proud. By the way, it is possible to try to find communication and between a fly and an elephant, to increase only a little limit chain length. I used value 6. By the way, do not try to put 100 at once — search process most likely will break since the number of options for search of ways in the graph will be too big. So, here is how the elephant and a fly lexically are connected:

We study the graph - the oriented Neo4j DBMS on the example of the lexical Wordnet base

I think, at this stage you understood a lot of things about the Neo4j database, and are capable to open independently a lot of interesting in the Wordnet database, and can apply Neo4j in the projects. We apply a linking of Neo4j with the Wordnet in system of search in film archives. If you want to research in the field of machine learning, I invite to training or for permanent job in NIKFI — research film photoinstitute.

This article is a translation of the original post at habrahabr.ru/post/273241/
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here: sysmagazine.com@gmail.com.

We believe that the knowledge, which is available at the most popular Russian IT blog habrahabr.ru, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.
Best wishes.

comments powered by Disqus