In this article I will a little tell about such networks and I will acquaint with pair of cool tools for house experiments which will allow to build neural networks of any complexity in several code lines even to school students. Welcome under kat.
What is RNN?
The main difference of recurrent networks (Recurrent Neural Network, RNN) from traditional consists in logic of network functioning at which each neuron interacts with itself. The signal which is some sequence as a rule is transmitted to an input to such networks. Each element of such sequence in turn is transferred to the same neurons which return the prediction to themselves together with its following element until the sequence does not end. Such networks are, as a rule, used during the work with consecutive information — generally with texts and audio / video-signalami. Elements of recurrent network represent as normal neurons with an additional cyclic arrow which shows that except an input signal the neuron uses also the additional hidden status. If "to unroll" such image, the whole chain of identical neurons will turn out, each of which receives the sequence unit on an input, issues a prediction and gives him further on a chain as some kind of storage cell. It is necessary to understand that it is abstraction as it is the same neuron which fulfills several times in a row.
Such architecture of a neural network allows to solve such problems as, a prediction of the last word in the sentence, for example the word "sun" in the phrase "in the clear sky shines the sun".
Modeling of memory in a neural network in this way enters new measurement into process description of its work — time. Let the neural network receive on an input sequence of data, for example, the text poslovno or the word pobukvenno. Then each following element of this sequence comes on neuron to a new conditional timepoint. By this moment in neuron there is already an experience which is saved up since the beginning of receipt of information. In an example with the sun as x0 the vector characterizing a pretext "in" will act as x1 — the word "sky" and so on. As a result as ht there has to be a vector close to the word "sun".
The main difference of different types of recurrent neurons is from each other covered in how the storage cell in them is processed. Traditional approach means addition of two vectors (a signal and memory) with the subsequent calculation of activation from the sum, for example, a hyperbolic tangent. The normal grid with one buried layer turns out. The similar scheme is drawn as follows:
But the memory implemented in this way turns out very short. As every time information in memory mixes up with information in a new signal, later 5-7 iterations information is already completely rewritten. Returning to a problem of predicting of the last word in the sentence, it should be noted that within one sentence such network will work not bad, but if the speech comes about longer text, then patterns at its beginning will not make any contribution to network solutions closer to the end of the text, also any more as the error on the first elements of sequences in training activity ceases to make a contribution to the general error of a network. This very conditional description of this phenomenon, actually is a fundamental problem of neural networks which is called a problem of a disappearing gradient, and because of it third "winter" of deep learning at the end of the 20th century when neural networks for one and a half decades conceded leadership to machines of reference vectors and algorithms of a busting neither more nor less began.
To overcome this shortcoming, LSTM-RNN a network (Long Short-Term Memory Recurent Neural Network) in which additional internal conversions which operate with memory more carefully were added was thought up. There is its scheme:
Let's walk in more detail on each of layers:
The first layer calculates as far as on this step it needs to forget the previous information — in fact multipliers to memory vector components.
The second layer calculates as far as to it the new information which came with a signal — the same multiplier is interesting, but already for supervision.
On the third layer linear combination of memory and supervision with only the calculated scales for each of a component is calculated. The new memory state which is transferred in the same type further is so come into.
It was necessary to calculate output. But as part of an input signal already in memory, it is not necessary to consider activation on all signal. At first the signal passes through a sigmoida which solves what its part is important for further solutions, then hyperbolic tangent "spreads" a memory vector on segment from-1 to 1, and at the end these two vectors are multiplied.
ht received thus and Ct are transferred further on a chain. Certainly, there is a set of variations of what functions of activation are used by each layer, modify schemes and other a little, but the essence remains former — at first forget part of memory, then remember part of a new signal, and already then on the basis of these data the result is calculated. I took pictures from here, there it is also possible to look at several examples of more complex circuits LSTM.
Here I will not tell in detail how such networks are trained, I will tell only that the algorithm of BPTT (Backpropagation Through Time) which is generalization of standard algorithm on a case when in a network there is time is used. It is possible to esteem about this algorithm here or here.
Use of LSTM-RNN
The recurrent neural networks built on the similar principles are very popular, here several examples of similar projects:
There are also successful examples of use of LSTM grids as one of layers in hybrid systems. Here an example of a hybrid network which answers questions on the picture from the series "how many it is represented books?":
Here the LSTM network works together with the pattern recognition module at pictures. Here comparison of different hybrid architecture is available to a solution of this task.
Theano and keras
For the Python language there are a lot of very powerful libraries for creation of neural networks. Without aiming to give any complete overview of these libraries, I want to acquaint you with Theano library. Generally speaking, from a box it is very effective tools on work with multidimensional tensors and columns. Implementations of the majority of algebraic operations over them, including search of extrema of tensor functions, calculation of derivatives and other are available. And all this it is possible to parallel and start effectively calculations with use of the CUDA technologies on video cards.
Sounds fine if not that fact that Theano itself generates and compiles a code on C ++. My prejudice can do it, but I with big mistrust treat systems of this sort as, as a rule, they are filled improbable number of bugs which are very difficult for finding, perhaps because of it I long time did not pay due attention to this library. But Theano was developed at the Canadian institute MILA under the direction of Yoshua Bengio, one of the best-known specialists in the field of deep learning of our time, and for the meanwhile short experience with it, I, certainly, did not find any errors.
Nevertheless, Theano is only library for effective calculations, on it it is necessary to implement independently backpropagation, neurons and all the rest. For example, here a code with use only Theano of the same LSTM network about which I told above, and in it about 650 lines that does not answer heading of this article at all. But there can never be I and did not try to work with Theano if not surprising library keras. Being in fact only sugar for the Theano interface, she just also solves the problem declared in heading.
With use of keras the object of model which describes what order and what layers contain your neuronet is the cornerstone of any code. For example, the model which we used for an assessment of a tonality of tweets about Star wars accepted words sequence therefore its type was on an input
model = Sequential()
After the declaration like model, layers are consistently added to it, for example, it is possible to add a LSTM layer such command:
After all layers are added, the model needs to be compiled, at desire having specified loss function type, algorithm of optimization and some more settings:
model.compile(loss='binary_crossentropy', optimizer='adam', class_mode="binary")
Compilation takes a couple of minutes, after that at model the clear fit methods (), to predict (), to predict_proba are available to all () and evaluate (). Here it is so simple, in my opinion it is ideal option to begin to plunge into deep learning depths. When there are not enough opportunities of keras and it will want to use, for example, own loss functions, it is possible to fall by level below and to write part of a code on Theano. By the way, if someone is frightened too by programs which generate other programs, as a backend it is possible to connect to keras also fresh TensorFlow from Google, but it works meanwhile much more slowly.
Analysis of a tonality of tweets
Let's return to our initial task — to define, the Russian viewer liked Star wars, or not. I used unpretentious library TwitterSearch as the convenient tool to be iterated on search results from Twitter. As well as at all open API of large systems, Twitter has certain restrictions. The library allows to cause callback after each request so it is very convenient to place pauses. Thus about 50 000 tweets in Russian on the following hashtags were extorted:
- #star #wars
- #звездные #войны
- #пробуждение #силы
While they were extorted, I was engaged in search of the training selection. In English in free access there are several annotated corpora of tweets, the largest of them — the Stanford training selection of sentiment140 mentioned in the beginning, is also the list of small dataset. But all of them in English, and the task was set for Russian. In this regard I want to express separate gratitude to the graduate student (probably already former?) Institute of systems of information science of A. P. Yershov of the Siberian Branch of the Russian Academy of Science of Yulia Rubtsova which uploaded publicly the body from nearly 230 000 marked (with an accuracy more than 82%) tweets. It is more to our country than such people who on a grant basis support community. Generally, datasety also worked with it, to esteem about it and it is possible to download at the link.
I cleared all tweets of superfluous, having left only continuous sequences of the Cyrillic characters and numbers which banished through PyStemmer. Then replaced identical words with identical number codes, as a result having received the dictionary from about 100000 words, and tweets were presented in the form of sequences of numbers, they are ready to classification. I did not begin to clean from low-frequency garbage because the grid smart and will guess that there superfluous.
Here our code of a neuronet on keras:
from keras.preprocessing import sequence from keras.utils import np_utils from keras.models import Sequential from keras.layers.core import Dense, Dropout, Activation from keras.layers.embeddings import Embedding from keras.layers.recurrent import LSTM max_features = 100000 maxlen = 100 batch_size = 32 model = Sequential() model.add(Embedding(max_features, 128, input_length=maxlen)) model.add(LSTM(64, return_sequences=True)) model.add(LSTM(64)) model.add(Dropout(0.5)) model.add(Dense(1)) model.add(Activation('sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', class_mode="binary") model.fit( X_train, y_train, batch_size=batch_size, nb_epoch=1, show_accuracy=True ) result = model.predict_proba(X)
Except for imports and declarations of variables there were exactly 10 lines, and it would be possible and to write to one. We will be run on a code. In a network of 6 layers:
- Settings say a layer of Embedding which is engaged in preparation in a feature that in the dictionary 100 000 different a feature, and a grid to wait for sequence from no more, than 100 words.
- Further two layers of LSTM, each of which the tensor gives on an output dimension of batch_size/length of a sequence/units in LSTM, and the second gives a matrix of batch_size/units in LSTM. That the second understood the first, return_sequences=True flag is exposed
- The layer of Dropout is responsible for retraining. It nullifies an accidental half a feature and prevents coadaptation of scales in layers (we take the word to Canadians).
- The Dense-layer is a normal linear unit which deliberately sums up components of an input vector.
- The last layer of activation drives this value into an interval from 0 to 1 that it became probability. In fact Dense and Activation in such order are logistic regression.
In order that training happened on GPU at execution of this code it is necessary to expose the corresponding flag, for example so:
THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python myscript.py
For GPU this model at us was trained nearly 20 times quicker, than for CPU — about 500 seconds on a dataseta from 160 000 tweets (a third of tweets went for validation).
For similar tasks there are no accurate rules of forming of network topology. We fairly spent in the afternoon for experiments with different configurations, and this showed the best accuracy — 75%. We compared result of a prediction of a grid to ordinary logistic regression which showed 71% of accuracy on the same dataset at vectorization of the text by tf-idf method and approximately the same 75%, but when using tf-idf for digrams. The reason that the neuronet almost did not overtake logistic regression most likely that the training selection after all was small (frankly such network requires not less than 1 million tweets of the training selection) and a zashumlena. Training took place for only 1 era as further we fixed strong retraining.
The model predicted probability that a tweet positive; we considered positive a response with this probability of 0.65, negative — till 0:45, and an interval between them — neutral. In breakdown on days of the loudspeaker looks as follows:
In general it is visible that the movie was pleasant to people rather. Though personally to me not really :)
Examples of network functioning
I selected 5 examples of tweets from each group (the specified number — probability that a response positive):
It is possible to exhale quietly, new Star Wars oldskulno excellent. Abrams — abrupt, as well as always. The scenario, music, actors and shooting — it is ideal. — snowdenny (@maximlupashko) December 17, 2015
I advise all to descend on star wars the super movie — Nikolay (@shans9494) of December 22, 2015
FORCE WAKENED! YES THERE WILL ARRIVE WITH YOU FORCE TODAY AT THE PREMIERE OF THE MIRACLE WHICH YOU WAITED FOR 10 YEARS! #TheForceAwakens #StarWars — Vladislav Ivanov (@Mrrrrrr_J) December 16, 2015
Though I am not a fan #StarWars, but this execution is wonderful! #StarWarsForceAwakens https://t.co/1hHKdy0WhB — Oksana Storozhuk (@atn_Oksanasova) December 16, 2015
Who looked at star wars today? I I I :)) — Anastasiya Ananich (@NastyaAnanich) December 19, 2015
The mixed tonality
New Star wars it is better than the first episode, but it is the worst than the others — Igor Larionov (@Larionovll1013) of December 19, 2015
Han Solo will die. Pleasant viewing. #звездныевойнû— Nick Silicone (@nicksilicone) December 16, 2015
All have Star wars around. I one perhaps not in a subject?:/— Olga (@dlfkjskdhn) December 19, 2015
To go or not to take the Star Field, here in what a question — annet_p (@anitamaksova) December 17, 2015
Star wars left double impressions. Both well and not really. In places it was not felt that it is those … something others sliped — Kolot Evgeny (@KOLOT1991) December 21, 2015
Around there are so much talk, really only I am not fan on Star wars? #StarWars #StarWarsTheForceAwakens — modern mind (@modernmind3) December 17, 2015
they pulled out my poor heart from a thorax and hurt it into millions and millions of splinters of #StarWars— Remi Evans (@Remi_Evans) of December 22, 2015
I hate dnokl, a prospoylerila to me star wars — a pizhamk of a nayl (@harryteaxxx) of December 17, 2015
Woke up and understood that new Star Wars disappointed. — Tim Frost (@Tim_Fowl) December 20, 2015
I am disappointed with #пробуждениесилû— Eugenjkee; Star Wars (@eugenjkeee) December 20, 2015
P.S. How research was conducted, came across article in which praise convolution networks for a solution of this task. Next time we will try them, in keras they are also supported. If someone from readers decides to check itself, write to comments on results, very interestingly. Yes will be with you Sil of big data!
This article is a translation of the original post at habrahabr.ru/post/274027/
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here: firstname.lastname@example.org.
We believe that the knowledge, which is available at the most popular Russian IT blog habrahabr.ru, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.