This post is based on lecture "by Deep Learning: Theoretical Motivations" ("Deep learning: theoretical justifications") doctor Yoshua Bendzhio (Dr. Yoshua Bengio) which he read at summer school on deep learning in Montreal (2015). We recommend it to listen – so you will understand underwritten material better.

Deep learning is a set of the algorithms of machine learning based on studying of a set of the presentation layers. A set of the presentation layers mean a set of the abstraction layers. In this post the generalization problem, that is opportunities to define difficult objects on the basis of successfully studied presentation layers is considered.

The scheme shows above how parts of system in different subsections of artificial intelligence interact. Gray the blocks capable to be trained are noted.

**Expert systems**

The expert system is a program of artificial intelligence which all algorithms of behavior are registered by the person manually. Necessary for this knowledge are provided by experts from this or that area. For the answer to the asked questions such systems use the facts and the logic integrating them.

**Classical machine learning**

In classical machine learning important expert knowledge is entered manually, however then the system is trained, organizing a data output on the basis of independently studied signs. This type of machine learning is widely used for a solution of simple problems of object recognition. At design of such systems the most part of time is spent at choice of a right training data set. When knowledge of experts manages to be formalized, for receipt of the output data the normal qualifier is used.

**Training in representations (Representation Learning)**

In comparison with classical machine learning training in representations takes a step forward and excludes need of formalization of knowledge of experts. The system defines all important patterns independently on the basis of the entered data (as, for example, in neural networks).

**Deep learning**

Deep learning is a part of wider family of methods of machine learning – training in representations where vectors of signs are located at the set of levels at once. These signs are automatically determined and connect with each other, creating the output data. At each level the abstract signs based on signs of the previous level are provided. Thus, the more deeply we move ahead, the abstraction layer is higher. In neural networks the set of layers represents a set of levels with vectors of signs which generate the output data.

#### Way to artificial intelligence

To create artificial intelligence, we need three main ingredients:

*1. Large volumes of data*

Creation of system of artificial intelligence requires the huge volume of knowledge. This knowledge is formalized by people or forms on the basis of other data (as in case of machine learning), the right decisions, necessary for acceptance. Today similar systems are capable to process video, images, a sound and so on.

*2. Very flexible models*

It is not enough one data. On the basis of collected data it is necessary to make some important decision, moreover, all information needs to be stored somewhere therefore models have to be rather big and flexible.

*3. The priori knowledge*

The priori knowledge allows "to remove" a dimension damnation by the message to system of enough knowledge of the world.

Classical nonparametric algorithms are capable to process huge data volumes and have flexible models, however demand carrying out procedure of smoothing. This post, in the majority, is devoted to this third ingredient.

**What it is necessary for us?**

*Knowledge.*The world – very difficult piece, and artificial intelligence has to learn to understand it. That the system learned to understand the world as we understand it, the huge number of knowledge – much bigger will be required, than systems of machine learning have at the moment.

*Training.*Training is the very important process allowing artificial intelligence to find independently difficult knowledge. The training algorithms include two elements: the priori knowledge and methods of optimization.

*Generalization.*This aspect is the most important in machine learning. Generalization is an attempt to guess what result is the most probable. From the point of view of geometry – it is attempt to guess where the center of masses is located.

*Methods of fight against a dimension damnation.*It is a consequence of emergence of the variables of high dimension increasing complexity of function. Even in the presence of only two measurements, the quantity of possible configurations is huge. If measurements a set then to master all available configurations it is almost unreal.

*Solution of the problem of the explaining factors.*The artificial intelligence has to understand methods and the reasons of data acquisition. It is that problem of which the science deals now – experiments are made and world explanation methods are checked. Deep learning is a step towards to artificial intelligence.

#### Why not classical nonparametric algorithms?

For the term "nonparametric" there is a set of determinations. We say that the training algorithm is nonparametric if complexity of functions which it is capable to study grows with growth of volume of training data. In other words, it means that the vector of parameters is not fixed.

Depending on the data which are available for us, we can select family of functions which will be more or less flexible. In case of a linear classifier increase in number of data does not lead to change of model. On the contrary, in neural networks we have an opportunity to select bigger quantity of the hidden elements.

The term "nonparametric" does not mean "has no parameters" – it means "has no the fixed parameters", that is we can select the number of parameters depending on the number of data which we locate.

**Multidimensionality damnation**

The damnation of multidimensionality is a consequence of the abundance of configurations arising because of a large number of measurements. The number of possible configurations grows on an exponent with growth of number of measurements. Thus, there is a problem of machine learning – to receive a configuration unknown to us.

Classical approach in nonparametric statistics hopes for "smoothness". Such approach well works at a small number of measurements, however at their rather serious quantity either all examples, or any can get to the end result that in itself is useless. To carry out local generalization, examples of representations will be necessary for us for all relevant outcomes. It is impossible to execute local averaging and to receive something standing.

If we go deep into mathematics, then we will see that the number of measurements is quantity of different variations of the functions studied by us. In this case the concept of uniformity belongs to quantity of the hollows and cambers provided on a curve.

The line is very smooth "curve". A curve with several cambers and hollows still smooth, but already to a lesser extent.

The functions studied by us not smooth. In such cases as computer sight or processing of natural language, target function turns out very difficult.

Many nonparametric statistical techniques rely on the Gaussian function of influence averaging values in a certain small area. However the nuclear machines using Gaussian distribution need at least k of examples for studying of the function having 2k intersections of a x axis. With growth of number of measurements the amount of cambers and hollows increases on an exponent. It is possible to receive very rough function even in one measurement.

If to approach a problem from the point of view of geometry, then we have to place a probability measure there where the structure is most plausible. In case of empirical distribution function the measure gets on training examples. Let's consider an illustration above where two-dimensional data are submitted. If to consider the diagram smooth, then the probability measure is evenly distributed between examples.

Roundish figures are represented by Gaussian kernels for each example. Many nonparametric statistical techniques resort to such approach. In case of two measurements everything looks rather simply and realizable, however with increase in number of measurements the sizes of circles (spheres) increase so that block themselves all space or, on the contrary, leave blank spaces where there has to be the maximum probability. Therefore we should not hope for smoothness of function and it is necessary to think up something more effective – some structure.

On the image higher such structure is the one-dimensional variety where the probability measure is collected. If we are able to define representation of probability, then we will solve our problems. Representation can be smaller dimension or be located along other axes in the same measurement. We take composite nonlinear function and we build in it the Euclidean space, changing representation. To make predictions so easier, to carry out interpolation and to perform a distribution density assessment.

**We remove a damnation**

Smoothness was the main requirement in the majority of nonparametric methods, but it is quite obvious that with its help we cannot overcome a dimension damnation. We want that flexibility of family of functions grew at increase in number of data. In neural networks we change quantity of the hidden elements depending on data volume.

Our models of machine learning have to become composition. Natural languages use sequentialization to represent more difficult ideas. In deep learning we use:

- The distributed representations;
- Deep architecture.

Let's assume that data arrive to us not all at once, and batches. Batches can arrive either in parallel, or is consecutive. Parallel receipt of batches is the distributed representation (training in representations). Consecutive receipt of data is similar to training in representations, but with several levels. Sequentialization gives us the chance of the effective description of the world around us.

#### Power of the distributed representations

*Unallotted representations*

Among the methods which are not using the distributed representations it is possible to select a klastering, the N-gram method, a method of the closest neighbors, a method of radial basic functions and reference vectors, and also a decision tree method. On an input of such algorithms the space moves, and on an output areas turn out. On an output of some algorithms strict separations, and on an output of others mild, allowing smooth interpolation between nearby areas turn out. Each area has the set of parameters.

Result which reports area and its location are regulated by data. The concept of complexity is connected with the number of areas. From the point of view of the theory of training, generalization depends on the relation of number of necessary examples and complexity. Rich function demands the bigger number of areas and data. There is linear relation between number of distinguishable areas and number of parameters. Also there is linear relation between number of distinguishable areas and number of training examples.

*Why the distributed representations?*

There is other option. It is possible to provide the exponential number of areas a linear set of parameters, using the distributed representations. The magic of the distributed representation is that with its help it is possible to study very composite function (with a set of cambers and hollows), having on hand a small amount of examples.

In unallotted representations the number of parameters is in linear relation from number of areas. From this it follows that the number of areas can exponential grow to number of parameters and examples. In the distributed representations separate signs matter, not depending on other signs. Different correlations have the right to existence, but the majority of signs are studied regardless from each other. It is not obligatory for us to know all configurations to make the correct decision. Incompatible signs create a big combinatory set of different configurations. All benefits are visible when using only one layer – the quantity of examples can be very small.

However such benefit is not observed in practice – it big and – but not exponential. If representations good, then they develop variety in new flat coordinate system. Neural networks succeeded in studying of the representations affecting semantic aspects. Generalization is formed on the basis of such representations. If the example is in space where there are no data, then the nonparametric system will not be able to tell about it anything. Using the distributed representations, we can draw conclusions that we never saw. It is a generalization essence.

*Classical character artificial intelligence and training in representations*

The distributed representations are the cornerstone of a konnektsionizm or the konnetsionistsky approach which arose in 1980. Classical approach is based on concept of characters. In character processing of such things as language, the logic, or rules, each concept is associated with a certain entity – the character. The character or exists, or not.

There is nothing what would define communications between them. Let's treat for example a dog and a cat. In character artificial intelligence – it is two different characters which do not have any interrelation among themselves. In the distributed representation they have similar features, for example, they are pets, have 4 claws and so on. It is possible to regard these concepts as templates of signs or templates of activation of neurons in a brain.

*The distributed representation in programming of natural language*

By means of the distributed representations it was succeeded to achieve very interesting results in processing of natural language. I recommend to study the article "Deep Learning, NLP, and Representations" ("Deep learning, programming of natural language and representation").

#### Power of deep representations

Many misunderstand the word "deep". Research of deep neural networks was not conducted earlier as people considered that in them there is no need. The flat neural network with one layer of the hidden elements is capable to provide any function with the set accuracy rating.

This property is called universal approximation. However in this case we do not know how many the hidden elements are required to us. The deep neural network allows to reduce considerably quantity of the hidden elements – to reduce function representation cost. If we try to study the deep function (having a set of levels of composition), then the neural network needs bigger number of layers.

Depth is not need: without having deep networks, we can still receive family of functions with sufficient flexibility. Deeper networks have no bigger capacity. More deeply – does not mean that we can provide more functions. It is worth using a deep neural network when the function studied by us possesses the certain characteristics which became result of composition of several operations.

*"Flat" and "deep" computer program*

*"Flat" program*

*"Deep" program*

When writing computer programs we do not locate all code lines one after another – usually we use subprogrammes. And so, the hidden elements act as the subprogramme for the big program – a final layer. It is possible to consider that the result of calculations of every line of the program changes a machine status, at the same time transferring the output to other line. To every line on an input the machine status moves, and on an output of a line other status appears already.

It is similar to the Turing machine. The quantity of the steps executed by the Turing machine depends on depth of calculations. In principle, we can provide any function for two steps (lookup table), however it is not always effective. The nuclear support vector machine or a flat neural network can be considered as the lookup table. We need deeper programs.

*The separated components*

Polynoms are often represented as the sum of works. One more method of representation is the computation graph where each top makes addition or multiplication. In this way we can provide and deep calculations – so the number of calculations will decrease as we will be able to reuse some operations.

However deep networks with valve neural modules are much more expensive than elements of flat networks as they are capable to separate space into the bigger number of linear areas (with conditions).

#### Camber illusion

Not camber of a problem of optimization became one of the reasons of rejection of neural networks in the nineties. Since the end of the 80th and the beginning of the 90th years we know that in neural networks there is an exponential quantity of local minima. This knowledge, along with success of nuclear machines in the nineties, played a role and strongly reduced interest of many researchers of neural networks.

They considered that time optimization is nonconvex, there are no guarantees of finding of the optimal solution. Moreover, the network could go in cycles in bad, non-optimal solutions. Researchers changed the opinion quite recently. There were theoretical and empirical proofs that not camber problem – at all not a problem. It changed all our idea of optimization in neural networks.

*Saddle points*

Let's consider an optimization problem on a small and large number of measurements. In small measurements there is a set of local minima. However in high measurements local minima are not critical points (interest points). When we optimize neural networks or any other function of several variables, for the majority of trajectories critical points (points where the derivative equals to zero or it is close to it) are saddle. Saddle points are unstable.

On the image the saddle point is just given above. If the point is a local or global minimum, then at the movement from it in all directions increase in function values (will be observed at the movement from a local maximum – reduction). In the presence of a randomness factor in a task of functions or at the independent choice of the direction of the movement, it is extremely improbable that function will increase in all directions in other points, except a global minimum.

It is intuitively clear that if we found a minimum which is close to global, then function will increase in all directions – is lower than this point it is impossible to fall. The statistical physics and the matrix theory assume that for some families of functions (their rather large number), concentration of probability between an index of critical points and target function is observed.

The index is the coefficient of the direction defining in what direction there is a function value reduction. If the index is equal to zero - it is a local minimum and if the index is equal to unit - it is a local maximum. If the index is equal to number between zero and unit, then this is saddle point. Thus, the local minimum is a special case of saddle point whose index equals to zero. Most often saddle points meet. Empirical results show that, really, between an index and target function there is close coupling.

It is only an empirical assessment, and there are no proofs that results are suitable for optimization of neural networks, however the similar behavior corresponds to the theory. In practice it is visible that stochastic gradient descent will always "leave" surfaces where there is no local minimum.

*Other methods working with the distributed representations*

**Way of the person**

People are capable to draw conclusions on the basis of very small amount of examples. Children usually learn to do something new on a small amount of examples. Sometimes even on the basis of one that is statistically impossible. The only explanation – the child uses some knowledge gained by it before. The priori knowledge can be used for creation of representations thanks to which, already in new space, there is an opportunity to draw a conclusion on the basis of only one example. The person more relies on the priori knowledge.

**Methods of partial training**

Partial training is something between supervised learning and without teacher. In supervised learning we use only the marked examples. In partial training we in addition use unmarked examples. On the image it is shown below how partial training can find the best separation, using not marked examples.

**Multitask training**

Adaptation to a solution of new type of tasks is very important point in development of artificial intelligence. The priori knowledge is a connecting knowledge. Deep architecture learn intermediate representations which can be separated between tasks. The good representations increasing variation coefficient are applicable for a solution of a set of tasks as each task considers a subfeature set.

The following scheme illustrates multitask training with different input data:

*Studying of a set of the abstraction layers*

Deep learning allows to study the big abstraction layers increasing variation coefficient that simplifies generalization process.

#### Conclusion

- The distributed representation and deep composition allow to improve capabilities to generalization considerably;
- The distributed representation and deep composition issue nonlocal generalization;
- Local minima are not a problem because of saddle points;
- It is necessary to use other methods, such as partial training and multitask training, capable it is better to generalize the deep distributed representations.

This article is a translation of the original post at habrahabr.ru/post/271027/

If you have any questions regarding the material covered in the article above, please, contact the original author of the post.

If you have any complaints about this article or you want this article to be deleted, please, drop an email here: sysmagazine.com@gmail.com.We believe that the knowledge, which is available at the most popular Russian IT blog habrahabr.ru, should be accessed by everyone, even though it is poorly translated.

Shared knowledge makes the world better.

Best wishes.