### Developers Club geek daily blog

1 year, 4 months ago

In the first part of the story we got acquainted with the information theory on a specific example and today we will continue our research. This tool allows to describe strong ideas which have a set of scopes of application: from data compression to quantum physics and machine learning.

Let's remind that costs for the message L long make. We can transform expression and receive message length cost. As a result, we spend p(x) for the code word x length.

It is the best length of the message.

Earlier we spoke about existence of fundamental restriction during creation of the shortest message on the basis of distribution of probabilities of p. This restriction (average length of the message when using the best code) is called entropy of p and H(p) is designated. We can calculate it, knowing the optimum length of code words:

People often describe entropy a formula, using identity. I think that the first option is more evident, we will use it further).

What I did, on average I will transfer the number of bits calculated on a formula. The average amount of information necessary for communication has accurate restrictions on compression. Something else? Yes! It describes degree of uncertainty and gives the chance to define amount of information.

If I knew for certain what will occur, I would not send the message at all. If there are two events which can happen to probability of 50%, then I need to transfer only 1 bits. If there are 64 different events which can happen to equal probability, then I have to give 6 bits. The more precisely probabilities of emergence of events match, the average length of the transferred messages will be shorter. If the probability "is strongly sprayed", then the average length of messages increases. The outcome, the more information is not more certain I can obtain.

#### Cross entropy

Shortly before the moving to Australia Bob married Alice (other fruit of my imagination). To mine (and other characters in my head) to surprise, Alice did not love dogs. It liked cats. However these two found a common language, despite their limited lexicon.

They tell the same words, only with a different frequency. Bob tells all the time about dogs, and Alice – about cats.

At first Alice sent to me messages, using the Bean code. Unfortunately, its messages were more, than could be. The code of the Bean is optimized for its distribution of probabilities – Alice has other distribution of probabilities. Average length of the message of Bob when he uses the code, is equal 1,75 bits, but when its code is used by Alice, the average length of the message increases to 2,25 bits. It would be worse if their probabilities of distribution resembled each other even less.

The average length of the message necessary for determination of an event from one distribution of probabilities if the set encoding scheme is based on other distribution, is called cross entropy. Cross entropy is calculated on a formula [4]:

Cross entropy in this case is a ratio of frequency of the used words of the lover of cats Alice to the frequency of the used words of the fan of dogs Bob.

To save the cost of our communication low, I asked Alice to use the code. To my simplification, it reduced the average length of its messages. But there was a new problem: sometimes Bob accidentally used Alice's code. Surprisingly, but when Bob uses Alice's code, the result leaves worse than when Alice uses its code.

• Bob uses the own H(p) code = 1,75 bits;
• Alice uses the Bean code = 2,25 bits;
• Alice uses the own H(q) code = 1,75 bits;
• Bob uses Alice's code = 2,375 bits.

Here everything is not so obvious. For example, we see that. Whether there is a method somehow to consider how these four values correspond with each other?

On the image four charts are included below, each of which represents one of 4 chances. On each chart the average length of the message is displayed. Messages from one distribution are in the same row, and the messages using the same code – in one column. It allows you to estimate distributions and codes visually.

Why? it turned out big because this event (blue) often takes place in the conditions of distribution of probabilities r, but length of a code increased because it seldom occurs in the conditions of distribution of probabilities of q. On the other hand, frequent events from q seldom arise in r, but a difference not radical therefore it is less.

Cross entropy is not symmetric.

So, why it is necessary to consider cross entropy? Cross entropy displays a difference of two distributions of probabilities. What distributions of p and q differ stronger in, cross entropy of p on q will be that more entropy of p.

Similarly, the p differs from q stronger, the cross entropy of q on p, will be more entropy of q.

Difference between entropy and cross entropy – quite curious thing: it shows as far as messages because of use of the code optimized for other distribution increase. If distributions match, then the difference will be equal to zero. If the difference increases, then length of messages increases.

This difference is called Kulbaka-Leybler's divergence. Kulbaka-Leybler's divergence of p concerning q is designated as [5] and is determined by a formula [6]:

Kulbaka-Leybler's divergence is a distance between two distributions. It shows as far as they differ (if you want to deal seriously with this issue, then it is necessary to deal with information geometry).

Cross entropy and Kulbaka-Leybler's divergence are incredibly useful in machine learning. It is often necessary for us that one distribution was similar to another. Kulbaka-Leybler's divergence allows to predict it with rather big accuracy and therefore it is often used.

#### Entropy of several variables

As well as many parents, my mother sometimes worries that I put on not on weather (it has all bases so to think because I seldom wear a coat in the winter). Therefore she is interested what now weather on the street and what clothes on me are put on. How many bits I have to send it to report it?

The easiest way is to reduce distribution of probabilities:

Now we can define optimum code words for the events happening to the set probabilities and to calculate the average length of the message:

It is called mutual entropy of X and Y which is calculated on a formula:

This formula is similar to an entropy formula, only here two variables instead of one. However it is better to avoid the data of distributions and to enter the third axis – code length. Entropy in volume!

But we will assume that my mother already knows what today weather – she looked at the forecast. What amount of information I need to transfer now?

It seems that I should transfer such amount of information which will allow to report about what clothes I wear, but it is actually necessary to tell a little less as weather influences what I will put on. Let's consider two cases: when weather rainy and when solar.

I do not need to transfer a large number of information because weather assumes the correct answer in both cases. I can use one code for a sunny weather, and another for rainy. Both here, and there I send less information than if I used the general code for both events.

To count average amount of information which I need to transfer to my mother I will integrate both cases on one diagram:

We call it conditional entropy. Conditional entropy is calculated on a formula:

In the previous section we found out that if the receiver knows what one of variables is equal to, then it is possible to transfer smaller amount of information about the second variable.

It is possible to provide infomration volumes as strips. If two strips have a general information, then they are imposed at each other. For example, X and Y contain general information therefore lines H(X) and H(Y) are imposed. N (X, Y) sodezhit information of both lines therefore integrates in itself H(X) and H(Y) [7].

Having looked at a situation under such corner, it becomes easier to understand a set of things.

We already noted that the X and Y broadcast at the same time requires more information (mutual entropy of N (X, Y)), than for a broadcast X (extreme entropy of H(X)). However if you already know Y, then the number of the transmitted data on X (conditional entropy of N (X|Y)) decreases.

Sounds unclear, but if again to resort to the help of strips, then everything becomes simpler. The N (X|Y) is information which we have to transfer to report to the X volume who already knows Y (information which is in X, but which is not in Y). It means that H(X|Y) is part of a strip of H(X) which is not imposed on H(Y).

The inequality is provided on the following chart.

Other identity is a N (X, Y) = H (Y) + N (X|Y) where information in X and Y is information sum in Y and information in X which is not in Y.

Besides, the equation can seem difficult, but thanks to "information strips" everything becomes more clear.

We already broke information in X and Y in several ways. We have information on each H(X) variable and H(Y). We have a sum of information of both events – N (X, Y). We have information which is born in itself by one event, but does not bear another – N (X|Y) and N (Y|X). It seems that everything is connected with concept of general information, information which these two events separate. It is called a mutual information of I (X, Y) also is calculated on a formula [8]:

This equation is fair because H(X) sum + H(Y) bears in itself two copies of the mutual information which is in X and Y while in H (X, Y) – only one (look at the previous chart with strips).

The variation of information is closely connected with a mutual information. The variation of information is information which is not distributed between variables. We can define it so:

The variation of information is interesting because it gives us a metrics – knowledge of distance between different variables. Information variation between two variables is equal to zero if, having learned value of one, you can define value another. Than nezavisimy variables – the bigger value information variation accepts.

How the variation of information and Kulbaka-Leybler's divergence which also gives us an idea of distance are connected? Kulbaka-Leybler's divergence is a measure of remoteness of two probabilistic distributions from one variable or a set of variables. Contrary to it, the variation of information is a measure of remoteness of two variables within one distribution. Kulbaka-Leybler's divergence defines distance between distributions, and information variation – in distribution.

It is possible to represent the relation of different types of information on one chart:

#### Fractional bits

One more difficult understood thing in the information theory are fractional bits. It is quite strange. What does a half of bit mean?

There is a simple answer: often we need average length of the message. If to 50% of cases it is transferred one bits, and to other half of cases – two bits, then the average length of the message is equal to one and a half bits. There is nothing strange that average length will be fractional.

The similar answer – deviation from a question. Often optimum length of code words fractional. What does it mean?

Let's consider distribution of probabilities. The event and has probability of equal 71%, and b event – 29%.

In the optimum code 0,5 bits will be used for representation of a, and 1,7 bits – for representation of b. If we want to send one code word, then it is simply impossible. We are forced to round to an integral number and to send on average in 1 bits.

But we will be able to improve a code if we send several messages at the same time. Let's consider two events with this distribution. If we send them separately, then we will have to give two bits. May we improve it?

We transfer to 50% of cases aa, we transfer in 21% ab or ba and we give bb to 8% of cases. We see that the ideal code contains fractional bits.

If we round lengths of the code word, then we will receive:

These codes give us the average length of the message of 1,8 bits. It is less than when we sent them independently. Consider that we give 0,9 bits for each event. Even if we will increase the number of departures, then the average length of the message all the same will decrease. As n aims at infinity, expenses on rounding of our code will disappear, and the bit count of the code word will aim at entropy.

Besides, pay attention that the ideal length of the code word was equal 0,5 bits, and the ideal length of the code word aa of-1 bits. Length of the code word increases even if it is fractional. Thus, if we give a set of events at the same time, their lengths will develop.

We see that use of fractional bits is not deprived of sense.

In practice people use special encoding schemes which in different degree are effective. The Huffman code – a main type of a code which we described here cannot work with fractional bits (it is necessary to group characters as we did it higher, or to use more difficult acceptances to approach an entropy limit). Arithmetic coding differs a little, but can transfer fractional bits asymptotically by an optimum method.

#### Output

The ideas described in article widely are applied by transfer of compressed information. In the information theory the main acceptances of data compression which give us understanding and the principles of work of the described algorithms are considered. But whether all this has other applications?

Ideas from the information theory are shown in many areas: in machine learning, quantum physics, genetics, thermodynamics and even in gamblings. As a rule, experts specialists do not inquire in these areas in the information theory, but are interested in its compression – it is necessary for them in the specialty. The quantum complexity can be described by means of entropy [9]. Many results in statistical mechanics and thermodynamics can be received by method of the maximum entropy [10]. Victories and defeats in gamblings are directly connected with Kulbaka-Leybler's divergence (in certain cases) [11].

The information theory is shown in all these spheres because it offers the specific, formal description of many things which we need to express. It allows us methods to measure uncertainty and distinctions of two sets of beliefs, and also to understand how two questions are connected among themselves: as the probability, what distance between distributions of probabilities is distributed and two variables are how dependent. Whether there are alternatives? Of course. But ideas from the information theory are clear, they possess good properties and a basic origin. In certain cases it the fact that it is necessary for you, and sometimes – it is simply convenient.

Best of all I understand machine learning, let's talk about it a little. Very widespread type of a task in the field of machine learning is a classification. For example, we want, having looked at the image, to define: this is a cat or a dog. Our model can tell something it seems: "Probability that this image of a dog makes 80%, and probability that this image of a cat – 20%". Let's assume that the correct answer – a dog. Well it or is bad that probability that it is a dog of only 80%? It would be better if it was 85%?

It is an important issue because we have to have an idea of that, our model is how good to optimize it. What do we have to optimize? The correct answer actually depends on for what we need this model: only the correct answer is interesting to us or we are interested in confidence in correctness of this answer? What if it is wrong? On these questions exist a set of answers, and very often it is impossible to recognize them because we do not know how the model will be used that precisely to formalize it. Situations when we have to consider cross entropy [12] result.

Information creates new borders for reflections about the world. Sometimes she helps us to solve a problem, and sometimes just it is incredibly useful. In this article I only slightly mentioned the information theory. We did not discuss such subjects as the adjusting codes, but I hope that I managed to interest you in this remarkable section of mathematics. You should not be afraid of it.

Notes
4. Pay attention that this non-standard designation of cross entropy. The commonly accepted designation is H (p, q). It is awful for two reasons. First, just the same designation is used for mutual entropy. Secondly, apparently, that cross entropy is symmetric. It is ridiculous therefore I will write.

5. Also non-standard designation.

6. If to paint concept of divergence of Kulbaka-Leybler, you receive:

Looks a little strange. How we have to understand it? – it is just a difference between the number of the bits of a code optimized for p and q on x. In general, expression shows the expected difference meanwhile how many the bit is required to each code.

7. The idea of such interpretation was borrowed from Reymond Yoong's work (Raymond W. Yeung) as "A New Outlook on Shannon's Information Measures".

8. If you paint determination of a mutual information, then receive:

Looks in accuracy as Kulbaka-Leybler's divergence! What occurs? It is also Kulbaka-Leybler's divergence of P (X, Y) and its approximation P(X)P(Y). In other words, this number of the saved bits representing X and Y if they are correlated.

Visualization of a ratio between distribution and its approximation in pictures:

9. It is the whole area from the quantum theory of information. I understand it little, but is ready to argue with extremely high degree of confidence (as other works of Michael were magnificent) that Quantum Computation and Quantum Information ("Quantum computings and quantum information") of Michael Nielsen and Isaak Chiang (Issac Cuang) is an excellent source to begin with it acquaintance.

10. As the person who understands nothing in statistical physics, I very much am nervous when I try to state its communication with the information theory.

After Shannon created the information theory, many noticed suspicious similarities between the equations of thermodynamics and the equations of the information theory. E.T. Dzheyns found (E.T. Jaynes) very deep and basic communication. Let's assume, you have a system where you take some measurements of pressure and temperature. What will be probability of a certain system status? Dzheyns offered that we have to accept that distribution of probability at which, taking into account restrictions of our measurements, entropy is maximum.

(I will note that "the maximum principle of entropy" is applicable not only in physics). That is we have to select probability with the maximum quantity of uncertainty. It is so possible to receive a set of results.

(Read the first several sections of articles of Dzheyns (part 1, part 2). I was impressed by in what language it is written – simply and well).

If it is interesting to you, but there is no wish to understand original works, in work of Kover and Thomas (Cover &Thomas) there is a section where they bring the statistical version of the second law of thermodynamics out of Markov chains.

11. Communication between the information theory and gamblings was for the first time described by John Kelly (John Kelly) in the work "New interpretation of information transmission rate" (A New Interpretation of Information Rate). It is written in surprisingly available language, though demands understanding of several concepts which I did not sort in the article.

Kelly noticed that entropy is used in many costs functions which had no relation to data coding, and wanted to find the specific reason of it. When writing article I was disturbed by the same problem therefore I appreciated Kelly's work. Nevertheless, I do not consider that Kelly was rather convincing: it came to entropy only because reinvested all profit at each new stage. Any other schemes do not lead to entropy.

Communication which was carried out by Kelly between rates and the information theory can be found in the book "Elements of Information Theory" (Information theory elements) of Kover and Thomas.

12. It will not solve a problem, but I cannot but protect Kulbaka-Leybler's divergence.

There is a postulate which Kover and Thomas call Steyn's lemma though it has not a direct bearing on the lemma. If not to go deep, then here what there to be told.

Let's assume, you have data of one of two distributions of probabilities. How to define from where they arrived? The more data you obtain, the your confidence will be higher. For example, on average your confidence can increase by 1,5 times for each of results of data retrieveds.

Your confidence will how strongly increase, depends on distinction of distributions. If distinctions are big, then determination of the necessary distribution will not take a lot of time. But if their differences are insignificant, then you should browse many data.

In general, in Steyn's lemma it is said that growth rate of confidence is controlled by Kulbaka-Leybler's divergence (there is a subtlety with false positive and lozhnootritsatelny results). It seems that it is a good occasion to study Kulbaka-Leybler's divergence.