*I like to open for myself new methods of thinking. Especially it is pleasant to observe how the vague idea is transformed to the specific concept. A striking example of it is the information theory.*

The information theory gives us exact language for the description of many things. What degree of uncertainty? How to answer a question B, knowing the answer to a question A? Than one set of beliefs is similar to another?

When I was a child, I had some non-standard thoughts about it, but the information theory created them in specific, strong ideas which have a set of scopes of application: from data compression to quantum physics and machine learning.

The information theory looks frightening, but I think that it not so. Actually, many main ideas can be explained clearly.

The information theory gives us exact language for the description of many things. What degree of uncertainty? How to answer a question B, knowing the answer to a question A? Than one set of beliefs is similar to another?

When I was a child, I had some non-standard thoughts about it, but the information theory created them in specific, strong ideas which have a set of scopes of application: from data compression to quantum physics and machine learning.

The information theory looks frightening, but I think that it not so. Actually, many main ideas can be explained clearly.

The most part of time I wear a t-shirt, but sometimes I put on a coat. Let's assume that I wear a coat of 38% of time.

We represent it on the chart:

Now I want to integrate both charts. It is easy if they do not interact with each other in any way, that is are independent. For example, I will put on a t-shirt today or the coat, actually, does not depend on weather next week.

Let's note the first variable on a x axis, and the second – on a y axis:

Pay attention to direct lines: vertical and horizontal. So independence of events looks. The probability that I put on a coat does not influence the fact of loss of rainfall this week. In other words, probability that I will put on a coat and next week rain will expect to fall, there is a work of probabilities that I will put on a coat and that there will be a rain. These probabilities do not influence at each other.

At interaction of variables, for one couples the probability increases, and for others decreases. Probability that I will put on a coat when it is raining much higher because variables correlate. Probability that I will put on a coat in rainy day above, than probability that I will put on a coat in a sunny day.

Visually it looks so: one areas increase due to additional probability, and others decrease because this couple of events is improbable.

Impresses, the truth? But such scheme is not really convenient for understanding.

Let's concentrate on one variable – weather. We know probability of what will be day: solar or rainy. In both cases it is possible to consider conditional probabilities. What probability that I will put on a t-shirt if on the street it is solar? What probability that I will put on a coat if it is raining?

The probability that the rain will go makes 25%; that I will put on a coat in rainy weather – 75%. Thus, probability that it is raining, and I in a coat am 25% increased by 75% that makes about 19%. Probability that it is raining, and I in a coat, am equal to probability that it is raining increased by probability that I will put on a coat in rainy weather.

It is one of chances of fundamental identity of probability theory.

We decompose function on work of two multipliers. At first we consider probability that one variable (weather) will accept a certain value. Then we consider probability that other variable (clothes) will accept a certain value, depending on the first variable.

For a start randomly we select a variable. Let's begin with clothes, and then we will consider the weather caused by clothes.

It sounds a little strange as we understand that, from the point of view of relationship of cause and effect, weather influences that I carry and not vice versa … but now it not essentially.

Let's review an example. If we consider accidental day, then probability that I will put on a coat, 38% equal. What probability that the rain will go if I put on a coat?

Most likely, I will put on a coat during a rain, than in a sunny weather, but a rain – an unusual occurrence in California (therefore we will assume that the probability of loss of rainfall is equal to 50%). So, probability that it is raining, and I in a coat, am equal to work of probabilities of the fact that I will put on a coat (38%) and that there will be a rain if I in a coat (50%). These are about 19%.

It is second method to visualize the same distribution of probabilities.

Pay attention that designations make a bit different sense, than on the previous scheme: now the t-shirt and a coat are absolute probabilities (probability to wear certain clothes without weather conditions). Also we see that two designations of probabilities of sunny and rainy weather appeared, depending on that, I put on a t-shirt or a coat.

Perhaps, you heard about a Bayes' theorem. It is possible to use it for transition from one of these methods of display of distribution of probabilities to another.

#### Retreat: Simpson's paradox

Whether these acceptances are useful to visualization of distribution of probabilities? I think and. After a while we will apply them in the information theory for now I like to digress and investigate Simpson's paradox a little. Simpson's paradox – extremely difficult statistical phenomenon which is difficult for understanding at the intuitive level.

Michael Nielsen (Michael Nielsen) wrote the fine article "Reinventing Explanation" ("Reconsideration of explanations") in which considered different methods of an explanation of this paradox. I will try to analyze article, using the acceptances worked by us in the previous section.

Two methods of treatment of stones in kidneys are tested. To a half of patients treatment is assigned And, and to other half V. U treatment patients to whom treatment of B was assigned is assigned, chances to survive were higher, the at those, the assigned treatment And.

Nevertheless, patients with small stones in kidneys have more chances to survive if by it to assign treatment And. Patients with big stones in kidneys have more chances to survive if by it to assign … treatment And! How it is possible?

The essence of a problem is that research was not properly randomizirovano. Most likely, the patients who received treatment And, had big stones in kidneys while the patients receiving treatment of B had small stones in kidneys.

As it became clear, patients with small stones in kidneys have much more chances to survive.

That to understand it better, let's integrate two previous charts. As a result we receive the three-dimensional circuit of probability of a survival which is broken into areas by the size of stones in kidneys.

Now it is visible that treatment And works better, than treatment of B, in cases both with big stones in kidneys, and with small. Treatment of B seemed better only because he was assigned the patient at whom the probability to survive was higher.

#### Codes

Now, when we have methods of visualization of probability, we can plunge into the information theory.

I will tell you about my imagined friend Bob. Bob very much loves animals. He constantly speaks about animals. Actually, he tells only four words: dog, cat, fish and bird.

In spite of the fact that Bob is a fruit of my imagination, few weeks ago he moved to Australia. Besides, he decided that he wants to communicate with me only with the help of binary codes. All (imagined) messages from Bob look as follows:

Bean and I have to create a code for communication and a method to transform words to bit sequence.

To send the message, Bob replaces each character (word) with the corresponding code combination, and then integrates them together to create the coded line.

##### Variable length code

Unfortunately, communication services in the imagined Australia very expensive. I have to pay $5 for each bit of the message from Bob. I mentioned that he likes to chat? That I was not ruined, we decided to find a method to reduce the average length of the message.

It appears, Bob tells all words with a different frequency. Bob very much loves dogs. He tells all the time about dogs. Sometimes he speaks about other animals (especially about a cat whom his dog likes to chase), but generally he speaks about dogs.

Here diagram of frequency of use of words:

Looks it is promising. Code words in our old code occupy 2 bits everyone irrespective of the fact how often they are used.

There is an excellent method to visualize it. On the following chart we will postpone on a vertical axis of p (x) – probability of use of each word, and on horizontal axis L (h) – lengths of the corresponding code combination. Pay attention that the chart occupies the area equal to average length of the code word sent by us, that is 2 bits.

Perhaps, we could create a code of variable length where code combinations for often used words would be short. The problem is that when we do one words shorter, others become longer. Ideally, we need to minimize length of all code words of the message (at least often used).

Thus, often used words (for example, "dog") have short code combinations, and words which are used less often – long code combinations.

Let's represent the received codes on the scheme. Pay attention that often used code word became shorter while others became longer. On the diagram the size of the minimum area decreased to 1 bit, and now the average length of the code word is equal 1,75 bits.

You can ask a question: why not to use unit as the code word? Unfortunately, it would result in uncertainty when decoding. We will talk about it soon.

This code is the best as at the set probabilities it is impossible to receive the average length of a code less than 1,75 bits.

This fundamental restriction. Does not matter what word was sent and what code was selected – the average length of the message will be always equal 1,75 bits.

Irrespective of, our code is thought how well over, it is impossible to make the average length of messages shorter. It is a limit of entropy of distribution, we will discuss this question in more detail soon.

To understand an essence of this limit, it is necessary to understand an optimum ratio of short and long words. When we understand it, we will be able to define the best code.

##### Space of code words

There are two codes 1 bits long: 0 and 1, and four codes 2 bits long: 00, 01, 10 and 11. Each new bit doubles quantity of possible codes.

We are interested in a code of variable length in which one code words are longer than others. In the simplest case we will have eight code words 3 bits long everyone. It is also possible to make more difficult combinations of code words of two words 2 bits long and four words 3 bits long. How to define how many code words of different length we have to take?

I will remind that Bob transfers the messages to the coded lines, consistently replacing each word with the code word.

During creation of a code of variable length it is necessary to be extremely attentive. How to break the coded line back into code words? When all words have identical length, it is possible just to divide a line through each couple of steps, but if length of code words differs, we have to pay attention to its contents.

We want that our code was decoded, and besides only by one method. We do not need uncertainty when decoding a line. It would be simpler if we had the special character "the end of the code word". (But it is awfully inefficient! If we have an additional character in a code which costs at the end of code words, then it is terrible expenditure of resources).

But we use only 0 and 1. We have to have an opportunity to tell where each code word comes to an end.

There is a probability to create a code which is not unambiguously decoded. Let's say that are 0 and 01 two code words. Then it would be unclear what first word is coded in line 0100111. There is a rule which says that one code word should not be the expanded version of other code word. One more rule: the code word should not be a prefix of other code word. It is called property of a prefix, and the code which submits to this property is called a prefix code.

It is necessary to understand that because of one word it is necessary to endow some other combinations from space of code words. If we take the code floor 01, then we will lose an opportunity to use any code words with this prefix. We cannot use more combinations 010 or 011010110 because of uncertainty – they are lost for us.

We lose a quarter of all possible code words as this quarter begins with 01. It is the price which we pay to receive one code word of 2 bits. In turn, it means that all other code words have to be a little longer.

Always there is a choice between length and a variety of code words. Selecting the short code word, we lose a large number of others therefore we should code the remained words longer codes. We have to find an optimum ratio of length and a variety of words.

##### Optimum coding

To receive the short code word, we need to pay, however our means are limited. We pay for one code word of shares of other possible code words.

The cost of purchase of the code word 0 long equals 1 (all possible code words), that is if you want to have the code word 0 long, then you will have no other code words. The cost of the code word 1 long (for example, 0) equals 1/2 because a half of possible code words begins with 0. The cost of the code word 2 long (for example, 01) equals 1/4 because a quarter of all possible code words begins with 01. In general, the cost of code words exponential decreases with increase in length of the code word.

Let's note that if cost decreases on an exponent (with e basis), then it is both height, and the area [figures on the diagram].

(Here I used cunning a little. I used an exponent with the basis 2 (but it is wrong, it is necessary to pass to an exponent on e basis). It simply saves to us time when calculating log(2) in our proof and does it visually more clear).

We need short code words because we want to reduce the average length of messages. Each code word increases the average length of the message as we have to increase its probability by length of the code word.

For example, if we send the code word 4 bits long to 50% of cases, then the average length of the message will grow by 2 bits. It can be presented in the rectangle form.

These two values are connected by length of the code word. Length of the code word is determined by the number of the offered words. Increase in average length of the message depends on length of the added code word.

Together it looks so:

Short code words reduce the average length of the message, but the road. Wide code words increase the average length of the message, but cost cheaper.

How it is better to use our limited budget? How many we have to spend for code words for each event?

Someone invests more in tools which he regularly uses, we will invest in often used code words.

There is a logical method to make it: to distribute our budget according to the frequency of events. Thus, if the event takes place in 50% of cases, we spend 50% of our budget for acquisition of the short code word for it. If the event takes place only in 1% of cases, we will spend 1% of our budget because it in this case is unimportant what length there will be a word.

It is quite logical method, but whether here optimum? Yes, and I will prove it.

*The following proof is evident and has to be available, but is the most difficult part of article for understanding. You as readers can take it on trust and pass to the following section.*

Let's represent a specific example where we need to understand what of two probabilistic events occurred. The event

**of a**takes place in

**p(a)**of cases, the event

**of b**takes place in

**p(b)**of cases. We distribute our budget by the natural method described above, spending

**p(a)**of our budget for receipt of the short code

**word a**, and

**p(b)**on receipt of the short code

**word b**.

Borders of cost and increase in word length ideally matched. Whether it makes some sense?

Let's consider what will happen to the cost and length if we a little change length of the code word. If to increase length of the code word, then length of the message will begin to increase while cost will decrease.

To make the code word an is shorter, it is necessary to spend

**p(a)**. On the other hand, to us length of each specific code word in itself is not so important, we consider it how often we use these or those code words. In case of

**the word a**

**p(a)**will be frequency. Our prize from reduction

**of the word a**will make

**p(a)**.

It is interesting that the adjoining borders on the diagram are equal. It means that our initial budget has interesting property: if there is opportunity, then investment into reduction of length of any code word will yield positive result.

Eventually, we aim at an optimum ratio of benefit and costs – it defines what we have to invest more in. In this case the relation is equal to unit and does not depend on

**the value p(a)**. It is fair also for other events.

There is a sense to invest in any word as the ratio of benefit and costs is always equal to unit.

It is possible to change the budget infinitely therefore we need to find optimum. To make it, let's make investments of another in one code word. If we enclose in

**an**on

**ϵ**more, due to reduction

**of b**, then it does the code

**word a**shorter, and the code

**word b**is longer.

Now we received costs for the short code word for

**an**and costs for

**b**. Now, if we use the same formula, then we will receive as the relation of benefit and costs for

**a**, and for

**b**.

Costs are not balanced any more –

**b**more profitable

**than a**. Investors shout: "To purchase b! To sell a!" If we make it, then as a result we will return to our initial plan – it is so possible to improve all non-optimal budgets.

Our very first option of distribution of means (to invest in each code word it is proportional to the frequency of its use) not only is logical, but also is optimum. Though this proof is provided only for two code words, it is easy to apply it to their bigger quantity.

The attentive reader could notice that for our optimum budget it is possible to consider codes where code words have fractional lengths. It guards. What does it mean? In practice, sending one code word, you should round its length. However, as we will see later, it is possible to send fractional code words if to send a little for once.

*In the second part of the Theory of information visualization in the history with Bob one more character – Alice will appear, and on the example of their "interaction" with the author we will consider questions of calculation of entropy, the phenomenon of cross entropy and entropy of several variables.*

This article is a translation of the original post at habrahabr.ru/post/273631/

If you have any questions regarding the material covered in the article above, please, contact the original author of the post.

If you have any complaints about this article or you want this article to be deleted, please, drop an email here: sysmagazine.com@gmail.com.We believe that the knowledge, which is available at the most popular Russian IT blog habrahabr.ru, should be accessed by everyone, even though it is poorly translated.

Shared knowledge makes the world better.

Best wishes.