#### Theory mixing alpha

Most likely you know this theory therefore I will not paint it in detail, I will note only highlights.

So, we have 2 pixels — initial and pixel of assignment. It is necessary to mix and receive them new pixel of assignment. Each pixel is provided by 4 bytes of A,R,G,B where A — value (not) of transparency of pixel (0 — completely transparent, 255 — completely opaque), RGB — color components. Classical formula of mixing is as follows:

```
TGT_COLOR = TGT_COLOR * (1 - SRC_ALPHA) + SRC_COLOR * SRC_ALPHA
```

Important point! Unit is in a formula. In life at us unit is supported by value 255. I.e. to apply a formula, we have to separate previously value of each byte on 255. As it is not difficult to notice, 255 and 256 — quite close values, and division on 256 is only the right shift on 8 bits. Therefore such simplification often meets: instead of operation

`(X) * (A/255.0)`

do the following:```
(X * A) >> 8
```

It works not bad (and, above all, much quicker than fair division), but, in a case a mixing alpha result not quite correct, namely, the resulting pixel turns out a little more darkly. Further, I will show how it is possible to execute calculations precisely and without loss in work speed.

Other important point! Look at a formula. In the second part there is SRC_COLOR * SRC_ALPHA. 3D accelerators execute such multiplication in millions and even in billions, without batting an eyelid. But we that try to solve a problem, using the central processor, and excess multiplication (more precisely than 4 excess multiplication) on each pixel — it is not really good. Why superfluous? Yes because this multiplication can be made in advance, having transformed the initial picture. Such images even have a name:

*premultiplied*. I do not know the term in Russian, but having translated literally we will receive "preincreased". And precisely, the AlphaBlend function demands GDI as the source image strictly of premultiplied. It is reasonable.

Well, with the theory finished. In practice we will work with 32-bit color. One pixel is provided by 32-bit number in which 4 bytes, since younger, mean: B(lue), G(reen), R(ed), A(lpha). Went.

#### First implementation

My first implementation was such:

```
uint32 ALPHABLEND_PM(uint32 dst, uint32 src)
{
uint8 ba = ALPHA(src); // функция ALPHA возвращает старший байт 32-битного числа
if (ba == 0) return dst; // альфа == 0, значит другие компоненты тоже == 0, значит делать ничего не надо
float a = (float)((double)(ba)* (1.0 / 255.0)); // просто я люблю все делать в даблах :)
float not_a = 1.0f - a; // но иногда признаю: флоатов достаточно
uint B = lround(float(BLUE(dst)) * not_a) + BLUE(src); // функция BLUE возвращает 0-й байт 32-битного числа
uint G = lround(float(GREEN(dst)) * not_a) + GREEN(src); // функция GREEN возвращает 1-й байт 32-битного числа
uint R = lround(float(RED(dst)) * not_a) + RED(src); // функция RED возвращает 2-й байт 32-битного числа
uint A = lround(float(ALPHA(dst)) * not_a) + ALPHA(src);
return B | (G << 8) | (R << 16) | (A << 24); // собрали 32-битный пиксель из запчастей
}
```

He agrees, looks not really. 4 material (more precisely 5) multiplication and 4 roundings on each pixel are too. It is no wonder that on speed this monster lost Alphablend'U approximately by 7 times.

Let's try to improve. Let's get rid of material umnozheniye.

```
uint32 ALPHABLEND_PM(uint32 dst, uint32 src)
{
uint not_a = 256 - ALPHA(src);
return = src + (((not_a * BLUEx256(dst))>>16) |
(((not_a * GREENx256(dst))>>8) &0xff00) |
(((not_a * REDx256(dst))) &0xff0000) |
(((not_a * ALPHAx256(dst))<<8) &0xff000000));
}
```

There are functions BLUEx256, GREENx256, etc. return corresponding to a component, shifted to the left on 8 bits, i.e. increased on 256.

This function is remarkable the fact that in it there is compensation of replacement of division into the 255th with shift on 8 bits to the right. Noticed? If is not present, undergo, below I will describe this moment in more detail.

On speed this implementation concedes to Alphablend'U approximately by 3 times. Already best of all, but it is still very far from an ideal.

#### Unexpected result

How it is possible to improve the previous function? It seems, we made everything that is possible. However, at me it turned out to improve this function by method which became for me a surprise. I also tried it just to be convinced that nothing will turn out. However, it turned out.

That if to take out operation of multiplication of byte by byte in the table. It will turn out not really much — only 65536 bytes. Kopeks.

We get such plate:

```
uint8 __declspec(align(256)) multbl[256][256];
```

We fill:

```
for (int i = 0; i < 256; ++i)
for (int j = 0; j < 256; ++j) {
int k = i * j / 255;
multbl[i][j] = (uint8)k;
}
```

We try:

```
uint32 ALPHABLEND_PM(uint32 dst, uint32 src)
{
uint8 not_a = 255 - ALPHA(src);
return src + ((multbl[not_a][dst &0xff]) |
(((uint)multbl[not_a][(dst >> 8) &0xff]) << 8) |
(((uint)multbl[not_a][(dst >> 16) &0xff]) << 16) |
(((uint)multbl[not_a][(dst >> 24) &0xff]) << 24));
}
```

Surprisingly, but this function worked one and a half times quicker than the previous implementation. The truth is one subtlety — the compiler (in my case it was msvc 2013) very competently worked in operations of work with memory. When I tried to write this function on the naked assembler, having made as it seemed to me, everything is much better than the optimizer, I received function which worked twice more slowly than this. It was the failure. I did not begin to understand what specifically I was mistaken in — obviously I did not manage to parallelize competently all operations — I just left this function on a payoff to the optimizer.

So. There is nothing to optimize more. Nothing comes to my mind more. But AlphaBlend still quicker, time in two. How they achieved it? It seems, it is time on pension?

#### About compensation of replacement of division into the 255th with shift

There are many methods of fast division on 255. To me such met:

```
X/255 == (X+1+(X>>8)) >> 8
```

It is quite good. It quicker than fair division on 255. But all this is still too bulky. I long thought how quickly to separate on 255 and not to lose neither in quality nor in speed. How to compensate degradation of color when using shift?

Let's say at us is color a component, equal

**0xff**(255) and we have another of a component, equal

**0xff**(255) too. Multiplying them, we receive:

**0xff * 0xff = 0xfe01**. Having shifted on 8 bits to the right, we will receive

**0xfe**— brightness components is reduced. Badly.

And what if we increase one of a component on 1 before multiplication?

**0xff * 0x100 = 0xff00**. Hm, it seems it. Let's check a case when one of a component is equal to 0:

**0xff * 1 = 0x00ff**, we shift to the right on 8 bits, we receive 0. Voila! At others znacheny a component the result will also be correct.

Now it is easy to find the place of compensation in the second function: uint not_a = 256 — ALPHA(src);

Not 255 — A, but 256 — A, i.e. +1 to a component before multiplication. For a tabular method of multiplication compensation is not required since in the table all values and are so counted as it is necessary.

#### Heavy artillery — instructions of SSSE3

It is time to think of optimization with use of simd. Speak, the compiler Intel is able to do it and without participation of the person. Perhaps. But gnaws me doubts that Intel will master Alphablend'om. Well at most — it will be made even to it. But I need to make that quicker. We open the reference book and forward.

The first question by which it is necessary to be set — to do under what instructions to optimization? I have a suspicion that AlphaBlend is optimized under MMX, otherwise I cannot explain it superiority over pure x86 with implementation. MMX — it is good, but it is the last century. Now it is difficult to find the computer where there would be no support of SSE4. And under SSE in general it is possible to optimize, even without troubling itself check on existence of support of these instructions — probability that your program will be started on something below Pentium 3 it is close to zero. I, of course, talk about desktop applications. Exotic outside this article.

I stopped the choice on SSSE3. This set of instructions is rather widespread to bother optimization under it, considering existence in it very much of convenient instructions.

The most useful instruction which will form the basis of all optimization is

**pshufb**(intrinsik

**_mm_shuffle_epi8**). For the sake of it SSSE3 is also selected. In what its force? That this instruction allows to scatter bytes of the initial 16-byte register in any random order or in general to throw out these bytes as superfluous. I.e. I can by means of this instruction in one movement prepare all necessary for the necessary calculations. Other important instruction —

**pmulhuw**(intrinsik

**_mm_mulhi_epu16**) is 8 umnozheniye and 8 right shifts on 16 bits. As though especially for operation a mixing alpha. I.e. one this command I actually calculate 2 pixels at once.

Well well, went:

**Code asm sheet**

```
lddqu xmm5, [eax] ; загрузим в xmm5 16 байт, или 4 пикселя накладываемой premultiplied картинки
movdqa xmm6, xmm5 ; сохраним в xmm6 для работы со старшими 2-мя пикселями
; первый шаг: поготовка
; просто расширяем все компоненты входных пикселей до 16 бит на компоненту
pshufb xmm5, preparesrcs_1
pshufb xmm6, preparesrcs_2
; готово
; xmm5 содержит первые 2 пикселя, где каждой компоненте отведено 16 бит
; xmm6 содержит тоже для следующих 2 пикселей
; второй шаг: обработаем 2 из 4 пикселей
; но сначала нам нужно получить 8 16-битных значений (256-A)
; получим их в xmm7
movdqa xmm2, xmm5 ; скопируем первые 2 пикселя в xmm2
pshufb xmm2, preparealphas ; этой командой мы поставим альфа компоненты в нужные позиции: A0 A0 A0 A0 A1 A1 A1 A1
movdqa xmm7, sub256 ; в xmm7 загрузим 8 16-битных значений 256
psubw xmm7, xmm2 ; вычтем альфа компоненты
movdqu xmm0, [edx] ; 4 пикселя назначения
movdqa xmm1, xmm0 ; скопируем в xmm1 для шага 3
pshufb xmm0, preparetgtc_1 ; тут получаем 2 пикселя с 16-битными компонентами, сдвинутыми влево на 8
pmulhuw xmm0, xmm7 ; умножаем и получаем готовые 2 пикселя с 16-битными компонентами
paddw xmm0, xmm5 ; осталось добавить к ним предумноженный цвет исходного пикселя
pshufb xmm0, packcback_1 ; и упаковать в младшие 8 байт xmm0
; шаг три - тоже самое, что и шаг 2, но для следующих двух пикселей
movdqa xmm2, xmm6
pshufb xmm2, xmm3
movdqa xmm7, xmm4
psubw xmm7, xmm2
pshufb xmm1, preparetgtc_2
pmulhuw xmm1, xmm7
paddw xmm1, xmm6
pshufb xmm1, packcback_2
por xmm0, xmm1 ; итого в xmm0 4 готовых пикселя
movdqu [edx], xmm0 ; которые мы просто кладем обратно в память
```

Apparently, simd implementation mixes 4 initial pixels with 4 pixels of assignment at once. Well on that it and simd. A framework of this article I will leave

#### Results

As a result this ssse3 implementation works nearly 4 times quicker (at 3.78 at my iron), than the AlphaBlend function. It is very quite good result. Many programmers (and I including) are skeptical about similar "bicycles". As a rule, the result turns out obviously worse, than work of group of high quality specialists. I undertook writing of the implementation of AlphaBlend of function without believing that I will be able to win against children from Microsoft. It was just sports interest which, nevertheless, yielded result.

But also it not all. The matter is that I provided a code of a simple case in this article — when the initial picture mixes up from a resultant as is. But if you read documentation to the AlphaBlend function, then could notice that this function is able to do additional multiplication by a constant alpha (it is transmitted through parameters). I wrote to ssse3 implementation and for this case. The result is interesting: AlphaBlend works almost twice more slowly if the constant alpha is not equal 255, i.e. additional multiplication of color is required. My implementation degrades in speed for only 4% that too profitable distinguishes it from creation of Microsoft.

#### Links

The code is given in article only for acquaintance with the principle of ssse3 of optimization. I did not begin to give value of the used constants here. If you want to use the optimized AlphaBlend in the project, you should get a working code directly from source texts of Isotoxin'a (my development so is called).

Isotoxin'a repository on a gitkhaba.

Directly the file in which there is the necessary function here.

I apologize that did not prepare working examples and did not take out everything in separate library. If you really need this function, and you experience difficulties in independently to get it from my source codes, write me the personal message and I will in detail tell you as to make it.

This article is a translation of the original post at habrahabr.ru/post/273109/

If you have any questions regarding the material covered in the article above, please, contact the original author of the post.

If you have any complaints about this article or you want this article to be deleted, please, drop an email here: sysmagazine.com@gmail.com.We believe that the knowledge, which is available at the most popular Russian IT blog habrahabr.ru, should be accessed by everyone, even though it is poorly translated.

Shared knowledge makes the world better.

Best wishes.