Developers Club geek daily blog

2 years, 11 months ago
We mix colors correctly or we optimize AlphaBlendI write multilegal (but not multiplatform, alas, now only windows) a messenger which meanwhile supports only the TOX protocol. But the speech not about a messenger, and about its interface, and if more precisely, about its basic function — AlphaBlend. Yes, I decided to write the GUI bicycle. And what modern GUI without translucent elements and smooth curves? Therefore sharply there was a need to mix images taking into account translucency, i.e. alpha mixing or alpha blending. Fortunately, in the Windows GDI such function is available — AlphaBlend. Works as it is necessary, does what is necessary. But I that still the builder of bicycles, and it became interesting to me whether I will be able to write the same function, but faster. Result of my works under a cat.

Theory mixing alpha

Most likely you know this theory therefore I will not paint it in detail, I will note only highlights.

So, we have 2 pixels — initial and pixel of assignment. It is necessary to mix and receive them new pixel of assignment. Each pixel is provided by 4 bytes of A,R,G,B where A — value (not) of transparency of pixel (0 — completely transparent, 255 — completely opaque), RGB — color components. Classical formula of mixing is as follows:


Important point! Unit is in a formula. In life at us unit is supported by value 255. I.e. to apply a formula, we have to separate previously value of each byte on 255. As it is not difficult to notice, 255 and 256 — quite close values, and division on 256 is only the right shift on 8 bits. Therefore such simplification often meets: instead of operation

(X) * (A/255.0)
do the following:
 (X * A) >> 8

It works not bad (and, above all, much quicker than fair division), but, in a case a mixing alpha result not quite correct, namely, the resulting pixel turns out a little more darkly. Further, I will show how it is possible to execute calculations precisely and without loss in work speed.

Other important point! Look at a formula. In the second part there is SRC_COLOR * SRC_ALPHA. 3D accelerators execute such multiplication in millions and even in billions, without batting an eyelid. But we that try to solve a problem, using the central processor, and excess multiplication (more precisely than 4 excess multiplication) on each pixel — it is not really good. Why superfluous? Yes because this multiplication can be made in advance, having transformed the initial picture. Such images even have a name: premultiplied. I do not know the term in Russian, but having translated literally we will receive "preincreased". And precisely, the AlphaBlend function demands GDI as the source image strictly of premultiplied. It is reasonable.

Well, with the theory finished. In practice we will work with 32-bit color. One pixel is provided by 32-bit number in which 4 bytes, since younger, mean: B(lue), G(reen), R(ed), A(lpha). Went.

First implementation

My first implementation was such:

uint32 ALPHABLEND_PM(uint32 dst, uint32 src)
    uint8 ba = ALPHA(src);				// функция ALPHA возвращает старший байт 32-битного числа
    if (ba == 0) return dst;				// альфа == 0, значит другие компоненты тоже == 0, значит делать ничего не надо
    float a = (float)((double)(ba)* (1.0 / 255.0));	// просто я люблю все делать в даблах :)
    float not_a = 1.0f - a;				// но иногда признаю: флоатов достаточно

    uint B = lround(float(BLUE(dst)) * not_a) + BLUE(src);	// функция BLUE возвращает 0-й байт 32-битного числа
    uint G = lround(float(GREEN(dst)) * not_a) + GREEN(src);	// функция GREEN возвращает 1-й байт 32-битного числа
    uint R = lround(float(RED(dst)) * not_a) + RED(src);	// функция RED возвращает 2-й байт 32-битного числа
    uint A = lround(float(ALPHA(dst)) * not_a) + ALPHA(src);

    return B | (G << 8) | (R << 16) | (A << 24);	// собрали 32-битный пиксель из запчастей

He agrees, looks not really. 4 material (more precisely 5) multiplication and 4 roundings on each pixel are too. It is no wonder that on speed this monster lost Alphablend'U approximately by 7 times.

Let's try to improve. Let's get rid of material umnozheniye.

uint32 ALPHABLEND_PM(uint32 dst, uint32 src)
    uint not_a = 256 - ALPHA(src);
    return = src + (((not_a * BLUEx256(dst))>>16) |
                   (((not_a * GREENx256(dst))>>8) &0xff00) |
                   (((not_a * REDx256(dst))) &0xff0000) |
                   (((not_a * ALPHAx256(dst))<<8) &0xff000000));

There are functions BLUEx256, GREENx256, etc. return corresponding to a component, shifted to the left on 8 bits, i.e. increased on 256.

This function is remarkable the fact that in it there is compensation of replacement of division into the 255th with shift on 8 bits to the right. Noticed? If is not present, undergo, below I will describe this moment in more detail.

On speed this implementation concedes to Alphablend'U approximately by 3 times. Already best of all, but it is still very far from an ideal.

Unexpected result

How it is possible to improve the previous function? It seems, we made everything that is possible. However, at me it turned out to improve this function by method which became for me a surprise. I also tried it just to be convinced that nothing will turn out. However, it turned out.
That if to take out operation of multiplication of byte by byte in the table. It will turn out not really much — only 65536 bytes. Kopeks.

We get such plate:

uint8 __declspec(align(256)) multbl[256][256];

We fill:

for (int i = 0; i < 256; ++i)
    for (int j = 0; j < 256; ++j) {
        int k = i * j / 255;
        multbl[i][j] = (uint8)k;

We try:

uint32 ALPHABLEND_PM(uint32 dst, uint32 src)
    uint8 not_a = 255 - ALPHA(src);
    return src + ((multbl[not_a][dst &0xff]) |
                (((uint)multbl[not_a][(dst >> 8) &0xff]) << 8) |
                (((uint)multbl[not_a][(dst >> 16) &0xff]) << 16) |
                (((uint)multbl[not_a][(dst >> 24) &0xff]) << 24));

Surprisingly, but this function worked one and a half times quicker than the previous implementation. The truth is one subtlety — the compiler (in my case it was msvc 2013) very competently worked in operations of work with memory. When I tried to write this function on the naked assembler, having made as it seemed to me, everything is much better than the optimizer, I received function which worked twice more slowly than this. It was the failure. I did not begin to understand what specifically I was mistaken in — obviously I did not manage to parallelize competently all operations — I just left this function on a payoff to the optimizer.

So. There is nothing to optimize more. Nothing comes to my mind more. But AlphaBlend still quicker, time in two. How they achieved it? It seems, it is time on pension?

About compensation of replacement of division into the 255th with shift

There are many methods of fast division on 255. To me such met:

X/255 == (X+1+(X>>8)) >> 8

It is quite good. It quicker than fair division on 255. But all this is still too bulky. I long thought how quickly to separate on 255 and not to lose neither in quality nor in speed. How to compensate degradation of color when using shift?

Let's say at us is color a component, equal 0xff (255) and we have another of a component, equal 0xff (255) too. Multiplying them, we receive:

0xff * 0xff = 0xfe01. Having shifted on 8 bits to the right, we will receive 0xfe — brightness components is reduced. Badly.
And what if we increase one of a component on 1 before multiplication?
0xff * 0x100 = 0xff00. Hm, it seems it. Let's check a case when one of a component is equal to 0:
0xff * 1 = 0x00ff, we shift to the right on 8 bits, we receive 0. Voila! At others znacheny a component the result will also be correct.
Now it is easy to find the place of compensation in the second function: uint not_a = 256 — ALPHA(src);
Not 255 — A, but 256 — A, i.e. +1 to a component before multiplication. For a tabular method of multiplication compensation is not required since in the table all values and are so counted as it is necessary.

Heavy artillery — instructions of SSSE3

It is time to think of optimization with use of simd. Speak, the compiler Intel is able to do it and without participation of the person. Perhaps. But gnaws me doubts that Intel will master Alphablend'om. Well at most — it will be made even to it. But I need to make that quicker. We open the reference book and forward.

The first question by which it is necessary to be set — to do under what instructions to optimization? I have a suspicion that AlphaBlend is optimized under MMX, otherwise I cannot explain it superiority over pure x86 with implementation. MMX — it is good, but it is the last century. Now it is difficult to find the computer where there would be no support of SSE4. And under SSE in general it is possible to optimize, even without troubling itself check on existence of support of these instructions — probability that your program will be started on something below Pentium 3 it is close to zero. I, of course, talk about desktop applications. Exotic outside this article.

I stopped the choice on SSSE3. This set of instructions is rather widespread to bother optimization under it, considering existence in it very much of convenient instructions.

The most useful instruction which will form the basis of all optimization is pshufb (intrinsik _mm_shuffle_epi8). For the sake of it SSSE3 is also selected. In what its force? That this instruction allows to scatter bytes of the initial 16-byte register in any random order or in general to throw out these bytes as superfluous. I.e. I can by means of this instruction in one movement prepare all necessary for the necessary calculations. Other important instruction — pmulhuw (intrinsik _mm_mulhi_epu16) is 8 umnozheniye and 8 right shifts on 16 bits. As though especially for operation a mixing alpha. I.e. one this command I actually calculate 2 pixels at once.

Well well, went:

Code asm sheet
	lddqu	xmm5, [eax]	; загрузим в xmm5 16 байт, или 4 пикселя накладываемой premultiplied картинки
	movdqa	xmm6, xmm5	; сохраним в xmm6 для работы со старшими 2-мя пикселями

	; первый шаг: поготовка
	; просто расширяем все компоненты входных пикселей до 16 бит на компоненту

	pshufb	xmm5, preparesrcs_1
	pshufb	xmm6, preparesrcs_2

	; готово
	; xmm5 содержит первые 2 пикселя, где каждой компоненте отведено 16 бит
	; xmm6 содержит тоже для следующих 2 пикселей

	; второй шаг: обработаем 2 из 4 пикселей
        ; но сначала нам нужно получить 8 16-битных значений (256-A)
        ; получим их в xmm7

	movdqa	xmm2, xmm5		; скопируем первые 2 пикселя в xmm2
	pshufb	xmm2, preparealphas	; этой командой мы поставим альфа компоненты в нужные позиции: A0 A0 A0 A0 A1 A1 A1 A1
	movdqa	xmm7, sub256	; в xmm7 загрузим 8 16-битных значений 256
	psubw	xmm7, xmm2	; вычтем альфа компоненты

	movdqu	xmm0, [edx]	; 4 пикселя назначения
	movdqa	xmm1, xmm0	; скопируем в xmm1 для шага 3
	pshufb	xmm0, preparetgtc_1	; тут получаем 2 пикселя с 16-битными компонентами, сдвинутыми влево на 8
	pmulhuw	xmm0, xmm7	; умножаем и получаем готовые 2 пикселя с 16-битными компонентами
	paddw	xmm0, xmm5	; осталось добавить к ним предумноженный цвет исходного пикселя
	pshufb	xmm0, packcback_1		; и упаковать в младшие 8 байт xmm0

	; шаг три - тоже самое, что и шаг 2, но для следующих двух пикселей

	movdqa	xmm2, xmm6
	pshufb	xmm2, xmm3
	movdqa	xmm7, xmm4
	psubw	xmm7, xmm2

	pshufb	xmm1, preparetgtc_2
	pmulhuw	xmm1, xmm7
	paddw	xmm1, xmm6
	pshufb	xmm1, packcback_2

	por	xmm0, xmm1	; итого в xmm0 4 готовых пикселя

	movdqu	[edx], xmm0	; которые мы просто кладем обратно в память

Apparently, simd implementation mixes 4 initial pixels with 4 pixels of assignment at once. Well on that it and simd. A framework of this article I will leave off-screen the description of a solution when it is required to mix not multiple the 4th the number of pixels. Personally I apply for this purpose "one-pixel" challenges of c ++ implementations.


As a result this ssse3 implementation works nearly 4 times quicker (at 3.78 at my iron), than the AlphaBlend function. It is very quite good result. Many programmers (and I including) are skeptical about similar "bicycles". As a rule, the result turns out obviously worse, than work of group of high quality specialists. I undertook writing of the implementation of AlphaBlend of function without believing that I will be able to win against children from Microsoft. It was just sports interest which, nevertheless, yielded result.

But also it not all. The matter is that I provided a code of a simple case in this article — when the initial picture mixes up from a resultant as is. But if you read documentation to the AlphaBlend function, then could notice that this function is able to do additional multiplication by a constant alpha (it is transmitted through parameters). I wrote to ssse3 implementation and for this case. The result is interesting: AlphaBlend works almost twice more slowly if the constant alpha is not equal 255, i.e. additional multiplication of color is required. My implementation degrades in speed for only 4% that too profitable distinguishes it from creation of Microsoft.


The code is given in article only for acquaintance with the principle of ssse3 of optimization. I did not begin to give value of the used constants here. If you want to use the optimized AlphaBlend in the project, you should get a working code directly from source texts of Isotoxin'a (my development so is called).

Isotoxin'a repository on a gitkhaba.
Directly the file in which there is the necessary function here.

I apologize that did not prepare working examples and did not take out everything in separate library. If you really need this function, and you experience difficulties in independently to get it from my source codes, write me the personal message and I will in detail tell you as to make it.

This article is a translation of the original post at
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here:

We believe that the knowledge, which is available at the most popular Russian IT blog, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.
Best wishes.

comments powered by Disqus