Developers Club geek daily blog

3 years, 2 months ago
? Ra-a-avnyays, quietly!?. We align data

In modern compilers the problem of vectorization of cycles is very important and necessary. In the majority, at successful vectorization application performance can be significantly increased. There is a lot of ways to achieve it, and the subtleties connected with receiving the expected? accelerations? our application? it is even more.

Today we will talk about data smoothing, its influence on productivity and vectorization and work to it in the compiler, in particular. Very in detail concept is given in this article, as well as set of other nuances. But we are interested in influence of alignment at vectorization. And so, if you have read article or simply know how there is work to memory, news that data are read out by blocks you will not surprise.

When we operate with elements of arrays (and not only with them), actually constantly we work about cache lines of 64 bytes in size. SSE and AVX vectors always get to one cache the line if they are aligned on 16 and 32 bytes, respectively. And here if our data are not aligned, very possibly, we should podgruzhat one more? the additional? cache line. This process rather strongly affects productivity and if we thus and to array cells, so, and to memory, address inconsistently, everything can be even worse.
Besides, also instructions can be with the aligned or nevyravnenny data access. If in the instruction we see small letter of u (unaligned), most likely it is the instruction of nevyravnenny read and write, for example vmovupd. It should be noted that since architecture of Nehalem the speed of work of these instructions became comparable with aligned, on condition of uniformity of data. On older versions that not so.

The compiler can actively help us with fight for productivity. For example, it can try to break 128 bit nevyravnenny load into two 64 bit that it will be better, but nevertheless slowly. One more good solution which the compiler is able to implement? it to generate different versions for the aligned and nevyravnenny cases. In rantayma there is definition, what data at us are available, and execution goes according to the necessary version. Problem only in one? overhead costs of similar checks can be too great, and the compiler will refuse this idea. Even better if the compiler will be able to align itself for us data. By the way, if at vectorization data are not aligned or the compiler knows nothing about uniformity, the initial cycle breaks into three parts:
  • quantity of iterations (always there is less than length of vector) to the basic? kernels? (peel loop) which the compiler can use for alignment of starting address. It is possible to disconnect peeling by means of the option mP2OPT_vec_alignment=6.
  • main body?? kernel?? cycle (kernel loop) for which the aligned vector instructions are generated
  • ? tail? (remainder loop) which remains because number of iterations does not share at vector length; it can be too vektorizovan, but is not so effective as main loop. If we want to disconnect vectorization of remainder of cycle, we use the directive #pragma vector novecremainder in its own juice ++ or! DIR$ vector noremainder in the Fortran.

Thus, uniformity of starting address, due to loss in speed can be reached? it is necessary to us? to peretaptyvatsya? to the main kernel of cycle, executing some number of iterations. But it can be avoided, aligning data and speaking to the compiler about it.

Developers need to take for the rule to align data? how it is necessary?: 16 bytes for SSE, 32 for AVX and 64 for MIC &AVX-512.; How it can be done?

For selection of the aligned memory on With/with ++ in heap function is used:

void* _mm_malloc(int size, int base)

In Linux there is function:

int posix_memaligned(void **p, size_t base, size_t size)

For variables on the stack the attribute __ is used by declspec:

__declspec(align(base)) <var>

Or specific to Linux:

<var> __attribute__((aligned(base)))

Problem in that __ declspec is unknown for gcc so the problem with portiruyemost therefore it is worth using preprocessor is possible:

#ifdef __GNUC__
#define _ALIGN(N)  __attribute__((aligned(N)))
#define _ALIGN(N)  __declspec(align(N))

_ALIGN(16) int foo[4];  

It is interesting that in the Fortran compiler from Intel (version 13.0 above) there is special option - align, with the help which can be made data aligned (at declaration). For example, through - align array32byte we will tell the compiler that all arrays have been aligned on 32 bytes. There is also directive:

 !DIR$ ATTRIBUTES ALIGN: base :: variable

Now about instructions. During the work with nevyravnenny data of the instruction of nevyravnenny read and write very slow, except for vector SSE operations on SandyBridge is also newer. There they on speed can not concede to instructions with the aligned access at observance of number of conditions. Nevyravnenny vector instructions of AVX for work with nevyravnenny data more slowly similar for work with aligned, even on the last generations of processors.

Thus the compiler prefers to generate nevyravnenny instructions for AVX because in case of the aligned data they will work also quickly and if data are not aligned? that will be slower execution, but it will be. If the aligned instructions are generated, and data will not be aligned? that everything will fall.

To prompt to what compiler set of instructions it is possible to use through the directive pragma vector unaligned/aligned.

For example, we will consider this code:

void mult(double* a, double* b, double* c)
  int i;
#pragma vector unaligned
  for (i = 0; i < N; i++)
    c[i] = a[i] * b[i];

For it when using AVX of instructions we receive the following assembler code:

  vmovupd   (%rdi,%rax,8), %xmm0
  vmovupd   (%rsi,%rax,8), %xmm1
  vinsertf128 $1, 16(%rsi,%rax,8), %ymm1, %ymm3
  vinsertf128 $1, 16(%rdi,%rax,8), %ymm0, %ymm2
  vmulpd    %ymm3, %ymm2, %ymm4
  vmovupd   %xmm4, (%rdx,%rax,8)
  vextractf128 $1, %ymm4, 16(%rdx,%rax,8)
  addq      $4, %rax
  cmpq      $1000000, %rax
  jb        ..B2.2

It should be noted that in this case there will be no that peel loop'a because we used the directive.
If we replace unaligned with aligned, having given thereby guarantees to the compiler that data are aligned and it is safe to generate the corresponding aligned instructions, we will receive the following:

  vmovupd   (%rdi,%rax,8), %ymm0
  vmulpd    (%rsi,%rax,8), %ymm0, %ymm1
  vmovntpd  %ymm1, (%rdx,%rax,8)
  addq      $4, %rax
  cmpq      $1000000, %rax
  jb        ..B2.2

The last case will quicker work on condition of the aligned a, b and page. If is not present? everything will be bad. In the first case we receive slower implementation on condition of the aligned data because the compiler had no opportunity to use vmovntpd, and there was additional instruction of vextractf128.

One more important point? this concept of uniformity of starting address and relative alignment. Let's review the following example:

void matvec(double a[][COLWIDTH], double b[], double c[])
  int i, j;
  for(i = 0; i < size1; i++) {
    b[i] = 0;
#pragma vector aligned
    for(j = 0; j < size2; j++)
      b[i] += a[i][j] * c[j];

Question here only one? whether will earn this code provided that a, b and with are aligned on 16 bytes, and we collect our code c use of SSE? The answer depends on COLWIDTH value. In case of the odd length (length of the registers SSE/size double = 2, means COLWIDTH has to share on 2), our application will finish the execution expected much earlier (after pass on the first line of array). The reason that the first data item in the second line is nevyravnenny. For such cases it is necessary to add dummy elements (? holes?) in the end of every line that the new line was aligned, doing so-called padding. In this case we are able to do it by means of COLWIDTH, depending on set of vector instructions and data type which we will use. How it was already told, for SSE it has to be even number, and for AVX? to share on 4.
If we know that only the starting address is aligned, it is possible to give this information to the compiler through attribute:

__assume_aligned(<array>, base)

Analog for the Fortran:
!DIR$ ASSUME_ALIGNED address1:base [, address2:base] ...

I was a little played with simple example of multiplication of matrixes on Haswell to compare the speed of operation of application to AVX instructions on Windows depending on directives in code:

  for (j = 0;j < size2; j++) {
    b[i] += a[i][j] * x[j];

Aligned data on 32 bytes:
_declspec(align(32)) FTYPE a[ROW][COLWIDTH];
_declspec(align(32)) FTYPE b[ROW];
_declspec(align(32)) FTYPE x[COLWIDTH];

Primerchik goes together with samples to the compiler from Intel, it is possible to look at all code there. And so, if we use the directive pragma vetor aligned before cycle, runtime of cycle made 2.531 seconds. At its absence, it has increased to 3.466 and there was peel cycle. Possibly, about the aligned data the compiler has not learned. Having disconnected its generation by means of mP2OPT_vec_alignment=6, the cycle was executed nearly 4 seconds. It is interesting, what? to deceive? the compiler it has appeared very not simply in such example because it persistently generated rantaym verification of data and did some options of cycle therefore work speed with nevyravnenny data was slightly worse.

In the dry rest it is necessary to tell that aligning data you will almost always get rid of potential problems, in particular, with productivity. But to align data in itself not enough? it is necessary to inform the compiler that is known to you, and then it is possible to receive the most effective application on output. Main thing? not to forget about small cunnings!

This article is a translation of the original post at
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here:

We believe that the knowledge, which is available at the most popular Russian IT blog, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.
Best wishes.

comments powered by Disqus