Developers Club geek daily blog

2 years ago

Now the huge number of tasks demands the big performance of systems. Infinitely physical restrictions do not allow to increase the number of transistors on a processor crystal. The geometrical sizes of transistors cannot be reduced physically as when exceeding perhaps admissible sizes the phenomena which are not noticeable at the big sizes of active elements begin to be shown — quantum size effects begin to affect strongly. Transistors begin to work not as transistors.
And Moore's law here at anything. It was and remains the law of cost, and increase in number of transistors on a crystal is rather consequence from the law. Thus, to increase the power of computer systems it is necessary to look for other methods. This use of multiprocessors, multicomputers. Such approach is characterized by a large number of processor elements that execution of subtasks on each computing device brings to independent.

Parallel methods of processing:

Parallelism source Acceleration Effort of the programmer Popularity
Set of kernels 2kh-128kh Moderate High
Set of machines 1kh-Beskonechnost Moderate and High High
Vectorization 2kh-8kh Moderate Low
Graphics adapters 128kh-2048kh High Low
Coprocessor 40kh-80kh Moderate and high Extremely - low

There is a lot of methods for increase of efficiency of systems and all are quite different. One of such methods is use of vector processors which many times increase the speed of calculations. Unlike scalar processors which process one data item for one instruction (SISD) vector processors are capable to process several data items (SIMD) for one instruction. The majority of modern processors are scalar. But many problems which they solve demand large volume of calculations: processing of video, sound, work with graphics, scientific computations and many other things. For acceleration of process of calculations vendors of processors began to build in additional stream SIMD expansions the devices.
Respectively at a certain approach of programming use of vector data handling in the processor became possible. The existing expansions: MMX, SSE and AVX. They allow to use additional opportunities of the processor for the accelerated processing of data bulks. At the same time vectorization allows to achieve acceleration without explicit parallelism. I.e. it is from the point of view of data handling, but from the point of view of the programmer it does not demand any costs for development of special algorithms for prevention of a status of race or synchronization, and style of development does not differ from synchronous. We receive acceleration with little effort, almost absolutely free of charge. And in it there is no magic.

What is SSE?

SSE (English Streaming SIMD Extensions, stream SIMD expansion of the processor) is SIMD (English Single Instruction, Multiple Data, One instruction — a set of data) a set of instructions. SSE includes eight 128-bit registers and a set of instructions in architecture of the processor. The SSE technology was introduced in Pentium III in 1999 for the first time. Over time, this set of instructions was improved by adding of more difficult operations. Eight (to x86-64 — sixteen) 128-bit registers were added to the processor: from xmm0 to xmm7.
Initially these registers could be used only for single precision of calculations (i.e. for the float type). But after SSE2 output, these registers can be used for any primitive data type. Considering the standard 32-bit machine thus, we can store and process in parallel:
  • 2 double
  • 2 long
  • 4 float
  • 4 int
  • 8 short
  • 16 char

If to use the AVX technology, then you will manipulate already 256-bit registers, respectively there are more numbers for one instruction. So there are already also 512-bit registers.
At first on an example With ++ (to whom it is not interesting, you can pass) we will write the program which will sum up two arrays from 8 float elements.

Example of vectorization on With ++

The SSE technology in With ++ is implemented by low-level instructions, presented in the pseudo-code form which reflect commands in the assembler. So, for example, command __ m128 _mm_add_ps (__ m128 a, __ m128 b); it will be transformed to the instruction of the ADDPS assembler operand1, operand2. Respectively command __ m128 _mm_add_ss (__ m128 a, __ m128 b); it will be transformed to the instruction of ADDSS operand1, operand2. These two commands do almost identical operations: put array cells, but it is a little differently. _mm_add_ps puts completely the register with the register so:
  • r0: = a0 + b0
  • r1: = a1 + b1
  • r2: = a2 + b2
  • r3: = a3 + b3

At the same time all register __ m128 is also the r0-r3 set. And here the _mm_add_ss command puts only part of the register so:
  • r0: = a0 + b0
  • r1: = a1
  • r2: = a2
  • r3: = a3

By the same principle also other commands, such as subtraction, division, the square root, at least, at most and other operations are arranged.
For writing of the program it is possible to manipulate 128-bit registers like __ m128 for float, __ m128d for double and __ m128i for int, short, char. At the same time it is possible not to use arrays like __ m128, and to use the provided pointers of an array of float to type __ m128*.
At the same time it is necessary to consider several conditions for work:
  • These float loaded and which are stored in __ m128 object have to have 16-byte alignment
  • Some embedded functions demand that their argument was like constants of integral numbers, in an instruction force of nature
  • The result of the arithmetic operations operating on two NAN of arguments is not defined

Such here small digression to the theory. However, we will review the sample program with use of SSE:
#include "iostream"
#include "xmmintrin.h"	

int main()
	const auto N = 8;

	alignas(16) float a[] = { 41982.0,  81.5091, 3.14, 42.666, 54776.45, 342.4556, 6756.2344, 4563.789 };
	alignas(16) float b[] = { 85989.111,  156.5091, 3.14, 42.666, 1006.45, 9999.4546, 0.2344, 7893.789 };
	__m128* a_simd = reinterpret_cast<__m128*>(a);
	__m128* b_simd = reinterpret_cast<__m128*>(b);

        auto size = sizeof(float);
        void *ptr = _aligned_malloc(N * size, 32);
	float* c = reinterpret_cast<float*>(ptr);
        for (size_t i = 0; i < N/2; i++, a_simd++, b_simd++, c += 4)
		_mm_store_ps(c, _mm_add_ps(*a_simd, *b_simd));
	c -= N;

	for (size_t i = 0; i < N; i++)
		std::cout << c[i] << std::endl;


	return 0;

  • alignas (#) — standard for With ++ a transferable method of a task of the configured alignment of variable and user types. It is used in With ++ 11 and the Visual Studio of 2015 is supported. It is possible to use also other option — __ declspec (align (#)) declarator. These means for management of alignment at static memory allocation. If alignment with dynamic selection is necessary, it is necessary to use void * _aligned_malloc (size_t size, size_t alignment);
  • Then we will transform the pointer on an array of an and b to type _m128 * by means of reinterpret_cast which allows to transform any pointer to the pointer of any other type.
  • After we will dynamically select the aligned memory by means of the _aligned_malloc function which is already mentioned above (N*sizeof(float), 16);
  • We select the number of necessary bytes proceeding from quantity of elements taking into account dimension of type, and the 16th this value of alignment which has to be a power of two. And then the pointer on this section of memory we lead to other pointer type that with it it would be possible to work taking into account dimension like float as with an array.

Thus all preparations for work of SSE are executed. Further in the scraper we sum up elements of arrays. Approach is based on arithmetics of pointers. As a_simd, b_simd and with are pointers, their increase leads to shift on sizeof(T) on memory. If to take for example a dynamic array with, then with [0] and * with will show identical value since with points to the first array cell. The increment with will lead to pointer shift on 4 bytes forward and now the pointer will indicate the 2nd array cell. Thus it is possible to move forward on an array and increasing back and reducing the pointer. But at the same time it is necessary to consider dimension of an array as it is easy to go beyond its limits and to address others section of memory. Work with the pointers a_simd and b_simd is similar, only the increment of the pointer will lead to movement of 128 bits forward and at the same time from the point of view of the float type 4 variable arrays of an and b will be passed. In principle the pointer a_simd and a, as well as b_simd and b indicate respectively one section in memory, behind that exception that they are processed differently taking into account dimension of pointer type:

	for (int i = 0; i < N/2; i++, a_simd++, b_simd++, c += 4)
		_mm_store_ps(c, _mm_add_ps(*a_simd, *b_simd));

Now it is clear why in this cycle such changes of pointers. On each iteration of a cycle there is an addition of 4 elements and saving of the received result on pointer addresses with from the register xmm0 (for this program). I.e. as we see such approach does not change basic data, and stores the sum in the register and as necessary transfers it to an operand necessary to us. It allows to increase program performance when it is necessary to reuse operands.
Let's consider a code which generates the assembler for the _mm_add_ps method:
mov         eax,dword ptr [b_simd]  ;// поместить адрес b_simd в регистр eax(базовая команда пересылки данных, источник не изменяется)
mov         ecx,dword ptr [a_simd]  ;// поместить адрес a_simd в регистр ecx 
movups      xmm0,xmmword ptr [ecx]  ;// поместить 4 переменные с плавающей точкой по адресу ecx в регистр xmm0;            xmm0 = {a[i], a[i+1], a[i+2], a[i+3]}
addps       xmm0,xmmword ptr [eax]  ;// сложить переменные: xmm0 = xmm0 + b_simd
                                    ;// xmm0[0] = xmm[0] + b_simd[0]
                                    ;// xmm0[1] = xmm[1] + b_simd[1]
                                    ;// xmm0[2] = xmm[2] + b_simd[2]
                                    ;// xmm0[3] = xmm[3] + b_simd[3]

movaps      xmmword ptr [ebp-190h],xmm0 ;// поместить значение регистра по адресу в стеке со смещением
movaps      xmm0,xmmword ptr [ebp-190h] ;// поместить в регистр
mov         edx,dword ptr [c] ;// поместить в регистр ecx адрес переменной с
movaps      xmmword ptr [edx],xmm0 ;// поместить значение регистра в регистр ecx или сохранить сумму по адресу памяти, куда указывает (ecx) или с. При этом xmmword представляет собой один и тот же тип, что и _m128 - 128-битовое слово, в   котором 4 переменные с плавающей точкой 

Apparently from a code, one instruction of addps processes at once 4 variables which is hardwired and is supported at the hardware level. The system does not take any part when processing these variables that gives a good gain of performance without excess costs from outside.
At the same time would like to note one feature that in this example and the compiler the instruction of movups for which operands which have to be aligned on 16-byte border are not required is used. What follows from that it would be possible not to align an array. However, the array of b needs to be aligned, otherwise in the operation addps there will be an error of reading memory, the register develops with a 128-bit arrangement in memory. In other compiler or Wednesday there can be other instructions therefore it is better to do anyway for all operands which are taking part in similar operations alignment on border. In any case in order to avoid problems with memory.
One more reason to do alignment, so it when we operate with elements of arrays (and not only with them), actually constantly we work about a cache lines of 64 bytes in size. SSE and AVX vectors always get to one cache the line if they are aligned on 16 and 32 bytes, respectively. And here if our data are not aligned, then, very possibly, we should load one more "additional" a cache line. This process rather strongly affects performance and if we at the same time and to array cells, so, and to memory, address inconsistently, then everything can be even worse.

Support of SIMD in .NET

For the first time mentioning of o to support of JIT of the SIMD technology was declared in the blog .NET in April, 2014. Then developers announced new a preview version of RyuJIT which provided to SIMD functionality. Quite high popularity of request for support of C# and SIMD became the reason of adding. The initial set of the supported types was not big and there were restrictions on functionality. The SSE set was initially supported, and AVX promised to add in release. Later updates were released and new types with support SIMD and new methods for work with them are added that in the latests version represents extensive and convenient library for hardware data handling.
Such approach facilitates life of the developer who should not write a CPU dependent code. Instead of this CLR abstracts hardware, providing the virtual performing environment which transfers the ​​ a code to machine instructions or at the runtime (JIT), or during installation (NGEN). Leaving code generation of CLR, you can use the same MSIL code on different computers with different processors, without refusing the optimization specific to this specific CPU.
At the moment support of this technology in .NET is provided in namespace of System.Numerics.Vectors and represents library of vector types which can use benefits of hardware acceleration of SIMD. Hardware acceleration can lead to substantial increase of performance at mathematical and scientific programming, and also when programming graphics. It contains the following types:
  • Vector — a collection of static convenient methods for work with universal vectors
  • Matrix3x2 — represents a matrix 3х2
  • Matrix4x4 — represents a matrix 4х4
  • Plane — represents the three-dimensional plane
  • Quaternion — represents the vector used for coding of three-dimensional physical turns
  • Vector <(<(<'T>)>)> represents to Of a vector of the specified numerical type which is suitable for low-level optimization of parallel algorithms
  • Vector2 — represents a vector with two values of single precision from a floating comma
  • Vector3 — represents a vector with three values of single precision from a floating comma
  • Vector4 — represents a vector with four values of single precision from a floating comma

The class Vector provides methods for addition, comparison, search of a minimum and a maximum and many other conversions over vectors. At the same time operations work with use of the SIMD technology. Other types also support hardware acceleration and contain conversions, specific to them. For matrixes it can be multiplication, for vectors Euclidean distance between points, etc.

The sample program on C#

So, that is necessary to use this technology? It is necessary to have first of all RyuJIT the compiler and the .NET 4.6 version. System.Numerics.Vectors through NuGet is not put if the version is lower. However, already at the set library I lowered the version and everything worked as it is necessary. Then assembly under x64 is necessary, for this purpose it is necessary to clean project properties "to prefer a 32-bit platform" and it is possible to collect under Any CPU.
using System;
using System.Numerics;

    class Program
        static void Main(string[] args)
            const Int32 N = 8;
            Single[] a = { 41982.0F, 81.5091F, 3.14F, 42.666F, 54776.45F, 342.4556F, 6756.2344F, 4563.789F };
            Single[] b = { 85989.111F, 156.5091F, 3.14F, 42.666F, 1006.45F, 9999.4546F, 0.2344F, 7893.789F };
            Single[] c = new Single[N];

            for (int i = 0; i < N; i += Vector<Single>.Count) // Count возвращает 16 для char, 4 для float, 2 для      double и т.п.
                var aSimd = new Vector<Single>(a, i); // создать экземпляр со смещением i
                var bSimd = new Vector<Single>(b, i);
                Vector<Single> cSimd = aSimd + bSimd; // или так Vector<Single> c_simd = Vector.Add(b_simd, a_simd);
                cSimd.CopyTo(c, i); //копировать в массив со смещением

            for (int i = 0; i < a.Length; i++)

From the general point of view that With ++ approach that .NET are quite similar. Conversion/copying of basic data is necessary, to execute copying in a final array. However, approach with C# is much simpler, many things are made for you and you only need to use and enjoy. There is no need to think of data smoothing, to be engaged in memory allocation and to do it statically, or dynamically with certain operators. On the other hand you have a bigger control over the events with use of pointers, but also it is more than responsibility for the events.
And in the scraper there is everything as and in the scraper in With ++. And I not about pointers. Algorithm of calculation same. On the first iteration we bring the first 4 elements of initial arrays in structure of aSimd and bSimd, then we sum up and we save in a final array. Then on the following iteration we bring the following 4 elements by means of shift and we sum up them. Here so everything simply and quickly becomes. Let's consider a code which generates the compiler for this var cSimd command = aSimd + bSimd:
addps       xmm0,xmm1  

Difference from With ++ versions only that there are both registers while there was a folding of the register with memory section. The room in registers occurs at initialization of aSimd and bSimd. In general this approach if to compare a code of compilers C ++ and .NET not especially differs and gives approximately equal performance. Though the option will work with pointers all the same quicker. It would be desirable to note that SIMD instructions are generated at the included code optimization. I.e. to see them in a disassembler in Debug it will not turn out: it is implemented in the form of function call. However in Release where optimization is included, you receive these instructions in the explicit (built-in) type.


That we have:
  • In many cases vectorization gives 4-8× increase in performance
  • Difficult algorithms will demand an ingenuity, but without it anywhere
  • System.Numerics.Vectors clasps only part of simd-instructions now. For more serious approach it will be required With ++
  • There is a set of other methods in addition to vectorization: the correct use of a cache, a multithreading, heterogeneous calculations, competent work with memory (that the garbage collector did not sweat), etc.

During short twitter correspondence with Sashoy Goldstein (one of authors of the book "Optimization of Applications on the .NET Platform") who considered hardware acceleration in .NET I took an interest that as support of SIMD in .NET and what it in comparison with With ++ is in his opinion implemented. What he answered: "Undoubtedly, you can make more on With ++, than on C#. But you really get cross-processor support on С#. For example, the automatic choice between SSE4 and AVX". In general, it cannot but please. At the price of small efforts we can receive from system as it is possible for bigger performance, involving all possible hardware resources.
For me it is very good opportunity to develop productive programs. At least in the thesis, on modeling of physical processes, I generally tried to obtain efficiency by creation of a quantity of flows from myself, and also by means of heterogeneous calculations. I use both CUDA, and C ++ AMP. Development is conducted on a universal platform under Windows 10 where I am very much attracted by WinRT which allows to write the program both on C#, and on C++/CX. Generally on pluses I write a kernel for big calculations (Boost), and on C# already I manipulate data and I develop the interface. Naturally, the stage of two languages given via the binary ABI interface for interaction has the price (though not really big) that demands more reasonable development of library on With ++. However at me data are sent only in case of need and quite seldom, only for display of results.
In case of need of data manipulation in C#, I will transform them to the .NET types not to work with the WinRT types, thereby increasing processing performance already on C#. So for example, when it is necessary to process several thousands or tens of thousands of elements or the requirement to processing have no special specifications, data can be calculated in C# without involvement of library (in it from 3 to 10 million copies of structures, sometimes only for one iteration are considered). So approach with hardware acceleration will simplify a task and will make it quicker.

The list of sources when writing article

Well and separate thanks to Sasha Goldstein for the provided help and information.

This article is a translation of the original post at
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here:

We believe that the knowledge, which is available at the most popular Russian IT blog, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.
Best wishes.

comments powered by Disqus