docs/Vectorizers.rst

   1 ==========================
   2 Auto-Vectorization in LLVM
   3 ==========================
   4
   5 LLVM has two vectorizers: The *Loop Vectorizer*, which operates on Loops,
   6 and the *Basic Block Vectorizer*, which optimizes straight-line code. These
   7 vectorizers focus on different optimization opportunities and use different
   8 techniques. The BB vectorizer merges multiple scalars that are found in the
   9 code into vectors while the Loop Vectorizer widens instructions in the
  10 original loop to operate on multiple consecutive loop iterations.
  11
  12 The Loop Vectorizer
  13 ===================
  14
  15 LLVM’s Loop Vectorizer is now available and will be useful for many people.
  16 It is not enabled by default, but can be enabled through clang using the
  17 command line flag:
  18
  19 .. code-block:: console
  20
  21    $ clang -fvectorize -O3 file.c
  22
  23 If the ``-fvectorize`` flag is used then the loop vectorizer will be enabled
  24 when running with ``-O3``, ``-O2``. When ``-Os`` is used, the loop vectorizer
  25 will only vectorize loops that do not require a major increase in code size.
  26
  27 We plan to enable the Loop Vectorizer by default as part of the LLVM 3.3 release.
  28
  29 Features
  30 ^^^^^^^^^
  31
  32 The LLVM Loop Vectorizer has a number of features that allow it to vectorize
  33 complex loops.
  34
  35 Loops with unknown trip count
  36 ------------------------------
  37
  38 The Loop Vectorizer supports loops with an unknown trip count.
  39 In the loop below, the iteration ``start`` and ``finish`` points are unknown,
  40 and the Loop Vectorizer has a mechanism to vectorize loops that do not start
  41 at zero. In this example, ‘n’ may not be a multiple of the vector width, and
  42 the vectorizer has to execute the last few iterations as scalar code. Keeping
  43 a scalar copy of the loop increases the code size.
  44
  45 .. code-block:: c++
  46
  47   void bar(float *A, float* B, float K, int start, int end) {
  48    for (int i = start; i < end; ++i)
  49      A[i] *= B[i] + K;
  50   }
  51
  52 Runtime Checks of Pointers
  53 --------------------------
  54
  55 In the example below, if the pointers A and B point to consecutive addresses,
  56 then it is illegal to vectorize the code because some elements of A will be
  57 written before they are read from array B.
  58
  59 Some programmers use the 'restrict' keyword to notify the compiler that the
  60 pointers are disjointed, but in our example, the Loop Vectorizer has no way of
  61 knowing that the pointers A and B are unique. The Loop Vectorizer handles this
  62 loop by placing code that checks, at runtime, if the arrays A and B point to
  63 disjointed memory locations. If arrays A and B overlap, then the scalar version
  64 of the loop is executed.
  65
  66 .. code-block:: c++
  67
  68   void bar(float *A, float* B, float K, int n) {
  69    for (int i = 0; i < n; ++i)
  70      A[i] *= B[i] + K;
  71   }
  72
  73
  74 Reductions
  75 --------------------------
  76
  77 In this example the ``sum`` variable is used by consecutive iterations of
  78 the loop. Normally, this would prevent vectorization, but the vectorizer can
  79 detect that ‘sum’ is a reduction variable. The variable ‘sum’ becomes a vector
  80 of integers, and at the end of the loop the elements of the array are added
  81 together to create the correct result. We support a number of different
  82 reduction operations, such as addition, multiplication, XOR, AND and OR.
  83
  84 .. code-block:: c++
  85
  86   int foo(int *A, int *B, int n) {
  87     unsigned sum = 0;
  88     for (int i = 0; i < n; ++i)
  89         sum += A[i] + 5;
  90     return sum;
  91   }
  92
  93 Inductions
  94 --------------------------
  95
  96 In this example the value of the induction variable ``i`` is saved into an
  97 array. The Loop Vectorizer knows to vectorize induction variables.
  98
  99 .. code-block:: c++
 100
 101   void bar(float *A, float* B, float K, int n) {
 102    for (int i = 0; i < n; ++i)
 103      A[i] = i;
 104   }
 105
 106 If Conversion
 107 --------------------------
 108
 109 The Loop Vectorizer is able to "flatten" the IF statement in the code and
 110 generate a single stream of instructions. The Loop Vectorizer supports any
 111 control flow in the innermost loop. The innermost loop may contain complex
 112 nesting of IFs, ELSEs and even GOTOs.
 113
 114 .. code-block:: c++
 115
 116   int foo(int *A, int *B, int n) {
 117     unsigned sum = 0;
 118     for (int i = 0; i < n; ++i)
 119       if (A[i] > B[i])
 120         sum += A[i] + 5;
 121     return sum;
 122   }
 123
 124 Pointer Induction Variables
 125 ---------------------------
 126
 127 This example uses the "accumulate" function of the standard c++ library. This
 128 loop uses C++ iterators, which are pointers, and not integer indices.
 129 The Loop Vectorizer detects pointer induction variables and can vectorize
 130 this loop. This feature is important because many C++ programs use iterators.
 131
 132 .. code-block:: c++
 133
 134   int baz(int *A, int n) {
 135     return std::accumulate(A, A + n, 0);
 136   }
 137
 138 Reverse Iterators
 139 --------------------------
 140
 141 The Loop Vectorizer can vectorize loops that count backwards.
 142
 143 .. code-block:: c++
 144
 145   int foo(int *A, int *B, int n) {
 146     for (int i = n; i > 0; --i)
 147       A[i] +=1;
 148   }
 149
 150 Scatter / Gather
 151 ----------------
 152
 153 The Loop Vectorizer can vectorize code that becomes scatter/gather
 154 memory accesses.
 155
 156 .. code-block:: c++
 157
 158   int foo(int *A, int *B, int n, int k) {
 159   for (int i = 0; i < n; ++i)
 160       A[i*7] += B[i*k];
 161   }
 162
 163 Vectorization of Mixed Types
 164 --------------------------
 165
 166 The Loop Vectorizer can vectorize programs with mixed types. The Vectorizer
 167 cost model can estimate the cost of the type conversion and decide if
 168 vectorization is profitable.
 169
 170 .. code-block:: c++
 171
 172   int foo(int *A, char *B, int n, int k) {
 173   for (int i = 0; i < n; ++i)
 174       A[i] += 4 * B[i];
 175   }
 176
 177 Vectorization of function calls
 178 --------------------------
 179
 180 The Loop Vectorize can vectorize intrinsic math functions.
 181 See the table below for a list of these functions.
 182
 183 +-----+-----+---------+
 184 | pow | exp |  exp2   |
 185 +-----+-----+---------+
 186 | sin | cos |  sqrt   |
 187 +-----+-----+---------+
 188 | log |log2 |  log10  |
 189 +-----+-----+---------+
 190 |fabs |floor|  ceil   |
 191 +-----+-----+---------+
 192 |fma  |trunc|nearbyint|
 193 +-----+-----+---------+
 194
 195 Performance
 196 ^^^^^^^^^^^
 197
 198 This section shows the the execution time of Clang on a simple benchmark:
 199 `gcc-loops <http://llvm.org/viewvc/llvm-project/test-suite/trunk/SingleSource/UnitTests/Vectorizer/>`_.
 200 This benchmarks is a collection of loops from the GCC autovectorization
 201 `page <http://gcc.gnu.org/projects/tree-ssa/vectorization.html>`_ by Dorit Nuzman.
 202
 203 The chart below compares GCC-4.7, ICC-13, and Clang-SVN at -O3, running on a Sandybridge.
 204 The Y-axis shows time in msec. Lower is better.
 205
 206 .. image:: gcc-loops.png
 207
 208 The Basic Block Vectorizer
 209 ==========================
 210
 211 The Basic Block Vectorizer is not enabled by default, but it can be enabled
 212 through clang using the command line flag:
 213
 214 .. code-block:: console
 215
 216    $ clang -fslp-vectorize file.c
 217
 218 The goal of basic-block vectorization (a.k.a. superword-level parallelism) is
 219 to combine similar independent instructions within simple control-flow regions
 220 into vector instructions. Memory accesses, arithemetic operations, comparison
 221 operations and some math functions can all be vectorized using this technique
 222 (subject to the capabilities of the target architecture).
 223
 224 For example, the following function performs very similar operations on its
 225 inputs (a1, b1) and (a2, b2). The basic-block vectorizer may combine these
 226 into vector operations.
 227
 228 .. code-block:: c++
 229
 230   int foo(int a1, int a2, int b1, int b2) {
 231     int r1 = a1*(a1 + b1)/b1 + 50*b1/a1;
 232     int r2 = a2*(a2 + b2)/b2 + 50*b2/a2;
 233     return r1 + r2;
 234   }
 235
 236