A loop vectorizer that maximizes vectorization benefit through exploiting mixed SIMD parallelism

* Artifacts evaluated by CGO'2016 AEC

  • Authors:
  • Affiliation:
  • Contacts:

About Loop-Mix

Loop-Mix is a new loop vectorizer introduced to exploit both intra- and inter-iteration SIMD parallelismin simulataneously, resulting in reduced data reorganization overhead and thus faster vectorized code. Loop-Mix has been implemented on top of the original loop-level vectorizer in LLVM 3.5.0. For more details about the design and implementation of Loop-Mix, please see the reference below and the released code.


Loop-Mix is introduced in our CGO'16 paper titled Exploiting Mixed SIMD Parallelism By Reducing Data Reorganisation Overhead. There are two other recent papers on SIMD optimization from our research group: (1) Hao Zhou and Jingling Xue. A Compiler Approach for Exploiting Partial SIMD Parallelism, ACM Transactions on Architecture and Code Optimization, 13(1):11:1--11:26, 2016, and (2) Yulei Sui, Xiaokang Fan, Hao Zhou and Jingling Xue. Loop-Oriented Array- and Field-Sensitive Pointer Analysis for Automatic SIMD Vectorization, LCTES'16.


Our implementation on Loop-Mix can be found here : Download.

The package includes the source code of Loop-Mix (i.e., LoopVectorize.cpp) and three kernels (i.e., kernel.c, kernel178.c and kernel200.c) that can be used for testing purposes. An example script, kernel_build.sh, for compiling the kernel is also included.

How to use

To build it

Replace the source file named LoopVectorize.cpp under the path $YOUR_LLVM_SOURCE_LOCATION/lib/transforms/vectorize/ with the file LoopVectorize.cpp extracted from the downloaded package. Then, build LLVM as usual. (Note: for other versions of LLVM, a successful build may require some modifications in LoopVectorize.cpp, such as its included head files.)

To enable it

The loop vectorizer Loop-Mix is enabled by default in clang. Please use the command line flag -loop-vectorize to enable it in opt. The command line flag -vectorize-cost-threshold (defaulted to 6) can be used to set the threshold for profitability analysis. The command line flag -max-vector-width-in-bit can be used to specify the largest vector width of the underlying hardware. The default value is set as 256.

To disable it

Please use the command line flag -fno-vectorize to disable the loop vectorizer Loop-Mix in clang. If the command line flag -loop-vectorize is not specified, Loop-Mix is disabled in opt.

How it works

Loop-Mix vectorizes a loop in 4 phases: (1) the Legality Checking phase examines whether the loop is vectorizable or not; (2) the Statement Grouping phase determines which statements in the loop should be combined together in order to exploit intra/inter-iteration SIMD parallelism and reduce data reorganization overhead; (3) the Profitability Analysis phase decides whether vectorizing the loop is beneficial or not by comparing with the scalar code for the loop; and (4) the Code Generation phase emits the vectorized code.

Loop-Mix differs from existing loop vectorizers mainly in its second stage, i.e., Statement Grouping, as illustrated by an example below. If we only exploit intra-iteration SIMD parallelism and thus group S2 and S3 together, the SIMD resources will be underutilized as S1 is left unvectorized. To maximize the utilization of the SIMD resources, existing loop vectorization techniques choose to exploit inter-iteration SIMD parallelism alone, by grouping together the instances of the same statement from consecutive iterations, due to the dependences between S1 and the other two statements. Therefore, data reorganization is performed on the data accessed on array b in both S2 and S3.

Loop-Mix exploits both intra- and inter-iteration SIMD parallelism, i.e., inter-iteration parallelism on S1 and intra-iteration parallelism on S2 and S3. As a result, data reorganization is only performed on data t. By reducing the data reorganization overhead, more vectorization benefit will be reaped. For the example loop, Loop-Mix generates the most efficient vectorized code (compared with existing loop vectorization techniques).

/////////example loop///////////

for(int i = 0 ; i != n; i++)


S1: t = a[i];

S2: b[2i] = b[2i] + t;

S3: b[2i+1] = b[2i+1] + t;


Find a bug?!

We have tried our best to evaluate our implementation, and now release it to benefit others. If you find a bug, please contact us. Thanks!