计算机科学
并行计算
矢量化(数学)
交错
x86个
编译程序
水准点(测量)
标杆管理
程序设计语言
软件
操作系统
大地测量学
业务
营销
地理
作者
Andrew Anderson,Avinash Malik,David Gregg
出处
期刊:ACM Transactions on Architecture and Code Optimization
[Association for Computing Machinery]
日期:2015-12-08
卷期号:12 (4): 1-25
被引量:22
摘要
Automatically exploiting short vector instructions sets (SSE, AVX, NEON) is a critically important task for optimizing compilers. Vector instructions typically work best on data that is contiguous in memory, and operating on non-contiguous data requires additional work to gather and scatter the data. There are several varieties of non-contiguous access, including interleaved data access. An existing approach used by GCC generates extremely efficient code for loops with power-of-2 interleaving factors (strides). In this paper we propose a generalization of this approach that produces similar code for any compile-time constant interleaving factor. In addition, we propose several novel program transformations, which were made possible by our generalized representation of the problem. Experiments show that our approach achieves significant speedups for both power-of-2 and non--power-of-2 interleaving factors. Our vectorization approach results in mean speedups over scalar code of 1.77x on Intel SSE and 2.53x on Intel AVX2 in real-world benchmarking on a selection of BLAS Level 1 routines. On the same benchmark programs, GCC 5.0 achieves mean improvements of 1.43x on Intel SSE and 1.30x on Intel AVX2. In synthetic benchmarking on Intel SSE, our maximum improvement on data movement is over 4x for gathering operations and over 6x for scattering operations versus scalar code.
科研通智能强力驱动
Strongly Powered by AbleSci AI