计算机科学
矢量化(数学)
并行计算
超级计算机
乘法(音乐)
矩阵乘法
接口(物质)
线性代数
编码(集合论)
双精度浮点格式
绩效改进
计算机体系结构
计算科学
程序设计语言
浮点型
最大气泡压力法
几何学
集合(抽象数据类型)
运营管理
经济
气泡
物理
量子
声学
量子力学
数学
作者
Jack Dongarra,Sven Hammarling,Nicholas J. Higham,Samuel D. Relton,Pedro Valero-Lara,Mawussi Zounon
标识
DOI:10.1016/j.procs.2017.05.138
摘要
A current trend in high-performance computing is to decompose a large linear algebra problem into batches containing thousands of smaller problems, that can be solved independently, before collating the results. To standardize the interface to these routines, the community is developing an extension to the BLAS standard (the batched BLAS), enabling users to perform thousands of small BLAS operations in parallel whilst making efficient use of their hardware. We discuss the benefits and drawbacks of the current batched BLAS proposals and perform a number of experiments, focusing on a general matrix-matrix multiplication (GEMM), to explore their affect on the performance. In particular we analyze the effect of novel data layouts which, for example, interleave the matrices in memory to aid vectorization and prefetching of data. Utilizing these modifications our code outperforms both MKL1 CuBLAS2 by up to 6 times on the self-hosted Intel KNL (codenamed Knights Landing) and Kepler GPU architectures, for large numbers of double precision GEMM operations using matrices of size 2 × 2 to 20 × 20.
科研通智能强力驱动
Strongly Powered by AbleSci AI