Implementation of the CPU/GPU hybrid parallel method of characteristics neutron transport calculation using the heterogeneous cluster with dynamic workload assignment

计算机科学并行计算库达 GPU群集多核处理器中子输运超级计算机消息传递接口加速区域分解方法中央处理器对称多处理机系统图形处理单元的通用计算消息传递绘图操作系统中子有限元法物理热力学量子力学

作者

Peitao Song,Zhijian Zhang,Qian Zhang,Liang Liang,Qiang Zhao

出处

期刊：Annals of Nuclear Energy [Elsevier BV]
日期：2020-01-01 卷期号：135: 106957-106957 被引量：9

标识

DOI：10.1016/j.anucene.2019.106957

摘要

• A heterogeneous parallel MOC algorithm is implemented with MPI + OpenMP/CUDA model. • A dynamic workload assignment scheme is applied to insure the workload balance. • A performance analysis model is applied to evaluate the parallel algorithm. In recent years, graphics processing units (GPUs) have been adopted in many High-Performance Computing (HPC) systems due to their massive computational power and superior energy efficiency. And accelerating CPU-version computational code on heterogeneous clusters with multi-core CPUs and GPUs has attracted a lot of attention. One of the focus on heterogeneous computing is to efficiently take advantage of all computational resources, including both CPU and GPU available on a cluster. In this paper, a heterogeneous MPI + OpenMP/CUDA parallel algorithm for solving the 2D neutron transport equation with the method of characteristic (MOC) is implemented. In this algorithm, the spatial domain decomposition technique provides the coarse-grained parallelism with the MPI protocol while the fine-grained parallelism is exploited through OpenMP (in CPU calculated domain) and CUDA (in GPU calculated domain) based on the ray parallelization. In order to efficiently leverage the computing power of heterogeneous clusters, a dynamic workload assignment scheme is proposed, which is to distribute the workload based on the runtime performance of CPUs and GPUs in the cluster. Moreover, the strong scaling performance of the MPI + CUDA parallelization is studied through a performance analysis model which provides the detailed impact of the degradation in iteration scheme, the load imbalance issue, the data copy between CPUs and GPUs, and the MPI communication in the MPI + CUDA parallel algorithm. And the corresponding conclusion is still tenable for the MPI + OpenMP/CUDA parallelization. The C5G7 2D benchmark and an extended 2D whole-core problem are calculated with MPI + CUDA parallelization, MPI + OpenMP/CUDA parallelization, and the MPI parallelization for comparison. Numerical results demonstrate that the heterogeneous parallel algorithm maintains the desired accuracy. And the dynamic workload assignment scheme can provide the optimal workload assignment which ideally matches the experimental results. In addition, over 11% improvement is observed in MPI + OpenMP/CUDA parallelization compared against the MPI + CUDA parallelization. Moreover, the CPUs/GPUs heterogeneous clusters significantly outperform the CPUs clusters and one heterogeneous node shows basically five times faster than a CPUs node.

求助该文献

最长约 10秒，即可获得该文献文件

Implementation of the CPU/GPU hybrid parallel method of characteristics neutron transport calculation using the heterogeneous cluster with dynamic workload assignment

今日热心研友