High-performance Cholesky Factorization using the GPU and CPU parallel processing for band matrix
The required memory storage and processing time will be increased and elongated when solver linear system in larger matrices. Hence, the application of parallel computing technology on solving of linear system has received considerable interest in the last decade. Most of the parallel computing technologies of the previous studies have focused on iterative algorithm on the distributed parallel computing platforms。 However, the performance of iterative algorithms can realize only for matrices with larger-scaled linear system on super computers. The aim of this study focuses on developing more complicated direct parallel algorithm, on the multi-core CPU (Multi-core) and GPU parallel computing platforms. There are three stages in this study. First, the direct linear system solving algorithms are parallelized and implemented on the multi-core platform. The computing time and precision of solution were investigated and compared to conclude the performance of these different algorithms. Following, the blocked-Cholesky algorithm was utilized and optimized to develop a novel parallel algorithm. Finally, the optimized novel blocked-Cholesky algorithm was implemented on multi-core CPU and GPU parallel computing platforms. The computing results revealed that a 2.3 speed-up achieved fir band-matrices of bandwidth greater than 100 on a four-core platform as compared with performance on a single-core platform. Moreover, the computing performance accomplished 3.3 when the bandwidth of matrices greater than the1000. Notable, a ten-time performance can be reached when the novel algorithm was implemented on a platform of GPU with CUDA technology. The results also revealed that the more the bandwidth of matrices, the higher the achieved performance for computing on GPU platforms.
|Appears in Collections:||Thesis|