標題: 高效能Cholesky分解法使GPU與CPU平行處理應用於帶狀矩陣
High-performance Cholesky Factorization using the GPU and CPU parallel processing for band matrix
作者: 連江祥
Lain, Jiang-Siang
洪士林
Hung, Shih-Lin
土木工程學系
關鍵字: Cholesky;CUDA;OpenMP;解算器;寬帶矩陣;平行;Cholesky;CUDA;OpenMP;band matrix;parallel
公開日期: 2011
摘要: 線性系統求解隨著矩陣之變大,所需要之記憶體以及處理時間隨之增大與拉長,一直以來都有許多研究如何將平行運算理論應用在線性系統求解上。過去大部分的平行計算之研究都著重於疊代求解演算法,並使用分散式平行運算技術。疊代法雖然擁有平行成果較好,但演算法本身適用性較為狹小,且較適合應用於大型電腦上。 本研究之重點則著重於較難平行的直接演算法,及分別使用較新的多核心CPU(Multi-core)及GPU平行運算技術,希望可以達到較好的計算效能,同時也可節省計算所需之記憶體與時間。本研究主要分為三個步驟。首先使用多核心(Multi-core)將基本的直接演算法平行化而後作精度與效率之比較,尋找最適合的線性求解演算法。接著以最適合的Cholesky演算法做為研究基礎,並且參考相關期刊的Cholesky磚塊化切割理論再加以進行優化與改良;最後將優化過後的演算法,透過GPU平行運算,以達到最佳的計算效果。在四核心系統下,當帶狀矩陣中之寬帶大於100則採用多核心技術可達到相較於單核心2.3倍以上效能,而寬帶大於1000可達3.3倍以上效能,而在GPU上使用CUDA技術則可達到10倍以上效能,隨著矩陣寬帶之增大在GPU上會有更佳的實數運算效果。
The required memory storage and processing time will be increased and elongated when solver linear system in larger matrices. Hence, the application of parallel computing technology on solving of linear system has received considerable interest in the last decade. Most of the parallel computing technologies of the previous studies have focused on iterative algorithm on the distributed parallel computing platforms。 However, the performance of iterative algorithms can realize only for matrices with larger-scaled linear system on super computers. The aim of this study focuses on developing more complicated direct parallel algorithm, on the multi-core CPU (Multi-core) and GPU parallel computing platforms. There are three stages in this study. First, the direct linear system solving algorithms are parallelized and implemented on the multi-core platform. The computing time and precision of solution were investigated and compared to conclude the performance of these different algorithms. Following, the blocked-Cholesky algorithm was utilized and optimized to develop a novel parallel algorithm. Finally, the optimized novel blocked-Cholesky algorithm was implemented on multi-core CPU and GPU parallel computing platforms. The computing results revealed that a 2.3 speed-up achieved fir band-matrices of bandwidth greater than 100 on a four-core platform as compared with performance on a single-core platform. Moreover, the computing performance accomplished 3.3 when the bandwidth of matrices greater than the1000. Notable, a ten-time performance can be reached when the novel algorithm was implemented on a platform of GPU with CUDA technology. The results also revealed that the more the bandwidth of matrices, the higher the achieved performance for computing on GPU platforms.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT079816585
http://hdl.handle.net/11536/47337
Appears in Collections:Thesis


Files in This Item:

  1. 658502.pdf
  2. 658503.pdf