標題: 迴圈多疊代之資料流執行
Multi-iteration Dataflow Execution for Loops
作者: 劉嘉修
Jie-Shiou Liu
單智君
Jean Jyh-Jiun Shann
資訊科學與工程研究所
關鍵字: 資料流;迴圈;多疊代;dataflow;loops;multi-iteration
公開日期: 2000
摘要: 超純量微處理器是目前最被廣泛使用的高效能計算機架構。為了達到提高效能的目的,指令抓取率在此類微處理器的設計中是相當重要之一環。但由於複雜指令集(如:x86)的指令長度並不固定,使得指令抓取效率受到了嚴重的限制,進而限制了CISC超純量微處理器的效能。由於迴圈的特殊行為模式,在過去有許多的機制試圖透過迴圈指令之收集以提升指令之抓取效率,例如Loop Cache與Loop Buffer等。但是此類的技術對指令平行度的提升並無直接之好處。而針對迴圈指令平行度提升的技術中最重要的為Loop Unrolling。此技術是透過編譯器之協助對迴圈內的指令重新排序以開發迴圈中的指令平行度,但對於指令之抓取效率則無幫助。因此,本論文提出了一個處理迴圈的新機制,運用資料流觀念達成迴圈中多個疊代並行執行之目的,稱為多疊代資料流執行機制(Multi-iteration Dataflow Execution for Loops)。希望藉由動態地收集迴圈中的指令以減低x86指令抓取的困難,進而開發迴圈中指令之平行度。設計的想法主要是動態地將迴圈的控制資訊與迴圈本體分離,並將迴圈本體內的指令轉成資料流圖。之後,再利用迴圈的控制資訊安排迴圈中多個疊代以資料流的方式並行執行。根據模擬結果,此機制在並行三個疊代時,其執行迴圈所得到之指令平行度已相當於理想的超純量微處理器所能達到之飽和值。
Superscalar is the mainstream of the contemporary processors for improving performance. Fetching many instructions in a single cycle is an important ability of superscalar processors. However, in a CISC architecture, the variable length of instructions make fetching multiple instructions in a cycle very difficult. The limited fetch rate may bound the performance of a CISC superscalar processor. Because of the special behavior of loops, many techniques have been proposed to increase the fetch rate for loops, such as Loop Cache and Loop Buffer. These techniques are proposed to save the instruction fetch energy, but not to exploit instruction parallelism. On the contrary, loop unrolling is a compiler technique that exploits the instruction parallelism by instruction scheduling, but cannot increase the fetch rate. In this thesis, we propose a mechanism, called Multi-iteration Dataflow Execution for Loops for loop. This mechanism is designed both to increase the fetch rate and to exploit the instruction parallelism for simple loops. It splits a loop into two parts, the loop control statement and the loop body, and translates the loop body into dataflow graph. Then, it tries to execute multiple iterations in parallel. According to our simulation results, the issue rate of simple loops achieves the ideal performance of superscalar processors when we execute three iterations in parallel.
URI: http://140.113.39.130/cdrfb3/record/nctu/#NT890392091
http://hdl.handle.net/11536/66883
Appears in Collections:Thesis