標題: 提早載入:在深管線處理器設計下隱藏載入使用延遲
Early Load:Hiding Load-to-Use Latency in Deep Pipeline Processors
作者: 張順傑
Shun-Chieh Chang
鍾崇斌
Chung-Ping Chung
資訊科學與工程研究所
關鍵字: 嵌入式系統;載入使用延遲;深管線處理器;超純量;載入指令;Embedded System;Load-to-Use Latency;Deep Pipeline Processor;Superscalar;Load Instruction
公開日期: 2008
摘要: 高效能處理器為了達到更高的指令產出量,其設計會趨近於有更寬的指令發出寬度以及更深的管線設計。隨著管線變得越深以及越寬,指令的執行延遲變得越來越長,對於in-order處理器來說,變長的指令執行延遲會造成更多的Stall週期。一般的解決方法是使用out-of-order指令發出的管線設計,但是這對於一些應用來講是過於昂貴的,例如嵌入式處理器;比較經濟的解決方法是只針對某些特定的指令做out-of-order發出,這篇論文針對的是載入指令。載入指令在深管線處理器中有較長的執行延遲,且由於執行頻率高而造成大量的Stall週期;如果接下來的指令與載入指令相依,則必須暫停管線直到載入指令完成,這些暫停的週期稱為載入-使用延遲。 在這篇論文中,我們提出一個硬體設計的方法在載入/儲存單元空閒時提早執行載入指令,這個方法我們稱之為提早載入。提早載入能夠提早辨識出載入指令,並允許載入指令在還沒輪到循序執行前,提早從記憶體中載入所需要的資料;同時,我們提出一個偵錯方法,能夠避免錯誤的提早載入運算,以確保提早載入資料的正確性,並且不會造成額外的效能損失。提早載入方法可以隱藏載入-使用延遲以及降低載入/儲存單元的競爭,並且只耗費額外不多的硬體。 在我們的實驗結果中,我們使用一個12-stage in-order dual-issue的處理器,在Dhrystone Benchmark我們的方法可以達到11.64%的效能改善,而在MiBench benchmark suite方面,我們的方法最大可以達到18.60%的效能改善,而整體平均的效能改善為5.15%。同時提早載入的方法會增加所有Load指令額外24.08%的記憶體存取。所耗費的硬體約為10K個電晶體以及相關的控制線路。
In order to achieve high instruction throughput, high performance processors tend to use more and deeper pipelines. As pipeline gets deeper and wider, the instruction execution latency becomes longer. The longer instruction execution latency induces more pipeline stall cycles in an in-order processor. A conventional solution is out-of-order instruction issue and execution; but it is too expensive for some applications, such as embedded processors. An economical solution for low-cost designs is to out-of-order execute only some critical instructions. We focus on load instructions, due to their frequent occurrences and long execution latency in a deep pipeline. If a subsequent instruction depends on the load instruction, it may need to stall in the pipeline to wait for the load outcome. The maximum possible number of stall cycles is called the load-to-use latency. In this thesis, we propose a hardware method, called the early load, to hide load-to-use latency via executing load instructions early. Early load requires that load instructions be identified early and issued for execution early. In the meantime, an error detection method is proposed to stop or invalidate incorrect early loads, ensuring correctness without inducing extra performance degradation. Early load can both hide load-to-use latency and reduce load/store unit contention, at only a little hardware cost. Our experiments show that for a 12-stage in-order dual-issue design, early load can give a 11.64% performance gain in Dhrystone benchmark; and a 18.60% maximal and 5.15% average gain for MiBench benchmark suite. Meanwhile, early load induces 24.08% additional memory accesses. The incurred hardware cost is about ten thousand transistors and corresponding control circuits.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT009555605
http://hdl.handle.net/11536/39556
Appears in Collections:Thesis