Title: 管線化及叢集化之超長指令集數位信號處理器之高效能資料路徑設計
Efficient Datapath Design for Clustered & Pipelined VLIW DSP Processors
Authors: 蕭丕承
Pi-Chen Hsiao
Chih-Wei Liu
Keywords: 前饋;叢集化;暫存器組;超長指令集;數位信號處理器;管線化;forwarding;clustering;register file;very long instruction word (VLIW);digital signal processor (DSP);pipelining
Issue Date: 2005
Abstract: 大部分的數位訊號處理應用程式都具有高度資料階層以及指令階層平行度的特性,因此可以藉由叢集化以及加深管線的方式來增加資料路徑的效率。然而,複雜的前饋(forwarding)網路以及叢集間連結(inter-cluster communication)網路抵銷了叢集化及加深管線所提升的效能。這篇論文以標準單元設計為基礎利用正反器和多工器來分析前饋單元與叢集間連結機制的複雜度。藉由這些分析,我們提出複雜度感知的前饋單元架構和以記憶體讀取/寫入(load/store)指令為基礎的叢集間連結機制。除此之外,我們還提出了分散式乒乓暫存器組架構來進一步降低叢集內暫存器組的複雜度。在實作的部分,我們使用UMC 0.13um 1P8M CMOS製程來實現我們設計。實驗的結果顯示,我們提出了前饋單元架構可以增加13.2%的運作時脈,而分散式乒乓暫存器組搭配我們提出的叢集間連結機制則可以減少76.8%的面積和46.9%的暫存器存取時間。對於可攜帶型裝置的應用方面,我們另外提出了與原本應用程式完全相容的折疊式資料路徑架構。比起原本的設計,這種架構可以節省55.33%的面積和增加26.3%的運作速度。最後,我們利用前述的前饋單元和叢集架構設計並實現了一個完整的4-way 超長指令集(VLIW)數位訊號處理器。實作與模擬的結果顯示在UMC 0.13um 1P8M CMOS的製程下,其最高工作頻率為333MHz,且具有近似於現在市面上數位訊號處理器的運算能力。
Most DSP applications feature a high degree of data-level and instruction-level parallelism, which enables efficient datapath design with clustering and deep pipelining. However, the ad-hoc data forwarding and inter-cluster communications in most processors significantly compensate the advantages. This thesis presents analytical formulae which are based on cell-based implementation with flip-flops and multiplexers to analyze the complexity of forwarding unit and inter-cluster communication mechanisms. We also propose a complexity-aware data forwarding architecture and a simple inter-cluster communication mechanism based on load/store instruction pairs. Moreover, we introduce the distributed & ping-pong register file to further reduce the complexity of register file inside clusters. In the experiments with UMC 0.13um 1P8M CMOS technology, our proposed forwarding architecture can improve cycle time by 13.2%, while the distributed ping-pong register file collocated with proposed inter-cluster communication mechanism can reduce the area and access time of register file by 76.8% and 46.9%. For portable applications, we bring up the folded datapath with binary compatibility which saves 55.33% area and increases the clock speed by 26.3%. Finally, we implement the proposed forwarding unit and the proposed inter-cluster communication mechanism with distributed & ping-pong register file organization in a complete 4-way VLIW DSP processor which can operate at 333MHz and shows comparable performance with state-of-the-art DSPs.
