標題: 具複雜運算單元之低功率多執行緒資料路徑的研究與設計
Study on Improving Utilization for Low-Power Multithreaded Datapath with Composite Functional Units
作者: 卓毅
Yi Cho
劉志尉
Chih-Wei Liu
電子研究所
關鍵字: 低功率;多執行緒;資料路徑;運算單元;硬體使用率;low-power;multithreaded;datapath;functional unit;hardware utilization
公開日期: 2007
摘要: 在觀察近年來處理器的發展演變中我們發現,簡化指令集處理器(RISC)已成為一大設計主流。其簡單和規律的指令集設計很容易進一步的將指令執行管線化(pipeline)提高處理器效能。然而,因為分派一個指令,只能執行一個動作導致其硬體使用率不高。多指令分發(multi-issue)處理器,即超長指令(VLIW)處理器,利用指令層級平行度(ILP)提高硬體使用率,但它的暫存器檔案面積,隨著運算單元增加而劇烈成長,因而付出沉重的硬體代價。在本論文中,我們提出一個具複雜運算單元(composite FU)的資料路徑,以客製化順序串接多個運算單元的方式,在同一指令中處理連續多個基本運算(primitive operations),達到硬體使用率的提升。此複雜運算單元不僅可以減輕如VLIW的暫存器面積會因功能單元(FU)增加而大幅成長的問題,還因為複雜運算單元可以在抓取運算子後,作多個運算才存回,總暫存器存取次數得到節省,進而得到低功率的好處。此外我們也利用整合管線化設計流程來提升整體效能(操作頻率),以及搭配交錯多執行緒(interleaved multithreaded)架構來完全地隱藏管線化後所衍生的指令延遲。我們同時提出一個自動化複雜運算單元產生器,藉由分析使用者所輸入的應用程式資料流程圖(data-flow graph),自動產生出一個最佳化的複雜運算單元。經由對多個典型DSP應用分析,複雜運算單元MSA(串接一個乘法器M以及一個移位器S和加法器A)的硬體使用率(operation per cycle)和簡化指令集處理器的1.00比較提升為1.35。使用台積電0.13um製程作合成分析,在同樣的運算效能下,複雜運算單元較簡化指令集合的面積約多10%,但較超長指令減少約50%。複雜運算單元之功率消耗,較簡化指令集合及超長指令節省16.6%到31.6%。
From the observation of evolution of processor development in recent years, we find that Reduced Instruction Set Computer (RISC) processors have already become main design fashion. The simplicity and regularity of RISC is suitable for pipeline design to boost performance. However, its hardware utilization is low because of it execute only one operation in single instruction issued. Multi-issue (VLIW) processors, takes advantage of the Instruction Level Parallelism (ILP) to promote hardware utilization. But the register file (RF) area of VLIW grows exaggeratedly with the increase of the functional unit number. It pays a great hardware overhead. In this thesis, we propose a datapath with composite functional units (FUs). It cascades several functional units in costumed order to perform continuous multiple primitive operations in single cycle for raising hardware utilization. The read and write port number of the register files of composite FUs only slightly increase by 1 or remain unchanged. It solves the problem of large RF area pressure. In addition, the composite FUs can perform several operations after fetching operands and then write back. The reduction of total register accesses leads to low-power benefit. Besides, the pipeline design is integrated to boost performance up and the Interleaved Multithreaded (IMT) architecture is coordinated to hide instruction latency derived from pipeline design totally. In the mean time, we propose a recursive composite FUs generator which automatically generator a best composite FU by analyzing Data Flow Graph (DFG) input by user. From the analysis of several classic DSP kernels, the hardware utilization of MSA-ordered (cascade a multiplier, an adder, then a shifter) composite FU is 1.35 times higher than 1.00 of RISC. Use the TSMC 0.13um process to do synthesis analysis. Under same performance, the register file area of composite FU is 10% more than RISC and 50% less than VLIW. The power reduction of composite FU is smaller compared with RISC and VLIW ranging from 16.6% to 31.6%.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT009411619
http://hdl.handle.net/11536/80532
Appears in Collections:Thesis


Files in This Item:

  1. 161901.pdf