Title: 多處理器系統與連結網路之局部偵錯診斷研究
System Level Local Diagnosability of Multiprocessor Systems and Interconnection Networks
Authors: 譚建民
Keywords: 偵錯診斷;PMC模型;診斷性;可靠性;診斷演算法;Fault diagnosis;PMC model;diagnosability;reliability;diagnosis algorithm.
Issue Date: 2011
Abstract: 我的研究重心著重在多處理器系統以及連結網路上的偵錯診斷。過去幾年間,敝實驗室發展出許多相關研究主題,並培養出多位博士人才,尤其於近五年內發表有數十多篇相關主題之國際期刊論文。 我們計畫延續之前的研究,繼續針對多處理器系統以及連結網路上的偵錯能力作深入的探討,包括了近些年來我們定義出來的新概念:強診斷系統、條件式診斷能力、以及局部診斷能力等等。就在今年和去年,我和我所指導的博士班學生也以這幾項新概念發表了兩篇論文刊登在IEEE Transaction on Computer期刊以及IEEE Transaction on Parallel and Distributed Systems期刊之上,另外這五年間的研究,也共有5篇論文刊登在上述兩個著名的IEEE等級期刊上。 在多處理器系統上為了達成系統自我偵錯診斷,過去的文獻提出了一些診斷模式。Preparata, Metze, and Chien 針對多處理器系統上的系統層診斷狀態,提出一個現在名為PMC的診斷模型(PMC model),其診斷模型是以系統中每兩個處理器之間相互作測試為主要的基礎。另外,Maeng and Malek也提出了一個實用的診斷模型,名為比較模型(comparison model) 或MM模型;系統由某一個處理器向相連的處理器散發出測試訊息,再將返傳回來的測試訊息兩兩作比較。在這個狀態之下,這個處理器可稱為另外兩個被測試處理器之訊息比較者(comparator);若訊息比較者本身是好的,回傳的測試訊息就能指出被測試處理器的好壞狀態。以此概念為基礎,另一個延伸的MM*模型則將系統中所有訊息完全蒐集,再進而偵測所有處理器的好壞狀態。之後,Sengupta and Dahbura研究了MM模型和MM*模型,討論有關比較模型之下的系統診斷狀態,並提出一個能在多項式時間O(n5)內完成的演算法來偵測多處理器系統或連結網路中的所有損壞狀況。 在這個提案中,我們將會以最近剛研發出來的新概念–相對於傳統的全域診斷概念–來探討多處理器系統中的局部診斷能力。除此之外,以這新概念為基礎,我們還提出一個更有效率O(n*logn)的演算法來作偵錯診斷。在此同時,我們也有一些新研究成果是關於容錯迴圈(fault-tolerant cycle)和容錯路徑(fault-tolerant path)與連結網路之間之嵌入性。這些結果也正在修正整理當中,並且預計在近期內再次投稿於國際期刊IEEE Transaction on Computers以及IEEE Transaction on Parallel and Distributed Systems之上。
In the recent years, my research focuses on the system diagnosis of multiprocessor systems and interconnection networks. In the past few years, we have developed several research topics and published many international journal papers. Besides, there are more than 10 PhD students graduated from my laboratory these years. We plan to continue our studies on diagnosis problem of multiprocessor systems and interconnection networks. We have defined several new concepts, such as strongly t-diagnosable systems, conditional diagnosability, and local diagnosability. I, supervising my PhD students, have already published 5 journal papers on these topics, all in IEEE Transaction on Computers and IEEE Transaction on Parallel and Distributed Systems in the past five years. Recently, several different models have been proposed in literatures for system diagnosis. Preparata, Metze, and Chien first introduced a model, so called the PMC model, for system level diagnosis in multiprocessor systems. In this model, it is assumed that a processor can test the faulty or fault-free status of another processor. In addition to this model, Maeng and Malek proposed another one, called the MM model, which is another practical approach for fault diagnosis in multiprocessor systems. In this approach, the diagnosis is carried out by sending the same testing task to a pair {u, v} of processors and comparing their responses. The comparison is performed by a third processor w that has direct communication links to both processors u and v. The third processor w is called a comparator of u and v. If the comparator is fault-free, a disagreement between the two responses is an indication of the existence of a faulty processor. To gain as much knowledge as possible about the faulty status of the system, it was assumed that a comparison is performed by each processor for each pair of distinct neighbors with which it can communicate directly. This special case of MM-model is referred to as the MM*-model. Based on the MM* model, Sengupta and Dahbura gave a characterization of diagnosable systems under the comparison approach, and proposed a polynomial time O(n5) algorithm to determine faulty processors under MM*-model. In this proposal, we study the diagnosis problem of multiprocessor systems using the concept of local diagnosis newly defined: local diagnosability of a node as oppose to the traditional global diagnosability. We obtain an O(n*logn) algorithm to determine all the faulty processors. In the mean time, we also have some new results on fault-tolerant cycle and path embedding of interconnection networks. We are currently writing papers on local diagnosability and plan to submit our new result to IEEE Transaction on Computers and IEEE Transaction on Parallel and Distributed Systems.
Gov't Doc #: NSC99-2221-E009-083-MY3
URI: http://hdl.handle.net/11536/99275
Appears in Collections:Research Plans