標題: 臉部偵測應用在共享記憶體多核心系統中的平行度分析與資料地域性之研究
Parallelism and Data Locality Analysis of Face Detection on a Shared Memory Multi-Core System
作者: 江志軒
Chiang, Chih-Hsuan
賴伯承
Lai,Bo-Cheng
電子研究所
關鍵字: 臉部偵測;平行處理;資料區域性;偷工作量;face detection;parallel execution;data-locality;work-stealing
公開日期: 2011
摘要: 人臉偵測在未來的智慧型裝置是一種重要的技術。然而,它的高運算量造成其在嵌入式系統上難以實現。另一方面,平行處理和多核心架構已經成為未來高性能計算系統的主流,將人臉偵測在多核心系統執行會是讓其在嵌入式系統上實現的一個可行方案。要讓臉部偵測充分利用平行系統之前,我們必須把該應用的平行度展開,才能使其應用充分的使用平行系統。本文第一部分分析人臉偵測在該演算法不同層級的平行度。簡要介紹了臉部辨識不同層級的平行度、可拓展性以及導致效能變差的因素,歸納分析找到提高系統整體性能的方法。根據分析結果和設計經驗,提出了一種混合多層級平行方案,保留平行度的可擴展性和避開會遇到的限制因素。 臉部偵測的平行度展開之後,多核心系統成了一個適當的平台,使得人臉偵測的密集的計算量在資源受限的嵌入式系統不再是昂貴的應用。但是,大量的記憶體存取成了運算瓶頸,限制了應用可擴展性,進一步可能使系統不能有效的運用各核心的資源。提高資料在快取記憶體存取中的資料相依性成為一個重要的設計問題。本文第二部分分析了維奧拉-瓊斯(Viola-Jones)演算法在平行系統的記憶體行為,並提出了一個方案,以提高資料在快取記憶體存取時的資料相依性,減少不必要的資料存取,降低處理器間資料網路負擔,也降低各處理器在記憶體資源的競爭。 因為讓平行系統的各處理器都有相同的工作量可以使系統最有效率的運算,細粒度的執行緒負載(fine-grained thread loading)可以讓系統較容易在各處理器間取得工作量的平衡。而粗粒度的執行緒負載(coarse-grained thread loading)也有很多好處,不過在細粒度的執行緒負載跟粗粒度的執行緒負載之間取得平衡會造成程式設計師額外的負擔。本文的第三部分會提出一個偷工作量(steal work load)在粗粒度的執行緒負載上的技術,幫助程式設計師較容易讓各處理器都有相同的工作量,使得系統較有效率,而且不會造成程式設計師太多的負擔。 歸納本文的各個部分,第一部分混合多層級平行方案,如果在記憶體存取方面不受限制的情況,可以在六十四核心系統達到三十七點五倍的加速。然而,記憶體存取一直是多處理器系統的瓶頸,因此第二部分針對資料區域性進行優化,相對於原混合多層級平行方案在正常十六核心的平台下,可以減少58%的運算時間。本文最後部分提出的偷工作量技術,可以讓系統幫助未優化的程式去改善工作量不平均的情形,其結果接近優化過的程式,進而使得程式設計師不用花太多額外的力氣去優化程式。
Face detection is one of the fundamental technologies for future smart devices. However, its high computation makes applying such technique to an embedded device difficult to realize. Parallel processing and many-core architecture have become the mainstream to achieve high performance in future computing systems. The parallelism of an application needs to be exposed before being exploited by the parallel architecture. The first part of this thesis performs a comprehensive analysis on the parallelism of a face detection algorithm at different algorithmic levels. This thesis demonstrates that each parallelism level has its own potential to enhance performance, but also imposes some limitations. Based on the results and design experience, this thesis proposes a multi-staged mixed-level parallelization scheme to maintain the performance scalability and at the same time avoid the limiting factors. The intensive computation requirements make the object detection an expensive application running on the resource-constrained embedded device. Due to the parallelism exposed in first part of this thesis, parallel processing on multi-core systems boosts the overall system performance. However, the memory bottleneck limits the performance scalability. Improving data locality of the on-chip cache has therefore become a critical design concern. The second part of this thesis analyzes the memory behavior of a parallel Viola-Jones algorithm, and proposes a scheme to enhance the data locality of on-chip cache. The scheme reduces unnecessary data accesses and the communication between processors and main memory. Balancing the workload among processors of a parallel system enhances the execution efficiency. Implementing fine-grained threads makes it easier to achieve load balance between processors. However, using coarse-grained threads also poses many advantages. Therefore, how to strike a balance between the two parallelization schemes will become an additional burden of programmers. The third part of this thesis proposes a work-stealing design which lowers the programming effort and improves the system efficiency as well. This paper dedicates its first part to discussing the multi-stage hybrid parallelism which achieves 37.5x speed-up on a 64-core system. However, memory accessing is an issue long existing in multi-core systems. Therefore, the second part of this paper focuses on the optimization of data locality which brings 62% reduction of computation time on a regular 16-core system. In the final part of this paper, we propose the work steal technique to alleviate the load imbalance of an un-optimized program. This mechanism can attain the similar performance to an optimized program and save programmers’ effort on program optimization.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT079811684
http://hdl.handle.net/11536/46849
Appears in Collections:Thesis


Files in This Item:

  1. 168403.pdf