標題: 以模糊理論與高頻項目集為基礎之文件分群研究
Fuzzy Frequent Itemset-based Textual Document Clustering
作者: 陳淳齡
Chen, Chun-Ling
梁 婷
曾守正
Liang, Tyne
Tseng, Frank S.C.
資訊科學與工程研究所
關鍵字: 文件分群;文字探勘;關聯規則探勘;高頻項目集;模糊集合理論;WordNet;Document Clustering;Text Mining;Association Rule Mining;Frequent Itemsets;Fuzzy Set Theory;WordNet
公開日期: 2009
摘要: 隨著文字類型文件的數量大幅成長,文件分群技術可用來有效管理這些數量龐大的文件,以便於日後的檢索及瀏覽。為了提升文件分群品質,近年來陸續有學者採用關聯規則探勘技術所產生之高頻項目集於文件分群方法中,解決了一般在文件分群中常遇到的高維度詞彙、執行效能、分群正確性、和自動產生有意義之群集標籤等多項問題。然而,採用關聯規則探勘技術較容易忽略重要且出現頻率較少的關鍵詞彙,再者如項目間的關係程度太高,也會產生數量過多的高頻項目集,造成分群執行時間過長。因此,本研究提出三個以模糊理論和高頻項目集為基礎的文件分群方法,主要是利用模糊關聯規則探勘技術所產生之模糊高頻項目集來有效降低詞彙維度,並可依每個詞彙在文件集中的散佈情況和出現頻率,區分為高頻詞、中頻詞或低頻詞。 本研究首先提出Fuzzy Frequent Itemset-based Hierarchical Document Clustering (F^2IHC) 方法,主要是利用模糊關聯規則探勘技術找出關鍵詞彙間的關聯性,進而以模糊高頻項目集來產生候選群集,並藉由計算文件與候選群集間的相似度來進行文件分群。此外,並將分群結果以階層式群集樹來呈現,使得歸類好的群集具有容易瀏覽的特性。第二,為了能使用具概念性詞彙來自動標註為群集標籤,我們提出Fuzzy Frequent Itemset-based Document Clustering (F^2IDC) 方法,此方法結合WordNet探索關鍵詞彙間的語意關係,並加入從WordNet中對應出的上位詞 (hypernyms)於文件中,進而擷取出具概念性的群集標籤來表示群集主題。第三,我們提出Fuzzy Frequent Itemset-based Soft Clustering (F^2ISC) 方法,此方法主要是擴充 F2IDC 方法,並採用模糊理論之α-cut法,能使一份文件分群到一至多個群集中。 在本研究的文件分群過程中,由於使用模糊高頻項目集降低詞彙維度,且所產生之模糊高頻項目集並不會隨著文件數而增加,所以可有效地應用於大文件集的分群上。與傳統的分群方法相比較,實驗結果顯示本論文所提出之研究方法,能有效提高文件分群的正確性與效能,使得文件分群效果更加完善。
With the rapid growth of text documents, document clustering technique is emerging for efficient document retrieval and better document browsing. Recently, some methods had been proposed to resolve the problems of high dimensionality, scalability, accuracy, and meaningful cluster labels by using frequent itemsets derived from association rule mining for clustering documents. However, there are still two situations to be confronted, if we use association rule mining in our approaches: (1) the important sparse key terms may be obscured; (2) too many itemsets will be produced, especially when items in the dataset are highly correlated. Moreover, frequent itemset-based clustering methods usually need a lot of time to generate the large number of itemsets. Considering the above two issues, we present three fuzzy frequent itemset-based document clustering approaches which using fuzzy association rule mining to provide significant dimensionality reduction over interesting fuzzy frequent itemsets. By applying fuzzy association rule mining, each term in the document dataset is labeled with a linguistic term, like Low, Mid, or High. First, we propose the Fuzzy Frequent Itemset-based Hierarchical Document Clustering (F2IHC) approach, which employ fuzzy set theory for document representation to find suitable fuzzy frequent itemsets for clustering documents. In addition, F2IHC constructs a hierarchical cluster tree for providing flexible browsing. Second, in order to label clusters with conceptual terms, we present a Fuzzy Frequent Itemset-based Document Clustering (F2IDC) approach with the use of WordNet as background knowledge to explore better ways of representing document semantically for clustering. F2IDC presents a means of dynamically deriving a hierarchical organization of hypernymy from WordNet based on the content of each document without use of training data or standard clustering techniques. Third, we propose a Fuzzy Frequent Itemset-based Soft Clustering (F2ISC) approach by extending F2IDC under the consideration of overlapping clusters. F2ISC provides an accurate measure of confidence, and adopts the α-cut concept to assign each document to one or more than one cluster. As a result, in the proposed clustering approaches, the interesting fuzzy frequent itemsets are used to reduce the dimensionality of term vectors. In addition, these itemsets do not increase with the growth of documents. Hence, our approaches perform better for large document collections. Our experimental results show that our proposed F2IHC, F2IDC, and F2ISC approaches indeed provide more accurate clustering results than prior influential clustering methods presented in recent literature.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT079323804
http://hdl.handle.net/11536/40583
Appears in Collections:Thesis


Files in This Item:

  1. 380401.pdf