標題: 時序資料庫中高效率頻繁樣式探勘演算法之研究
Efficient Algorithms for Mining Frequent Patterns in Temporal Databases
作者: 朱俊榮
梁婷
曾新穆
資訊科學與工程研究所
關鍵字: 頻繁樣式;關聯法則;時序性資料庫;利益探勘;Frequent patterns mining;Association rules;Temporal databases;Utility mining
公開日期: 2007
摘要: 在資料探勘的領域中,從大型資料庫中尋找資料項目間的頻繁樣式是相當重要之議題。因為資料項目間的頻繁樣式在現實生活中可應用在許多領域裡,譬如超級市場裡商品之間的販售關係。然而頻繁樣式所涵蓋的項目廣泛,其中包含了高頻率項目集、新興高頻率項目集以及高利益項目集等。近年來,由於經濟的關係,市場獲利的需求增加,新興高頻率項目集與高利益項目集成為主要探討的兩個議題。所謂的新興高頻率項目集是指項目集在舊有的資料庫中是低頻率項目集,然而在資料庫新增資料後,這些項目集變成高頻率項目集。至於高利益項目集的探勘,則針對項目集所能得到利益的高低來探討。過去的研究多著重於如何在大型資料庫中快速且正確地尋找這些項目集,減少候選項目集的產生、搜尋資料庫的次數以及記憶體的使用等。然而,傳統的演算法無法直接找尋在時序性資料庫中的頻繁樣式。因此,本論文主旨在研發從時序性資料庫中挖掘出新興高頻率的項目集以及高利益的項目集的高效率探勘方法。 在新興高頻率項目集的探勘上,我們提出了一個有效率的演算法EFI (Emerging Frequent Itemsets)-Mine。此演算法利用移動視窗交集法來預測高頻率項目集,因此不需要像以往的演算法須找出所有的高頻率項目集,以至減少高頻率項目集搜尋所需花費的時間。實驗的結果,我們所提出的演算法較Apriori 演算法在以IBM資料產生器中所產生的資料情形下,減少90%的執行時間。 此外,在本論文我們也提出一個高利益項目集的演算法THUI (Temporal High Utility Itemsets)-Mine。此演算法利用分割移動視窗法與交易權重項目集來減少候選項目集的產生,因此不需要像以往的演算法須找出所有的候選項目集,以至減少搜尋高利益項目集與資料庫次數所需花費的時間。在以IBM資料產生器中所產生的資料下,我們所提出的演算法較Two-Phase演算法,執行時間的改善率平均高達67%。 從利益探勘的特性中,我們可以觀察到資料庫存有許多未滿足門檻值的利益項目集,有些是值得參考的。我們提出了兩個珍貴的利益項目集之演算法TP-RUI (Two-Phase Rare Utility Itemsets) -Mine 和 TRUI (Temporal Rare Utility Itemsets) –Mine。尤其是TRUI-Mine,此演算法利用相對性的門檻值來尋找珍貴的利益項目集,並利用分割移動視窗法與交易權重項目集來減少候選項目集的產生,以至減少尋找珍貴的利益項目集所需花費的時間。 由於過去所探討的高利益項目集都是針對正利益的項目來研究,於是我們提出ㄧ個從大型的資料庫中尋找包含負利益的高利益項目集的演算法HUINIV (High Utility Itemsets with Negative Item Values) -Mine。此演算法利用去除負利益項目的交易權重項目集方法來減少候選項目集產生。實驗的結果,我們所提出的演算法較MEU 演算法在以IBM資料產生器中所產生的資料情形下,減少99%的候選項目集。
Mining frequent patterns for discovering the relationship among data items from large databases is an important topic in data mining since frequent patterns can be applied to wide applications. There exist various kinds of frequent patterns like frequent itemsets, emerging frequent itemsets, high utility itemsets and so on. Recently, emerging pattern mining and utility mining become very popular topics because of the rising of economic applications. Emerging frequent itemsets are those who are infrequent in the current database and then become frequent in the new database temporally added with new data transactions. High utility itemsets reveal the utility of an itemset, which can be measured in terms of cost, profit or other expressions of user preferences. In the past, most studies focus on developing efficient and effective methods for finding frequent itemsets from large database by reducing candidate itemsets, database scans and memory space. However, most methods designed for the traditional databases cannot be directly applied for mining temporal patterns in temporal databases because of the high complexity. Hence, we investigate efficient methods for mining emerging frequent itemsets and high utility itemsets in temporal databases in this dissertation. In emerging pattern mining, a novel approach named EFI (Emerging Frequent Itemsets)-Mine is presented to effectively identify the potential emerging itemsets by crossing sliding windows to predict frequent itemsets such that the execution time can be reduced substantially in mining all frequent itemsets in temporal databases. The experimental results show that EFI delivers 90.6% of improvement over the well-known method Apriori in terms of execution time on various kinds of synthetic datasets. Besides, we also propose a novel method, namely THUI (Temporal High Utility Itemsets)-Mine, for mining temporal high utility itemsets from temporal databases efficiently and effectively. THUI-Mine can effectively identify the temporal high utility itemsets by partitioning sliding window and using transaction-weighted utilization itemsets to generate fewer candidate itemsets such that the execution time and number of database scans can be reduced substantially in mining high utility itemsets from temporal databases. The experimental results show that THUI-Mine delivers 67% of improvement over the well-known method Two-Phase in terms of execution time on various kinds of synthetic datasets. According to the characters of utility mining, we could obverse some utility itemsets are those itemsets which appear infrequently in the current time window of large databases but are highly associated with specific data. Hence, we proposed two novel algorithms, namely TP-RUI (Two-Phase Rare Utility Itemsets) -Mine and TRUI (Temporal Rare Utility Itemsets) –Mine, which can effectively identify the temporal rare utility itemsets by using relative utility threshold. The temporal rare utility itemsets are discovered by partitioning sliding window and using transaction-weighted utilization itemsets to generate fewer candidate itemsets such that the execution time and database scan can be reduced substantially in mining all high and rare utility itemsets in temporal databases. The past studies on high utility itemsets mining considered only positive item values and ignored the cases of negative item values that may occur in real-life applications. Therefore, we propose a novel method, namely HUINIV (High Utility Itemsets with Negative Item Values) -Mine, for mining high utility itemsets from large databases with consideration of negative item values. HUINIV-Mine can effectively identify high utility itemsets by using transaction utility without negative item values to generate fewer candidate itemsets. The experimental results show that HUINIV-Mine delivers 99% of improvement over the well-known method MEU in terms of candidate itemsets on various kinds of synthetic datasets.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT009223820
http://hdl.handle.net/11536/76690
Appears in Collections:Thesis


Files in This Item:

  1. 382001.pdf