標題: 現代網路搜尋引擎中轉置檔快取系統之設計
A Cache Mechanism for the Inverted Files in Modern Search Engines
作者: 王俊程
Chung Cheng Wang
單智君
Jean Jyh-Jiun Shann
資訊科學與工程研究所
關鍵字: 轉置檔;搜尋引擎;雜湊函數;衝突;轉置檔快取;inverted file;search engine;hashing function;hash function;collision;inverted file cache
公開日期: 1999
摘要: 由於目前網際網路中之資料庫搜尋系統越趨龐大,因此普遍採用轉置檔 (inverted file) 搜尋系統來加速文件的搜尋速度。然而,轉置檔的大小通常是原本資料量的1至3倍,所以直接搜尋轉置檔將會導致頻繁的磁碟讀取運作。因此,設計一個轉置檔的快取系統,將可以有效減少磁碟讀取運作,增加搜尋系統的效能。 在本篇論文中,我們首先討論適合用在轉置檔資料結構上的傳統索引方法,然後提出一個適用於轉置檔快取系統的新的索引方法稱為具虛鏈之增強型雜湊函數 (enhanced hashing function with pseudo link)。此方法乃根據傳統雜湊函數的觀念,並在其處理衝突 (collision)與管理記憶體的方式上作改良,使得新的方法可以減少發生衝突時搜尋索引表的次數並增加轉置檔快取系統的使用率與隨機存取的效能。實驗結果顯示,轉置檔快取系統大約有65% 的命中率,與沒有轉置檔快取系統的搜尋引擎比較,使用傳統索引方法的轉置檔快取系統可以使搜尋引擎系統效能增加大約59.2%,使用新的索引方法的轉置檔快取系統可以使搜尋引擎系統效能增加大約81%。
The size of the database of the modern search engine in the Internet becomes larger and larger today. The inverted file is commonly used in the modern search engine in order to improve the performance of search engine. However, the size of the inverted file is very large. If we access the inverted file directly, frequent disk I/O operations will occur and become the bottleneck in an information retrieval system. Therefore, we add a cache inside the search engine, called inverted file cache, to reduce the number of the disk I/O operations. In this thesis, first, we discuss the traditional data structure that can be used for indexing the inverted file. And then we propose a new method, called enhanced hashing function with pseudo link, which is suitable for indexing inverted file cache. Comparing with the traditional hashing function, this new method improves the processes of the collision handling and the memory management to increase the utilization of the inverted file cache and the performance of random accesses in the cache. The simulation results show that the hit rate of the inverted file cache is about 65%. Comparing to the search engine without the inverted file cache, the performance of search engine with the inverted file cache in traditional method will increase about 59.2% and that in new method will increase up to 81%.
URI: http://140.113.39.130/cdrfb3/record/nctu/#NT880392047
http://hdl.handle.net/11536/65444
Appears in Collections:Thesis