標題: 雙層式中文全文檢索系統之建立
Building a Two-level Chinese Full-text Retrieval System
作者: 鍾問星
Chung, Wen-Shin
李錫堅
Hsi-Jian Lee
資訊科學與工程研究所
關鍵字: 中文檢索;Retrievel;Inverted-file
公開日期: 1997
摘要: 本論文提出一個中文全文檢索系統,其中包含了索引的建立,查詢字的 搜 尋,我們針對中文文字的特性,採用反轉檔(inverted file)方法設計兩 層架構的檢索系統,來降低記億體空間需求,對於中文檢索提出一個可行的 機制. 本系統包含兩大主體,一個是關於索引的建立,一個是在建好的索 引中作搜尋的動作,對於檢索的建立方面,本系統用反轉檔案的方法作為兩 層索引的基礎,第一層索引建立的目的,乃是要將中文資料庫中每對中文文 字跟牠們所在的檔案名稱作對應,而第二層就是採用一般的反轉檔案作法, 將每個檔案內的每個中文文字跟其所在檔案相對位置資訊作對應;至於在 搜尋的方面,當使用者從界面輸入一串搜尋字串的時後,首先我們將該字串 拆成一對對的字對,再去第一層索引去找這些字對可能出現的檔案,在將這 些檔案的第二層索引載入記億體中執行搜尋的動作,最後找到該查詢字串 的所有出現地方. 在實驗結果中,用兩層索引的中文全文檢索系統,解決 用反轉檔案方法的系統負載太大的問題,減少系統記億體需求約43%;而且 加入了相似,容錯,布林的搜尋方法,並且提供了文件校對工具給管理者. In this thesis, a Chinese full-text retrieval system is proposed for building indices andsearching query patterns. An inverted file method is used to design our two-level retrieval systemand the required memory is reduced by considering properties of Chinese characters. This system contains two main phases: the index building phase and the pattern searchingphase. In the index building phase, two-level indices are built by applying the inverted file method.The occurrence of each Chinese character is used to index text files in the first level. Werecord all characters and which files they belong to. For each text file, the occurrence of charactersis employed to index character locations in the second level. In the pattern searching phase,we divide a query pattern into several bi- characters. All possible text files which these bi- charactersbelonging to are found according to the first level index. Then, corresponding inverted tablesof these text files are loaded to search the occurrence of the query pattern. Experimental results show that the required memory can be reduced by about 43% usingour two-level scheme. Therewithal, we also provide other seaching methods such as similarsearching, tolerant searching and boolean searching for users. In this system, a document checkingtool is provided for database managers before document indexing.
URI: http://140.113.39.130/cdrfb3/record/nctu/#NT860392039
http://hdl.handle.net/11536/62770
Appears in Collections:Thesis