標題: 最長連續共同子序列與相關字搜尋演算法
Longest Contiguous Common Subsequences and Resemblance Words Finding Algorithms
作者: 周冠宇
Kuan-Yu Chou
杜敏文
Min-Wen Du
資訊科學與工程研究所
關鍵字: 最長連續共同子序列;演算法;相關字;最長共同子序列;字串比對;特徵;絡;longest contiguous common subsequence;algorithm;resemblance word;longest common subsequence;string matching;signature;lattice
公開日期: 1999
摘要: 本論文提出一個相關字搜尋的問題,此問題與傳統近似字比對問題類似。傳統近似字比對問題的字距定義通常為『編輯距離』,因此常用於拼字或打字上錯誤的校正;而我們提出的相關字搜尋問題則是在一個很大的字集中尋找可能具有相關涵義的字。在此篇論文中,我們以兩個字的最長連續共同子序列之長度當作它們之間相關度的衡量標準。文中提出一個『連續子序列索引絡』的結構,藉由分析此結構的特性,我們設計找出最長連續共同子序列的方法。同時我們也提出一種求得連續子序列特徵的方法,並用此特徵做索引,提出一個有效率解決相關字搜尋問題的演算法。最後我們以實驗證明:時下的一般電腦能夠以此演算法即時地搜尋出我們所定義最相關的字。
In this thesis, we formulated a Resemblance Words Finding (RWF) problem. It is very similar to the traditional Approximate String Matching (ASM) problem. In a traditional ASM problem, the similarity measure between two words is usually defined by the "editing distance", making it to be more suitable for spelling or typing error correction. The RWF problem, however, is to find words containing components that most resemble a component of a given word where the similarity measure is defined as the length of the Longest Contiguous Common Subsequences (LCCSs) measure between two words. A "Contiguous Substring Index Lattice" (CISL) structure is presented. By analyzing properties of the CISL structure, we have designed algorithms for finding LCCSs of two character strings. Based on the same CISL structure, we have also defined signatures of contiguous subsequences to construct an index scheme and developed efficient RWF algorithms. Our experimental results show that the final RWF algorithm developed in this thesis can find the words with LCCS in real-time by an ordinary computer.
URI: http://140.113.39.130/cdrfb3/record/nctu/#NT880394037
http://hdl.handle.net/11536/65532
Appears in Collections:Thesis