標題: 利用Nearest Neighbor演算法處理符號性質資料的分類及其於生物資訊的應用
Nearest Neighbor Algorithm for Symbolic Data Set Classification and Its Application in Bioinformatics
作者: 陳勇成
Yeong-Cheng Chen
張志永
Jyh-Yeong Chang
電控工程研究所
關鍵字: 符號屬性;典型符號;最近均值分類器;生物資訊;機器學習;模糊;k-NN;SNM;FSNM;Promoter;Splice;cross-validation
公開日期: 2003
摘要: 在過去,Nearest Neighbor演算法通常都是用來處理資料屬性全部都是數值的例子。在這樣的屬性當中,這些事例都是被視為點,而且彼此之間的距離都適用標準的定義(如歐幾里得距離為基準)。而在符號的領域當中,我們通常需要對特徵向量空間做更複雜的處理;處理符號屬性空間的Nearest Neighbor演算法則,是利用特定的距離表,產生事例之間彼此的實值距離,而且指派一些權重在某些有效或可靠事例,以進一步修正特徵空間中的架構。 此篇論文,我們在符號領域中,有效的提出一種典型符號的學習方式,這種典型可以藉由最小距離分類器,學習處理關於符號屬性的問題、屬性的權重、以及在每一種類別當中找到一個典型符號,如此我們都可以由符號性質最近均值分類器(symbolic nearest mean classifier)進行分類。 除了上述之每一種類中當學到典型符號的方法,另外,我們可以把在同一類別當中所有典型的分量均予考慮,這樣我們就可以在同一個類別當中,設計出一個模糊式的典型符號,我們再由模糊典型符號之最近均值分類器(symbolic nearest mean classifier with fuzzy prototype)進行分類。 我們使用上述演算法,處理機器學習領域中的三個(其中兩個為生物資訊)問題:鏡片辨識、辨識Promoter的基因序列及計算Splice的接面,皆呈現極佳的分類準確率。藉由不同的測試評估方法,和其他的學習演算法做比較,我們的演算法在那三個所要測試的資料領域中,都是勝過其他演算法或是可與其匹敵的;除此之外,我們的演算法具有訓練簡單及速度快的優點。最後,模擬實驗結果可以證明Nearest-Neighbor演算法及相關的延續發展在處理符號屬性資料的辨識是具優勢的。
In the past, nearest neighbor algorithms for learning from examples have worked very well in domains in which all features had numeric values. In such domains, the examples can be treated as points and distance metricscan be exploited using standard definitions, such as Euclidean distance. In symbolic domains, a more sophisticated treatment of the feature space is required. The nearest neighbor algorithm used for the symbolic feature space calculates distance tables that allow it to produce real-valued distances between instances, and attaches weight to the instances to further modify the structure of feature space. In this thesis, we present an empirical analysis of symbolic prototype learners for discrete domains. Our symbolic prototype learner is derived from modifying the minimum distance classifier to solve problems with symbolic attributes and attribute weighting, and learns a prototype to each class. And then the classification is implemented in symbolic nearest mean classifier. In addition to a prototype to each class, we can consider the contributions of the component prototypes for all samples in each class. Then we can design a fuzzy prototype approach and implement the symbolic nearest mean by fuzzy prototype setting. We validate our proposed algorithms and on three data sets, majority of them are bioinformatics problems; that have been studied by machine learning researchers, such as Lenses recognition, identifying DNA promoter sequences, and Splice-junction determination. From experimentalcomparisons with the other learning algorithms, our simulation result has shown that our proposed algorithms are superior or comparable in the classification accuracy. In addition, our algorithms have advantages in training speed, simplicity, and perspicuity. Experimental evidence has demonstrated the promising sign to continue development of nearest neighbor algorithms for symbolic data domains.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT009112561
http://hdl.handle.net/11536/45168
Appears in Collections:Thesis


Files in This Item:

  1. 256101.pdf