標題: 預測蛋白質上去氧核醣核酸鍵結位置
Prediction of DNA-Binding Sites in Proteins
作者: 游富傑
Fu-Chieh Yu
何信瑩
Shinn-Ying Ho
生物資訊及系統生物研究所
關鍵字: 蛋白質上去氧核醣核酸鍵結位置;位置加權矩陣;向量支持機器;模糊化最近k個鄰居法;DNA-binding proteins;PSSM;Support Vector Machine;fuzzy k-NN
公開日期: 2005
摘要: 在本研究中,我們針對蛋白質上去氧核醣核酸鍵結位置的預測問題設計一較精確之分類器,我們分別使用模糊化最近k個鄰居法與向量支持機器法兩種分類器來預測蛋白質上去氧核醣核酸鍵結位置。最後我們提出一效能較佳之方法,使用向量支持機器法結合蛋白質多重序列比對中位置加權矩陣提供的氨基酸序列演化資訊來預測蛋白質上去氧核醣核酸的鍵結位置。由於蛋白質中與去氧核醣核酸鍵結和非鍵結的氨基酸位置的數目比例顯著不均衡,所以除了向量支持機器原有的參數外額外兩個針對此一不平衡問題之參數將同時最佳化,希望最後能獲得最高之淨準確率(NP,鍵結類氨基酸準確率與非鍵結類氨基酸準確率的平均值)。為了評估所建立向量支持機器模型的普遍化能力,我們額外蒐集另一低序列相似度的蛋白質-去氧核醣核酸複合物結晶資料,PDC-59,總共包含59條蛋白質鏈作為獨立測試的樣本。向量支持機器採用六等分交叉驗證,在訓練資料PDNA-62的淨準確率為80.15%而獨立測試資料PDC-59的淨準確率為69.54%,分別比現有最佳方法類神經網路提高13.45%及16.53%。除了位置加權矩陣特徵外,三種與蛋白質-去氧核醣核酸交互作用有關的氨基酸物化性質:溶劑可接觸表面積、電子電荷、和親疏水性也額外作為輸入向量支持機器的特徵值。結果顯示,預測新發現蛋白質上去氧核醣核酸鍵結位置時向量支持機器結合位置加權矩陣有較佳之表現。
In our study, we investigate the design of accurate predictors for DNA-binding sites in proteins from amino acid sequences. Two classification methods, support vector machine (SVM) and fuzzy k-nearest neighbors (fuzzy k-NN), are used to predict of DNA-binding sites in proteins. As a result, we propose a hybrid method that has best performance using SVM in conjunction with evolutionary information of amino acid sequences in terms of their position specific scoring matrices (PSSMs) for prediction of DNA-binding sites. Considering the numbers of binding and non-binding residues in proteins are significantly unequal, two additional weights as well as SVM parameters are analyzed and adopted to maximize net prediction (NP, an average of Sensitivity and Specificity) accuracy. To evaluate the generalization ability of the proposed method SVM-PSSM, a DNA-binding dataset PDC-59 consisting of 59 protein chains with low sequence identity on each other is additionally established. The SVM-based method using the same six-fold cross-validation procedure and PSSM features has NP=80.15% for the training dataset PDNA-62 and NP=69.54% for the independent test dataset PDC-59, which are much better than the existing neural network based method by increasing the NP values for training and test accuracies up to 13.45% and 16.53%, respectively. Besides the PSSM feature, other amino acids physico-chemical properties features which are related to protein-DNA interactions such as solvent accessible surface area, electric charge, and hydropathy index are also adopted and analyzed. Simulation results reveal that SVM-PSSM performs well in predicting DNA-binding sites of novel proteins from amino acid sequences.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT009351509
http://hdl.handle.net/11536/79862
顯示於類別:畢業論文


文件中的檔案:

  1. 150901.pdf