標題: 蛋白質細胞定位及核糖核酸結合點之預測
Prediction of Subcellular Localization and RNA-binding Sites in Proteins
作者: 蘇家玉
Su, Chia-Yu
許聞廉
黃鎮剛
Hsu, Wen-Lian
Hwang, Jenn-Kang
生物資訊及系統生物研究所
關鍵字: 蛋白質細胞定位預測;蛋白質核糖核酸結合點預測;支持向量機;機率式潛在語意分析;位置加權矩陣;位置加權矩陣之平滑編碼設計;protein subcellular localization prediction;RNA-binding site prediction in proteins;support vector machines;probabilistic latent semantic analysis;position specific scoring matrix;smoothed PSSM encoding scheme
公開日期: 2008
摘要: 近年來隨著後基因體時代的來臨,生物資料庫中逐漸累積了許多待以分析的蛋白質序列。於是,如何自動地來分析和註解蛋白質的功能,已經在生物研究上扮演一個不可或缺的角色。在這當中,蛋白質細胞定位及核糖核酸結合點之相關研究,對於功能分析、基因體標註和藥物標靶發現是非常重要的。然而,利用傳統實驗方法來決定蛋白質細胞定位或結構非常昂貴又耗時,因此利用計算方法來分析和預測蛋白質功能,已經在蛋白質研究上成為一個非常重要的課題。 在蛋白質細胞定位預測中,我們發展出兩套不同的方法,PSL101和PSLDoc。PSL101是根據細菌轉位路徑來擷取與細胞區室相關的生物特徵,再整合結構同源方法和支持向量機模型,來預測蛋白質在細胞中座落的位置。PSLDoc則是先將蛋白質序列以間隙二肽的方法表示,結合位置加權矩陣的演化資訊後,利用機率式潛在語意分析模型來找出序列特徵,最後以支持向量機預測序列在細胞中的位置。我們提出的兩個方法中,在革蘭氏陰性菌的蛋白質細胞定位預測皆達到93%的整體準確率;對於低同源性的資料集,準確率更是比目前最好的結果提升7.4%。實驗結果證實,無論從轉位路徑擷取來的生物特徵,或是藉由文件分類技巧發展出的特徵精簡,皆能顯著地提高預測準確率。此外,我們所提出的生物特徵和間隙二肽標誌特徵,皆屬於可解釋的生物特徵,這些特徵可提供生物學家在進一步的研究和實驗設計上做為參考。 在核糖核酸結合點預測方面,我們提出RNAProB這方法來預測蛋白質序列上的核糖核酸結合點。我們針對傳統的位置加權矩陣提出一個新的平滑編碼設計,並利用支持向量機來預測蛋白質序列上的核糖核酸結合點。我們提出的位置加權矩陣之平滑編碼設計中,最大的特點在於考慮了蛋白質序列裡,每個氨基酸鄰近殘基的交互作用和關聯性。實驗結果顯示平滑編碼設計能夠顯著地提高預測準確率,尤其在敏感性的提升更為顯著。在目前較佳的預測方法中,我們所提出的方法較其他方法在整體準確率、敏感性、特異性和馬修斯相關係數上,分別提高了4.90%~6.83%,7.05%~26.90%,0.88%~5.33%和0.10~0.23。實驗結果支持了我們所提出的這個假設:平滑編碼設計考慮了鄰近氨基酸之間的關聯性,因此能更準確地分辨出和核糖核酸有交互作用的殘基和沒有交互作用的殘基之間的歧異性。 基於我們所提出方法所具有的普遍性,將可以廣泛地延伸應用在其他生物資訊的研究上。此外,我們方法所預測的蛋白質細胞定位和核糖核酸結合點資訊,能夠幫助生物學家推論蛋白質功能和發現合適的藥物標靶;因此,我們深信文中所提出的高通量蛋白質體分析研究,將對科學發現有所貢獻。
Automated function annotation is a major goal of post-genomic era with tremendous amount of protein sequences in the databases. Prediction of subcellular localization or binding sites in proteins is crucial for function analysis, genome annotation, and drug discovery. Determination of localization or structure using experimental approaches is time-consuming; thus, computational approaches become highly desirable. We proposed two protein subcellular localization prediction methods, PSL101 and PSLDoc. PSL101 combines a structural homology approach and a support vector machine model, in which compartment-specific biological features derived from bacterial translocation pathways are incorporated. PSLDoc uses a probabilistic latent semantic analysis on gapped-dipeptides of various distances, where evolutionary information from position specific scoring matrix (PSSM) is utilized. Our methods achieve 93% in overall accuracy for Gram-negative bacteria, and compared favorably to the state-of-the-art results by 7.4% on a benchmark dataset having low homology to the training set. Experiment results demonstrate that both biological features derived from translocation pathways and feature reduction by document classification techniques can lead to a significant improvement in the prediction performance. Moreover, the proposed biological features and gapped-dipeptide signatures are interpretable and can be applied in advanced studies and experiment designs. For RNA-binding site prediction, we propose another method, RNAProB, which incorporates a new smoothed PSSM encoding scheme in a support vector machine model. The proposed smoothed PSSM encoding considers correlation and dependency from neighboring residues for each amino acid in a protein sequence. Experiment results show that smoothed PSSM encoding significantly enhances the prediction performance, especially for sensitivity. Our method performs better than the state-of-the-art systems by 4.90%~6.83%, 7.05%~26.90%, 0.88%~5.33%, and 0.10~0.23 in terms of overall accuracy, sensitivity, specificity, and Matthew’s correlation coefficient, respectively. This also supports our assumption that smoothed PSSM encoding can better resolve the ambiguity in discriminating between interacting and non-interacting residues by modeling the dependency from surrounding residues. Because of the generality of the proposed methods, they can be extended to other research topics in the future. Moreover, the information from predicted localization and structure of proteins can be used collectively to assist biologists in both inferring protein function and finding suitable drug targets. Therefore, we believe that our work can contribute to scientific discoveries on a high-throughput basis.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT009251803
http://hdl.handle.net/11536/77499
Appears in Collections:Thesis


Files in This Item:

  1. 180301.pdf