標題: 用評分卡方法來預測和描述光合作蛋白序列
Using Scoring Card Method for Predicting and Characterizing Photosynthetic Proteins from Sequences
作者: 方恩蘭
Vasylenko Tamara
Shinn-Ying Ho
關鍵字: 光合作用蛋白;光系統II;光系統I;分數評分卡方法;智慧型基因演算;物化特性;photosynthetic proteins;photosystem II;photosystem I;scoring card method;intelligent genetic algoritnm;physicochemical properties
公開日期: 2015
摘要: Photosynthesis is among the most important biological processes on Earth and photosynthetic proteins (PSPs) have been extensively studied for more than fifty years. Although, a considerable number of research articles have already been published relating to the structure and function of the major photosynthetic protein complexes (e.g. Photosystem I, Photosystem II), more and more recent studies continue to discover novel proteins involved in photosynthetic energy conversion. By now, less than 50% of Arabidopsis thaliana protein-coding genes have experimentally confirmed functional annotations. In the mostly studied photosynthetic bacterial strain, Synechocystis sp. PCC 6803 about half of open reading frames encode hypothetical proteins. These and other evidences reveal that thylakoid membrane network of chloroplasts still contains a number of proteins, which may be yet unknown subunits of the photosynthetic complexes, or those playing auxiliary roles in maintenance, repair, turnover, and biogenesis of the whole photosynthetic machinery. Traditional approaches for the research of PSPs are time-consuming, expensive and laborious. Therefore, it is desirable to develop effective in silico methods for the discovery of photosynthetic proteins, which can provide hints for further experiments. To date, no machine learning tool have been proposed to specifically predict photosynthetic protein function. On the other hand, a dozen of chloroplast localization predictors have been developed based on recognition of N-terminal presequences (cTPs) of proteins that are targeted to these plastids. Localization predictors already yield high accuracies, however, they are subject to several limitations. First, not all chloroplast proteins have obvious cTPs. Second, unicellular photosynthetic bacteria do not have chloroplasts. Finally, not all chloroplast proteins are necessarily engaged in photosynthesis as long as many other processes (e.g. amino acid biosynthesis) take place in these organelles. This thesis proposes a novel scoring card (SCM)-based method (SCMPSP) for prediction and characterization of photosynthetic proteins from sequences. SCMPSP uses dipeptide composition as a feature set. The proposed method estimates propensity scores of 400 individual dipeptides using the difference between the dipeptide compositions of PSPs and non-PSPs and is further optimized by an Intelligent Genetic Algorithm (IGA). The propensity scores of 20 natural amino acids are derived from dipeptide scores and are used to discover informative physicochemical properties (PCPs) of the photosynthetic proteins. The SCMPSP method achieved a test accuracy of 71.54%, and its mean performance on all datasets used is 66.17%. Additionally, four physicochemical properties of PSPs have been identified: 1) PSPs favor to be composed of amino acids with hydrophobic side chains; 2) PSPs prefer to form α-helices in membrane environments; 3) PSPs have low interaction with water; 4) PSPs tend to be composed of amino acids with electron-reactive side chains.
光合作用是很重要的生物功能,在地球和光合蛋白已經被廣泛地研究了超過五十年。雖然有相當數量的研究文章已發表有關主要光合蛋白複合物(例如光系統I 和光系統II )的結構和功能,最近越來越多的研究繼續發現參與光合能量轉換新的蛋白質。到現在為止,少於50%阿拉伯芥的蛋白質編碼基因已被實驗確認功能並註釋。在多數研究光合細菌菌株之中,一半左右的藍綠菌的開放閱讀框會編碼假設蛋白。這些資訊和其它的證據顯示,葉綠體的類囊體膜的脈絡仍然包含許多蛋白質,這可能是未知的亞基光合複合物,或那些在維護、修護、運轉及生物合成的整個光合機制發揮輔助作用。關於傳統的方法來研究光合作用蛋白的是相當耗時、昂貴和費力的。因此,理想的有效研究是用計算的方法來開發光合作用蛋白質,可以提供訊息利於進一步的實驗。 目前在辨識一條光合作用蛋白序列的研究,尚無以機器學習工具來專門預測光合蛋白的功能。另一研究方向,葉綠體定位預測大都基於識別N端定位到這些蛋白質特定質體目標序列(cTPs)的開發。這類已開發的蛋白局部預測雖精準度不錯,但其發展是有一些設限。首先,不是所有的葉綠體蛋白質有明顯的cTPs。第二,單細胞光合細菌沒有葉綠體。最後,不是所有的葉綠體蛋白質須從事光合作用如其它程序冗長的(如氨基酸生物合成)發生在這些細胞器。 本研究提出一個嶄新分數評分卡方法(SCMPSP),只用序列資訊來預測和描述光合作用蛋白。SCMPSP所用的特徵集是雙胜肽組成成分。其所測量的400個雙胜肽傾向分數,是用光合作用蛋白和非光合作用蛋白雙胜肽成分的差異量做為最佳化基因演算法的初始值來更進一步最佳化求得。由雙胜肽傾向分數所推得的20個基本胺基酸的傾向分數,用來發現光合作用蛋白的物化特性。用SCMPSP方法獨立測試的準確度是71.54%,其測所有的測試資料集平均效果達66.17%的準度。最後有四個光合作用蛋白的物化特性被確認:1) 光合作用蛋白嗜好疏水支鏈組成的氨基酸;2) 在細胞膜的環境中,光合作用蛋白易形成α螺旋;3) 光合作用蛋白不易和水有交互作用;4)光合作用蛋白傾向活化電子組成的氨基酸。
Appears in Collections:Thesis