標題: 以二階隱藏式馬可夫模型預測特定蛋白激酶磷酸化的位置
Predicting Protein Kinase-Specific Phosphorylation Sites Using Second-Order Hidden Markov Models
作者: 葉智國
Chih-Kuo Yeh
何信瑩
Shinn-Ying Ho
生物資訊及系統生物研究所
關鍵字: 磷酸化;激酶;隱藏式馬可夫模型;貝氏資訊準則;二階;phosphorylation;kinase;Hidden Markov Model;Bayesian Information Criterion;second-order
公開日期: 2006
摘要: 蛋白質磷酸化是在蛋白質轉譯後修飾中很重要的機制,在調控基本的細胞進行過程像是新陳代謝、訊號傳遞、細胞分化和細胞膜穿透性等扮演重要角色。在過去,要標記已被磷酸化的蛋白質和被磷酸化的位置,即使透過如二維電泳分析和質譜儀分析等新技術的幫助,仍然要耗費大量的人力與資源。因此發展使用蛋白質序列資訊的電腦輔助預測軟體來預測磷酸化的位置與它們特定的激酶,可以提供一個關鍵的選擇步驟,用來減少實驗中候選者的數目。HMMer 是一個用來做蛋白激酶特定磷酸化位置預測的很好軟體,它使用一階隱藏式馬可夫模型(HMM-1)及Plan7 的架構,在重要激酶PKA、PKC、CDK等現有資料集上,可分別達到 82%、74% 和 82% 辨識率。 本論文經由對HMMer的分析,希望提出一套以二階隱藏式馬可夫模型(HMM-2)為基底的改良演算法iHMM (improved HMM)用來預測特定蛋白激酶磷酸化的位置,希望從序列中取得更豐富的前後文資訊來提升預測正確率。藉由搭配使用貝氏資訊準則(Bayesian Information Criterion)的模型參數選擇方法,使HMM-2 在資料集不夠大時能盡量避免眾所周知的過度適化(over fitting)問題。本論文將 Phospho.ELM資料庫與Swiss-Prot資料庫結合,將有蛋白激酶註解的資料依照蛋白激酶屬性 PKA、PKC、CDK 等分類建立十八個資料集,然後分別用 iHMM 建立預測模型,以5-fold交互驗證做30次獨立測試的結果,並經由與HMMer和傳統HMM-1的效能比較來評估iHMM。實驗結果發現iHMM 的辨識率與 HMMer 相比得到將近平均4.3%的提升,而跟傳統HMM-1相比則得到將近平均3.6%的提升。本論文並進一步探討比較HMMer、傳統 HMM-1和iHMM 三種方法的特性與優缺點。
Protein phosphorylation is an important mechanism of posttranslational modifications and it plays important roles in regulation of essential cellular processes such as metabolism, cell signaling, differentiation and membrane transportation. In the past, laboratory identification of phosphorylated proteins and phosphorylation sites is usually tedious and cumbersome. Recently, large-scale methods of two-dimensional gel analysis and mass spectrometry techniques were applied to efficiently detect phosphorylation sites. However, experimental identification of phosphorylation sites is still expensive. Therefore, computational prediction of phosphorylation sites with their specific kinases using protein's primary sequences can provide a crucial selection step to reduce the number of candidates. HMMer using first-order Hidden Markov Model (HMM-1) with the Plan7 architecture is a conventional tool for prediction of kinase-specific phosphorylation sites and the prediction accuracies of HMMer are 82%, 74% and 82% for existing data sets of the important kinases PKA, PKC and CDK, respectively. From the analysis of HMMer, this thesis aims to propose an improved algorithm iHMM using the second-order HMM (HMM-2) with more context information of sequences to advance prediction accuracy. With the use of Bayesian Information Criterion on the selection of model parameters, iHMM tries to avoid the known over-fitting problem when the sizes of data sets are not large. This thesis established 18 data sets of annotated kinases family such as PKA, PKC, CDK, etc from the Phospho.ELM database and Swiss-Prot database. The performance of iHMM is compared with those of HMMer and the conventional HMM-1 using 5-fold cross validation for 30 independent runs. Simulation results reveal that iHMM can improve the average accuracies of HMMer and HMM-1 near to 4.3% and 3.6%, respectively. Furthermore, this thesis investigated the advantages and disadvantages of iHMM, compared with those of HMMer and HMM-1.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT009451513
http://hdl.handle.net/11536/82004
顯示於類別:畢業論文


文件中的檔案:

  1. 151301.pdf