標題: 運用智慧型基因演算法最佳化微陣列資料分析 - 可語意解讀基因表現量分類器之設計暨基因網路模型之重建
Intelligent Genetic Algorithm for Microarray Data Analysis – Interpretable Gene Expression Classifier and Inference of Genetic Network
作者: 謝志宏
Chih-Hung Hsieh
何信瑩
Shinn-Ying Ho
生物資訊及系統生物研究所
關鍵字: 智慧型基因演算法;直交實驗設計;型樣識別;模糊分類器;微陣列資料分析;基因調控網路;Intelligent genetic algorithm;Orthogonal experimental design;Pattern recognition;Fuzzy classifier;Microarray data analysis;Genetic network
公開日期: 2005
摘要: [摘要] 在癌症及疾病醫學診斷的研究之中,微陣列基因表現量資料分析可說是目前最重要的研究領域之一。基因表現量資料可提供有關基因、基因調控網路、及細胞內狀態之豐富資訊,藉由微陣列資料分析之技術,我們可以由基因表現量資料中,篩選出參與基因調控的重要基因,並且能夠重建出細胞內動態化的基因調控網路,進而探索並發現更多有關分子生物學、生物化學、生化工程學及製藥學的重要新知識。進行微陣列資料分折其中兩個主要目的分別為: 探索針對不同的細胞狀態中,各基因的表現情形分別為何? 例如,在健康細胞及癌細胞中,各基因所分別表現之狀態; 以及研究同一基因調控網路內之基因其彼此調控影響的關係。而上述微陣列資料分析的兩個主要議題,可以分別歸類為基因表現量分類問題及基因調控網路重建問題。 首先,當在面對基因表現量分類問題時,一個精準、只需少數基因資訊即可運作、並且可用自然語意解讀其學習結果之分類器,對於微陣列資料分析以及其後具經濟效益的醫學檢測,將有決定性之幫助。然而,許多常用於微陣列資料分析之分類器,例如: 支持向量機(SVM)、類神經網路、 k 個最近鄰居分類法(k-NN)以及羅吉斯回歸模型皆缺少良好之可用自然語意解讀的特性。因此,於此篇論文之中,對於基因表現量分類問題,我們提出一以精確且精簡之模糊分類規則為基礎,並且可以語意解讀之基因表現量分類器(iGEC)。iGEC 包含三個主要之最佳化設計目標分別為最大化分類辨識率、最少化所需分類規則、以及最少化分類所需基因數,並且採用一新式智慧型基因演算法 (IGA) 有效率地解決含有大量調控參數之 iGEC 最佳化設計問題。進一步,我們使用八組常用的基因表現資料來做效能評估。實驗結果顯示 iGEC 可有效產生一組精確、精簡、且語意可解讀的模糊分類規則(平均一個類別只需 1.1 條模糊規則),其中平均測試階段辨識率為 87.9%,平均所需模糊分類規則數為 3.9,平均分類所需基因數為 5.0。此外,針對基因表現量分類問題,根據上述的評量標準,iGEC 不但較現有之模糊規則分類器有更佳的表現,對於某些不以分類規則為基礎的分類器,iGEC 同樣具有更精確之辨識率。 其次,針對基因調控網路重建問題,我們希望利用基因表現量資料,藉由有效重建動態化的基因調控網路來發現更多有關分子生物學、生物化學等的重要知識。其中,S-system 基因網路模型不但適合用來描述生化網路系統,更可用來分調控網路內部動態變化之情形。然而要推算出一個含有 N 個基因的 S-system 基因網路模型就必須處理含有 2N(N+1) 個調控參數之非線性微分方程組,此為一大量參數最佳化問題,需耗費大量的計算成本。因此,我們於此篇論文中,提出一智慧型兩階段演化式演算法(iTEA),有效率地由時間序列的基因表現量資料重建出 S-system 基因網路模型。為了處理如此大量的調控參數,iTEA 演算法主要可分為兩個分別採用divide-and-conquer策略之階段。首先將此最佳化問題分割為 N 個含有 2(N +1) 調控參數的子問題。於 iTEA 第一階段時,使用以直交實驗設計 (OED) 為基礎之新式智慧型基因演算法 (IGA) 最佳化決定每一個子問題之解。再者,為了處理基因表現量資料含有雜訊的問題,於第二階階段時,結合 N 個子問題之解組成含有 2N(N+1) 個參數之 S-system 網路模型,再利用另一以 OED 為基礎之新式退火演算法 (OSA) 做進一步的最佳化調整。我們利用單 CPU 電腦,並且使用模擬產生不含及含有雜訊的基因表現量資料來對 iTEA 做效能評估。實驗結果顯示: (1) IGA 能夠有效地解決含有含有 2(N +1) 調控參數的子問題; (2) 相較於前人所採用 SPXGA 演算法,IGA 明顯具有更好的最佳化搜尋能力; (3) iTEA 能夠有效率地解決S-system 基因調控網路模型的重建問題。 關鍵詞: 演化式演算法; 智慧型基因演算法; 直交實驗設計; Divide-and-conquer; 型樣識別; 模糊分類器; 基因表現量; 微陣列資料分析; 基因調控網路; 生化途徑識別; S-system 基因網路模型。
[Abstract] Microarray gene expression profiling technology is one of the most important research topics in cancer research or clinical diagnosis of disease. The gene expression data provide valuable information in the understanding of genes, biological networks, and cellular states. Through microarray techniques, we can find out the important genes which participate in the genetic regulation and rebuild cellular dynamic regulation networks from gene expression data to discover more delicate and substantial functions in molecular biology, biochemistry, bioengineering, and pharmaceutics. One goal in analyzing expression data is to determine how genes are expressed as a result of certain cellular conditions (e.g., how genes are expressed in diseased and healthy cells). Another goal is to determine how the expression of any particular gene might affect the expression of other genes in the same genetic network. To achieve the two objectives of microarray data analysis mentioned above, two of the important issues in microarray data analysis are the gene expression classification and the genetic networks inference problem. First, when dealing with the gene expression classification problem, an accurate classifier with linguistic interpretability using a small number of relevant genes is beneficial to microarray data analysis and development of inexpensive diagnostic tests. Several frequently used techniques for designing classifiers of microarray data, such as support vector machine, neural networks, k-nearest neighbor rule, and logistic regression model, suffer from low interpretabilities. This thesis proposes an interpretable gene expression classifier (named iGEC) with an accurate and compact fuzzy rule base for microarray data analysis. The design of iGEC has three objectives to be simultaneously optimized: maximal classification accuracy, minimal number of rules, and minimal number of used genes. A novel intelligent genetic algorithm (IGA) is used to efficiently solve the design problem with a large number of tuning parameters. The performance of iGEC is evaluated using eight commonly-used data sets. It is shown that iGEC has an accurate, concise, and interpretable rule base (1.1 rules per class) on average in terms of test classification accuracy (87.9%), rule number (3.9), and used gene number (5.0). Moreover, iGEC not only has better performance than the existing fuzzy rule-based classifier in terms of the above-mentioned objectives, but also is more accurate than some existing non-rule-based classifiers. Second, for the genetic networks inference problems, it is desirable to rebuild the relationships of regulation between genes from gene expression profiles. S-system model is suitable to characterize biochemical network systems and capable to analyze the regulatory system dynamics. However, inference of an S-system model of N-gene genetic networks has 2N(N + 1) parameters in a set of non-linear differential equations to be optimized. This thesis proposes an intelligent two-stage evolutionary algorithm (iTEA) to efficiently infer the S-system models of genetic networks from time-series data of gene expression. To cope with curse of dimensionality, the proposed algorithm consists of two stages where each uses a divide-and-conquer strategy. The optimization problem is first decomposed into N subproblems having 2(N +1) parameters each. At the first stage, each subproblem is solved using the novel intelligent genetic algorithm (IGA) with intelligent crossover based on orthogonal experimental design (OED). At the second stage, the obtained N solutions to the N subproblems are combined and refined using an OED-based simulated annealing algorithm for handling noisy gene expression profiles. The effectiveness of iTEA is evaluated using simulated expression patterns with and without noise running on a single-processor PC. It is shown that 1) IGA is efficient enough to solve subproblems; 2) IGA is significantly superior to the existing method SPXGA; and 3) iTEA performs well in inferring S-system models for dynamic pathway identification. Keywords: Evolutionary algorithm; Intelligent genetic algorithm; Orthogonal experimental design; Divide-and-conquer; Pattern recognition; Fuzzy classifier; Gene expression; Microarray data analysis; Genetic network; Pathway identification; S-system model.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT009351501
http://hdl.handle.net/11536/79853
Appears in Collections:Thesis


Files in This Item:

  1. 150101.pdf