標題: 正規化表示式的限制型多重序列比對之研究
On the Study of Multiple Sequence Alignment with Regular Expression Constraints
作者: 李威勳
Wei-Hsun Lee
盧錦隆
Chin Lung Lu
生物資訊及系統生物研究所
關鍵字: 限制型序列比對;正規化表示式;PROSITE 資料庫;sequence alignment with constraints;regular expression;PROSITE database
公開日期: 2006
摘要: 在生物資訊及計算生物領域中,多重序列比對 (Multiple Sequences Alignment) 在發掘基因體或蛋白質序列的生物意義上是很有用的工具。通常生物學家對序列的結構╱功能╱演化關係已有一些初步的認識,如活化部位的殘基、分子間的雙硫鍵、受質結合的部位、酵素的活化性及保守性的 Motifs 等等。因此在做多重序列比對時,生物學家希望有一個工具能讓一些結構性的╱功能性的╱保留性的核甘酸或殘基可以排在一起。 2004年我們的研究團隊已開發出一套限制型多重序列比對 (Multiple Sequence Alignment with Constraints) 的工具叫 MuSiC。至目前為止 MuSiC 已被許多生物學家證實在生物的研究上是相當有用的。然而,MuSiC 中的 Constraint 只能是允許 Mismatches 但不能允許Gap的簡單序列片段。很多生物重要的 Patterns 像是 PROSITE database 中的 Motifs 在 MuSiC 中是無法使用的。因此,在此論文中我們主要的目的為研究並開發出一套能夠使用正規表示式的限制型多重序列比對 (Multiple Sequence Alignment with Regular Expression Constraints) 的演算法與工具。 我們採用了 Progressive 的方法來解決正規表示式的限制型多重序列比對的問題。事實上,這個方法的核心在於設計出有效率的演算法來解決正規表示式的限制型兩條序列比對問題 (Pairwise Sequence Alignment with Regular Expression Constraints Problem)。我們是將正規表示式 (Regular Expression) 轉成有限狀態機 (Finite Automaton),並使用 Dynamic Programming 與 Divide-and-Conquer 方法來設計一個在時間與空間上都有效率的演算法來求得最佳化的正規表示式限制型兩條序列比對。然後,我們再跟據此演算法發展出能夠使用多個正規表示式的限制型多重序列比對工具:RE-MuSiC (Multiple Sequence Alignment with Regular Expression Constraints),其網址在 http://140.113.239.131/RE-MUSIC
Multiple sequence alignment (MSA) has received much attention in the fields of bioinformatics and computational biology because it is very useful for discovering the biological meanings of sequences. Usually, biologists may have advanced knowledge about the structural, functional, and /or evolutionary relationships about sequences of their interest, such as active site residues, intramolecular disulfide bonds, substrate binding sites, enzyme activities, conserved motifs (consensuses) and so on. They would expect an MSA program that is able to align these sequences such that the structural, functional, and/or conserved bases (i.e., nucleotides or residues) are aligned together. In 2004, our research group has already developed a tool, called MuSiC, for multiple sequence alignment with constraint. Since then, it has been proven by many biologists to be useful in biological research. Nevertheless, the constraints allowed in MuSiC can only be simple substrings allowing mismatches but disallowing gaps in the occurrences. Many biologically important patterns such as motifs in the PROSITE database cannot be supported by MuSiC, either. Hence, in this thesis, we study and develop an algorithm and a tool for the problem of multiple sequence alignment with regular expression constraints (RECMSA). We used a progressive approach to design an efficient program for solving the RECMSA problem. The kernel of this approach is an efficient algorithm for solving the problem of pairwise sequence alignment with regular expression constraints (RECPSA). We transform the regular expressions into a finite automaton and then use dynamic programming and divide-and-conquer approaches to develop a time and space efficient algorithm for optimally solving the RECPSA problem, which can be implemented effectively on a desktop PC with limited memory. Using this algorithm as the kernel, we developed a web-server called RE-MuSiC (Multiple Sequence Alignment with Regular Expression Constraints) that is available on-line at http://140.113.239.131/RE-MUSIC.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT009451511
http://hdl.handle.net/11536/82003
Appears in Collections:Thesis


Files in This Item:

  1. 151101.pdf
  2. 151101.pdf