標題: MuSiC and MuSiC-ME:有效率的限制型多重序列比對工具
MuSiC and MuSiC-ME: Efficient Tools for Multiple Sequence Alignment with Constraints
作者: 黃彥斌
Yen Pin Huang
盧錦隆
Chin Lung Lu
生物資訊及系統生物研究所
關鍵字: 多重序列比對;Multiple Sequence Alignment;Dynamic Programming;Divide-and-Conquer;Progressive
公開日期: 2003
摘要: 在生物資訊及計算生物的領域中,多重序列比對 (Multiple Sequence Alignment)在發掘基因體或蛋白質序列的生物意義上是很有用的工具。通常生物學家對序列的結構/功能/演化關係已有一些初步的認識,如活化部位的殘基、分子間的雙硫鍵、受質結合的部位、酵素的活性及保守性的 Motifs 。因此,在做多重序列比對的時候,生物學家希望能有一個工具能夠讓一些結構性的/功能性的/保留性的核甘酸或殘基可以排比在一起。然而,目前已有的多重序列比對工具大都無法滿足這種需求,因為他們都只根據序列的內容去排比而忽略了其他在結構/功能/演化上已知的訊息。因此,在這個論文中,我們主要的目的是去研究發展一種稱為限制型的多重序列比對工具來解決這個問題。這個工具允許使用者輸入多條的序列及一些限制條件,每個限制條件對應在序列上一些結構性的/功能性的/保守性的核甘酸或殘基,然後產生一組多重序列比對使得一些滿足限制條件的核甘酸或殘基排比在一起。我們採用了所謂的 Progressive 的方法來解決限制型的多重序列比對問題。事實上,這個方法的核心在於能否設計出有效率的演算法來解決限制型兩條序列比對問題(Constrained Pairwise Sequence Alignment Problem)。我們採用 Dynamic Programming 及 Divide-and-Conquer這兩種不同的方法分別設計出一個在時間上有效率及另一個在空間上有效率的演算法來求得最佳化的限制型的兩條序列比對。然後,我們再根據這兩種演算法分別發展出兩個限制型多重序列比對工具:MuSiC (Multiple Seuqnece Alignment with Constraints) 及 MuSiC-ME (Memory-Efficient MuSiC)。同時,我們也利用一些冠狀病毒(包含 SARS)的 3’ UTR 序列來測試這兩種工具的可用性,並從其所產生的序列比對中找到在 SARS 病毒中會折疊成 Pseudoknot 結構的片斷。
Multiple sequence alignment (MSA) has received much attention in the fields of bioinformatics and computational biology because it is very useful for discovering the biological meanings of sequences. Usually, biologists may have advanced knowledge about the structures/functionalities/evolutionary relationships of sequences of interest, such as active site residues, intramolecular disulfide bonds, substrate binding sites, enzyme activities and conserved motifs (consensuses). They would expect an MSA program that is able to align these sequences such that the structural/ functional/conserved bases (i.e., nucleotides or residues) are aligned together. However, most available MSA programs cannot satisfy such a requirement because they generate an alignment based only on the content of the sequences, ignoring the known functional/structural/conserved information. Hence, in this thesis, we study and develop a so-called constrained multiple sequence alignment (CMSA) tool, which takes as input the sequences and several user-specified constraints, each with corresponding to the known functional/structural/conserved bases, and generates an output of alignment in which the bases corresponding to a user-specified constraint are aligned together. We use the progressive approach to design efficient programs for heuristically solving the CMSA problem. The kernel of this approach is an efficient algorithms for optimally solving the constrained pairwise alignment (CPSA) problem. We use two different approaches, called as dynamic programming and divide-and-conquer, to design a time-efficient algorithm and a memory-efficient algorithm respectively for optimally solving the CPSA program. Based on those two algorithms, we then develop two programs, called MuSiC (Multiple Sequence Alignment with Constraints) and MuSiCME (Memory-Efficient MuSiC), respectively. To demonstrate the applicabilities of our programs, we test them on a data set of RNA sequences of 3' UTRs of several coronaviruses, including SARS, for detecting a fragment in the 3' UTR of SARS that is able to fold into a stable pseudoknotted structure, where such a pseudoknot is considered to be involved in the RNA replication of coronaviruses.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT009151514
http://hdl.handle.net/11536/61491
Appears in Collections:Thesis


Files in This Item:

  1. 151401.pdf