標題: 譜系樹上全共表型指標之機率分析Probabilistic Analysis of the Total Cophenetic Index in Phylogenetic Trees 作者: 陳麗安符麥克Chen, Li-AnFuchs, Michael應用數學系所 關鍵字: 譜系樹;隨機樹;極限定理;加法性參數;平衡指標;全共表型指標;phylogenetic tree;random tree;limit laws;additive parameter;balance index;total cophenetic index 公開日期: 2015 摘要: 譜系樹（phylogenetic tree）是演化生物學中廣泛被應用的重要工具，可用來表示操作分類單元（operational taxonomic units, OUT）之間的分類關係，尤其是演化歷史。以圖論的角度而言，譜系樹即是有根二元樹（rooted binary tree）。在譜系樹的研究中，樹的平衡（balance）是一個重要的課題。所謂平衡意指樹的對稱性，或可視為每個內點（internal node）擁有的後代（descendant）數目越接近則越平衡。這個性質可用平衡指標（balance index）量化之。此類型指標滿足遞迴關係，即一個樹的平衡指標，可由其左右子樹的該指標相加，再根據指標的性質補上特定函數值得到。若以機率模型生成譜系樹，則平衡指標可視為隨機變數。在生成譜系樹的方法中，最常被討論的機率模型包含Yule-Harding Model 及Uniform Model，前者是由根（root）出發，隨機選取一個現存葉片（leaf）將之一分為二，直至生成目標數量的葉片；後者則是將所有可能生成的樹型視為擁有相同被生成的機率。 事實上，平衡指標相當於理論計算機科學中的加法性參數（additive parameter）。此類型參數的研究在演算分析領域已發展多年，例如M. Sackin 在1972 年提出的Sackin’s index（Sn），是一樹中所有葉片的深度（depth）（即由根到葉片之最小路徑長度）之和。此指標在理論計算機科學中相當於用快速排序法（quicksort）排序隨機資料所需花費的比較數，或隨機二元搜尋樹（binary search tree）的總路徑長度。其期望值的解析公式在1993 年才由演化生物學家M. Kirkpatrick 及M. Slatkin 給出，然而當代演算分析大師D. Knuth（高納德）早在1973 年便將此參數的解析公式與近似解收錄在著作中。M. Régnier 及U. Rösler 亦分別在1989 年及1991 年發表其極限定理（limit law）。許多資料結構上的加法性參數也已有一般化的成果，例如H.-K Hwang（黃顯貴）與R. Neininger在2002 年的著作便對快速排序法上的加法性參數做了系統性的研究。 2013 年，A. Mir 等人定義了新的平衡指標――全共表型指標（total cophenetic index，\Phi_{n}），並給出其期望值在Yule-Harding Model 及Uniform Model 下的解析公式。此指標是一樹上所有相異葉片兩兩共同祖先（least common ancestor）的深度之和。G. Cardona 等人隨後計算出n 在Yule-Harding Model 下的二次動差及變異數的解析公式。上述研究皆是以解析公式為目標，並用代數方法直接計算完成，當動差的階數越高，計算也越繁瑣。事實上，我們也可以考慮n的生成函數（generating function）。理論計算機科學家P. Flajolet 和A. Odlyzko在1990 年提出奇異點分析（singularity analysis）用以處理生成函數，乃應用複變分析解決組合問題。此方法亦是有「解析組合（analytic combinatorics）之父」之稱的Flajolet 在其2009 年與R. Sedgewick 合著的史上第一本解析組合書藉中最重要的內容之一。本研究旨在以分析的角度，分別在Yule-Harding Model 及Uniform Model 下應用解析組合的方法處理n 的生成函數，近而得到n 所有動差的主要項，並進一步利用動差法（method of moment）證明極限定理。 此外，我們定義了一般化的全共表型指標（generalized total cophenetic index，\Phi_{\alpha, n}），即一樹中所有相異 個葉片之共同祖先的深度之和。根據此定義，S_{n} 實為當\alpha = 1 的特例，而\Phi_{n} 則為\alpha = 2 時的特例。我們也分別給出了在Yule-Harding Model 及Uniform Model 下\Phi_{n} 的所有動差。 本論文的架構可依章節內容分為文獻整理及研究結果兩部分。文獻整理主 要在第一章、第二章、附錄Ｂ及附錄Ｃ，而研究結果則在第三章、第四章、第五章、第六章及附錄Ａ中呈現。 第一章旨在介紹研究背景，包含譜系樹的定義、平衡指標的一般性質及 機率模型的介紹，並整理本研究探及的三種平衡指標：S_{n}、\Phi_{n} 及Colless index（C_{n}）之相關已知成果。第二章集中於研究方法的介紹，包含奇異點分析、基本法（elementary approach）及動差法。除了給出本研究應用的定理之外，亦藉由實例來說明方法的選用動機。 第三章使用奇異點分析，分別給出在Yule-Harding Model 及Uniform Model下\Phi_{n}的極限分布，其中生成函數的常用計算規則在附錄Ｂ有較詳細的介紹。第四章則用基本法進一步考慮在Yule-Harding Model 下，\Phi_{n}、S_{n} 及C_{n} 的聯合分布（joint distribution），這也是M. Blum 等人(2006) 提出的S_{n} 及C_{n} 的聯合分布之延伸。另外，我們也由此驗證前人給出的共變異數（covariance），其計算過程則以附錄Ａ輔助之。第五章討論\Phi_{n}的變化型。第一小節是回應A. Mir 等人(2013) 提及的指標\overline{\Phi}_{n} = S_{n} + \Phi_{n}，其中用到的Slutsky’s 定理會在附錄Ｃ有完整的證明。第二小節則提出一般化的全共表型指標，並分別算出在Yule-Harding Model 及Uniform Model 下所有動差的主要項。 最後，第六章會整理本研究所有的成果，並呈現相關的數值計算結果作為 本論文的總結。Phylogenetic trees are widely applied in evolutionary biology. They provide graphical representation of historical relations between operational taxonomic units (OTUs), in particular the evolutionary relationships. From a graph theoretic point of view, a phylogenetic tree is a rooted binary tree. The balance of a phylogenetic tree is probably one of the most widely studied tree shape properties. One can regard the balance of a rooted tree as its overall asymmetry, or the degree to which both children of each internal node tend to have the same number of descendant. This property can be quantified by balance indices. A balance index satisfies a recursive relation. When computing such an index of a tree, one can compute those of its two subtrees, and sum the results as well as the value of a specified function that depends on the property of the index. If a tree is generated by random models, then the balance indices can be regarded as random variables. Two of the most popular random models will be discussed. One is the Yule-Harding model, which randomly selects an extant leaf and replaces it with a cherry until the required number of leaves are obtained. Another one is the uniform model, which assigns the same probability to all trees with the same number of leaves. In fact, balance indices are equivalent to additive parameters in theoretical computer science. Such parameters have been studied for decades. For example, M. Sackin suggested the Sackin's index (S_{n}) in 1972, which is the sum of the depth of all leaves of a tree. Under the Yule-Harding model, this is equivalent to the number of comparison used by quicksort to sort a random input, or the total path length of a random binary search tree. The exact formula of the mean of S_{n} was given by the biologists M. Kirkpatrick and M. Slatkin in 1993. However, its exact formula as well as asymptotic solution has been mentioned in D. Knuth's book in 1973. Also, its limit law was given by M. R{\'e}gnier and U. R{\"o}sler in 1989 and 1991, respectively. Analysis on several structures and random models, as well as general analytic approaches, are already developed in the literature. For instance, H.-K. Hwang and R. Neininger published a systematic study of additive parameters on quicksort-liked structures in 2002. \\ In 2013, A. Mir et al. defined a new balance index---the total cophenetic index (\Phi_{n}), and provided the exact formulas of its mean under the Yule-Harding model and the uniform model. This index is the sum over all pairs of leaves, of the depth of their least common ancestor. Soon after, G. Cardona el al. published the second moment as well as the variance of \Phi_{n} under the Yule-Harding model. The above studies are all done by arithmetic computation. As the order of moments grows, computation will become more complicated. Also, for large tree size n, it might be more suitable to consider asymptotic formulas. In fact, we can consider the generating function of \Phi_{n}. Computer scientists P. Flajolet and Odlyzko A. introduced the method of singularity analysis to deal with generating functions in 1990. This method uses complex analysis to solve combinatorial problems. It is also one of the most important contents in the 2009 book---the first book-length material on this topic written by Flajolet, who is known as the father of analytic combinatorics, together with R. Sedgewick. One of the goal of this thesis is to apply the methods of analytic combinatorics to the generating function of \Phi_{n} and find the main terms of all moments under the Yule-Harding model and the uniform model. Also, we prove limit laws by the method of moments. Moreover, we define a generalized total cophenetic index, \Phi_{\alpha, n}, which is the sum over all set of size \alpha of leaves, of the depth of their least common ancestor. In this way, S_{n} is the special case when \alpha=1 and \Phi_{n} is \alpha=2. We also provide the asymptotics of all moments under the Yule-Harding model and the uniform model for this generalized total cophenetic index. This thesis aims to review literature and display our results. Literature arrangement is mainly in Chapter 1, 2, Appendix B and C, while the results are presented in Chapter 3, 4, 5, 6 and Appendix A. In Chapter 1, we give the background of our research problem, including the definition of phylogenetic trees, the properties of balance indices and random models. Previous results on \Phi_{n},S_{n} and Colless' index, C_{n} are also provided. Chapter 2 focuses on introducing our methods, such as singularity analysis, an elementary approach and the method of moments. We not only give the theorems we need, but also explain the motivations with examples. In Chapter 3, the limit law of \Phi_{n} will be discussed under the Yule-Harding model and the uniform model by singularity analysis. Rules of operations on generating functions will be introduced in Appendix B. Furthermore, Chapter 4 includes the joint distribution of \Phi_{n},S_{n} and C_{n} for the Yule-Harding model by an elementary approach, which is also an extension of the joint limit law of S_{n} and C_{n} given by M. Blum. et al. (2006). We also recompute covariances that appeared in previous papers, with the computation details presented in Appendix A. Chapter 5 considers variations of \Phi_{n}. The first part is about the index \overline{\Phi}_{n}=S_{n}+\Phi_{n} that was mentioned in A. Mir et al. (2013). We apply Slutsky's theorem in this section and include its proof in Appendix C. The second part will discuss the generalized total cophenetic index \Phi_{\alpha,n}. We give the asymptotics of its moments under the Yule-Harding model and the uniform model, where for the Yule-Harding model we use an elementary approach. Finally, we conclude this thesis by Chapter 6, which contains a summation of our work as well as several numerical results. URI: http://etd.lib.nctu.edu.tw/cdrfb3/record/nctu/#GT070252229http://hdl.handle.net/11536/141795 Appears in Collections: Thesis