Least squares inference in phylogeny

Least squares inference in phylogeny generates a
phylogenetic tree based on anobserved matrix of pairwise genetic distances andoptionally a weightmatrix. The goal is to find a tree which satisfies the distance constraints asbest as possible.

Ordinary and weighted least squares

The discrepancy between the observed pairwise distances D_{ij}and the distances T_{ij} over a phylogenetic tree (i.e. the sumof the branch lengths in the path from leaf i to leafj) is measured by: S = sum_{ij} w_{ij} (D_{ij}-T_{ij})^2 where the weights w_{ij} depend on the least squares method used.Least squaresdistance tree construction aims to find the tree (topology and branch lengths)with minimal S. This is a non-trivial problem. It involves searching thediscrete space of unrooted binary tree topologies whose size is exponential inthe number of leaves. For n leaves there are1 • 3 • 5 • ... • (2n-3)different topologies. Enumerating them is not feasible already for a smallnumber of leaves. Heuristic search methods are used to find a reasonablygood topology. The evaluation of S for a given topology (which includes thecomputation of the branch lengths) is a linear least squares problem.There are several ways to weight the squared errors(D_{ij}-T_{ij})^2,depending on the knowledge and assumptions about the variances of the observeddistances. When nothing is known about the errors, or if they are assumed to beindependently distributed and equal for all observed distances, then all theweights w_{ij} are set to one. This leads to an ordinary leastsquares estimate.In the weighted least squares case the errors are assumed to be independent(or their correlations are not known). Given independent errors, a particularweight should ideally be set to the inverse of the variance of the corresponding distanceestimate. Sometimes the variances may not be known, but theycan be modeled as a function of the distance estimates. In the Fitch andMargoliash methodFitch WM, Margoliash E. (1967). Construction of phylogenetic trees. "Science" 155: 279-84.] for instance it is assumed that the variances are proportional to the squareddistances.

Generalized least squares

The ordinary and weighted least squares methods described aboveassume independent distance estimates. If the distancesare derived from genomic data their estimates covary, because evolutionaryevents on internalbranches (of the true tree) can push several distances up or down atthe same time. The resulting covariances can be taken into account using themethod of generalized least squares, i.e. minimizing the following quantity:sum_{ij, kl} w_{ij,kl} (D_{ij}-T_{ij}) (D_{kl}-T_{kl})where w_{ij,kl} are the entries of the inverse of the covariance matrix of the distance estimates.

External links

* [http://evolution.genetics.washington.edu/phylip.html PHYLIP] , a freely distributed phylogenetic analysis package containing an implementation of the weighted least squares method
* [http://paup.csit.fsu.edu/ PAUP] , a similar package available for purchase
* [http://www.cbrg.ethz.ch/darwin/index Darwin] , a programming environment with a library of functions for statistics, numerics, sequence and phylogenetic analysis


