# Least squares inference in phylogeny

Least squares inference in phylogeny

Least squares inference in phylogeny generates a
phylogenetic tree based on anobserved matrix of pairwise genetic distances andoptionally a weightmatrix. The goal is to find a tree which satisfies the distance constraints asbest as possible.

Ordinary and weighted least squares

The discrepancy between the observed pairwise distances $D_\left\{ij\right\}$and the distances $T_\left\{ij\right\}$ over a phylogenetic tree (i.e. the sumof the branch lengths in the path from leaf $i$ to leaf$j$) is measured by:$S = sum_\left\{ij\right\} w_\left\{ij\right\} \left(D_\left\{ij\right\}-T_\left\{ij\right\}\right)^2$where the weights $w_\left\{ij\right\}$ depend on the least squares method used.Least squaresdistance tree construction aims to find the tree (topology and branch lengths)with minimal S. This is a non-trivial problem. It involves searching thediscrete space of unrooted binary tree topologies whose size is exponential inthe number of leaves. For n leaves there are1 • 3 • 5 • ... • (2n-3)different topologies. Enumerating them is not feasible already for a smallnumber of leaves. Heuristic search methods are used to find a reasonablygood topology. The evaluation of S for a given topology (which includes thecomputation of the branch lengths) is a linear least squares problem.There are several ways to weight the squared errors$\left(D_\left\{ij\right\}-T_\left\{ij\right\}\right)^2$,depending on the knowledge and assumptions about the variances of the observeddistances. When nothing is known about the errors, or if they are assumed to beindependently distributed and equal for all observed distances, then all theweights $w_\left\{ij\right\}$ are set to one. This leads to an ordinary leastsquares estimate.In the weighted least squares case the errors are assumed to be independent(or their correlations are not known). Given independent errors, a particularweight should ideally be set to the inverse of the variance of the corresponding distanceestimate. Sometimes the variances may not be known, but theycan be modeled as a function of the distance estimates. In the Fitch andMargoliash methodFitch WM, Margoliash E. (1967). Construction of phylogenetic trees. "Science" 155: 279-84.] for instance it is assumed that the variances are proportional to the squareddistances.

Generalized least squares

The ordinary and weighted least squares methods described aboveassume independent distance estimates. If the distancesare derived from genomic data their estimates covary, because evolutionaryevents on internalbranches (of the true tree) can push several distances up or down atthe same time. The resulting covariances can be taken into account using themethod of generalized least squares, i.e. minimizing the following quantity:$sum_\left\{ij, kl\right\} w_\left\{ij,kl\right\} \left(D_\left\{ij\right\}-T_\left\{ij\right\}\right) \left(D_\left\{kl\right\}-T_\left\{kl\right\}\right)$where $w_\left\{ij,kl\right\}$ are the entries of the inverse of the covariance matrix of the distance estimates.

* [http://evolution.genetics.washington.edu/phylip.html PHYLIP] , a freely distributed phylogenetic analysis package containing an implementation of the weighted least squares method
* [http://paup.csit.fsu.edu/ PAUP] , a similar package available for purchase
* [http://www.cbrg.ethz.ch/darwin/index Darwin] , a programming environment with a library of functions for statistics, numerics, sequence and phylogenetic analysis

References

Wikimedia Foundation. 2010.

### Look at other dictionaries:

• Distance matrices in phylogeny — Distance matrices are used in phylogeny as non parametric distance methods were originally applied to phenetic data using a matrix of pairwise distances. These distances are then reconciled to produce a tree (a phylogram, with informative branch… …   Wikipedia

• Computational phylogenetics — is the application of computational algorithms, methods and programs to phylogenetic analyses. The goal is to assemble a phylogenetic tree representing a hypothesis about the evolutionary ancestry of a set of genes, species, or other taxa. For… …   Wikipedia

• List of statistics topics — Please add any Wikipedia articles related to statistics that are not already on this list.The Related changes link in the margin of this page (below search) leads to a list of the most recent changes to the articles listed below. To see the most… …   Wikipedia

• Maximum parsimony (phylogenetics) — Parsimony is a non parametric statistical method commonly used in computational phylogenetics for estimating phylogenies. Under parsimony, the preferred phylogenetic tree is the tree that requires the least evolutionary change to explain some… …   Wikipedia

• Cladistics — For the scientific journal, see Cladistics (journal). Part of a series on Evolutionary Biology …   Wikipedia

• Monte Carlo method — Not to be confused with Monte Carlo algorithm. Computational physics …   Wikipedia

• Cladogram — For help on how to use cladograms in Wikipedia, see Help:Cladograms A horizontal cladogram, with the ancestor (not named) to the left …   Wikipedia

• Clade — For other uses, see Clade (disambiguation). Cladogram (family tree) of a biological group. The red and blue boxes represent clades (i.e., complete branches). The green box is not a clade, but rather represents an evolutionary grade, an incomplete …   Wikipedia

• DNA barcoding — is a taxonomic method that uses a short genetic marker in an organism s DNA to identify it as belonging to a particular species. It differs from molecular phylogeny in that the main goal is not to determine classification but to identify an… …   Wikipedia

• Phylogenetic tree — ptree redirects here. For Patricia tree, see Radix tree …   Wikipedia