Compositional heterogeneity and phylogenomic inference of metazoan relationships

Mol Biol Evol. 2010 Sep;27(9):2095-104. doi: 10.1093/molbev/msq097. Epub 2010 Apr 9.

Abstract

Compositional heterogeneity of sequences between taxa may cause systematic error in phylogenetic inference. The potential influence of such bias might be mitigated by strategies to reduce compositional heterogeneity in the data set or by phylogeny reconstruction methods that account for compositional heterogeneity. We adopted several of these strategies to analyze a large ribosomal protein data set representing all major metazoan taxa. Posterior predictive tests revealed that there is compositional bias in this data set. Only a few taxa with strongly deviating amino acid composition had to be excluded to reduce this bias. Thus, this is a good solution, if these taxa are not central to the phylogenetic question at hand. Deleting individual proteins from the data matrix may be an appropriate method, if compositional heterogeneity among taxa is concentrated in a few proteins. However, half of the ribosomal proteins had to be excluded to reduce the compositional heterogeneity to a degree that the CAT model was no longer significantly violated. Recoding of amino acids into groups is another alternative but causes a loss of information and may result in badly resolved trees as demonstrated by the present data set. Bayesian inference with the CAT-BP model directly accounts for compositional heterogeneity between lineages by introducing breakpoints along the branches of the phylogeny at which the amino acid composition is allowed to change but is computationally expensive. Finally, a neighbor joining tree based on equal input distances that consider pattern and rate heterogeneity showed several unusual groupings, which are most likely artifacts, probably caused by the loss of information resulting from the transformation of the sequence data into distances. As long as no more efficient phylogenetic inference methods are available that can directly account for compositional heterogeneity in large data sets, using methods for reducing compositional heterogeneity in the data in combination with methods that assume a stationary amino acid composition remains an option for controlling systematic errors in tree reconstruction that result from compositional bias. Our analyses indicated that the paraphyly of Deuterostomia in some analyses is the result of systematic errors that also affected the relationships of Entoprocta and Ectoprocta.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Animals
  • Databases, Genetic
  • Phylogeny*
  • Ribosomal Proteins / genetics

Substances

  • Ribosomal Proteins