ParsimonyML

Uploaded from authorPOINTLite
Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Slide1: 

Introduction to characters and parsimony analysis

Character evolution: 

Character evolution Heritable changes in features (morphology, gene sequences, etc.) provide the basis for inferring phylogeny Such changes delimit what are usually referred to as the states of characters (e.g. presence or absence of a feature or different nucleotide bases at specific sites in a sequence) The utility of characters depends on how often the changes that produce the different character states occur independently (homoplasy)

Unique and unreversed characters: 

Unique and unreversed characters Given a heritable evolutionary change that is unique and unreversed (e.g. the origin of hair) in an ancestral species, the presence of the novelty in any taxa must be due to inheritance from the ancestor. Similarly absence in any taxa must be because the taxa are not descendants of that ancestor The novelty will be a homology acting as a badge or marker for the descendants of the ancestor The taxa with the novelty will be a clade (e.g. Mammalia)

Unique and unreversed characters - Hair: 

Unique and unreversed characters - Hair Because hair evolved only once and is unreversed it is homologous and provides unambiguous evidence for the clade Mammalia Lizard Frog Human Dog HAIR absent present change or step

Slide5: 

Homoplasy is similarity that is not homologous (not due to common ancestry) Homoplasy is the result of independent evolution (convergence, parallelism, reversal) Homoplasy can provide misleading evidence of phylogenetic relationships Homoplasy - Independent evolution

Homoplasy - independent evolution - Tails: 

Homoplasy - independent evolution - Tails Human Lizard Frog Dog TAIL absent present Loss of tails evolved independently in humans and frogs - there are two steps on the true tree

Homoplasy - misleading evidence of phylogeny: 

Homoplasy - misleading evidence of phylogeny If misinterpreted as homology, the absence of tails would be evidence for a wrong tree grouping humans with frogs and lizards with dogs Human Frog Lizard Dog TAIL absent present

Homoplasy - reversal: 

Homoplasy - reversal Reversals are evolutionary changes back to an ancestral condition As with any homoplasy, reversals can provide misleading evidence of relationships True tree Wrong tree 10 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 10

Homoplasy - a fundamental problem of phylogenetic inference: 

Homoplasy - a fundamental problem of phylogenetic inference If there were no homoplastic similarities inferring phylogeny would be easy - all the pieces of the jig-saw would fit together neatly Distinguishing the misleading evidence of homoplasy from the reliable evidence of homology is a fundamental problem of phylogenetic inference

Homoplasy and Incongruence: 

Homoplasy and Incongruence If we assume that there is a single correct phylogenetic tree then: When characters support conflicting phylogenetic trees we know that there must be some misleading evidence of relationships among the incongruent or incompatible characters Incongruence between two characters implies that at least one of the characters is homoplastic and that at least one of the trees the character supports is wrong

Incongruence or Incompatibility: 

Incongruence or Incompatibility These trees and characters are incongruent. Both trees cannot be correct and at least one character must be homoplastic Lizard Frog Human Dog HAIR absent present Human Frog Lizard Dog TAIL absent present

Distinguishing homology and homoplasy : 

Distinguishing homology and homoplasy Morphologists use a variety of techniques to distinguish homoplasy and homology Homologous features are expected to display detailed similarity (in position, structure, development) whereas homoplastic similarities are more likely to be superficial As recognised by Darwin congruence with other characters provides the most compelling evidence for homology

The importance of congruence: 

The importance of congruence The importance, for classification, of trifling characters, mainly depends on their being correlated with several other characters of more or less importance. The value indeed of an aggregate of characters is very evident ........ a classification founded on any single character, however important that may be, has always failed. Charles Darwin, Origin of Species

Congruence - 4: 

Congruence - 4 We prefer the ‘true’ tree because it is supported by multiple congruent characters Lizard Frog Human Dog MAMMALIA Hair Single bone in lower jaw Lactation

Homoplasy in molecular data: 

Homoplasy in molecular data Incongruence and therefore homoplasy can be common in molecular sequence data One reason is that characters have a limited number of alternative character states ( e.g. A, G, C and T) In addition, these states are chemically identical so that homology and homoplasy are equally similar and cannot be distinguished through detailed study of structure or development

Parsimony analysis: 

Parsimony analysis Parsimony methods provide one way of choosing among alternative phylogenetic hypotheses The parsimony criterion favours hypotheses that maximise congruence and minimise homoplasy It depends on the idea of the fit of a character to a tree

Character Fit : 

Character Fit Initially, we can define the fit of a character to a tree as the minimum number of steps required to explain the observed distribution of character states among taxa This is determined by parsimonious character optimization Characters differ in their fit to different trees

Character Fit - Amniota: 

Character Fit - Amniota 3 steps 1 step Rayfinned fish lungfish frogs salamaders mammals turtles lizards crocodiles birds snakes Rayfinned fish lungfish frogs salamaders mammals turtles lizards crocodiles birds snakes

Parsimony Analysis: 

Parsimony Analysis Given a set of characters, such as aligned sequences, parsimony analysis works by determining the fit (number of steps) of each character on a given tree The sum over all characters is called Tree Length Most parsimonious trees (MPTs) have the minimum tree length needed to explain the observed distributions of all the characters

Parsimony in practice: 

Parsimony in practice Of these two trees, Tree 1 has the shortest length and is the most parsimonious Both trees require some homoplasy (extra steps)

Results of parsimony analysis: 

Results of parsimony analysis One or more most parsimonious trees Hypotheses of character evolution associated with each tree (where and how changes have occurred) - this may be very useful Branch lengths (amounts of change associated with branches) Various tree and character statistics describing the fit between tree and data Suboptimal trees - optional

Parsimony - advantages: 

Parsimony - advantages is a simple method - easily understood operation does not seem to depend on an explicit model of evolution gives both trees and associated hypotheses of character evolution should give reliable results if the data is well structured and homoplasy is either rare or widely (randomly) distributed on the tree

Parsimony - disadvantages: 

Parsimony - disadvantages May give misleading results if homoplasy is common or concentrated in particular parts of the tree, e.g: thermophilic convergence base composition biases long branch attraction Underestimates branch lengths

Parsimony can be inconsistent: 

Parsimony can be inconsistent Felsenstein (1978) developed a simple model phylogeny including four taxa and a mixture of short and long branches Under this model parsimony will give the wrong tree With more data the certainty that parsimony will give the wrong tree increases - so that parsimony is statistically inconsistent Advocates of parsimony initially responded by claiming that Felsenstein’s result showed only that his model was unrealistic It is now recognised that the long-branch attraction (in the Felsenstein Zone) is one of the most serious problems in phylogenetic inference Long branches are attracted but the similarity is homoplastic

Methods other than parsimony: 

Methods other than parsimony

Phylogenetic analysis - different methods: 

Character-based methods Maximum parsimony Maximum likelihood Distance-based methods Phylogenetic analysis - different methods

Slide27: 

Maximum likelihood methods of phylogenetic inference evaluate a hypothesis about evolutionary history (the branching order and branch lengths of a tree) in terms of a probability that a proposed model of the evolutionary process and the hypothesised history (tree) would give rise to the data we observe - so given a model and data we can estimate a tree (the maximum likelihood tree) Maximum Likelihood 1

Slide28: 

Maximum likelihood estimates a parameter from observed data under an explicit model There is an explicit link between model + tree + data (poor model = poor tree?) Likelihood also provides ways of evaluating models in terms of their log likelihoods, provided they are nested i.e. one model is a special case of the other Maximum Likelihood 2

Slide29: 

1 CGAGAC 2 AGCGAC 3 AGATTA 4 GGATAG What is the probability that unrooted Tree A (rather than another tree) could have generated the data shown under our chosen model ? Maximum likelihood tree reconstruction 1 1 2 3 4 Tree A

Slide30: 

1 CGAGAC 2 AGCGAC 3 AGATTA 4 GGATAG What is the probability that unrooted Tree A (rather than another tree) could have generated the data shown under our chosen model ? Maximum likelihood tree reconstruction 1 1 2 3 4 Tree A note rooting is arbitrary

Slide31: 

1 CGAGA C 2 AGCGA C 3 AGATT A 4 GGATA G j ACGT The likelihood for a particular site j is the sum of the probabilities of every possible reconstruction of ancestral states under a chosen model Maximum likelihood tree reconstruction 2 4 x 4 possibilities Tree A

Slide32: 

Maximum likelihood tree reconstruction 3 The likelihood of Tree A is the product of the likelihoods at each site The likelihood is usually evaluated by summing the log of the likelihoods (because the summed probabilities are so small) at each site and reported as the log likelihood of the full tree The Maximum likelihood tree is the one with the highest likelihood (might not be Tree A i.e. it could be another tree topology)

Slide33: 

Maximum likelihood tree reconstruction 4 How are the probabilities of change calculated ? The probabilities used to calculate likelihoods depend on the assumed model

Slide34: 

The probability of any change is independent of the prior history of the site (a Markov Model) Substitution probabilities do not change with time or over the tree (a homogeneous Markov process) Change is time reversible e.g. the rate of change of A to T is the same as T to A Typical assumptions of ML substitution models

Slide35: 

The model incorporates information about the rates at which each nucleotide is replaced by each alternative nucleotide For DNA this can be expressed as a 4 x 4 rate matrix Other model parameters may include: Site by site rate variation - often modelled as a statistical distribution - for example a gamma distribution Maximum likelihood models 1

Slide36: 

Model parameters can be: estimated from the data (using maximum likelihood in PAUP) can be pre-set based upon assumptions about the data (for example that for all sequences all sites change at the same rate and all substitutions are equally likely - e.g. the widely used Jukes and Cantor Model) wherever possible avoid assumptions which are violated by the data because they can lead to incorrect trees Maximum likelihood models 2

The true tree for Deinococcus and Thermus: 

The true tree for Deinococcus and Thermus Thermus Deinococcus Aquifex Bacillus “The true tree”

Slide38: 

The Jukes and Cantor model is the simplest model The JC model is a one parameter model 1) it assumes that all changes are equally probable (p=0.25) 2) unless modified it assumes all sites can change and that they do so at the same rate

Output of JC ML analysis for (Thermus, Deinococcus, Bacillus, Aquifex): 

Output of JC ML analysis for (Thermus, Deinococcus, Bacillus, Aquifex) Tree 1 -log likelihood = -4090 Best tree Tree 2 -log likelihood = -4101 True tree Tree 3 -log likelihood = -4132 The Jukes and Cantor model in ML is unable to recover the true tree for this data set

Slide40: 

The 16S rRNA genes of Aquifex, Bacillus, Deinococcus and Thermus Exclude characters command in PAUP - exclude constant sites: Character-exclusion status changed: 859 of 1273 characters excluded Total number of characters now excluded = 859 Number of included characters = 414 Taxon A C G T # sites -------------------------------------------------------------- Aquifex 0.12319 0.38164 0.38164 0.11353 414 Deinococc 0.23188 0.22222 0.27295 0.27295 414 Thermus 0.13317 0.35835 0.37530 0.13317 413 Bacillus 0.23188 0.22705 0.26570 0.27536 414 -------------------------------------------------------------- Mean 0.18006 0.29728 0.32387 0.19879 413.75 Base frequencies command in PAUP: Does the JC model fit these data?

Models can be made more parameter rich to increase their realism 1: 

Models can be made more parameter rich to increase their realism 1 The most common additional parameters are: A correction to allow different substitution rates for each type of nucleotide change A correction for the proportion of sites which are unable to change A correction for variable site rates at those sites which can change PAUP will estimate the values of these additional parameters for you

A gamma distribution can be used to model site rate heterogeneity : 

A gamma distribution can be used to model site rate heterogeneity

Slide43: 

The GTR model of sequence evolution: The general time reversable model (GTR) is the most general substitution model because it assigns different rates for each type of substitution. For example for the 16S ribosomal RNA data for Deinococcus, Thermus, Aquifex and Bacillus: Tree number 1: -Ln likelihood = 3985.30400 Estimated R-matrix: -2.7325625 0.4419956 1.42028 0.87028688 0.4419956 -5.2448524 1.2621698 3.540687 1.42028 1.2621698 -3.6824498 1 0.87028688 3.540687 1 -5.4109739 Estimated value of proportion of invariable sites = 0.228318 Estimated value of gamma shape parameter = 0.610459

Models can be made more parameter rich to increase their realism 2: 

Models can be made more parameter rich to increase their realism 2 But the more parameters you estimate from the data the more time needed for an analysis and the more sampling error accumulates One might have a realistic model but large sampling errors Realism comes at a cost in time and precision! Fewer parameters may give an inaccurate estimate, but more parameters decrease the precision of the estimate In general use the simplest model which fits the data Use PAUP to compare nested models incorporating additional parameters for their likelihoods

Models can be made more parameter rich to increase their realism 3: 

Models can be made more parameter rich to increase their realism 3 JC ML tree -4090 JC -invariable sites - 4030 JC -inv + gamma correction for variable sites - 4029 GTR-inv + gamma correction for variable sites - 3985

Slide46: 

The 16S rRNA genes of Aquifex, Bacillus, Deinococcus, Thermus and Thermus ruber Exclude characters command in PAUP - exclude constant sites: Base frequencies command in PAUP: Character-exclusion status changed: 837 characters excluded Total number of characters now excluded = 837 Number of included characters = 436 Taxon A C G T # sites -------------------------------------------------------------- ruber 0.19725 0.27294 0.29587 0.23394 436 Aquifex 0.12156 0.38073 0.38532 0.11239 436 Deinococc 0.22477 0.22936 0.28211 0.26376 436 Thermus 0.13103 0.35862 0.37931 0.13103 435 Bacillus 0.22477 0.23394 0.27523 0.26606 436 -------------------------------------------------------------- Mean 0.17990 0.29509 0.32354 0.20147 435.80

Output of GTR-inv sites ML analysis for (Deinococcus, Bacillus, Aquifex, thermus and Thermus ruber): 

Output of GTR-inv sites ML analysis for (Deinococcus, Bacillus, Aquifex, thermus and Thermus ruber) Tree 1 -log likelihood = 4439 Tree 2 -log likelihood = 4447 Tree 3 -log likelihood = 4437 Best tree = True tree With the addition of Thermus ruber which has a base composition which is intermediate between thermophiles and mesophiles GTR-inv sites ML recovers the Thermus + Deinococcus relationship

Slide48: 

Yang (1995) has shown that parameter estimates are reasonably stable across tree topologies provided trees are not “too wrong”. Thus one can obtain a tree using a quick method (useful when many sequences are being analysed) and then estimate parameters on that tree. These parameters can then be used in a search for the Maximum Likelihood Tree. Estimation of ML substitution model parameters:

Slide49: 

Parameter estimates using the “tree scores” command in PAUP* Use PAUP* tree scores to use ML to estimate over this tree: 1) Proportion of invariant sites 2) Gamma shape parameter for variable sites 3) Substitution parameters for all types of change Maximum parsimony tree

Slide50: 

ML Parameter estimates over a parsimony tree using tree scores in PAUP* Tree number 1: -Ln likelihood = 4432.16903 Estimated R-matrix: -2.992539 0.53399075 1.6835489 0.77499941 0.53399075 -6.0877637 1.0048052 4.5489678 1.6835489 1.0048052 -3.6883541 1 0.77499941 4.5489678 1 -6.3239672 Corresponding Q-matrix: -0.77509276 0.12637319 0.52569668 0.12302289 0.11475088 -1.1506065 0.31375553 0.72210013 0.36178289 0.2377952 -0.75831742 0.15873934 0.16654196 1.0765496 0.31225508 -1.5553467 Estimated value of proportion of invariable sites = 0.302946 Estimated value of gamma shape parameter = 0.629797 These values can then be used as the starting parameters for a full likelihood search

Slide51: 

Maximum Likelihood Tree

Slide52: 

Mathematically rigorous & performs well in computer simulations Allows investigation of the fit between model and data Provides a simple way of comparing trees according to their likelihoods (difference tests - Kishino Hasegawa Test) Maximum Likelihood -advantages

Slide53: 

Maximum likelihood will only be consistent (converge on the true tree) if evolution proceeds according to the assumed model: How well does the model fit the data ? Becomes impossible computationally if many taxa or many model parameters Maximum Likelihood -disadvantages