Phylogenetic Hidden Markov Models: A Tool for Comparative Genomics : Phylogenetic Hidden Markov Models: A Tool for Comparative Genomics COMP571
Fall 2006
Comparative Genomics : Comparative Genomics Comparing genetic material from different species to understand evolution, conservation, gene function, disease inheritance, etc.
A 'feature' that is present/conserved across species may not be detected by looking at a single genome
What Is Compared? : What Is Compared? Gene location (physical location on the chromosome)
Gene structure (numbers and lengths of the various 'components')
Gene characteristics (codon usage, conserved synteny, etc.)
Today there is great interest in analyzing and understanding conserved non-coding sequences (CNS)
Slide4 : Regions of the human and mouse homologous genes: Coding exons (white), noncoding exons (gray}, introns (dark gray), and intergenic regions (black). Corresponding strong (white) and weak (gray) alignment regions of GLASS are shown connected with arrows. Dark lines connecting the alignment regions denote very weak or no alignment. The predicted coding regions of ROSETTA in human, and the corresponding regins in mouse, are shown (white) between the genes and the alignment regions.
From 'Human and Mouse Gene Structure: Comparative Analysis and Application to Exon Prediction', Batzoglou et al., Genome Research.
Slide5 : Regions of conserved synteny between Mmu 16 and
the human genome. The analysis was done at the
protein level with the MUMmmer program; each line
represents a pair of orthologous genes present in the
mouse and human. Cyto, cytogenetic markers; SCF,
scaffold distribution; and Ori, orientation of scaffolds.
From 'A Comparison of Whole-genome Shotgun-
Derived Mouse Chromosome 16 and the Human
Genome', Mural et al., Science.
Combining Phylogeny and HMMs : Combining Phylogeny and HMMs Molecular evolution can be viewed as a combination of two Markov processes
One that operates in the dimension of space (along a genome)
One that operates in the dimension of time (along the branches of a phylogenetic tree)
Phylo-HMMs model this combination
A Brief History of Phylo-HMMs : A Brief History of Phylo-HMMs First proposed as a way of improving phylogenetic modeling that allow for variation among sites in the rate of substitution (Felsensteinandamp;Churchill, 1996; Yang, 1995)
Adapted for the problem of secondary structure prediction (Goldman et al., 1996; Thorne et al., 1996)
Later used for recombination detection (Husmeierandamp;Wright, 2001)
Recently, there has been a wide revival of interest in these models in connection with the explosion in the availability of comparative sequence data
Phylo-HMMs : Phylo-HMMs A phylo-HMM is a probabilistic machine that generates a multiple alignment, column by column, such that each column is defined by a phylogenetic model
Unlike single-sequence HMMs, the emission probabilities of phylo-HMMs are complex distributions defined by phylogenetic models
Slide9 : Single-sequence HMM Phylo-HMM
Phylogenetic models (1) : Phylogenetic models (1) Stochastic process of substitution that operates independently at each site in a genome
A character is first drawn at random from the background distribution and assigned to the root of the tree; character substitutions then occur randomly along the tree branches, from root to leaves
The characters at the leaves define an alignment column
Phylogenetic Models (2) : Phylogenetic Models (2) The different phylogenetic models associated with the states of a phylo-HMM may reflect different overall rates of substitution (as in conserved and non-conserved regions), different patterns of substitution or background distributions (as in coding and non-coding regions), or even different tree topologies (as with recombination)
Phylo-HMMs: Formal Definition : Phylo-HMMs: Formal Definition A phylo-HMM is a 4-tuple :
: set of states
: set of associated phylogenetic models
: transition probabilities
: initial-state probabilities
The Phylogenetic Model (1) : The Phylogenetic Model (1) :
: substitution rate matrix
: background frequencies
: binary tree
: branch lengths
The Phylogenetic Model (2) : The Phylogenetic Model (2) The model is defined with respect to an alphabet whose size is denoted d
The substitution rate matrix has dimension dxd
The background frequencies vector has dimension d
The tree has n leaves, corresponding to n extant taxa
The branch lengths are associated with the tree
Probability of the Data : Probability of the Data Let X be an alignment consisting of L columns and n rows, with the ith column denoted Xi
The probability that column Xi is emitted by state sj is simply the probability of Xi under the corresponding phylogenetic model,
This is the likelihood of the column given the tree, which can be computed efficiently using Felsenstein’s 'pruning' algorithm (which we described earlier)
Substitution Probabilities (1) : Substitution Probabilities (1) Felsenstein’s algorithm requires the conditional probabilities of substitution for all bases a,b and branch lengths tj
The probability of substitution of a base b for a base a along a branch of length t, denoted
is based on a continuous-time Markov model of substitution, defined by the rate matrix Qj
Substitution Probabilities (2) : Substitution Probabilities (2) In particular, for any given non-negative value t, the conditional probabilities for all a,b are given the dxd matrix , where
Example: HKY model : Example: HKY model represents the transition/transversion rate ratio for The - symbols indicate quantities required to make each row
sum to zero
Paths in Phylo-HMMs (1) : Paths in Phylo-HMMs (1) A 'path' through the phylo-HMM is a sequence of states such that
The joint probability of a path and and alignment is
Paths in Phylo-HMMs (2) : Paths in Phylo-HMMs (2) The likelihood is given by the sum over all paths
The maximum-likelihood path is
Computing the Probabilities : Computing the Probabilities The likelihood can be computed efficiently using the forward algorithm
The maximum-likelihood path can be computed efficiently using the Viterbi algorithm
The forward and backward algorithms can be combined to compute the posterior probability
Higher-order Markov Models for Emissions : Higher-order Markov Models for Emissions It is common with gene-finding HMMs to condition the emission probability of each observation on the observations that immediately precede it in the sequence
For example, in a 3-rd-codon-position state, the emission of a base xi='A' might have a fairly high probability if the previous two bases are xi-2='G' and xi-1='A' (GAA=Glu), but should have zero probability if the previous two bases are xi-2='T' and xi-1='A' (TAA=stop)
Higher-order Markov Models for Emission : Higher-order Markov Models for Emission Considering the N observations preceding each xi corresponds to using an Nth order Markov model for emissions
An Nth order model for emissions is typically parameterized in terms of (N+1)-tuples of observations, and conditional probabilities are computed as
Nth Order Phylo-HMMs : Nth Order Phylo-HMMs Sum over all possible alignment columns Y
(can be calculated efficiently by a slight modification
of Felsenstein’s 'pruning' algorithm Probability of the N-tuple