authorSTREAM Share PowerPoint. Anywhere

Comp Gen Phylo HMM

Uploaded from authorPOINT
Download as Download Not Available PPT
Presentation Description

No description available

What's up on authorSTREAM?
Views: 77
Like it  ( Likes) Dislike it  ( Dislikes)
Added: September 25, 2007 This presentation is Public
Presentation Category :Product Training/ Manuals
Presentation StatisticsNew!
Views on authorSTREAM: 77
Presentation Transcript

Phylogenetic Hidden Markov Models: A Tool for Comparative Genomics : Phylogenetic Hidden Markov Models: A Tool for Comparative Genomics COMP571 Fall 2006


Comparative Genomics : Comparative Genomics Comparing genetic material from different species to understand evolution, conservation, gene function, disease inheritance, etc. A 'feature' that is present/conserved across species may not be detected by looking at a single genome


What Is Compared? : What Is Compared? Gene location (physical location on the chromosome) Gene structure (numbers and lengths of the various 'components') Gene characteristics (codon usage, conserved synteny, etc.) Today there is great interest in analyzing and understanding conserved non-coding sequences (CNS)


Slide4 : Regions of the human and mouse homologous genes: Coding exons (white), noncoding exons (gray}, introns (dark gray), and intergenic regions (black). Corresponding strong (white) and weak (gray) alignment regions of GLASS are shown connected with arrows. Dark lines connecting the alignment regions denote very weak or no alignment. The predicted coding regions of ROSETTA in human, and the corresponding regins in mouse, are shown (white) between the genes and the alignment regions. From 'Human and Mouse Gene Structure: Comparative Analysis and Application to Exon Prediction', Batzoglou et al., Genome Research.


Slide5 : Regions of conserved synteny between Mmu 16 and the human genome. The analysis was done at the protein level with the MUMmmer program; each line represents a pair of orthologous genes present in the mouse and human. Cyto, cytogenetic markers; SCF, scaffold distribution; and Ori, orientation of scaffolds. From 'A Comparison of Whole-genome Shotgun- Derived Mouse Chromosome 16 and the Human Genome', Mural et al., Science.


Combining Phylogeny and HMMs : Combining Phylogeny and HMMs Molecular evolution can be viewed as a combination of two Markov processes One that operates in the dimension of space (along a genome) One that operates in the dimension of time (along the branches of a phylogenetic tree) Phylo-HMMs model this combination


A Brief History of Phylo-HMMs : A Brief History of Phylo-HMMs First proposed as a way of improving phylogenetic modeling that allow for variation among sites in the rate of substitution (Felsensteinandamp;Churchill, 1996; Yang, 1995) Adapted for the problem of secondary structure prediction (Goldman et al., 1996; Thorne et al., 1996) Later used for recombination detection (Husmeierandamp;Wright, 2001) Recently, there has been a wide revival of interest in these models in connection with the explosion in the availability of comparative sequence data


Phylo-HMMs : Phylo-HMMs A phylo-HMM is a probabilistic machine that generates a multiple alignment, column by column, such that each column is defined by a phylogenetic model Unlike single-sequence HMMs, the emission probabilities of phylo-HMMs are complex distributions defined by phylogenetic models


Slide9 : Single-sequence HMM Phylo-HMM


Phylogenetic models (1) : Phylogenetic models (1) Stochastic process of substitution that operates independently at each site in a genome A character is first drawn at random from the background distribution and assigned to the root of the tree; character substitutions then occur randomly along the tree branches, from root to leaves The characters at the leaves define an alignment column


Phylogenetic Models (2) : Phylogenetic Models (2) The different phylogenetic models associated with the states of a phylo-HMM may reflect different overall rates of substitution (as in conserved and non-conserved regions), different patterns of substitution or background distributions (as in coding and non-coding regions), or even different tree topologies (as with recombination)


Phylo-HMMs: Formal Definition : Phylo-HMMs: Formal Definition A phylo-HMM is a 4-tuple : : set of states : set of associated phylogenetic models : transition probabilities : initial-state probabilities


The Phylogenetic Model (1) : The Phylogenetic Model (1) : : substitution rate matrix : background frequencies : binary tree : branch lengths


The Phylogenetic Model (2) : The Phylogenetic Model (2) The model is defined with respect to an alphabet  whose size is denoted d The substitution rate matrix has dimension dxd The background frequencies vector has dimension d The tree has n leaves, corresponding to n extant taxa The branch lengths are associated with the tree


Probability of the Data : Probability of the Data Let X be an alignment consisting of L columns and n rows, with the ith column denoted Xi The probability that column Xi is emitted by state sj is simply the probability of Xi under the corresponding phylogenetic model, This is the likelihood of the column given the tree, which can be computed efficiently using Felsenstein’s 'pruning' algorithm (which we described earlier)


Substitution Probabilities (1) : Substitution Probabilities (1) Felsenstein’s algorithm requires the conditional probabilities of substitution for all bases a,b and branch lengths tj The probability of substitution of a base b for a base a along a branch of length t, denoted is based on a continuous-time Markov model of substitution, defined by the rate matrix Qj


Substitution Probabilities (2) : Substitution Probabilities (2) In particular, for any given non-negative value t, the conditional probabilities for all a,b are given the dxd matrix , where


Example: HKY model : Example: HKY model represents the transition/transversion rate ratio for The - symbols indicate quantities required to make each row sum to zero


Paths in Phylo-HMMs (1) : Paths in Phylo-HMMs (1) A 'path' through the phylo-HMM is a sequence of states such that The joint probability of a path and and alignment is


Paths in Phylo-HMMs (2) : Paths in Phylo-HMMs (2) The likelihood is given by the sum over all paths The maximum-likelihood path is


Computing the Probabilities : Computing the Probabilities The likelihood can be computed efficiently using the forward algorithm The maximum-likelihood path can be computed efficiently using the Viterbi algorithm The forward and backward algorithms can be combined to compute the posterior probability


Higher-order Markov Models for Emissions : Higher-order Markov Models for Emissions It is common with gene-finding HMMs to condition the emission probability of each observation on the observations that immediately precede it in the sequence For example, in a 3-rd-codon-position state, the emission of a base xi='A' might have a fairly high probability if the previous two bases are xi-2='G' and xi-1='A' (GAA=Glu), but should have zero probability if the previous two bases are xi-2='T' and xi-1='A' (TAA=stop)


Higher-order Markov Models for Emission : Higher-order Markov Models for Emission Considering the N observations preceding each xi corresponds to using an Nth order Markov model for emissions An Nth order model for emissions is typically parameterized in terms of (N+1)-tuples of observations, and conditional probabilities are computed as


Nth Order Phylo-HMMs : Nth Order Phylo-HMMs Sum over all possible alignment columns Y (can be calculated efficiently by a slight modification of Felsenstein’s 'pruning' algorithm Probability of the N-tuple