http://creativecommons.org/licenses/by-sa/2.0/: http://creativecommons.org/licenses/by-sa/2.0/
BNFO 602, Lecture 2: BNFO 602, Lecture 2 Usman Roshan Some of the slides are based upon material by David Wishart of University of Alberta and Ron Shamir
of Tel Aviv University
Previously…: Previously… Model of DNA sequence evolution
Poisson model under two state characters
Derivation of expected number of changes on a single edge of a tree
Jukes-Cantor for four state model (DNA)
Estimating expected number of changes of two DNA sequences using maximum likelihood
Previously…: Previously… Distance based phylogeny reconstruction methods
UPGMA --- not additive
Neighbor Joining --- additive
Both are fast --- polynomial time
Easy to implement
Previously: Previously Simulation
Method for simulating evolution of DNA sequences on a fixed tree
Comparing two different phylogenies for computing the error rate
Effect of accuracy on real methods as a function of sequence length, number of sequences, and other factors
Sequence Alignment: Sequence Alignment Widely used in bioinformatics
Proteins and genes are of different lengths due to error in sequencing and genetic variation across species
Involves identifying evolutionary events: insertions, deletions, and substitutions
Goal is to 'align' sequences such that number of mutations is minimized
Sequencing Successes: Sequencing Successes T7 bacteriophage
completed in 1983
39,937 bp, 59 coded proteins
Escherichia coli
completed in 1998
4,639,221 bp, 4293 ORFs
Sacchoromyces cerevisae
completed in 1996
12,069,252 bp, 5800 genes
Sequencing Successes: Sequencing Successes Caenorhabditis elegans
completed in 1998
95,078,296 bp, 19,099 genes
Drosophila melanogaster
completed in 2000
116,117,226 bp, 13,601 genes
Homo sapiens
completed in 2003
3,201,762,515 bp, 31,780 genes
Genomes to Date: Genomes to Date 8 vertebrates (human, mouse, rat, fugu, zebrafish)
3 plants (arabadopsis, rice, poplar)
2 insects (fruit fly, mosquito)
2 nematodes (C. elegans, C. briggsae)
1 sea squirt
4 parasites (plasmodium, guillardia)
4 fungi (S. cerevisae, S. pombe)
200+ bacteria and archebacteria
2000+ viruses
So what do we do with all this sequence data?: So what do we do with all this sequence data?
So what do we do with all this sequence data?: So what do we do with all this sequence data? Comparative bioinformatics
DNA Sequence Evolution: DNA Sequence Evolution
Sequence alignments: Sequence alignments They tell us about
Function or activity of a new gene/protein
Structure or shape of a new protein
Location or preferred location of a protein
Stability of a gene or protein
Origin of a gene or protein
Origin or phylogeny of an organelle
Origin or phylogeny of an organism
And more…
Pairwise alignment: Pairwise alignment How to align two sequences?
Pairwise alignment: Pairwise alignment How to align two sequences?
We use dynamic programming
Treat DNA sequences as strings over the alphabet {A, C, G, T}
Pairwise alignment: Pairwise alignment
Dynamic programming: Dynamic programming Define V(i,j) to be the optimal pairwise alignment score between S1..i and T1..j (|S|=m, |T|=n)
Dynamic programming: Dynamic programming Time and space complexity is O(mn) Define V(i,j) to be the optimal pairwise alignment score between S1..i and T1..j (|S|=m, |T|=n)
Tabular computation of scores: Tabular computation of scores
Traceback to get alignment: Traceback to get alignment
Local alignment: Local alignment Finding optimally aligned local regions
Local alignment: Local alignment
Database searching: Database searching Suppose we have a set of 1,000,000 sequences
You have a query sequence q and want to find the m closest ones in the database---that means 1,000,000 pairwise alignments!
How to speed up pairwise alignments?
FASTA: FASTA FASTA was the first software for quick searching of a database
Introduced the idea of searching for k-mers
Can be done quickly by preprocessing database
FASTA: combine high scoring hits into diagonal runs: FASTA: combine high scoring hits into diagonal runs
BLAST: BLAST Key idea: search for k-mers (short matchig substrings)
quickly by preprocessing the database.
BLAST: BLAST This key idea can also be used for speeding up pairwise
alignments when doing multiple sequence alignments
Biologically realistic scoring matrices: Biologically realistic scoring matrices PAM and BLOSUM are most popular
PAM was developed by Margaret Dayhoff and co-workers in 1978 by examining 1572 mutations between 71 families of closely related proteins
BLOSUM is more recent and computed from blocks of sequences with sufficient similarity
PAM: PAM We need to compute the probability transition matrix M which defines the probability of amino acid i converting to j
Examine a set of closely related sequences which are easy to align---for PAM 1572 mutations between 71 families
Compute probabilities of change and background probabilities by simple counting
PAM: PAM In this model the unit of evolution is the amount of evolution that will change 1 in 100 amino acids on the average The scoring matrix Sab is the ratio of Mab to pb
PAM Mij matrix (x10000): PAM Mij matrix (x10000)
Multiple sequence alignment: Multiple sequence alignment 'Two sequences whisper, multiple sequences shout out loud'---Arthur Lesk
Computationally very hard---NP-hard
Formally…: Formally…
Multiple sequence alignment: Multiple sequence alignment Unaligned sequences
GGCTT
TAGGCCTT
TAGCCCTTA
ACACTTC
ACTT Aligned sequences
_G_ _ GCTT_
TAGGCCTT_
TAGCCCTTA
A_ _CACTTC
A_ _C_ CTT_
Sum of pairs score: Sum of pairs score
Sum of pairs score: Sum of pairs score What is the sum of pairs score of this alignment?
Tree alignment score: Tree alignment score
Tree alignment score: Tree alignment score
Tree Alignment: Tree Alignment TAGGCCTT
(Human) TAGCCCTTA
(Monkey) ACCTT
(Cat) ACACTTC
(Lion) GGCTT
(Mouse)
Tree Alignment: Tree Alignment TAGGCCTT_
(Human) TAGCCCTTA
(Monkey) A__C_CTT_
(Cat) A__CACTTC
(Lion) _G__GCTT_
(Mouse)
Tree Alignment---depends on tree: Tree Alignment---depends on tree TAGGCCTT_
(Human) TAGCCCTTA
(Monkey) A__C_CTT_
(Cat) A__CACTTC
(Lion) _G__GCTT_
(Mouse) TA_CCCTT_ TA_CCCTTA TA_CCCTT_ TA_CCCTTA Tree alignment score = 15 Switch monkey and cat
Profiles: Profiles Before we see how to construct multiple alignments, how do we align two alignments?
Idea: summarize an alignment using its profile and align the two profiles
Profile alignment: Profile alignment
Iterative alignment(heuristic for sum-of-pairs): Iterative alignment (heuristic for sum-of-pairs) Pick a random sequence from input set S
Do (n-1) pairwise alignments and align to closest one t in S
Remove t from S and compute profile of alignment
While sequences remaining in S
Do |S| pairwise alignments and align to closest one t
Remove t from S
Iterative alignment: Iterative alignment Once alignment is computed randomly divide it into two parts
Compute profile of each sub-alignment and realign the profiles
If sum-of-pairs of the new alignment is better than the previous then keep, otherwise continue with a different division until specified iteration limit
Progressive alignment: Progressive alignment Idea: perform profile alignments in the order dictated by a tree
Given a guide-tree do a post-order search and align sequences in that order
Widely used heuristic
Can be used for solving tree alignment
Simultaneous alignment and phylogeny reconstruction: Simultaneous alignment and phylogeny reconstruction Given unaligned sequences produce both alignment and phylogeny
Known as the generalized tree alignment problem---MAX-SNP hard
Iterative improvement heuristic:
Take starting tree
Modify it using say NNI, SPR, or TBR
Compute tree alignment score
If better then select tree otherwise continue until reached a local minimum
Median alignment: Median alignment Idea: iterate over the phylogeny and align every triplet of sequences---takes o(m3) (in general for n sequences it takes O(2nmn) time
Same profiles can be used as in progressive alignment
Produces better tree alignment scores (as observed in experiments)
Iteration continues for a specified limit
Popular alignment programs: Popular alignment programs ClustalW: most popular, progressive alignment
MUSCLE: fast and accurate, progressive and iterative combination
T-COFFEE: slow but accurate, consistency based alignment (align sequences in multiple alignment to be close to the optimal pairwise alignment)
PROBCONS: slow but highly accurate, probabilistic consistency progressive based scheme
DIALIGN: very good for local alignments
MUSCLE: MUSCLE
MUSCLE: MUSCLE
MUSCLE: MUSCLE Profile sum-of-pairs score Log expectation score used by MUSCLE
Evaluation of multiple sequence alignments: Evaluation of multiple sequence alignments Compare to benchmark 'true' alignments
Use simulation
Measure conservation of an alignment
Measure accuracy of phylogenetic trees
How well does it align motifs?
More…
BAliBASE: BAliBASE Most popular benchmark of alignments
Alignments are based upon structure BAliBASE currently consists of 142 reference alignments, containing over 1000 sequences. Of the 200,000 residues in the database, 58% are defined within the core blocks. The remaining 42% are in ambiguous regions that cannot be reliably aligned. The alignments are divided into four hierarchical reference sets, reference 1 providing the basis for construction of the following sets. Each of the main sets may be further sub-divided into smaller groups, according to sequence length and percent similarity.
BAliBASE: BAliBASE The sequences included in the database are selected from alignments in either the FSSP or HOMSTRAD structural databases, or from manually constructed structural alignments taken from the literature. When sufficient structures are not available, additional sequences are included from the HSSP database (Schneider et al., 1997). The VAST Web server (Madej, 1995) is used to confirm that the sequences in each alignment are structural neighbours and can be structurally superimposed. Functional sites are identified using the PDBsum database (Laskowski et al., 1997) and the alignments are manually verified and adjusted, in order to ensure that conserved residues are aligned as well as the secondary structure elements.
BAliBASE: BAliBASE Reference 1 contains alignments of (less than 6) equi-distant sequences, ie. the percent identity between two sequences is within a specified range. All the sequences are of similar length, with no large insertions or extensions. Reference 2 aligns up to three 'orphan' sequences (less than 25% identical) from reference 1 with a family of at least 15 closely related sequences. Reference 3 consists of up to 4 sub-groups, with less than 25% residue identity between sequences from different groups. The alignments are constructed by adding homologous family members to the more distantly related sequences in reference 1. Reference 4 is divided into two sub-categories containing alignments of up to 20 sequences including N/C-terminal extensions (up to 400 residues), and insertions (up to 100 residues).
Comparison of alignments on BAliBASE: Comparison of alignments on BAliBASE
Parsimonious aligner (PAl): Parsimonious aligner (PAl) Construct progressive alignment A
Construct MP tree T on A
Construct progressive alignment A’ on guide-tree T
Set A=A’ and go to 3
Output alignment and tree with best MP score
PAl: PAl Faster than iterative improvement
Speed and accuracy both depend upon progressive alignment and MP heuristic
In practice MUSCLE and TNT are used for constructing alignments and MP trees
How does PAl compare against traditional methods?
PAl not designed for aligning structural regions but focuses on evolutionary conserved regions
Let’s look at performance under simulation
Evaluating alignments under simulation: Evaluating alignments under simulation We first need a way to evolve sequences with insertions and deletions
NOTE: evolutionary models we have encountered so far do not account for insertions and deletions
Not known exactly how to model insertions and deletions
ROSE: ROSE Evolve sequences under an i.i.d. Markov Model
Root sequence: probabilities given by a probability vector (for proteins default is Dayhoff et. al. values)
Substitutions
Edge length are integers
Probability matrix M is given as input (default is PAM1*)
For edge of length b probabilty of x ïƒ y is given by Mbxy
Insertion and deletions:
Insertions and deletions follow the same probabilistic model
For each edge probability to insert is iins .
Length of insertion is given by discrete probability distribution (normally exponential)
For edge of length b this is repeated b times.
Model tree can be specified as input
Evaluation of alignments: Evaluation of alignments Let’s simulate alignments and
phylogenies and compare them under
simulation!!
Parameters for simulation study: Parameters for simulation study Model trees: uniform random distribution and uniformly selected random edge lengths
Model of evolution: PAM with insertions and deletions probabilities selected from a gamma distribution (see ROSE software package)
Replicate settings: Settings of 50, 100, and 400 taxa, mean sequence lengths of 200 and 500 and avg branch lengths of 10, 25, and 50 were selected. For each setting 10 datasets were produced
Phylogeny accuracy: Phylogeny accuracy
Alignment accuracy: Alignment accuracy
Running time: Running time
Conclusions: Conclusions DIALIGN seems to perform best followed by PAl, MUSCLE, and PROBCONS
DIALIGN, however, is slower than PAl
Does this mean DIALIGN is the best alignment program?
Conclusions: Conclusions DIALIGN seems to perform best followed by PAl, MUSCLE, and PROBCONS
DIALIGN, however, is slower than PAl
Does this mean DIALIGN is the best alignment program?
Not necessarily: experiments were performed under uniform random trees with uniform random edge lengths. Not clear if this emulates the real deal.
Conclusions: Conclusions DIALIGN seems to perform best followed by PAl, MUSCLE, and PROBCONS
DIALIGN, however, is slower than PAl
Does this mean DIALIGN is the best alignment program?
Not necessarily: experiments were performed under uniform random trees with uniform random edge lengths. Not clear if this emulates the real deal.
What about sum-of-pairs vs MP scores?
Sum-of-pairs vs MP score: Sum-of-pairs vs MP score
Sum-of-pairs vs MP score: Sum-of-pairs vs MP score
Conclusions: Conclusions Optimizing MP scores under this simulation model leads to better phylogenies and alignments
Conclusions: Conclusions Optimizing MP scores under this simulation model leads to better phylogenies and alignments
What other models can we try?
Conclusions: Conclusions Optimizing MP scores under this simulation model leads to better phylogenies and alignments
What other models can we try?
Real data phylogenies as model trees
Birth-death model trees
Other distributions for model trees…
Branch lengths: similar issues…
Evolutionary model parameters estimated from real data