logging in or signing up xiehao0 Sibilla Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 24 Category: Education License: All Rights Reserved Like it (0) Dislike it (0) Added: February 28, 2008 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Visualization of K-Tuple Distribution in Prokaryote Complete Genomes and Their Randomized Counterparts: Visualization of K-Tuple Distribution in Prokaryote Complete Genomes and Their Randomized Counterparts XIE Huimin (谢惠民) Department of Mathematics, Suzhou University and HAO Bailin (郝柏林) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica Institute of Theoretical Physics, Academia SinicaSlide2: Prokaryote Complete Genomes ( PCG ) K Biology-inspired mathematics Combinatorics Goulden-Jackson cluster method Factorizable Language Avioded and Rare K-strings K=6-9,15,18 Species-specificity of avoidance Phylogeny ( Failure ) Phylogeny Based on PCG Compositional Distance ( Success ) Decomposition and Reconstruction of AA sequences Graph theory: Euler pathsSlide5: 1. 2D Histogram of K-TuplesSlide6: g c a tThermoanaerobacter tengchongenic(K = 8): Thermoanaerobacter tengchongenic (K = 8) g c a tThe Algorithm (Hao Histogram) Implemented at:: The Algorithm (Hao Histogram) Implemented at: National Institute for Standard and Technology (NIST) http://math.nist.gov/~FHunt/GenPatterns/ European Bioinformatics Institute (EBI) http://industry.ebi.ac.uk/openBSA/bsa_viewers However, 2D only, no 1D histograms.Two Mathematical Problems: Two Mathematical Problems Dimensions of the complementary sets of portraits of tagged strings. Number of true and redundant missing strings. The two problems turn out to be one and the same, the first being graphic representation of the second.Two Methods to Solve the Problem: Two Methods to Solve the Problem Combinatorial solution: Goulden-Jackson cluster method (1979); number of dirty and clean words. Language theory solution: factorizable language, minimal deterministic finite-state automaton.2. 1D Histogram of K-Tuples: 2. 1D Histogram of K-Tuples Collect those K-tuples whose count fall in a bin from to , Plot the number of such K-tuples versus the counts, This is a 1D histogram or An expectation curve. Slide17: The effect of c+g content in 2D histograms of original genome and randomized sequence:Slide18: Escherichia coli original genomeSlide19: Escherichia coli randomized sequenceSlide20: Haemophilus influenzae randomized sequenceSlide21: Mycobacterium leprae original genomeSlide22: Mycobacterium laprae randomized sequenceSlide23: Mycobacterium tuberculosis original genomeSlide24: Mycobacterium tuberculosis randomized sequenceG+C Content of Some Bacteria: G+C Content of Some Bacteria3. Three Artificial Models Generating Sequences: 3. Three Artificial Models Generating Sequences Eiid: equal-probability independently and identically distributed model. Niid: nonequal-probability independently and identically distributed model. MMn: Markov model of order n Monte Carlo Methodestimation of expectation (ex) and standard deviation (sd) for an niid model: Monte Carlo Method estimation of expectation (ex) and standard deviation (sd) for an niid model (the compositions of a,c,g,t are 15:35:35:15, the length of sequence is , the value of K=8.)Validation about the Robustness of K-Histograms: a comparison of absolute error from ex in an experiment with sd as reference: Validation about the Robustness of K-Histograms: a comparison of absolute error from ex in an experiment with sd as referenceCompare the population of shuffling a given sequence and the population of sequence generated from a stochastic model.: Compare the population of shuffling a given sequence and the population of sequence generated from a stochastic model. F-test t-test4. A Theory for the Expectation Curve (1): Definition. For each , define a random variable Where random variable takes value 1 if the i-th K-tuple occurs exactly n times in the sequence, or takes value 0 if it does not occur. (1) 4. A Theory for the Expectation Curve (1) A Theory for the Expectation Curve (2): Theorem. For each , the mathematical expectation of random variable is given by Where the random variable is the occurrence number of K-tuples of I-th type. (1) A Theory for the Expectation Curve (2) The Exact Computation of Expectation Curve: The Exact Computation of Expectation Curve In order to compute the expectation curve we need to know the probability for each and . The Goulden-Jackson cluster method can be used successfully for the model of eiid. It is still difficult to do the computation for other models.Slide33: Two Experiments (for the model of eiid): compare with a K-histogram compare with Monte Carlo method the red curves are the standard deviation estimation obtained by Monte Carlo method. Poisson Approximation forthe Expectation Curve: Poisson Approximation for the Expectation Curve For each K-tuple calculate its expected number of appearing in sequence of length N, then use the formula of probability function of Poisson distribution and sum them for all K-tuples: Remark. This follows from a theorem in Percus and Whitlock, ACM Transaction on Modeling and Computer Simulation, 5 (1995) 87—100 (the model, however, can only be eiid, and the tuples must be overlapless). Comparison of Poissonapproximation with K-histogram for U. urealyticum: Comparison of Poisson approximation with K-histogram for U. urealyticumComparison of Poissonapproximation with 7-histogram for Haemophilus influenzae: Comparison of Poisson approximation with 7-histogram for Haemophilus influenzaeComparison of Poissonapproximation with 8-histogram for Haemophilus influenzae: Comparison of Poisson approximation with 8-histogram for Haemophilus influenzaeA comparison of Poisson approximation withMonte Carlo method: A comparison of Poisson approximation with Monte Carlo method In this computation the model is an niid, in which the parameters are taken from the randomized sequence of H. influenzae. 5. Analysis of the Mechanism of Multi-Modal K-histograms: 5. Analysis of the Mechanism of Multi-Modal K-histograms An example for H. influenzae. The length of its genome is 1830023. Under the simplified conditions of for , there are only 9 types of different of as shown in the following list.The following map shows the nine individual probability functions and their sum: The following map shows the nine individual probability functions and their sum Notice that the effect from the ratio of successive modes:For E. coli the ratio is 0.968931, hence the result is quite different: For E. coli the ratio is 0.968931, hence the result is quite different6. Analysis of Short-Range Correlation by K-Histograms: 6. Analysis of Short-Range Correlation by K-Histograms Two 8-histograms for E. coli, the left one is from its genome, and the right one is from its Markov model of order 1.Compare the 8-histograms of Markov Models of order from 2—7 for E. coli: Compare the 8-histograms of Markov Models of order from 2—7 for E. coliSlide44: Using Markov model of order 5 and Monte Carlo method to compare the 8-histogram of E. coli’s complete genome sequence with the ex and sd of MM5. the red curve is the expectation curve estimated by doing 50 times of simulation. this is the ratio curve for Reference:: Reference: Huimin Xie, Bailin Hao, “Visualization of K-tuple distribution in prokaryote complete genomes and their randomized counterparts”, CSB2002: IEEE Computer Systems Bioinformatics Conference Proceedings, IEEE Computer Society, Los Alamitos, 2002, 31-42. 7. Discussion: 7. Discussion Most of the results shown above are of experimental nature, many problems are left for future study. How to select reasonably the value of K. How to use 1D visualization to protein? What are the properties of random variables ? How to compute exactly the expectation curve for the model of niid and MMn? Why the Poisson approximation is effective without considering the overlap of K-tuples? You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
xiehao0 Sibilla Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 24 Category: Education License: All Rights Reserved Like it (0) Dislike it (0) Added: February 28, 2008 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Visualization of K-Tuple Distribution in Prokaryote Complete Genomes and Their Randomized Counterparts: Visualization of K-Tuple Distribution in Prokaryote Complete Genomes and Their Randomized Counterparts XIE Huimin (谢惠民) Department of Mathematics, Suzhou University and HAO Bailin (郝柏林) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica Institute of Theoretical Physics, Academia SinicaSlide2: Prokaryote Complete Genomes ( PCG ) K Biology-inspired mathematics Combinatorics Goulden-Jackson cluster method Factorizable Language Avioded and Rare K-strings K=6-9,15,18 Species-specificity of avoidance Phylogeny ( Failure ) Phylogeny Based on PCG Compositional Distance ( Success ) Decomposition and Reconstruction of AA sequences Graph theory: Euler pathsSlide5: 1. 2D Histogram of K-TuplesSlide6: g c a tThermoanaerobacter tengchongenic(K = 8): Thermoanaerobacter tengchongenic (K = 8) g c a tThe Algorithm (Hao Histogram) Implemented at:: The Algorithm (Hao Histogram) Implemented at: National Institute for Standard and Technology (NIST) http://math.nist.gov/~FHunt/GenPatterns/ European Bioinformatics Institute (EBI) http://industry.ebi.ac.uk/openBSA/bsa_viewers However, 2D only, no 1D histograms.Two Mathematical Problems: Two Mathematical Problems Dimensions of the complementary sets of portraits of tagged strings. Number of true and redundant missing strings. The two problems turn out to be one and the same, the first being graphic representation of the second.Two Methods to Solve the Problem: Two Methods to Solve the Problem Combinatorial solution: Goulden-Jackson cluster method (1979); number of dirty and clean words. Language theory solution: factorizable language, minimal deterministic finite-state automaton.2. 1D Histogram of K-Tuples: 2. 1D Histogram of K-Tuples Collect those K-tuples whose count fall in a bin from to , Plot the number of such K-tuples versus the counts, This is a 1D histogram or An expectation curve. Slide17: The effect of c+g content in 2D histograms of original genome and randomized sequence:Slide18: Escherichia coli original genomeSlide19: Escherichia coli randomized sequenceSlide20: Haemophilus influenzae randomized sequenceSlide21: Mycobacterium leprae original genomeSlide22: Mycobacterium laprae randomized sequenceSlide23: Mycobacterium tuberculosis original genomeSlide24: Mycobacterium tuberculosis randomized sequenceG+C Content of Some Bacteria: G+C Content of Some Bacteria3. Three Artificial Models Generating Sequences: 3. Three Artificial Models Generating Sequences Eiid: equal-probability independently and identically distributed model. Niid: nonequal-probability independently and identically distributed model. MMn: Markov model of order n Monte Carlo Methodestimation of expectation (ex) and standard deviation (sd) for an niid model: Monte Carlo Method estimation of expectation (ex) and standard deviation (sd) for an niid model (the compositions of a,c,g,t are 15:35:35:15, the length of sequence is , the value of K=8.)Validation about the Robustness of K-Histograms: a comparison of absolute error from ex in an experiment with sd as reference: Validation about the Robustness of K-Histograms: a comparison of absolute error from ex in an experiment with sd as referenceCompare the population of shuffling a given sequence and the population of sequence generated from a stochastic model.: Compare the population of shuffling a given sequence and the population of sequence generated from a stochastic model. F-test t-test4. A Theory for the Expectation Curve (1): Definition. For each , define a random variable Where random variable takes value 1 if the i-th K-tuple occurs exactly n times in the sequence, or takes value 0 if it does not occur. (1) 4. A Theory for the Expectation Curve (1) A Theory for the Expectation Curve (2): Theorem. For each , the mathematical expectation of random variable is given by Where the random variable is the occurrence number of K-tuples of I-th type. (1) A Theory for the Expectation Curve (2) The Exact Computation of Expectation Curve: The Exact Computation of Expectation Curve In order to compute the expectation curve we need to know the probability for each and . The Goulden-Jackson cluster method can be used successfully for the model of eiid. It is still difficult to do the computation for other models.Slide33: Two Experiments (for the model of eiid): compare with a K-histogram compare with Monte Carlo method the red curves are the standard deviation estimation obtained by Monte Carlo method. Poisson Approximation forthe Expectation Curve: Poisson Approximation for the Expectation Curve For each K-tuple calculate its expected number of appearing in sequence of length N, then use the formula of probability function of Poisson distribution and sum them for all K-tuples: Remark. This follows from a theorem in Percus and Whitlock, ACM Transaction on Modeling and Computer Simulation, 5 (1995) 87—100 (the model, however, can only be eiid, and the tuples must be overlapless). Comparison of Poissonapproximation with K-histogram for U. urealyticum: Comparison of Poisson approximation with K-histogram for U. urealyticumComparison of Poissonapproximation with 7-histogram for Haemophilus influenzae: Comparison of Poisson approximation with 7-histogram for Haemophilus influenzaeComparison of Poissonapproximation with 8-histogram for Haemophilus influenzae: Comparison of Poisson approximation with 8-histogram for Haemophilus influenzaeA comparison of Poisson approximation withMonte Carlo method: A comparison of Poisson approximation with Monte Carlo method In this computation the model is an niid, in which the parameters are taken from the randomized sequence of H. influenzae. 5. Analysis of the Mechanism of Multi-Modal K-histograms: 5. Analysis of the Mechanism of Multi-Modal K-histograms An example for H. influenzae. The length of its genome is 1830023. Under the simplified conditions of for , there are only 9 types of different of as shown in the following list.The following map shows the nine individual probability functions and their sum: The following map shows the nine individual probability functions and their sum Notice that the effect from the ratio of successive modes:For E. coli the ratio is 0.968931, hence the result is quite different: For E. coli the ratio is 0.968931, hence the result is quite different6. Analysis of Short-Range Correlation by K-Histograms: 6. Analysis of Short-Range Correlation by K-Histograms Two 8-histograms for E. coli, the left one is from its genome, and the right one is from its Markov model of order 1.Compare the 8-histograms of Markov Models of order from 2—7 for E. coli: Compare the 8-histograms of Markov Models of order from 2—7 for E. coliSlide44: Using Markov model of order 5 and Monte Carlo method to compare the 8-histogram of E. coli’s complete genome sequence with the ex and sd of MM5. the red curve is the expectation curve estimated by doing 50 times of simulation. this is the ratio curve for Reference:: Reference: Huimin Xie, Bailin Hao, “Visualization of K-tuple distribution in prokaryote complete genomes and their randomized counterparts”, CSB2002: IEEE Computer Systems Bioinformatics Conference Proceedings, IEEE Computer Society, Los Alamitos, 2002, 31-42. 7. Discussion: 7. Discussion Most of the results shown above are of experimental nature, many problems are left for future study. How to select reasonably the value of K. How to use 1D visualization to protein? What are the properties of random variables ? How to compute exactly the expectation curve for the model of niid and MMn? Why the Poisson approximation is effective without considering the overlap of K-tuples?