logging in or signing up gene finding 1 Hannah Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 1455 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: November 16, 2007 This Presentation is Public Favorites: 2 Presentation Description No description available. Comments Posting comment... By: farhadwur (22 month(s) ago) hi please make it possible to dowload regards Saving..... Post Reply Close Saving..... Edit Comment Close Premium member Presentation Transcript BMB3600 - Bioinformatics : BMB3600 - Bioinformatics March 25 – gene finding I March 30 – gene finding 2 April 01 – prediction of binding motifs April 06 – microarray data analysis April 08 – sequence comparison April 13 – protein function prediction 1 April 15 – protein function prediction 2 April 20 – protein structure prediction 1 April 22 – protein structure prediction 2 April 27 – take-home examGene Finding I -- outline: Gene Finding I -- outline Problem definition Basic gene structures in eukaryotic versus prokaryotic genomes Codon and reading frames Codon frequencies in coding versus non-coding regions Basic idea of distinguishing coding versus non-coding regions Computational methods for distinguishing coding from non-coding regions Collecting data from model building How to develop a simple gene finderGene finding: Gene finding Human genome has ~3 billion base pairs and has about 35,000 protein-coding genes ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgctagaggtcagtgactgatgatcgatgcatgcatggatgatgcagctgatcgatgtagatgcaataagtcgatgatcgatgatgatgctagatgatagctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatgctagatcgtaggtagtagctagatgcagggataaacacacggaggcgagtgatcggtaccgggctgaggtgttagctaatgatgagtacgtatgaggcaggatgagtgacccgatgaggctagatgcgatggatggatcgatgatcgatgcatggtgatgcgatgctagatgatgtgtgtcagtaagtaagcgatgcggctgctgagagcgtaggcccgagaggagagatgtaggaggaaggtttgatggtagttgtagatgattgtgtagttgtagctgatagtgatgatcgtag ………………………………… Where are the protein-encoding genes?The basic idea of pattern recognition: The basic idea of pattern recognition How do kids learn to distinguish “dogs” from “cats”? were “trained” by being told “A is a dog”, “B is a cat”, “C is another dog”, ….. they learn to “extract” common features (patterns) among animals they were told to be “dogs” and “cats” then apply these extracted features to identify new dogs and cats Pattern recognition is generally done by providing “training sets” which are individually labeled “positives” versus “negatives”, or “good” versus “bad”, etc. learning the general rules that separate the “positives” from “negatives” or “good” from “bad”, …. applying the learned rules to new situations Gene finding through learning: Gene finding through learning Learning “general rules” about finding genes ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgctagaggtcagtgactgatgatcgatgcatgcatggatgatgcagctgatcgatgtagatgcaataagtcgatgatcgatgatgatgctagatgatagctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatgctagatcgtaggtagtagctagatgcagggataaacacacggaggcgagtgatcggtaccgggctgaggtgttagctaatgatgagtacgtatgaggcaggatgagtgacccgatgaggctagatgcgatggatggatcgatgatcgatgcatggtgatgcgatgctagatgatgtgtgtcagtaagtaagcgatgcggctgctgagagcgtaggcccgagaggagagatgtaggaggaaggtttgatggtagttgtagatgattgtgtagttgtagctgatagtgatgatcgtag ………………………………… Over the years, numerous genes have been identified through experiments. Also some DNA segments are known to be non-genes verified by experimentsGene finding through learning: Gene finding through learning So we know ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgt gggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgctagag gtcagtgactgatgatcgatgcatgcatg gatgatgcagctgatcgatgtagatgcaataagtcgatgatcgatgatgatgctagatgatagctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatg ctagatcgtaggtagtagctagatgcagggataaacacacggaggc gagtgatcggtaccgggctgaggtgttagctaatgatgagtacgtatgaggcaggatgagtgacccgatgaggctagatgcgatggatggatcgatgatcgatgcatggtgatgcgatgctagatgatgtgtgtcagtaagtaa ………………………………… genes non-genesGene finding through learning: Gene finding through learning Is a gene? Remember “dogs”, “cats” …. but the “patterns” here are much more hidden and more complex than the distinguishing features between “dogs” and “cats” gcgatgcggctgctgagagcgtaggcccgagaggagagatgtaggaggaaggtttgatggtagttgtagatgattgtgtagttgtagctgatagtgatgatcgtag We need to study the basic structures of genes first ….!Basic Gene Structures: Basic Gene Structures Eukaryotic genes Exons,introns, translation starts and stops, splice (donor/acceptor) junctions, Basic Gene Structure: Basic Gene Structure Prokaryotic genes coding regions, non-coding regions translation starts and stops gene gene gene promoter start stop Prokaryotic genes are easier to identify than eukaryotic genes because of the simplicity of their gene structure and the density of genes in the genomeGene Structure -- codons: Gene Structure -- codons A triplet of nucleotides is called a codon There are 64 codons (4 * 4 * 4 = 64) AAA, ….., TTT Three codons (TAG, TGA, TAA) are called stop codons as they code the termination signal of a gene Each of the other codons codes an amino acidGene Structure – reading frame: Gene Structure – reading frame Reading (or translation) frame: each DNA segment has six possible reading frames Reading frame #0 ATG GCT TAC GCT TGC Reading frame #1 TGG CTT ACG CTT GA. Reading frame #2 GGC TTA CGC TTG A.. ATGGCTTACGCTTGA Forward strand: Reading frame #0 TCA AGC GTA AGC CAT Reading frame #1 CAA GCG TAA GCC AT. Reading frame #2 AAG CGT AAG CCA T.. TCAAGCGTAAGCCAT Reverse strand:Gene Structure – open reading frame (ORF): Gene Structure – open reading frame (ORF) Open reading frame (ORF): a segment of DNA with one in-frame start codon and one in-frame stop codon at the two ends and no in-frame stop codon in the middle each ORF has a fixed reading frame How many genes can an ORF have inside it? Answer: one because an ORF has only one stopGene Structure -- open reading frame (ORF): Gene Structure -- open reading frame (ORF) Generally true: all long (> 300 bp) orfs in prokaryotic genomes encode genes But this may not necessarily be true for eukaryotic genomes Coding region – gene in prokaryotic genomes exon in eukaryotic genomesGene Structure: Gene Structure Each coding region (exon or whole gene) has a fixed translation frame A coding region always sits inside an ORF of same reading frame All exons of a gene are on the same strand Neighboring exons of a gene could have different reading frames frame 1 frame 2 frame 3Gene Structure – reading frame consistency: Gene Structure – reading frame consistency Now … we are talking about a little more “complex” features Neighboring exons of a gene should be frame-consistent ATG GCT TGG GCT TTA A -------------- GT TTC CCG GAG AT ------ T GGG exon 1 exon 3 exon 2 exon1 [i, j] in frame a and exon2 [m, n] in frame b are consistent if b = (m - j - 1 + a) mod 3 1 mod 3 = 1 2 mod 3 = 2 3 mod 3 = 0 4 mod 3 = 1 5 mod 3 = 2 ...... splicing!Codon Frequencies: Codon Frequencies Coding sequences are translated into protein sequences We found the following – the dimer frequency in protein sequences is NOT evenly distributed The average frequency is 5% Some amino acids prefer to be next to each other Some other amino acids prefer to be not next to each other shewanellaDicodon Frequencies: Dicodon Frequencies Believe it or not – the biased (uneven) dimer frequencies are the foundation of many gene finding programs! Basic idea – if a dimer has lower than average dimer frequency; this means that proteins prefer not to have such dimers in its sequence; Hence if we see a dicodon encoding this dimer, we may want to bet against this dicodon being in a coding region!Dicodon Frequencies: Dicodon Frequencies Hence if we see many such dicodons in a DNA segment, we may want to bet that this region is a non-coding region! This is the very basic idea of gene finding!Dicodon Frequencies: Dicodon Frequencies Dicodon frequencies in coding versus non-coding are genome-dependent shewanella bovineDicodon Frequencies: Dicodon Frequencies Relative frequencies of a di-codon in coding versus non-coding frequency of dicodon X (e.g, AAAAAA) in coding region, total number of occurrences of X divided by total number of dicocon occurrences frequency of dicodon X (e.g, AAAAAA) in noncoding region, total number of occurrences of X divided by total number of dicodon occurrences In human genome, frequency of dicodon “AAA AAA” is ~1% in coding region versus ~5% in non-coding region Question: if you see a region with many “AAA AAA”, would you guess it is a coding or non-coding region?Basic idea of gene finding: Basic idea of gene finding Most dicodons show bias towards either coding or non-coding regions; only fraction of dicodons is neutral Foundation for coding region identification Dicodon frequencies are key signal used for coding region detection; all gene finding programs use this information Regions consisting of dicodons that mostly tend to be in coding regions are probably coding regions; otherwise non-coding regionsBasic idea of gene finding : Basic idea of gene finding in-frame (the correct frame) versus any-frame dicodons ATG TTG GAT GCC CAG AAG............ in-frame dicodons not in-frame dicodons In-frame: ATG TTG GAT GCC CAG AAG Not in-frame: TGTTGG, ATGCCC AGAAG ., GTTGGA AGCCCA, AGAAG .. any-frame Basic idea of gene finding: Basic idea of gene finding In-frame dicodon frequencies provide a more sensitive measure than any-frame dicodon frequencies Computational model for gene finding: Computational model for gene finding YES, it is still simple …… Preference model: for each dicodon X (e.g., AAA AAA), calculate its frequencies in coding and non-coding regions, FC(X), FN(X) calculate X’s preference value P(X) = log (FC(X)/FN(X)) Properties: P(X) is 0 if X has the same frequencies in coding and non-coding regions P(X) has positive score if X has higher frequency in coding than in non-coding region; the larger the difference the more positive the score is P(X) has negative score if X has higher frequency in non-coding than in coding region; the larger the difference the more negative the score is Computational model for gene finding: Computational model for gene finding Example Coding preference of a region (an any-frame model) AAA ATT, AAA GAC, AAA TAG have the following frequencies FC(AAA ATT) = 1.4%, FN(AAA ATT) = 5.2% FC(AAA GAC) = 1.9%, FN(AAA GAC) = 4.8% FC(AAA TAG) = 0.0%, FN(AAA TAG) = 6.3% We have P(AAA ATT) = log (1.4/5.2) = -0.57 P(AAA GAC) = log (1.9/4.8) = -0.40 P(AAA TAG) = - infinity (treating STOP codons differently) A region consisting of only these dicodons is probably a non-coding region Calculate the preference scores of all dicodons of the region and sum them up; If the total score is positive, predict the region to be a coding region; otherwise a non-coding region.Computational model for gene finding: Computational model for gene finding Ok, now you may want to run away ….. In-frame preference model Actually, the concept is still simple …….Computational model for gene finding: Computational model for gene finding In-frame preference model (most commonly used in prediction programs) Application step: For each possible reading frame of a region, calculate the total in-frame preference score P0(X), the total (in-frame + 1) preference score P1(X), the total (in-frame + 2) preference score P2(X), and sum them up If the score is positive, predict it to be a coding region; otherwise non-codingComputational Gene Finding: Computational Gene Finding Prediction procedure of coding region Procedure: Calculate all ORFs of a DNA segment; For each ORF, do the following slide through the ORF with an increment of 10 base-pairs calculate the preference score, in same frame of ORF, within a window of 60 base-pairs; and assign the score to the center of the window Example (forward strand in one particular frame) preference scores 0 +5 -5Computational Gene Finding: Computational Gene Finding Making the call: coding or non-coding and where the boundaries are Need a training set with known coding and non-coding regions select threshold(s) to include as many known coding regions as possible, and in the same time to exclude as many known non-coding regions as possible If threshold = 0.2, we will include 90% of coding regions and also 10% of non-coding regions If threshold = 0.4, we will include 70% of coding regions and also 6% of non-coding regions If threshold = 0.5, we will include 60% of coding regions and also 2% of non-coding regions where to draw the line?Computational Gene Finding: Computational Gene Finding Why dicodon (6mer)? Codon (3mer) -based models are not nearly as information rich as dicodon-based models Tricodon (9mers)-based models need too many data points for it to be practical People have used 7-mer or 8-mer based models; they could provide better prediction methods 6-mer based models There are 4*4*4 = 64 codons 4*4*4*4*4*4 = 4,096 di-codons 4*4*4*4*4*4*4*4*4= 262,144 tricodons To make our statistics reliable, we would need at least ~15 occurrences of each X-mer; so for tricodon-based models, we need at least 15*262144 = 3932160 coding bases in our training data, which is probably not going to be available for most of the genomesCollecting data: Collecting data Where can we collect the data for estimating the initial dicodon frequencies? GenBank http://www.ncbi.nlm.nih.gov/entrez an example: Shewanella oneidensis MR-1 You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
gene finding 1 Hannah Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 1455 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: November 16, 2007 This Presentation is Public Favorites: 2 Presentation Description No description available. Comments Posting comment... By: farhadwur (22 month(s) ago) hi please make it possible to dowload regards Saving..... Post Reply Close Saving..... Edit Comment Close Premium member Presentation Transcript BMB3600 - Bioinformatics : BMB3600 - Bioinformatics March 25 – gene finding I March 30 – gene finding 2 April 01 – prediction of binding motifs April 06 – microarray data analysis April 08 – sequence comparison April 13 – protein function prediction 1 April 15 – protein function prediction 2 April 20 – protein structure prediction 1 April 22 – protein structure prediction 2 April 27 – take-home examGene Finding I -- outline: Gene Finding I -- outline Problem definition Basic gene structures in eukaryotic versus prokaryotic genomes Codon and reading frames Codon frequencies in coding versus non-coding regions Basic idea of distinguishing coding versus non-coding regions Computational methods for distinguishing coding from non-coding regions Collecting data from model building How to develop a simple gene finderGene finding: Gene finding Human genome has ~3 billion base pairs and has about 35,000 protein-coding genes ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgctagaggtcagtgactgatgatcgatgcatgcatggatgatgcagctgatcgatgtagatgcaataagtcgatgatcgatgatgatgctagatgatagctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatgctagatcgtaggtagtagctagatgcagggataaacacacggaggcgagtgatcggtaccgggctgaggtgttagctaatgatgagtacgtatgaggcaggatgagtgacccgatgaggctagatgcgatggatggatcgatgatcgatgcatggtgatgcgatgctagatgatgtgtgtcagtaagtaagcgatgcggctgctgagagcgtaggcccgagaggagagatgtaggaggaaggtttgatggtagttgtagatgattgtgtagttgtagctgatagtgatgatcgtag ………………………………… Where are the protein-encoding genes?The basic idea of pattern recognition: The basic idea of pattern recognition How do kids learn to distinguish “dogs” from “cats”? were “trained” by being told “A is a dog”, “B is a cat”, “C is another dog”, ….. they learn to “extract” common features (patterns) among animals they were told to be “dogs” and “cats” then apply these extracted features to identify new dogs and cats Pattern recognition is generally done by providing “training sets” which are individually labeled “positives” versus “negatives”, or “good” versus “bad”, etc. learning the general rules that separate the “positives” from “negatives” or “good” from “bad”, …. applying the learned rules to new situations Gene finding through learning: Gene finding through learning Learning “general rules” about finding genes ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgctagaggtcagtgactgatgatcgatgcatgcatggatgatgcagctgatcgatgtagatgcaataagtcgatgatcgatgatgatgctagatgatagctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatgctagatcgtaggtagtagctagatgcagggataaacacacggaggcgagtgatcggtaccgggctgaggtgttagctaatgatgagtacgtatgaggcaggatgagtgacccgatgaggctagatgcgatggatggatcgatgatcgatgcatggtgatgcgatgctagatgatgtgtgtcagtaagtaagcgatgcggctgctgagagcgtaggcccgagaggagagatgtaggaggaaggtttgatggtagttgtagatgattgtgtagttgtagctgatagtgatgatcgtag ………………………………… Over the years, numerous genes have been identified through experiments. Also some DNA segments are known to be non-genes verified by experimentsGene finding through learning: Gene finding through learning So we know ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgt gggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgctagag gtcagtgactgatgatcgatgcatgcatg gatgatgcagctgatcgatgtagatgcaataagtcgatgatcgatgatgatgctagatgatagctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatg ctagatcgtaggtagtagctagatgcagggataaacacacggaggc gagtgatcggtaccgggctgaggtgttagctaatgatgagtacgtatgaggcaggatgagtgacccgatgaggctagatgcgatggatggatcgatgatcgatgcatggtgatgcgatgctagatgatgtgtgtcagtaagtaa ………………………………… genes non-genesGene finding through learning: Gene finding through learning Is a gene? Remember “dogs”, “cats” …. but the “patterns” here are much more hidden and more complex than the distinguishing features between “dogs” and “cats” gcgatgcggctgctgagagcgtaggcccgagaggagagatgtaggaggaaggtttgatggtagttgtagatgattgtgtagttgtagctgatagtgatgatcgtag We need to study the basic structures of genes first ….!Basic Gene Structures: Basic Gene Structures Eukaryotic genes Exons,introns, translation starts and stops, splice (donor/acceptor) junctions, Basic Gene Structure: Basic Gene Structure Prokaryotic genes coding regions, non-coding regions translation starts and stops gene gene gene promoter start stop Prokaryotic genes are easier to identify than eukaryotic genes because of the simplicity of their gene structure and the density of genes in the genomeGene Structure -- codons: Gene Structure -- codons A triplet of nucleotides is called a codon There are 64 codons (4 * 4 * 4 = 64) AAA, ….., TTT Three codons (TAG, TGA, TAA) are called stop codons as they code the termination signal of a gene Each of the other codons codes an amino acidGene Structure – reading frame: Gene Structure – reading frame Reading (or translation) frame: each DNA segment has six possible reading frames Reading frame #0 ATG GCT TAC GCT TGC Reading frame #1 TGG CTT ACG CTT GA. Reading frame #2 GGC TTA CGC TTG A.. ATGGCTTACGCTTGA Forward strand: Reading frame #0 TCA AGC GTA AGC CAT Reading frame #1 CAA GCG TAA GCC AT. Reading frame #2 AAG CGT AAG CCA T.. TCAAGCGTAAGCCAT Reverse strand:Gene Structure – open reading frame (ORF): Gene Structure – open reading frame (ORF) Open reading frame (ORF): a segment of DNA with one in-frame start codon and one in-frame stop codon at the two ends and no in-frame stop codon in the middle each ORF has a fixed reading frame How many genes can an ORF have inside it? Answer: one because an ORF has only one stopGene Structure -- open reading frame (ORF): Gene Structure -- open reading frame (ORF) Generally true: all long (> 300 bp) orfs in prokaryotic genomes encode genes But this may not necessarily be true for eukaryotic genomes Coding region – gene in prokaryotic genomes exon in eukaryotic genomesGene Structure: Gene Structure Each coding region (exon or whole gene) has a fixed translation frame A coding region always sits inside an ORF of same reading frame All exons of a gene are on the same strand Neighboring exons of a gene could have different reading frames frame 1 frame 2 frame 3Gene Structure – reading frame consistency: Gene Structure – reading frame consistency Now … we are talking about a little more “complex” features Neighboring exons of a gene should be frame-consistent ATG GCT TGG GCT TTA A -------------- GT TTC CCG GAG AT ------ T GGG exon 1 exon 3 exon 2 exon1 [i, j] in frame a and exon2 [m, n] in frame b are consistent if b = (m - j - 1 + a) mod 3 1 mod 3 = 1 2 mod 3 = 2 3 mod 3 = 0 4 mod 3 = 1 5 mod 3 = 2 ...... splicing!Codon Frequencies: Codon Frequencies Coding sequences are translated into protein sequences We found the following – the dimer frequency in protein sequences is NOT evenly distributed The average frequency is 5% Some amino acids prefer to be next to each other Some other amino acids prefer to be not next to each other shewanellaDicodon Frequencies: Dicodon Frequencies Believe it or not – the biased (uneven) dimer frequencies are the foundation of many gene finding programs! Basic idea – if a dimer has lower than average dimer frequency; this means that proteins prefer not to have such dimers in its sequence; Hence if we see a dicodon encoding this dimer, we may want to bet against this dicodon being in a coding region!Dicodon Frequencies: Dicodon Frequencies Hence if we see many such dicodons in a DNA segment, we may want to bet that this region is a non-coding region! This is the very basic idea of gene finding!Dicodon Frequencies: Dicodon Frequencies Dicodon frequencies in coding versus non-coding are genome-dependent shewanella bovineDicodon Frequencies: Dicodon Frequencies Relative frequencies of a di-codon in coding versus non-coding frequency of dicodon X (e.g, AAAAAA) in coding region, total number of occurrences of X divided by total number of dicocon occurrences frequency of dicodon X (e.g, AAAAAA) in noncoding region, total number of occurrences of X divided by total number of dicodon occurrences In human genome, frequency of dicodon “AAA AAA” is ~1% in coding region versus ~5% in non-coding region Question: if you see a region with many “AAA AAA”, would you guess it is a coding or non-coding region?Basic idea of gene finding: Basic idea of gene finding Most dicodons show bias towards either coding or non-coding regions; only fraction of dicodons is neutral Foundation for coding region identification Dicodon frequencies are key signal used for coding region detection; all gene finding programs use this information Regions consisting of dicodons that mostly tend to be in coding regions are probably coding regions; otherwise non-coding regionsBasic idea of gene finding : Basic idea of gene finding in-frame (the correct frame) versus any-frame dicodons ATG TTG GAT GCC CAG AAG............ in-frame dicodons not in-frame dicodons In-frame: ATG TTG GAT GCC CAG AAG Not in-frame: TGTTGG, ATGCCC AGAAG ., GTTGGA AGCCCA, AGAAG .. any-frame Basic idea of gene finding: Basic idea of gene finding In-frame dicodon frequencies provide a more sensitive measure than any-frame dicodon frequencies Computational model for gene finding: Computational model for gene finding YES, it is still simple …… Preference model: for each dicodon X (e.g., AAA AAA), calculate its frequencies in coding and non-coding regions, FC(X), FN(X) calculate X’s preference value P(X) = log (FC(X)/FN(X)) Properties: P(X) is 0 if X has the same frequencies in coding and non-coding regions P(X) has positive score if X has higher frequency in coding than in non-coding region; the larger the difference the more positive the score is P(X) has negative score if X has higher frequency in non-coding than in coding region; the larger the difference the more negative the score is Computational model for gene finding: Computational model for gene finding Example Coding preference of a region (an any-frame model) AAA ATT, AAA GAC, AAA TAG have the following frequencies FC(AAA ATT) = 1.4%, FN(AAA ATT) = 5.2% FC(AAA GAC) = 1.9%, FN(AAA GAC) = 4.8% FC(AAA TAG) = 0.0%, FN(AAA TAG) = 6.3% We have P(AAA ATT) = log (1.4/5.2) = -0.57 P(AAA GAC) = log (1.9/4.8) = -0.40 P(AAA TAG) = - infinity (treating STOP codons differently) A region consisting of only these dicodons is probably a non-coding region Calculate the preference scores of all dicodons of the region and sum them up; If the total score is positive, predict the region to be a coding region; otherwise a non-coding region.Computational model for gene finding: Computational model for gene finding Ok, now you may want to run away ….. In-frame preference model Actually, the concept is still simple …….Computational model for gene finding: Computational model for gene finding In-frame preference model (most commonly used in prediction programs) Application step: For each possible reading frame of a region, calculate the total in-frame preference score P0(X), the total (in-frame + 1) preference score P1(X), the total (in-frame + 2) preference score P2(X), and sum them up If the score is positive, predict it to be a coding region; otherwise non-codingComputational Gene Finding: Computational Gene Finding Prediction procedure of coding region Procedure: Calculate all ORFs of a DNA segment; For each ORF, do the following slide through the ORF with an increment of 10 base-pairs calculate the preference score, in same frame of ORF, within a window of 60 base-pairs; and assign the score to the center of the window Example (forward strand in one particular frame) preference scores 0 +5 -5Computational Gene Finding: Computational Gene Finding Making the call: coding or non-coding and where the boundaries are Need a training set with known coding and non-coding regions select threshold(s) to include as many known coding regions as possible, and in the same time to exclude as many known non-coding regions as possible If threshold = 0.2, we will include 90% of coding regions and also 10% of non-coding regions If threshold = 0.4, we will include 70% of coding regions and also 6% of non-coding regions If threshold = 0.5, we will include 60% of coding regions and also 2% of non-coding regions where to draw the line?Computational Gene Finding: Computational Gene Finding Why dicodon (6mer)? Codon (3mer) -based models are not nearly as information rich as dicodon-based models Tricodon (9mers)-based models need too many data points for it to be practical People have used 7-mer or 8-mer based models; they could provide better prediction methods 6-mer based models There are 4*4*4 = 64 codons 4*4*4*4*4*4 = 4,096 di-codons 4*4*4*4*4*4*4*4*4= 262,144 tricodons To make our statistics reliable, we would need at least ~15 occurrences of each X-mer; so for tricodon-based models, we need at least 15*262144 = 3932160 coding bases in our training data, which is probably not going to be available for most of the genomesCollecting data: Collecting data Where can we collect the data for estimating the initial dicodon frequencies? GenBank http://www.ncbi.nlm.nih.gov/entrez an example: Shewanella oneidensis MR-1