gene finding 1

Category: Entertainment

Presentation Description

No description available.


By: farhadwur (120 month(s) ago)

hi please make it possible to dowload regards

Presentation Transcript

BMB3600 - Bioinformatics : 

BMB3600 - Bioinformatics March 25 – gene finding I March 30 – gene finding 2 April 01 – prediction of binding motifs April 06 – microarray data analysis April 08 – sequence comparison April 13 – protein function prediction 1 April 15 – protein function prediction 2 April 20 – protein structure prediction 1 April 22 – protein structure prediction 2 April 27 – take-home exam

Gene Finding I -- outline: 

Gene Finding I -- outline Problem definition Basic gene structures in eukaryotic versus prokaryotic genomes Codon and reading frames Codon frequencies in coding versus non-coding regions Basic idea of distinguishing coding versus non-coding regions Computational methods for distinguishing coding from non-coding regions Collecting data from model building How to develop a simple gene finder

Gene finding: 

Gene finding Human genome has ~3 billion base pairs and has about 35,000 protein-coding genes ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgctagaggtcagtgactgatgatcgatgcatgcatggatgatgcagctgatcgatgtagatgcaataagtcgatgatcgatgatgatgctagatgatagctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatgctagatcgtaggtagtagctagatgcagggataaacacacggaggcgagtgatcggtaccgggctgaggtgttagctaatgatgagtacgtatgaggcaggatgagtgacccgatgaggctagatgcgatggatggatcgatgatcgatgcatggtgatgcgatgctagatgatgtgtgtcagtaagtaagcgatgcggctgctgagagcgtaggcccgagaggagagatgtaggaggaaggtttgatggtagttgtagatgattgtgtagttgtagctgatagtgatgatcgtag ………………………………… Where are the protein-encoding genes?

The basic idea of pattern recognition: 

The basic idea of pattern recognition How do kids learn to distinguish “dogs” from “cats”? were “trained” by being told “A is a dog”, “B is a cat”, “C is another dog”, ….. they learn to “extract” common features (patterns) among animals they were told to be “dogs” and “cats” then apply these extracted features to identify new dogs and cats Pattern recognition is generally done by providing “training sets” which are individually labeled “positives” versus “negatives”, or “good” versus “bad”, etc. learning the general rules that separate the “positives” from “negatives” or “good” from “bad”, …. applying the learned rules to new situations

Gene finding through learning: 

Gene finding through learning Learning “general rules” about finding genes ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgctagaggtcagtgactgatgatcgatgcatgcatggatgatgcagctgatcgatgtagatgcaataagtcgatgatcgatgatgatgctagatgatagctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatgctagatcgtaggtagtagctagatgcagggataaacacacggaggcgagtgatcggtaccgggctgaggtgttagctaatgatgagtacgtatgaggcaggatgagtgacccgatgaggctagatgcgatggatggatcgatgatcgatgcatggtgatgcgatgctagatgatgtgtgtcagtaagtaagcgatgcggctgctgagagcgtaggcccgagaggagagatgtaggaggaaggtttgatggtagttgtagatgattgtgtagttgtagctgatagtgatgatcgtag ………………………………… Over the years, numerous genes have been identified through experiments. Also some DNA segments are known to be non-genes verified by experiments

Gene finding through learning: 

Gene finding through learning So we know ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgt gggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgctagag gtcagtgactgatgatcgatgcatgcatg gatgatgcagctgatcgatgtagatgcaataagtcgatgatcgatgatgatgctagatgatagctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatg ctagatcgtaggtagtagctagatgcagggataaacacacggaggc gagtgatcggtaccgggctgaggtgttagctaatgatgagtacgtatgaggcaggatgagtgacccgatgaggctagatgcgatggatggatcgatgatcgatgcatggtgatgcgatgctagatgatgtgtgtcagtaagtaa ………………………………… genes non-genes

Gene finding through learning: 

Gene finding through learning Is a gene? Remember “dogs”, “cats” …. but the “patterns” here are much more hidden and more complex than the distinguishing features between “dogs” and “cats” gcgatgcggctgctgagagcgtaggcccgagaggagagatgtaggaggaaggtttgatggtagttgtagatgattgtgtagttgtagctgatagtgatgatcgtag We need to study the basic structures of genes first ….!

Basic Gene Structures: 

Basic Gene Structures Eukaryotic genes Exons,introns, translation starts and stops, splice (donor/acceptor) junctions,

Basic Gene Structure: 

Basic Gene Structure Prokaryotic genes coding regions, non-coding regions translation starts and stops gene gene gene promoter start stop Prokaryotic genes are easier to identify than eukaryotic genes because of the simplicity of their gene structure and the density of genes in the genome

Gene Structure -- codons: 

Gene Structure -- codons A triplet of nucleotides is called a codon There are 64 codons (4 * 4 * 4 = 64) AAA, ….., TTT Three codons (TAG, TGA, TAA) are called stop codons as they code the termination signal of a gene Each of the other codons codes an amino acid

Gene Structure – reading frame: 

Gene Structure – reading frame Reading (or translation) frame: each DNA segment has six possible reading frames Reading frame #0 ATG GCT TAC GCT TGC Reading frame #1 TGG CTT ACG CTT GA. Reading frame #2 GGC TTA CGC TTG A.. ATGGCTTACGCTTGA Forward strand: Reading frame #0 TCA AGC GTA AGC CAT Reading frame #1 CAA GCG TAA GCC AT. Reading frame #2 AAG CGT AAG CCA T.. TCAAGCGTAAGCCAT Reverse strand:

Gene Structure – open reading frame (ORF): 

Gene Structure – open reading frame (ORF) Open reading frame (ORF): a segment of DNA with one in-frame start codon and one in-frame stop codon at the two ends and no in-frame stop codon in the middle each ORF has a fixed reading frame How many genes can an ORF have inside it? Answer: one because an ORF has only one stop

Gene Structure -- open reading frame (ORF): 

Gene Structure -- open reading frame (ORF) Generally true: all long (> 300 bp) orfs in prokaryotic genomes encode genes But this may not necessarily be true for eukaryotic genomes Coding region – gene in prokaryotic genomes exon in eukaryotic genomes

Gene Structure: 

Gene Structure Each coding region (exon or whole gene) has a fixed translation frame A coding region always sits inside an ORF of same reading frame All exons of a gene are on the same strand Neighboring exons of a gene could have different reading frames frame 1 frame 2 frame 3

Gene Structure – reading frame consistency: 

Gene Structure – reading frame consistency Now … we are talking about a little more “complex” features Neighboring exons of a gene should be frame-consistent ATG GCT TGG GCT TTA A -------------- GT TTC CCG GAG AT ------ T GGG exon 1 exon 3 exon 2 exon1 [i, j] in frame a and exon2 [m, n] in frame b are consistent if b = (m - j - 1 + a) mod 3 1 mod 3 = 1 2 mod 3 = 2 3 mod 3 = 0 4 mod 3 = 1 5 mod 3 = 2 ...... splicing!

Codon Frequencies: 

Codon Frequencies Coding sequences are translated into protein sequences We found the following – the dimer frequency in protein sequences is NOT evenly distributed The average frequency is 5% Some amino acids prefer to be next to each other Some other amino acids prefer to be not next to each other shewanella

Dicodon Frequencies: 

Dicodon Frequencies Believe it or not – the biased (uneven) dimer frequencies are the foundation of many gene finding programs! Basic idea – if a dimer has lower than average dimer frequency; this means that proteins prefer not to have such dimers in its sequence; Hence if we see a dicodon encoding this dimer, we may want to bet against this dicodon being in a coding region!

Dicodon Frequencies: 

Dicodon Frequencies Hence if we see many such dicodons in a DNA segment, we may want to bet that this region is a non-coding region! This is the very basic idea of gene finding!

Dicodon Frequencies: 

Dicodon Frequencies Dicodon frequencies in coding versus non-coding are genome-dependent shewanella bovine

Dicodon Frequencies: 

Dicodon Frequencies Relative frequencies of a di-codon in coding versus non-coding frequency of dicodon X (e.g, AAAAAA) in coding region, total number of occurrences of X divided by total number of dicocon occurrences frequency of dicodon X (e.g, AAAAAA) in noncoding region, total number of occurrences of X divided by total number of dicodon occurrences In human genome, frequency of dicodon “AAA AAA” is ~1% in coding region versus ~5% in non-coding region Question: if you see a region with many “AAA AAA”, would you guess it is a coding or non-coding region?

Basic idea of gene finding: 

Basic idea of gene finding Most dicodons show bias towards either coding or non-coding regions; only fraction of dicodons is neutral Foundation for coding region identification Dicodon frequencies are key signal used for coding region detection; all gene finding programs use this information Regions consisting of dicodons that mostly tend to be in coding regions are probably coding regions; otherwise non-coding regions

Basic idea of gene finding : 

Basic idea of gene finding in-frame (the correct frame) versus any-frame dicodons ATG TTG GAT GCC CAG AAG............ in-frame dicodons not in-frame dicodons In-frame: ATG TTG GAT GCC CAG AAG Not in-frame: TGTTGG, ATGCCC AGAAG ., GTTGGA AGCCCA, AGAAG .. any-frame

Basic idea of gene finding: 

Basic idea of gene finding In-frame dicodon frequencies provide a more sensitive measure than any-frame dicodon frequencies

Computational model for gene finding: 

Computational model for gene finding YES, it is still simple …… Preference model: for each dicodon X (e.g., AAA AAA), calculate its frequencies in coding and non-coding regions, FC(X), FN(X) calculate X’s preference value P(X) = log (FC(X)/FN(X)) Properties: P(X) is 0 if X has the same frequencies in coding and non-coding regions P(X) has positive score if X has higher frequency in coding than in non-coding region; the larger the difference the more positive the score is P(X) has negative score if X has higher frequency in non-coding than in coding region; the larger the difference the more negative the score is

Computational model for gene finding: 

Computational model for gene finding Example Coding preference of a region (an any-frame model) AAA ATT, AAA GAC, AAA TAG have the following frequencies FC(AAA ATT) = 1.4%, FN(AAA ATT) = 5.2% FC(AAA GAC) = 1.9%, FN(AAA GAC) = 4.8% FC(AAA TAG) = 0.0%, FN(AAA TAG) = 6.3% We have P(AAA ATT) = log (1.4/5.2) = -0.57 P(AAA GAC) = log (1.9/4.8) = -0.40 P(AAA TAG) = - infinity (treating STOP codons differently) A region consisting of only these dicodons is probably a non-coding region Calculate the preference scores of all dicodons of the region and sum them up; If the total score is positive, predict the region to be a coding region; otherwise a non-coding region.

Computational model for gene finding: 

Computational model for gene finding Ok, now you may want to run away ….. In-frame preference model Actually, the concept is still simple …….

Computational model for gene finding: 

Computational model for gene finding In-frame preference model (most commonly used in prediction programs) Application step: For each possible reading frame of a region, calculate the total in-frame preference score  P0(X), the total (in-frame + 1) preference score  P1(X), the total (in-frame + 2) preference score  P2(X), and sum them up If the score is positive, predict it to be a coding region; otherwise non-coding

Computational Gene Finding: 

Computational Gene Finding Prediction procedure of coding region Procedure: Calculate all ORFs of a DNA segment; For each ORF, do the following slide through the ORF with an increment of 10 base-pairs calculate the preference score, in same frame of ORF, within a window of 60 base-pairs; and assign the score to the center of the window Example (forward strand in one particular frame) preference scores 0 +5 -5

Computational Gene Finding: 

Computational Gene Finding Making the call: coding or non-coding and where the boundaries are Need a training set with known coding and non-coding regions select threshold(s) to include as many known coding regions as possible, and in the same time to exclude as many known non-coding regions as possible If threshold = 0.2, we will include 90% of coding regions and also 10% of non-coding regions If threshold = 0.4, we will include 70% of coding regions and also 6% of non-coding regions If threshold = 0.5, we will include 60% of coding regions and also 2% of non-coding regions where to draw the line?

Computational Gene Finding: 

Computational Gene Finding Why dicodon (6mer)? Codon (3mer) -based models are not nearly as information rich as dicodon-based models Tricodon (9mers)-based models need too many data points for it to be practical People have used 7-mer or 8-mer based models; they could provide better prediction methods 6-mer based models There are 4*4*4 = 64 codons 4*4*4*4*4*4 = 4,096 di-codons 4*4*4*4*4*4*4*4*4= 262,144 tricodons To make our statistics reliable, we would need at least ~15 occurrences of each X-mer; so for tricodon-based models, we need at least 15*262144 = 3932160 coding bases in our training data, which is probably not going to be available for most of the genomes

Collecting data: 

Collecting data Where can we collect the data for estimating the initial dicodon frequencies? GenBank an example: Shewanella oneidensis MR-1

authorStream Live Help