logging in or signing up Use of BLAST in Computational Biological Studies BINDUMADHAVI Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT lite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 136 Category: Education License: All Rights Reserved Like it (0) Dislike it (0) Added: September 08, 2010 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Use and working of BLAST in Computational Biological Studies : Use and working of BLAST in Computational Biological Studies Computational Biology?? : It is an interdisciplinary field that applies the techniques of computer science, mathematics and statistics, to solve biological problems. The main focus lies in the development of computational and statistical data analysis methods and in developing mathematical modelling and computational simulation (representing a real world thing using a computer program)techniques. Computational Biology?? It is related to:- : Bio-informatics, Molecular modeling Computational genomics, Protein structure prediction, Computational Bio-chemistry, Computational Bio-modelling, Mathematical Biology. It is related to:- It is mainly developed……. : In order to reduce the unnecessary waste of time and money in predicting function of un charecterised proteins for those an experimentally derived structure is not available , through traditional methods. The structure and function are related,means a significant homology between 2 seq.implies a structural similarity between them and thus we can even find the function of uncharacterised proteins. It is mainly developed……. : Comparative molecular modelling and virtual screening of chemical libraries by docking analysis(identify molecules that are likely to bind to protein target of interest ) are promising fields that provide very faster and cheaper identification of new drug targets. Detection of homologous sequences:- : The first step in characterisation of any unknown protein is SEARCHING FOR ITS HOMOLOGUES IN DATABASES. The most widely used tool for homologue searching among FASTA and BLAST is “ BLAST”. Detection of homologous sequences:- ‘B’asic ‘L’ocal ‘A’lignment ‘S’earch ‘T’ool:- : BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences. By aligning query with homologous previously characterized, regions of sequence similarity can be found, which may provide useful functional and evolutionary clues about the structure and function. It generates local alignment. ‘B’asic ‘L’ocal ‘A’lignment ‘S’earch ‘T’ool:- Why BLAST?? : Why BLAST?? It is a heuristic method(shortcuts)which is computationally very fast compared to FASTA and Smith-waterman algo. Though it is a little bit less accurate,it is almost 50 times faster to others. It has very high SENSITIVITY(ability to find most of the members of the protein family rep.by query seq.) It proceeds by local alignment assuring best alignment. Slide 9: We find this tool in various databases like:- National Center for Biotechnology Information (NCBI) database , The European Molecular Biology Laboratory- European Bioinformatics Institute database(EMBL-EBI), Ensembl database, The DNA Data Bank of Japan (DDBJ). Types of BLAST? : blastn-Searches nucleotide database using a nucleotide seq as query. blastp-Searches protein database using a protein seq as query. blastx-Searches protein database using a nucleotide query sequence translated in all reading frames. tblastx-comparison of the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database tblastn- It is screen a nucleotide sequence database dynamically translated in all reading frames with a protein query sequence Types of BLAST? How to work with BLAST? : How to work with BLAST? Input seq:- : Input seq:- Slide 15: DATABASES:- nr:-All non-redundant GenBank CDS translations + RefSeq Proteins + PDB + SwissProt + PIR + PRF refseq,swissprot,pat,Pdb,month,envnr ENTREZ QUERY:- Eg:-protease NOT hiv1[organism], 1000:2000[slen] ,Mus musculus[organism] AND biomol_mrna[properties]etc., Slide 17: Expect threshold:-This setting specifies the statistical significance threshold for reporting matches against database sequences. BLOSUM62:- is a general purpose matrix and the default choice in BLAST 2.0. The BLOSUM matrix assigns a probability score for each position in an alignment that is based on the frequency with which that substitution is known to occur among consensus blocks within related proteins. BLOSUM62 is among the best of the available matrices for detecting weak protein similarities. Other supported options include PAM30, PAM70, BLOSUM80, and BLOSUM45. Slide 18: GAP COSTS:-higher gap penalty will cause less favourable characters to be aligned, to avoid creating as many gaps and viceversa. FILTERING:-Filtering can eliminate statistically significant but biologically uninteresting reports from the blast output leaving the more biologically interesting regions of the query sequence available for specific matching against database sequences. MASKING:-(Mask for lookup table only)BLAST searches consist of two phases, finding hits based upon a lookup table and then extending them. This option masks only for purposes of constructing the lookup table used by BLAST so that no hits are found based upon low-complexity sequence or repeats (if repeat filter is checked). The BLAST extensions are performed without masking and so they can be extended through low-complexity sequence During this time BLAST algo.works like this:- : 1)Remove low-complexity region or sequence repeats in the query sequence:- These regions might give high scores that confuse the program to find the actual significant sequences in the database, so they should be filtered out. SEG-protein seq,DUST-DNA seq, XNU-to filter tandom repeats in protein seq 2)Make a k-letter word list of the query sequence:- During this time BLAST algo.works like this:- Slide 21: . 3)List the possible matching words:- This step is one of the main differences between BLAST and FASTA. FASTA cares about all of the common words in the database and query seq.but BLAST only cares about the high-scoring words. 4)Organize the remaining high-scoring words into an efficient search tree. 5) Repeat step 1 to 4 for each k-letter word in the query sequence. 6) Scan the database sequences for exact matches with the remaining high-scoring words. 7) Extend the exact matches to high-scoring segment pair (HSP) Slide 22: BLAST stretches a longer alignment between the query and the database sequence in the left and right directions, from the position where the exact match occurred. The extension doesn’t stop until the accumulated total score of the HSP begins to decrease. Slide 23: 8)List all of the HSPs in the database whose score is high enough to be considered:- We list the HSPs whose scores are greater than the empirically determined cutoff score S. 9) Evaluate the significance of the HSP score:- 10)Make two or more HSP regions into a longer alignment:-Sometimes, we find two or more HSP regions in one database sequence that can be made into a longer alignment. This provides additional evidence of the relation between the query and database sequence. Slide 25: 11)Report the matches whose expect score is lower than a threshold parameter E. OUTPUT DESCRIPTION:- : OUTPUT DESCRIPTION:- The typical threshold for a good E−value from a BLAST search is e−5=(10−5) or lower. The reason for such low values is that an E=0.001 in a million entry database would still leave 1000 entries due to chance. An E=e−6 would only leave one entry due to chance. The parameters K and λ represent natural scales for the search space and the scoring system respectively. ¨ The rest of the equation represents the size of the query (m), the size of the database (n), and of course the S score. Slide 29: MAXIMUM SCORE:- the highest alignment score of a set of aligned segments from the same subject (database) sequence. The score is calculated from the sum of the match rewards and the mismatch, gap open and extend penalties independently for each segment. This normally gives the same sorting order as the E Value. TOTAL SCORE:-The sum of alignment scores of all segments from same subject seq. QUERY COVERAGE:-the percent of the query length that is included in the aligned segments. LINKS:-Provide direct access to other resources. BLAST is used for:- : Searching for HOMOLOGUES of unknown seq. Which species have a protein that is related in lineage to a certain protein with known amino-acid sequence Where does a certain sequence of DNA originate? What other genes encode proteins that exhibit structures or motifs such as ones that have just been determined? BLAST is also often used as part of other algorithms that require approximate sequence matching. BLAST is used for:- Slide 31: THE END……………… THANK YOU You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
Use of BLAST in Computational Biological Studies BINDUMADHAVI Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT lite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 136 Category: Education License: All Rights Reserved Like it (0) Dislike it (0) Added: September 08, 2010 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Use and working of BLAST in Computational Biological Studies : Use and working of BLAST in Computational Biological Studies Computational Biology?? : It is an interdisciplinary field that applies the techniques of computer science, mathematics and statistics, to solve biological problems. The main focus lies in the development of computational and statistical data analysis methods and in developing mathematical modelling and computational simulation (representing a real world thing using a computer program)techniques. Computational Biology?? It is related to:- : Bio-informatics, Molecular modeling Computational genomics, Protein structure prediction, Computational Bio-chemistry, Computational Bio-modelling, Mathematical Biology. It is related to:- It is mainly developed……. : In order to reduce the unnecessary waste of time and money in predicting function of un charecterised proteins for those an experimentally derived structure is not available , through traditional methods. The structure and function are related,means a significant homology between 2 seq.implies a structural similarity between them and thus we can even find the function of uncharacterised proteins. It is mainly developed……. : Comparative molecular modelling and virtual screening of chemical libraries by docking analysis(identify molecules that are likely to bind to protein target of interest ) are promising fields that provide very faster and cheaper identification of new drug targets. Detection of homologous sequences:- : The first step in characterisation of any unknown protein is SEARCHING FOR ITS HOMOLOGUES IN DATABASES. The most widely used tool for homologue searching among FASTA and BLAST is “ BLAST”. Detection of homologous sequences:- ‘B’asic ‘L’ocal ‘A’lignment ‘S’earch ‘T’ool:- : BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences. By aligning query with homologous previously characterized, regions of sequence similarity can be found, which may provide useful functional and evolutionary clues about the structure and function. It generates local alignment. ‘B’asic ‘L’ocal ‘A’lignment ‘S’earch ‘T’ool:- Why BLAST?? : Why BLAST?? It is a heuristic method(shortcuts)which is computationally very fast compared to FASTA and Smith-waterman algo. Though it is a little bit less accurate,it is almost 50 times faster to others. It has very high SENSITIVITY(ability to find most of the members of the protein family rep.by query seq.) It proceeds by local alignment assuring best alignment. Slide 9: We find this tool in various databases like:- National Center for Biotechnology Information (NCBI) database , The European Molecular Biology Laboratory- European Bioinformatics Institute database(EMBL-EBI), Ensembl database, The DNA Data Bank of Japan (DDBJ). Types of BLAST? : blastn-Searches nucleotide database using a nucleotide seq as query. blastp-Searches protein database using a protein seq as query. blastx-Searches protein database using a nucleotide query sequence translated in all reading frames. tblastx-comparison of the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database tblastn- It is screen a nucleotide sequence database dynamically translated in all reading frames with a protein query sequence Types of BLAST? How to work with BLAST? : How to work with BLAST? Input seq:- : Input seq:- Slide 15: DATABASES:- nr:-All non-redundant GenBank CDS translations + RefSeq Proteins + PDB + SwissProt + PIR + PRF refseq,swissprot,pat,Pdb,month,envnr ENTREZ QUERY:- Eg:-protease NOT hiv1[organism], 1000:2000[slen] ,Mus musculus[organism] AND biomol_mrna[properties]etc., Slide 17: Expect threshold:-This setting specifies the statistical significance threshold for reporting matches against database sequences. BLOSUM62:- is a general purpose matrix and the default choice in BLAST 2.0. The BLOSUM matrix assigns a probability score for each position in an alignment that is based on the frequency with which that substitution is known to occur among consensus blocks within related proteins. BLOSUM62 is among the best of the available matrices for detecting weak protein similarities. Other supported options include PAM30, PAM70, BLOSUM80, and BLOSUM45. Slide 18: GAP COSTS:-higher gap penalty will cause less favourable characters to be aligned, to avoid creating as many gaps and viceversa. FILTERING:-Filtering can eliminate statistically significant but biologically uninteresting reports from the blast output leaving the more biologically interesting regions of the query sequence available for specific matching against database sequences. MASKING:-(Mask for lookup table only)BLAST searches consist of two phases, finding hits based upon a lookup table and then extending them. This option masks only for purposes of constructing the lookup table used by BLAST so that no hits are found based upon low-complexity sequence or repeats (if repeat filter is checked). The BLAST extensions are performed without masking and so they can be extended through low-complexity sequence During this time BLAST algo.works like this:- : 1)Remove low-complexity region or sequence repeats in the query sequence:- These regions might give high scores that confuse the program to find the actual significant sequences in the database, so they should be filtered out. SEG-protein seq,DUST-DNA seq, XNU-to filter tandom repeats in protein seq 2)Make a k-letter word list of the query sequence:- During this time BLAST algo.works like this:- Slide 21: . 3)List the possible matching words:- This step is one of the main differences between BLAST and FASTA. FASTA cares about all of the common words in the database and query seq.but BLAST only cares about the high-scoring words. 4)Organize the remaining high-scoring words into an efficient search tree. 5) Repeat step 1 to 4 for each k-letter word in the query sequence. 6) Scan the database sequences for exact matches with the remaining high-scoring words. 7) Extend the exact matches to high-scoring segment pair (HSP) Slide 22: BLAST stretches a longer alignment between the query and the database sequence in the left and right directions, from the position where the exact match occurred. The extension doesn’t stop until the accumulated total score of the HSP begins to decrease. Slide 23: 8)List all of the HSPs in the database whose score is high enough to be considered:- We list the HSPs whose scores are greater than the empirically determined cutoff score S. 9) Evaluate the significance of the HSP score:- 10)Make two or more HSP regions into a longer alignment:-Sometimes, we find two or more HSP regions in one database sequence that can be made into a longer alignment. This provides additional evidence of the relation between the query and database sequence. Slide 25: 11)Report the matches whose expect score is lower than a threshold parameter E. OUTPUT DESCRIPTION:- : OUTPUT DESCRIPTION:- The typical threshold for a good E−value from a BLAST search is e−5=(10−5) or lower. The reason for such low values is that an E=0.001 in a million entry database would still leave 1000 entries due to chance. An E=e−6 would only leave one entry due to chance. The parameters K and λ represent natural scales for the search space and the scoring system respectively. ¨ The rest of the equation represents the size of the query (m), the size of the database (n), and of course the S score. Slide 29: MAXIMUM SCORE:- the highest alignment score of a set of aligned segments from the same subject (database) sequence. The score is calculated from the sum of the match rewards and the mismatch, gap open and extend penalties independently for each segment. This normally gives the same sorting order as the E Value. TOTAL SCORE:-The sum of alignment scores of all segments from same subject seq. QUERY COVERAGE:-the percent of the query length that is included in the aligned segments. LINKS:-Provide direct access to other resources. BLAST is used for:- : Searching for HOMOLOGUES of unknown seq. Which species have a protein that is related in lineage to a certain protein with known amino-acid sequence Where does a certain sequence of DNA originate? What other genes encode proteins that exhibit structures or motifs such as ones that have just been determined? BLAST is also often used as part of other algorithms that require approximate sequence matching. BLAST is used for:- Slide 31: THE END……………… THANK YOU