logging in or signing up Bioinformatics ifasnet Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT lite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 158 Category: Science & Tech.. License: Some Rights Reserved Like it (2) Dislike it (0) Added: October 29, 2011 This Presentation is Public Favorites: 0 Presentation Description Introduction to bioinformatics and its tools Comments Posting comment... Premium member Presentation Transcript Bioinformatics Advances, Tools, Web resources: Bioinformatics Advances, Tools, Web resources Dr. Kailash Choudhary Assistant Professor Lachoo Memorial College, JodhpurScience then, then and now: Science then , then and now In the beginning, there was thought and observation.Science then, then and now: Science then , then and now For a long time this didn’t change. Man thought it would be enough to reason about the existing knowledge to explore everything there is to know. Back then, one single person could possess all knowledge in his cultural context-PANDA, PurohitThe achievements are still admirable …: The achievements are still admirable … …as we can seeScience then, then and now: Science then, then and now A vast amount of informationScience then, then and now: Science then, then and now No single person or group is sure what is known . Known, But not known = not knownInitial Challenges in biology: Initial Challenges in biology What makes us ill or unwell? Disease identification, disease inducing agents What keeps us healthy and makes us live longer? Drug discovery Where do we all come from and what are we made of? Genetics and beyond… and their implications: … and their implications Understand biological structures of increasing complexity: Genes ( Genomics ): 1980s Proteins ( Proteomics ): 1990s Complex Carbohydrates ( Glycomics ): 2000s Understand biological processes and the roles structures play in them (biosynthesis and biological processes)The “old” biology: The “old” biology The most challenging task for a scientist is to get good dataThe “new” biology: The “new” biology The most challenging task for a scientist is to make sense of lots of dataData is being collected faster and in greater amounts: Data is being collected faster and in greater amountsBio informatics: Bio informaticsWhat is bioinformatics?: What is bioinformatics?To me the bioinformatics is…: To me the bioinformatics is… Deduction of knowledge by computer analysis of biological data. or: see 20000 pages on this issue on the WWW “The field of science in which biology, computer science, and information technology merge to form a single discipline”What is Bioinformatics?: What is Bioinformatics?What can BioInformatics do?: What can BioInformatics do? Analyze genetic and molecular sequences Look for patterns, similarities, matches Identify structures Store derived information Large databases of genetic informationThe data: The data information stored in the genetic code (DNA) protein sequences 3D structures experimental results from various sources patient statistics scientific literatureAlgorithmic developments: Algorithmic developments Important part of research in bioinformatics: methods for data storage data retrieval data analysisWhat makes us human?: What makes us human? CHIMP GENOME Chimpanzees are similar to humans in so many ways: they are socially complex, sensitive and communicative, and yet indisputably on the animal side of the man/beast divide. Scientists have now sequenced the genetic code of our closest living relative, showing the striking concordances and divergences between the two species, and perhaps holding up a mirror to our own humanity.Slide 21: Perhaps not surprising!!! Comparison between the full drafts of the human and chimp genomes revealed that they differ only by 1.23% How humans are chimps?Complete Genomes: • 1994 0 • 1995 1 • 2004 234 2005 303 eukaryotes 24 bacteria 240 archaea 39 Complete Genomes Paradigm shift over time: : Paradigm shift over time:Challenges: The Data Flood: Challenges: The Data Flood 52 bacterial genomes completed and published in a year ~100,000 genes 228 genomes ongoing ~450,000 more genes when finishedThe Post-Genomic Iceberg: The Post-Genomic Iceberg The Undiscovered Phenotype The Undiscovered Genotype Most genes are of unknown function Undiscovered genomic diversity Discovered Biology Undiscovered BiologyWhat is Data mining?: Santa’s rules : Blue or Circle Banta’s rules : All the rest What is Data mining? Whose block is this? Santa’s blocks Banta’s blocksWhat is Data mining?: What is Data mining? Question: Can you explain how?The “post-genomics” era : The “post-genomics” era Goal: to understand the functional networks of a living cell Annotation Comparative genomics Structural genomics Functional genomics What’s Next After collecting Data ?Gene Prediction: Computational Challenge: Gene : A sequence of nucleotides coding for protein Gene Prediction Problem : Determine the beginning and end positions of genes in a genome Gene Prediction: Computational ChallengeGene Prediction: Computational Challenge: Gene Prediction: Computational Challenge aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggGene Prediction: Computational Challenge: Gene Prediction: Computational Challenge aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctat gctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaa gctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggGene Prediction: Computational Challenge: Gene Prediction: Computational Challenge aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctat gctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaa gctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcgg Gene!Two Approaches to Gene Prediction: Statistical : coding segments (exons) have typical sequences on either end and use different subwords than non-coding segments (introns). Similarity-based : many human genes are similar to genes in mice, chicken, or even bacteria. Therefore, already known mouse, chicken, and bacterial genes may help to find human genes. Two Approaches to Gene PredictionSlide 34: If you could compare the day’s news in English, side-by-side to the same news in a foreign language, some similarities may become apparent Similarity-Based Approach: Metaphor in Different LanguagesSlide 35: Comparative genomics Whole Genome Comparison Concluding on regulatory networksChimps and Us: Chimps and UsSlide 37: Comparative genomics Comparing ORFs Identifying orthologs Concluding on structure and function Whole Genome Comparison Concluding on regulatory networksSlide 38: Researchers have learned a great deal about the function of human genes by examining their counterparts in simpler model organisms such as the mouse. Conservation of the IGFALS (Insulin-like growth factor) Between human and mouse.Slide 39: Functional genomics Genome-wide profiling of: • mRNA levels • Protein levels Co-expression of genes and/or proteinsSlide 40: Understanding the function of genes and other parts of the genomeSlide 41: Functional genomics Genome-wide profiling of: • mRNA levels • Protein levels Co-expression of genes and/or proteins Identifying protein-protein interaction Networks of interactionsSlide 42: A large network of 8184 interactions among 4140 S. Cerevisiae proteins A network of interactions can be built For all proteins in an organismSlide 43: Structural genomics Assign structure to all proteins encoded in a genomeProtein Structure: Protein StructureResources and Databases: Resources and Databases The different types of data are collected in database Sequence databases Structural databases Databases of Experimental Results All databases are connectedDatabase Types: Database Types Sequence databases General Special GenBank, embl TF binding sites PIR, Swissprot Promoters Genomes Structure databases General Special PDB Specific protein families folds Databases of experimental results Co-expressed genes, prot-prot interaction, etc.Gene database: Gene database Give information into gene functionality Alternative splicing of genes Alternative pattern of exons included to create gene product ESTGenome Databases: Genome Databases Data organized by species Clones assembled into contigous pieces ‘contigs’ or whole chromosomes Information on non-coding regions RelativityGenome Browsers: Genome Browsers Annotation adds value to sequence Easy “walk” through the genome Comparative genomicsGenome Browsers: Genome Browsers Ensembl Genome Browser ( http://www.ensembl.org ) UCSC Genome Browser http://genome.ucsc.edu/ WormBase: http://www.wormbase.org/ AceDB: http://www.acedb.org/ Comprehensive Microbial Resource: http://www.tigr.org/tigr-scripts/CMR2/CMRHomePage.spl FlyBase: http://flybase.bio.indiana.edu/Slide 51: beta globinSNP database: SNP database Single Nucleotide Polymorphisms (SNPs) Single base difference in a single position among two different individuals of the same species Play an important role in differentiation and diseaseSickle Cell Anemia: Sickle Cell Anemia Due to 1 swapping an A for a T, causing inserted amino acid to be valine instead of glutamine in hemoglobinHealthy Individual: Healthy Individual >gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACC ATG GTGCATCTGACTCCTGA G G A G AAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGC AGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATG CTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGC TCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGAT CCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCA CCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCA CTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACT GGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGC >gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens] MVHLTP E EKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYHDiseased Individual: Diseased Individual >gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACC ATG GTGCATCTGACTCCTGA G G T G AAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGC AGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATG CTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGC TCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGAT CCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCA CCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCA CTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACT GGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGC >gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens] MVHLTP V EKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYHStructure Databases: Structure Databases 3-dimensional structures of proteins, nucleic acids, molecular complexes etc 3-d data is available due to techniques such as NMR and X-Ray crystallographyDatabases of Experimental Results: Databases of Experimental Results Data such as experimental microarray images- expression data Clustering information Metabolic pathways, protein-protein interaction dataPubMed: PubMed MEDLINE publication database Over 17,000 journals 15 million citations since 1950 Service of the National Library of Medicine http://www.ncbi.nlm.nih.giv/PubMed Literature DatabasesSlide 63: GENOMIC DATA GenBank DDBJ EMBL ASSEMBLED GENOMES GoldenPath WormBase TIGR PROTEIN PIR SWISS-PROT STRUCTURE PDB MMDB SCOP LITERATURE PubMed PATHWAY KEGG COG DISEASE LocusLink OMIM OMIA GENES RefSeq AllGenes GDB SNPs dbSNP ESTs dbEST unigene MOTIFS BLOCKS Pfam Prosite GENE EXPRESSION Stanford MGDB NetAffx ArrayExpressEntrez – NCBI Engine: Entrez – NCBI Engine Entrez is the integrated, text-based search and retrieval system used at NCBI for the major databases, including PubMed, Nucleotide and Protein Sequences, Protein Structures, Complete Genomes, Taxonomy, and others. http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi?itool=toolbarEntrez – NCBI Engine: Entrez – NCBI EnginePresent: User queries multiple sources: Present: User queries multiple sources Heterogeneous data sources on the web ?Future: the Web-Service queries multiple sources: Future: the Web-Service queries multiple sourcesBioinformatics Tools: Bioinformatics ToolsBioinformatics Tools: Bioinformatics Tools Internet, Google – wide array of tools, mostly free and open source, now exist for use Do not reinvent the wheel! But do be wary, Bioinformatician != Programmer Most common tools are written in C, C++, Java or PERL; others in Fortran, PythonSources for Tools: Sources for Tools http://sourceforge.net (51 projects) http://bioinformatics.org (156 projects) ftp://ftp.ncbi.nlm.nih.gov (NCBI) http://www.ebi.ac.uk/Tools/ (EMBL) http://www.blueprint.org (Blueprint Initiative) http://www.geocities.com/bioinformaticsweb/toollink.html (Suresh’s Links) http://www.agr.kuleuven.ac.be/vakken/i287/bioinformatica.htm (Dutch course page)Two Types of Tools: Two Types of Tools Toolkits/APIs – programming aids NCBI toolkits, Seqhound, BioPERL, BioJava, BioPHP… End-User applications BLAST, PsiPred, SMART, Rasmol…NCBI Toolkit – what is it? Programming…: NCBI Toolkit – what is it? Programming… The NCBI toolkit is based on 3 key libraries of code: A core library of platform independent C standard functions and extensions. All NCBI toolkit code has this core dependence. A C++ toolkit is under development as well A complete ASN.1 specification, I/O and code generation system for platform independent use of binary data. Recently XML support has been added as well A graphical user interface abstraction which maps to Mac, Windows and X-Windows GUIs – FLTK in the C++ toolkitSeqHound: SeqHoundSlide 77: What is SeqHound? Biological data is very heterogeneous, comes from various locations, in many different formats e.g. a biological sequence database may exist in ASN.1 format, a microarray expression database may exist in a flat-file format and both might reference an organism but by different names much more convenient to have access to the information in one centralized ‘data warehouse’ or ‘data mart’, represented in a single, cohesive data format This is Seqhound’s missionSlide 78: What is SeqHound? Sequences Sequence annotation Literature Interactions SeqHound Access Methods User StructureBioPERL: BioPERL The Bioperl project – www.bioperl.org Comprehensive, well documented set of Perl modules Last stable release 1.4 Open Source (Artistic License) project that has recruited developers from all over the world Modules available for alignments (BLAST, Clustal), sequence retrieval, annotations, sequence manipulation, gene prediction output, sequence databases, etc… Stajich et al ., The Bioperl toolkit: Perl modules for the life sciences. Genome Res . 2002 Oct; 12(10): 1611-8. PMID: 12368254 Use with caution: things change fastBioPython: BioPython The Biopython project – biopython.org “The Biopython Project is an international association of developers of freely available Python tools for computational molecular biology” Source code available under liberal terms Part of Open Bioinformatics Foundation http://open-bio.org/ Discussion mailing list: biopython@biopython.orgBio-*: Bio-* Open Bioinformatics Foundation Projects BioRuby – Ruby language toolkit BioPipe – workflow framework for BioPERL BioMoby – Interoperability of biological data Provides WSDL service descriptions to client through a MOBY central repository, which in turn tracks available data hosts and services BioDAS – Distributed Annotation Server uses DAS/1 protocol to talk to various annotation servers and provide a single view to client e.g. WormBase, FlyBase, Ensembl, TIGRBio-Extinct?: Bio-Extinct? Bio-XML – RIP? Bio-PHP – RIP? Bio-CORBA – RIP? Be wary of using the ‘latest and greatest’ R.I.P.ClustalW: ClustalW Possibly most-used multiple sequence alignment program Source code is available, can be linked directly to your code with a bit of work Very fast, works with DNA and protein, very flexible – choice of penalties and scoring matricesClustalW: ClustalW Performs pairwise global alignments between all sequence pairs Builds a guide tree Builds multiple sequence alignment using guide tree, starting with most similar pairClustalW: ClustalW FOS_RAT MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNTQDFCADLSVSSANF 60 FOS_MOUSE MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNTQDFCADLSVSSANF 60 FOS_CHICK MMYQGFAGEYEAPSSRCSSASPAGDSLTYYPSPADSFSSMGSPVNSQDFCTDLAVSSANF 60 FOSB_MOUSE -MFQAFPGDYDS-GSRCSS-SPSAESQ--YLSSVDSFGSPPTAAASQE-CAGLGEMPGSF 54 FOSB_HUMAN -MFQAFPGDYDS-GSRCSS-SPSAESQ--YLSSVDSFGSPPTAAASQE-CAGLGEMPGSF 54 *:..* .:*:: .***** **:.:* * *..***.* :.. :*: *:.*. ...* - Image adapted from Clustal documentationSlide 86: Python is an interpreted, interactive programming language created by Rossum in 1990. Python is fully dynamically typed and uses automatic memory management ; it is thus similar to Perl , Ruby , Scheme , Smalltalk , and Tcl . Python is developed as an open source project, managed by the non-profit Python Software Foundation . Python 2.4.2 was released on September 28 , 2005 . Biopython (http://biopython.org/) and biojava ( www.biojava.org ): Biopython and biojava are open source projects with very similar goals to bioperl MATLAB Bioinformatics Toolbox: MATLAB Bioinformatics Toolbox http://www.mathworks.com/products/bioinfo/Slide 87: R-language for Statistical Computing ( http://www.r-project.org/ ): R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS Bioconductor : Bioinformatics with R (http://www.bioconductor.org/) The broad goals of Bioconductor are to provide access to a wide range of powerful statistical and graphical methods for the analysis of genomic data; facilitate the integration of biological metadata in the analysis of experimental data: e.g. literature data from PubMed, annotation data from LocusLink; allow the rapid development of extensible, scalable, and interoperable software; promote high-quality and reproducible research; provide training in computational and statistical methods for the analysis of genomic data. Microarray Software Comparison (http://ihome.cuhk.edu.hk/~b400559/arraysoft_rpackages.html) A website organizing and commenting on links to R software for gene expression data analysis , including software not available from Bioconductor or CRAN.MATLAB: MATLAB The name MATLAB stands for mat rix lab oratory A high-performance language for technical computing Used extensively in industry and universities An interactive system whose basic data element is an array that does not require dimensioning. This allows you to solve many technical computing problems, especially those with matrix and vector formulations Features a family of add-on application-specific solutions called toolboxes Toolboxes (e.g., bioinformatics) are comprehensive collections of MATLAB functions (M-files) that extend the MATLAB environment to solve particular classes of problemsSlide 89: MATLAB DesktopHomology Searches with BLAST: Homology Searches with BLAST BLASTN Nucleotide query vs nucleotide database BLASTP protein query vs protein database BLASTX automatic 6-frame translation of nucleotide query vs protein database TBLASTN protein query vs automatic 6-frame translation of nucleotide database TBLASTX automatic 6-frame translation of nucleotide query vs automatic 6-frame translation of nucleotide databasePSI-BLAST Position-Specific Iterated BLAST : PSI-BLAST Position-Specific Iterated BLAST combines statistically significant alignments produced by BLAST into a position-specific score matrix searches the database using this matrix allows multiple iterations of this process runs at approximately the same speed per iteration as gapped BLAST is much more sensitive to weak but biologically relevant sequence similaritiesFive websites that all biologists should know: Five websites that all biologists should know NCBI (The National Center for Biotechnology Information; http://www.ncbi.nlm.nih.gov/ EBI (The European Bioinformatics Institute) http://www.ebi.ac.uk/ The Canadian Bioinformatics Resource http://www.cbr.nrc.ca/ SwissProt/ExPASy (Swiss Bioinformatics Resource) http://expasy.cbr.nrc.ca/sprot/ PDB (The Protein Databank) http://www.rcsb.org/PDB/NCBI (http://www.ncbi.nlm.nih.gov/): NCBI ( http://www.ncbi.nlm.nih.gov/ ) Entrez interface to databases Medline/OMIM Genbank/Genpept/Structures BLAST server(s) Five-plus flavors of blast Draft Human Genome Much, much more…EBI (http://www.ebi.ac.uk/): EBI ( http://www.ebi.ac.uk/ ) SRS database interface EMBL, SwissProt, and many more Many server-based tools ClustalW, DALI, …SwissProt (http://expasy.cbr.nrc.ca/sprot/): SwissProt ( http://expasy.cbr.nrc.ca/sprot/ ) Curation!!! Error rate in the information is greatly reduced in comparison to most other databases. Extensive cross-linking to other data sources SwissProt is the ‘gold-standard’ by which other databases can be measured, and is the best place to start if you have a specific protein to investigateA few more resources to be aware of: A few more resources to be aware of Human Genome Working Draft http://genome.ucsc.edu/ TIGR (The Institute for Genomics Research) http://www.tigr.org/ Celera http://www.celera.com/ (Model) Organism specific information: Yeast : http://genome-www.stanford.edu/Saccharomyces/ Arabidopis : http://www.tair.org/ Mouse : http://www.jax.org/ Fruitfly : http://www.fruitfly.org/ Nematode: http://www.wormbase.org/ Nucleic Acids Research Database Issue http://nar.oupjournals.org/ (First issue every year)UNEXPLORED AREAS…: UNEXPLORED AREAS…The Computational Biology Challenge: The Computational Biology Challenge "In principle, the string of genetic bits holds long-sought secrets of human development, physiology and medicine. In practice, our ability to transform such information into understanding remains woefully inadequate ". The Genome International Sequencing Consortium, ”Initial sequencing and analysis of the human genome,” Nature 409 : 860-921 (2001) [Emphasis added]Predict Epitopes, Find Vaccine Targets: Predict Epitopes, Find Vaccine Targets Vaccines are often the only solution for viral diseases Finding & developing effective vaccine targets (epitopes) is slow and expensive processRecognize Functional Sites, Help Scientists: Recognize Functional Sites, Help Scientists Effective recognition of initiation, control, and termination of biological processes is crucial to speeding up and focusing scientific experiments Data mining of bio seqs to find rules for recognizing & understanding functional sites Dragon’s 10x reduction of TSS recognition false positivesDiagnose Leukaemia, Benefit Children: Diagnose Leukaemia, Benefit Children Childhood leukaemia is a heterogeneous disease Treatment is based on subtype 3 different tests and 4 different experts are needed for diagnosis Curable in USA, fatal in IndonesiaUnderstand Proteins, Fight Diseases: Understand Proteins, Fight Diseases Understanding function and role of protein needs organised info on interaction pathways Such info are often reported in scientific paper but are seldom found in structured databases Knowledge extraction system to process free text extract protein names extract interactions Jak1Conclusions: Conclusions We have only touched small parts of the elephant Trial and error (intelligently) is often your best toolSlide 104: THANKS… Dr. Kailash Choudhary LMC AND IFAS, JODHPUR You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
Bioinformatics ifasnet Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT lite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 158 Category: Science & Tech.. License: Some Rights Reserved Like it (2) Dislike it (0) Added: October 29, 2011 This Presentation is Public Favorites: 0 Presentation Description Introduction to bioinformatics and its tools Comments Posting comment... Premium member Presentation Transcript Bioinformatics Advances, Tools, Web resources: Bioinformatics Advances, Tools, Web resources Dr. Kailash Choudhary Assistant Professor Lachoo Memorial College, JodhpurScience then, then and now: Science then , then and now In the beginning, there was thought and observation.Science then, then and now: Science then , then and now For a long time this didn’t change. Man thought it would be enough to reason about the existing knowledge to explore everything there is to know. Back then, one single person could possess all knowledge in his cultural context-PANDA, PurohitThe achievements are still admirable …: The achievements are still admirable … …as we can seeScience then, then and now: Science then, then and now A vast amount of informationScience then, then and now: Science then, then and now No single person or group is sure what is known . Known, But not known = not knownInitial Challenges in biology: Initial Challenges in biology What makes us ill or unwell? Disease identification, disease inducing agents What keeps us healthy and makes us live longer? Drug discovery Where do we all come from and what are we made of? Genetics and beyond… and their implications: … and their implications Understand biological structures of increasing complexity: Genes ( Genomics ): 1980s Proteins ( Proteomics ): 1990s Complex Carbohydrates ( Glycomics ): 2000s Understand biological processes and the roles structures play in them (biosynthesis and biological processes)The “old” biology: The “old” biology The most challenging task for a scientist is to get good dataThe “new” biology: The “new” biology The most challenging task for a scientist is to make sense of lots of dataData is being collected faster and in greater amounts: Data is being collected faster and in greater amountsBio informatics: Bio informaticsWhat is bioinformatics?: What is bioinformatics?To me the bioinformatics is…: To me the bioinformatics is… Deduction of knowledge by computer analysis of biological data. or: see 20000 pages on this issue on the WWW “The field of science in which biology, computer science, and information technology merge to form a single discipline”What is Bioinformatics?: What is Bioinformatics?What can BioInformatics do?: What can BioInformatics do? Analyze genetic and molecular sequences Look for patterns, similarities, matches Identify structures Store derived information Large databases of genetic informationThe data: The data information stored in the genetic code (DNA) protein sequences 3D structures experimental results from various sources patient statistics scientific literatureAlgorithmic developments: Algorithmic developments Important part of research in bioinformatics: methods for data storage data retrieval data analysisWhat makes us human?: What makes us human? CHIMP GENOME Chimpanzees are similar to humans in so many ways: they are socially complex, sensitive and communicative, and yet indisputably on the animal side of the man/beast divide. Scientists have now sequenced the genetic code of our closest living relative, showing the striking concordances and divergences between the two species, and perhaps holding up a mirror to our own humanity.Slide 21: Perhaps not surprising!!! Comparison between the full drafts of the human and chimp genomes revealed that they differ only by 1.23% How humans are chimps?Complete Genomes: • 1994 0 • 1995 1 • 2004 234 2005 303 eukaryotes 24 bacteria 240 archaea 39 Complete Genomes Paradigm shift over time: : Paradigm shift over time:Challenges: The Data Flood: Challenges: The Data Flood 52 bacterial genomes completed and published in a year ~100,000 genes 228 genomes ongoing ~450,000 more genes when finishedThe Post-Genomic Iceberg: The Post-Genomic Iceberg The Undiscovered Phenotype The Undiscovered Genotype Most genes are of unknown function Undiscovered genomic diversity Discovered Biology Undiscovered BiologyWhat is Data mining?: Santa’s rules : Blue or Circle Banta’s rules : All the rest What is Data mining? Whose block is this? Santa’s blocks Banta’s blocksWhat is Data mining?: What is Data mining? Question: Can you explain how?The “post-genomics” era : The “post-genomics” era Goal: to understand the functional networks of a living cell Annotation Comparative genomics Structural genomics Functional genomics What’s Next After collecting Data ?Gene Prediction: Computational Challenge: Gene : A sequence of nucleotides coding for protein Gene Prediction Problem : Determine the beginning and end positions of genes in a genome Gene Prediction: Computational ChallengeGene Prediction: Computational Challenge: Gene Prediction: Computational Challenge aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggGene Prediction: Computational Challenge: Gene Prediction: Computational Challenge aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctat gctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaa gctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggGene Prediction: Computational Challenge: Gene Prediction: Computational Challenge aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctat gctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaa gctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcgg Gene!Two Approaches to Gene Prediction: Statistical : coding segments (exons) have typical sequences on either end and use different subwords than non-coding segments (introns). Similarity-based : many human genes are similar to genes in mice, chicken, or even bacteria. Therefore, already known mouse, chicken, and bacterial genes may help to find human genes. Two Approaches to Gene PredictionSlide 34: If you could compare the day’s news in English, side-by-side to the same news in a foreign language, some similarities may become apparent Similarity-Based Approach: Metaphor in Different LanguagesSlide 35: Comparative genomics Whole Genome Comparison Concluding on regulatory networksChimps and Us: Chimps and UsSlide 37: Comparative genomics Comparing ORFs Identifying orthologs Concluding on structure and function Whole Genome Comparison Concluding on regulatory networksSlide 38: Researchers have learned a great deal about the function of human genes by examining their counterparts in simpler model organisms such as the mouse. Conservation of the IGFALS (Insulin-like growth factor) Between human and mouse.Slide 39: Functional genomics Genome-wide profiling of: • mRNA levels • Protein levels Co-expression of genes and/or proteinsSlide 40: Understanding the function of genes and other parts of the genomeSlide 41: Functional genomics Genome-wide profiling of: • mRNA levels • Protein levels Co-expression of genes and/or proteins Identifying protein-protein interaction Networks of interactionsSlide 42: A large network of 8184 interactions among 4140 S. Cerevisiae proteins A network of interactions can be built For all proteins in an organismSlide 43: Structural genomics Assign structure to all proteins encoded in a genomeProtein Structure: Protein StructureResources and Databases: Resources and Databases The different types of data are collected in database Sequence databases Structural databases Databases of Experimental Results All databases are connectedDatabase Types: Database Types Sequence databases General Special GenBank, embl TF binding sites PIR, Swissprot Promoters Genomes Structure databases General Special PDB Specific protein families folds Databases of experimental results Co-expressed genes, prot-prot interaction, etc.Gene database: Gene database Give information into gene functionality Alternative splicing of genes Alternative pattern of exons included to create gene product ESTGenome Databases: Genome Databases Data organized by species Clones assembled into contigous pieces ‘contigs’ or whole chromosomes Information on non-coding regions RelativityGenome Browsers: Genome Browsers Annotation adds value to sequence Easy “walk” through the genome Comparative genomicsGenome Browsers: Genome Browsers Ensembl Genome Browser ( http://www.ensembl.org ) UCSC Genome Browser http://genome.ucsc.edu/ WormBase: http://www.wormbase.org/ AceDB: http://www.acedb.org/ Comprehensive Microbial Resource: http://www.tigr.org/tigr-scripts/CMR2/CMRHomePage.spl FlyBase: http://flybase.bio.indiana.edu/Slide 51: beta globinSNP database: SNP database Single Nucleotide Polymorphisms (SNPs) Single base difference in a single position among two different individuals of the same species Play an important role in differentiation and diseaseSickle Cell Anemia: Sickle Cell Anemia Due to 1 swapping an A for a T, causing inserted amino acid to be valine instead of glutamine in hemoglobinHealthy Individual: Healthy Individual >gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACC ATG GTGCATCTGACTCCTGA G G A G AAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGC AGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATG CTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGC TCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGAT CCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCA CCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCA CTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACT GGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGC >gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens] MVHLTP E EKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYHDiseased Individual: Diseased Individual >gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACC ATG GTGCATCTGACTCCTGA G G T G AAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGC AGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATG CTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGC TCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGAT CCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCA CCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCA CTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACT GGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGC >gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens] MVHLTP V EKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYHStructure Databases: Structure Databases 3-dimensional structures of proteins, nucleic acids, molecular complexes etc 3-d data is available due to techniques such as NMR and X-Ray crystallographyDatabases of Experimental Results: Databases of Experimental Results Data such as experimental microarray images- expression data Clustering information Metabolic pathways, protein-protein interaction dataPubMed: PubMed MEDLINE publication database Over 17,000 journals 15 million citations since 1950 Service of the National Library of Medicine http://www.ncbi.nlm.nih.giv/PubMed Literature DatabasesSlide 63: GENOMIC DATA GenBank DDBJ EMBL ASSEMBLED GENOMES GoldenPath WormBase TIGR PROTEIN PIR SWISS-PROT STRUCTURE PDB MMDB SCOP LITERATURE PubMed PATHWAY KEGG COG DISEASE LocusLink OMIM OMIA GENES RefSeq AllGenes GDB SNPs dbSNP ESTs dbEST unigene MOTIFS BLOCKS Pfam Prosite GENE EXPRESSION Stanford MGDB NetAffx ArrayExpressEntrez – NCBI Engine: Entrez – NCBI Engine Entrez is the integrated, text-based search and retrieval system used at NCBI for the major databases, including PubMed, Nucleotide and Protein Sequences, Protein Structures, Complete Genomes, Taxonomy, and others. http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi?itool=toolbarEntrez – NCBI Engine: Entrez – NCBI EnginePresent: User queries multiple sources: Present: User queries multiple sources Heterogeneous data sources on the web ?Future: the Web-Service queries multiple sources: Future: the Web-Service queries multiple sourcesBioinformatics Tools: Bioinformatics ToolsBioinformatics Tools: Bioinformatics Tools Internet, Google – wide array of tools, mostly free and open source, now exist for use Do not reinvent the wheel! But do be wary, Bioinformatician != Programmer Most common tools are written in C, C++, Java or PERL; others in Fortran, PythonSources for Tools: Sources for Tools http://sourceforge.net (51 projects) http://bioinformatics.org (156 projects) ftp://ftp.ncbi.nlm.nih.gov (NCBI) http://www.ebi.ac.uk/Tools/ (EMBL) http://www.blueprint.org (Blueprint Initiative) http://www.geocities.com/bioinformaticsweb/toollink.html (Suresh’s Links) http://www.agr.kuleuven.ac.be/vakken/i287/bioinformatica.htm (Dutch course page)Two Types of Tools: Two Types of Tools Toolkits/APIs – programming aids NCBI toolkits, Seqhound, BioPERL, BioJava, BioPHP… End-User applications BLAST, PsiPred, SMART, Rasmol…NCBI Toolkit – what is it? Programming…: NCBI Toolkit – what is it? Programming… The NCBI toolkit is based on 3 key libraries of code: A core library of platform independent C standard functions and extensions. All NCBI toolkit code has this core dependence. A C++ toolkit is under development as well A complete ASN.1 specification, I/O and code generation system for platform independent use of binary data. Recently XML support has been added as well A graphical user interface abstraction which maps to Mac, Windows and X-Windows GUIs – FLTK in the C++ toolkitSeqHound: SeqHoundSlide 77: What is SeqHound? Biological data is very heterogeneous, comes from various locations, in many different formats e.g. a biological sequence database may exist in ASN.1 format, a microarray expression database may exist in a flat-file format and both might reference an organism but by different names much more convenient to have access to the information in one centralized ‘data warehouse’ or ‘data mart’, represented in a single, cohesive data format This is Seqhound’s missionSlide 78: What is SeqHound? Sequences Sequence annotation Literature Interactions SeqHound Access Methods User StructureBioPERL: BioPERL The Bioperl project – www.bioperl.org Comprehensive, well documented set of Perl modules Last stable release 1.4 Open Source (Artistic License) project that has recruited developers from all over the world Modules available for alignments (BLAST, Clustal), sequence retrieval, annotations, sequence manipulation, gene prediction output, sequence databases, etc… Stajich et al ., The Bioperl toolkit: Perl modules for the life sciences. Genome Res . 2002 Oct; 12(10): 1611-8. PMID: 12368254 Use with caution: things change fastBioPython: BioPython The Biopython project – biopython.org “The Biopython Project is an international association of developers of freely available Python tools for computational molecular biology” Source code available under liberal terms Part of Open Bioinformatics Foundation http://open-bio.org/ Discussion mailing list: biopython@biopython.orgBio-*: Bio-* Open Bioinformatics Foundation Projects BioRuby – Ruby language toolkit BioPipe – workflow framework for BioPERL BioMoby – Interoperability of biological data Provides WSDL service descriptions to client through a MOBY central repository, which in turn tracks available data hosts and services BioDAS – Distributed Annotation Server uses DAS/1 protocol to talk to various annotation servers and provide a single view to client e.g. WormBase, FlyBase, Ensembl, TIGRBio-Extinct?: Bio-Extinct? Bio-XML – RIP? Bio-PHP – RIP? Bio-CORBA – RIP? Be wary of using the ‘latest and greatest’ R.I.P.ClustalW: ClustalW Possibly most-used multiple sequence alignment program Source code is available, can be linked directly to your code with a bit of work Very fast, works with DNA and protein, very flexible – choice of penalties and scoring matricesClustalW: ClustalW Performs pairwise global alignments between all sequence pairs Builds a guide tree Builds multiple sequence alignment using guide tree, starting with most similar pairClustalW: ClustalW FOS_RAT MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNTQDFCADLSVSSANF 60 FOS_MOUSE MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNTQDFCADLSVSSANF 60 FOS_CHICK MMYQGFAGEYEAPSSRCSSASPAGDSLTYYPSPADSFSSMGSPVNSQDFCTDLAVSSANF 60 FOSB_MOUSE -MFQAFPGDYDS-GSRCSS-SPSAESQ--YLSSVDSFGSPPTAAASQE-CAGLGEMPGSF 54 FOSB_HUMAN -MFQAFPGDYDS-GSRCSS-SPSAESQ--YLSSVDSFGSPPTAAASQE-CAGLGEMPGSF 54 *:..* .:*:: .***** **:.:* * *..***.* :.. :*: *:.*. ...* - Image adapted from Clustal documentationSlide 86: Python is an interpreted, interactive programming language created by Rossum in 1990. Python is fully dynamically typed and uses automatic memory management ; it is thus similar to Perl , Ruby , Scheme , Smalltalk , and Tcl . Python is developed as an open source project, managed by the non-profit Python Software Foundation . Python 2.4.2 was released on September 28 , 2005 . Biopython (http://biopython.org/) and biojava ( www.biojava.org ): Biopython and biojava are open source projects with very similar goals to bioperl MATLAB Bioinformatics Toolbox: MATLAB Bioinformatics Toolbox http://www.mathworks.com/products/bioinfo/Slide 87: R-language for Statistical Computing ( http://www.r-project.org/ ): R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS Bioconductor : Bioinformatics with R (http://www.bioconductor.org/) The broad goals of Bioconductor are to provide access to a wide range of powerful statistical and graphical methods for the analysis of genomic data; facilitate the integration of biological metadata in the analysis of experimental data: e.g. literature data from PubMed, annotation data from LocusLink; allow the rapid development of extensible, scalable, and interoperable software; promote high-quality and reproducible research; provide training in computational and statistical methods for the analysis of genomic data. Microarray Software Comparison (http://ihome.cuhk.edu.hk/~b400559/arraysoft_rpackages.html) A website organizing and commenting on links to R software for gene expression data analysis , including software not available from Bioconductor or CRAN.MATLAB: MATLAB The name MATLAB stands for mat rix lab oratory A high-performance language for technical computing Used extensively in industry and universities An interactive system whose basic data element is an array that does not require dimensioning. This allows you to solve many technical computing problems, especially those with matrix and vector formulations Features a family of add-on application-specific solutions called toolboxes Toolboxes (e.g., bioinformatics) are comprehensive collections of MATLAB functions (M-files) that extend the MATLAB environment to solve particular classes of problemsSlide 89: MATLAB DesktopHomology Searches with BLAST: Homology Searches with BLAST BLASTN Nucleotide query vs nucleotide database BLASTP protein query vs protein database BLASTX automatic 6-frame translation of nucleotide query vs protein database TBLASTN protein query vs automatic 6-frame translation of nucleotide database TBLASTX automatic 6-frame translation of nucleotide query vs automatic 6-frame translation of nucleotide databasePSI-BLAST Position-Specific Iterated BLAST : PSI-BLAST Position-Specific Iterated BLAST combines statistically significant alignments produced by BLAST into a position-specific score matrix searches the database using this matrix allows multiple iterations of this process runs at approximately the same speed per iteration as gapped BLAST is much more sensitive to weak but biologically relevant sequence similaritiesFive websites that all biologists should know: Five websites that all biologists should know NCBI (The National Center for Biotechnology Information; http://www.ncbi.nlm.nih.gov/ EBI (The European Bioinformatics Institute) http://www.ebi.ac.uk/ The Canadian Bioinformatics Resource http://www.cbr.nrc.ca/ SwissProt/ExPASy (Swiss Bioinformatics Resource) http://expasy.cbr.nrc.ca/sprot/ PDB (The Protein Databank) http://www.rcsb.org/PDB/NCBI (http://www.ncbi.nlm.nih.gov/): NCBI ( http://www.ncbi.nlm.nih.gov/ ) Entrez interface to databases Medline/OMIM Genbank/Genpept/Structures BLAST server(s) Five-plus flavors of blast Draft Human Genome Much, much more…EBI (http://www.ebi.ac.uk/): EBI ( http://www.ebi.ac.uk/ ) SRS database interface EMBL, SwissProt, and many more Many server-based tools ClustalW, DALI, …SwissProt (http://expasy.cbr.nrc.ca/sprot/): SwissProt ( http://expasy.cbr.nrc.ca/sprot/ ) Curation!!! Error rate in the information is greatly reduced in comparison to most other databases. Extensive cross-linking to other data sources SwissProt is the ‘gold-standard’ by which other databases can be measured, and is the best place to start if you have a specific protein to investigateA few more resources to be aware of: A few more resources to be aware of Human Genome Working Draft http://genome.ucsc.edu/ TIGR (The Institute for Genomics Research) http://www.tigr.org/ Celera http://www.celera.com/ (Model) Organism specific information: Yeast : http://genome-www.stanford.edu/Saccharomyces/ Arabidopis : http://www.tair.org/ Mouse : http://www.jax.org/ Fruitfly : http://www.fruitfly.org/ Nematode: http://www.wormbase.org/ Nucleic Acids Research Database Issue http://nar.oupjournals.org/ (First issue every year)UNEXPLORED AREAS…: UNEXPLORED AREAS…The Computational Biology Challenge: The Computational Biology Challenge "In principle, the string of genetic bits holds long-sought secrets of human development, physiology and medicine. In practice, our ability to transform such information into understanding remains woefully inadequate ". The Genome International Sequencing Consortium, ”Initial sequencing and analysis of the human genome,” Nature 409 : 860-921 (2001) [Emphasis added]Predict Epitopes, Find Vaccine Targets: Predict Epitopes, Find Vaccine Targets Vaccines are often the only solution for viral diseases Finding & developing effective vaccine targets (epitopes) is slow and expensive processRecognize Functional Sites, Help Scientists: Recognize Functional Sites, Help Scientists Effective recognition of initiation, control, and termination of biological processes is crucial to speeding up and focusing scientific experiments Data mining of bio seqs to find rules for recognizing & understanding functional sites Dragon’s 10x reduction of TSS recognition false positivesDiagnose Leukaemia, Benefit Children: Diagnose Leukaemia, Benefit Children Childhood leukaemia is a heterogeneous disease Treatment is based on subtype 3 different tests and 4 different experts are needed for diagnosis Curable in USA, fatal in IndonesiaUnderstand Proteins, Fight Diseases: Understand Proteins, Fight Diseases Understanding function and role of protein needs organised info on interaction pathways Such info are often reported in scientific paper but are seldom found in structured databases Knowledge extraction system to process free text extract protein names extract interactions Jak1Conclusions: Conclusions We have only touched small parts of the elephant Trial and error (intelligently) is often your best toolSlide 104: THANKS… Dr. Kailash Choudhary LMC AND IFAS, JODHPUR