Bioinformatics

Views:
 
     
 

Presentation Description

Introduction to bioinformatics and its tools

Comments

Presentation Transcript

Bioinformatics Advances, Tools, Web resources: 

Bioinformatics Advances, Tools, Web resources Dr. Kailash Choudhary Assistant Professor Lachoo Memorial College, Jodhpur

Science then, then and now: 

Science then , then and now In the beginning, there was thought and observation.

Science then, then and now: 

Science then , then and now For a long time this didn’t change. Man thought it would be enough to reason about the existing knowledge to explore everything there is to know. Back then, one single person could possess all knowledge in his cultural context-PANDA, Purohit

The achievements are still admirable …: 

The achievements are still admirable … …as we can see

Science then, then and now: 

Science then, then and now A vast amount of information

Science then, then and now: 

Science then, then and now No single person or group is sure what is known . Known, But not known = not known

Initial Challenges in biology: 

Initial Challenges in biology What makes us ill or unwell? Disease identification, disease inducing agents What keeps us healthy and makes us live longer? Drug discovery Where do we all come from and what are we made of? Genetics and beyond

… and their implications: 

… and their implications Understand biological structures of increasing complexity: Genes ( Genomics ): 1980s Proteins ( Proteomics ): 1990s Complex Carbohydrates ( Glycomics ): 2000s Understand biological processes and the roles structures play in them (biosynthesis and biological processes)

The “old” biology: 

The “old” biology The most challenging task for a scientist is to get good data

The “new” biology: 

The “new” biology The most challenging task for a scientist is to make sense of lots of data

Data is being collected faster and in greater amounts: 

Data is being collected faster and in greater amounts

Bio informatics: 

Bio informatics

What is bioinformatics?: 

What is bioinformatics?

To me the bioinformatics is…: 

To me the bioinformatics is… Deduction of knowledge by computer analysis of biological data. or: see 20000 pages on this issue on the WWW “The field of science in which biology, computer science, and information technology merge to form a single discipline”

What is Bioinformatics?: 

What is Bioinformatics?

What can BioInformatics do?: 

What can BioInformatics do? Analyze genetic and molecular sequences Look for patterns, similarities, matches Identify structures Store derived information Large databases of genetic information

The data: 

The data information stored in the genetic code (DNA) protein sequences 3D structures experimental results from various sources patient statistics scientific literature

Algorithmic developments: 

Algorithmic developments Important part of research in bioinformatics: methods for data storage data retrieval data analysis

What makes us human?: 

What makes us human? CHIMP GENOME Chimpanzees are similar to humans in so many ways: they are socially complex, sensitive and communicative, and yet indisputably on the animal side of the man/beast divide. Scientists have now sequenced the genetic code of our closest living relative, showing the striking concordances and divergences between the two species, and perhaps holding up a mirror to our own humanity.

Slide 21: 

Perhaps not surprising!!! Comparison between the full drafts of the human and chimp genomes revealed that they differ only by 1.23% How humans are chimps?

Complete Genomes: 

• 1994 0 • 1995 1 • 2004 234 2005 303 eukaryotes 24 bacteria 240 archaea 39 Complete Genomes

Paradigm shift over time: : 

Paradigm shift over time:

Challenges: The Data Flood: 

Challenges: The Data Flood 52 bacterial genomes completed and published in a year ~100,000 genes 228 genomes ongoing ~450,000 more genes when finished

The Post-Genomic Iceberg: 

The Post-Genomic Iceberg The Undiscovered Phenotype The Undiscovered Genotype Most genes are of unknown function Undiscovered genomic diversity Discovered Biology Undiscovered Biology

What is Data mining?: 

Santa’s rules : Blue or Circle Banta’s rules : All the rest What is Data mining? Whose block is this? Santa’s blocks Banta’s blocks

What is Data mining?: 

What is Data mining? Question: Can you explain how?

The “post-genomics” era : 

The “post-genomics” era Goal: to understand the functional networks of a living cell Annotation Comparative genomics Structural genomics Functional genomics What’s Next After collecting Data ?

Gene Prediction: Computational Challenge: 

Gene : A sequence of nucleotides coding for protein Gene Prediction Problem : Determine the beginning and end positions of genes in a genome Gene Prediction: Computational Challenge

Gene Prediction: Computational Challenge: 

Gene Prediction: Computational Challenge aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcgg

Gene Prediction: Computational Challenge: 

Gene Prediction: Computational Challenge aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctat gctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaa gctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcgg

Gene Prediction: Computational Challenge: 

Gene Prediction: Computational Challenge aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctat gctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaa gctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcgg Gene!

Two Approaches to Gene Prediction: 

Statistical : coding segments (exons) have typical sequences on either end and use different subwords than non-coding segments (introns). Similarity-based : many human genes are similar to genes in mice, chicken, or even bacteria. Therefore, already known mouse, chicken, and bacterial genes may help to find human genes. Two Approaches to Gene Prediction

Slide 34: 

If you could compare the day’s news in English, side-by-side to the same news in a foreign language, some similarities may become apparent Similarity-Based Approach: Metaphor in Different Languages

Slide 35: 

Comparative genomics Whole Genome Comparison Concluding on regulatory networks

Chimps and Us: 

Chimps and Us

Slide 37: 

Comparative genomics Comparing ORFs Identifying orthologs Concluding on structure and function Whole Genome Comparison Concluding on regulatory networks

Slide 38: 

Researchers have learned a great deal about the function of human genes by examining their counterparts in simpler model organisms such as the mouse. Conservation of the IGFALS (Insulin-like growth factor) Between human and mouse.

Slide 39: 

Functional genomics Genome-wide profiling of: • mRNA levels • Protein levels Co-expression of genes and/or proteins

Slide 40: 

Understanding the function of genes and other parts of the genome

Slide 41: 

Functional genomics Genome-wide profiling of: • mRNA levels • Protein levels Co-expression of genes and/or proteins Identifying protein-protein interaction Networks of interactions

Slide 42: 

A large network of 8184 interactions among 4140 S. Cerevisiae proteins A network of interactions can be built For all proteins in an organism

Slide 43: 

Structural genomics Assign structure to all proteins encoded in a genome

Protein Structure: 

Protein Structure

Resources and Databases: 

Resources and Databases The different types of data are collected in database Sequence databases Structural databases Databases of Experimental Results All databases are connected

Database Types: 

Database Types Sequence databases General Special GenBank, embl TF binding sites PIR, Swissprot Promoters Genomes Structure databases General Special PDB Specific protein families folds Databases of experimental results Co-expressed genes, prot-prot interaction, etc.

Gene database: 

Gene database Give information into gene functionality Alternative splicing of genes Alternative pattern of exons included to create gene product EST

Genome Databases: 

Genome Databases Data organized by species Clones assembled into contigous pieces ‘contigs’ or whole chromosomes Information on non-coding regions Relativity

Genome Browsers: 

Genome Browsers Annotation adds value to sequence Easy “walk” through the genome Comparative genomics

Genome Browsers: 

Genome Browsers Ensembl Genome Browser ( http://www.ensembl.org ) UCSC Genome Browser http://genome.ucsc.edu/ WormBase: http://www.wormbase.org/ AceDB: http://www.acedb.org/ Comprehensive Microbial Resource: http://www.tigr.org/tigr-scripts/CMR2/CMRHomePage.spl FlyBase: http://flybase.bio.indiana.edu/

Slide 51: 

beta globin

SNP database: 

SNP database Single Nucleotide Polymorphisms (SNPs) Single base difference in a single position among two different individuals of the same species Play an important role in differentiation and disease

Sickle Cell Anemia: 

Sickle Cell Anemia Due to 1 swapping an A for a T, causing inserted amino acid to be valine instead of glutamine in hemoglobin

Healthy Individual: 

Healthy Individual >gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACC ATG GTGCATCTGACTCCTGA G G A G AAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGC AGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATG CTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGC TCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGAT CCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCA CCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCA CTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACT GGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGC >gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens] MVHLTP E EKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH

Diseased Individual: 

Diseased Individual >gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACC ATG GTGCATCTGACTCCTGA G G T G AAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGC AGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATG CTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGC TCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGAT CCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCA CCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCA CTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACT GGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGC >gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens] MVHLTP V EKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH

Structure Databases: 

Structure Databases 3-dimensional structures of proteins, nucleic acids, molecular complexes etc 3-d data is available due to techniques such as NMR and X-Ray crystallography

Databases of Experimental Results: 

Databases of Experimental Results Data such as experimental microarray images- expression data Clustering information Metabolic pathways, protein-protein interaction data

PubMed: 

PubMed MEDLINE publication database Over 17,000 journals 15 million citations since 1950 Service of the National Library of Medicine http://www.ncbi.nlm.nih.giv/PubMed Literature Databases

Slide 63: 

GENOMIC DATA GenBank DDBJ EMBL ASSEMBLED GENOMES GoldenPath WormBase TIGR PROTEIN PIR SWISS-PROT STRUCTURE PDB MMDB SCOP LITERATURE PubMed PATHWAY KEGG COG DISEASE LocusLink OMIM OMIA GENES RefSeq AllGenes GDB SNPs dbSNP ESTs dbEST unigene MOTIFS BLOCKS Pfam Prosite GENE EXPRESSION Stanford MGDB NetAffx ArrayExpress

Entrez – NCBI Engine: 

Entrez – NCBI Engine Entrez is the integrated, text-based search and retrieval system used at NCBI for the major databases, including PubMed, Nucleotide and Protein Sequences, Protein Structures, Complete Genomes, Taxonomy, and others. http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi?itool=toolbar

Entrez – NCBI Engine: 

Entrez – NCBI Engine

Present: User queries multiple sources: 

Present: User queries multiple sources Heterogeneous data sources on the web ?

Future: the Web-Service queries multiple sources: 

Future: the Web-Service queries multiple sources

Bioinformatics Tools: 

Bioinformatics Tools

Bioinformatics Tools: 

Bioinformatics Tools Internet, Google – wide array of tools, mostly free and open source, now exist for use Do not reinvent the wheel! But do be wary, Bioinformatician != Programmer Most common tools are written in C, C++, Java or PERL; others in Fortran, Python

Sources for Tools: 

Sources for Tools http://sourceforge.net (51 projects) http://bioinformatics.org (156 projects) ftp://ftp.ncbi.nlm.nih.gov (NCBI) http://www.ebi.ac.uk/Tools/ (EMBL) http://www.blueprint.org (Blueprint Initiative) http://www.geocities.com/bioinformaticsweb/toollink.html (Suresh’s Links) http://www.agr.kuleuven.ac.be/vakken/i287/bioinformatica.htm (Dutch course page)

Two Types of Tools: 

Two Types of Tools Toolkits/APIs – programming aids NCBI toolkits, Seqhound, BioPERL, BioJava, BioPHP… End-User applications BLAST, PsiPred, SMART, Rasmol…

NCBI Toolkit – what is it? Programming…: 

NCBI Toolkit – what is it? Programming… The NCBI toolkit is based on 3 key libraries of code: A core library of platform independent C standard functions and extensions. All NCBI toolkit code has this core dependence. A C++ toolkit is under development as well A complete ASN.1 specification, I/O and code generation system for platform independent use of binary data. Recently XML support has been added as well A graphical user interface abstraction which maps to Mac, Windows and X-Windows GUIs – FLTK in the C++ toolkit

SeqHound: 

SeqHound

Slide 77: 

What is SeqHound? Biological data is very heterogeneous, comes from various locations, in many different formats e.g. a biological sequence database may exist in ASN.1 format, a microarray expression database may exist in a flat-file format and both might reference an organism but by different names much more convenient to have access to the information in one centralized ‘data warehouse’ or ‘data mart’, represented in a single, cohesive data format This is Seqhound’s mission

Slide 78: 

What is SeqHound? Sequences Sequence annotation Literature Interactions SeqHound Access Methods User Structure

BioPERL: 

BioPERL The Bioperl project – www.bioperl.org Comprehensive, well documented set of Perl modules Last stable release 1.4 Open Source (Artistic License) project that has recruited developers from all over the world Modules available for alignments (BLAST, Clustal), sequence retrieval, annotations, sequence manipulation, gene prediction output, sequence databases, etc… Stajich et al ., The Bioperl toolkit: Perl modules for the life sciences. Genome Res . 2002 Oct; 12(10): 1611-8. PMID: 12368254 Use with caution: things change fast

BioPython: 

BioPython The Biopython project – biopython.org “The Biopython Project is an international association of developers of freely available Python tools for computational molecular biology” Source code available under liberal terms Part of Open Bioinformatics Foundation http://open-bio.org/ Discussion mailing list: biopython@biopython.org

Bio-*: 

Bio-* Open Bioinformatics Foundation Projects BioRuby – Ruby language toolkit BioPipe – workflow framework for BioPERL BioMoby – Interoperability of biological data Provides WSDL service descriptions to client through a MOBY central repository, which in turn tracks available data hosts and services BioDAS – Distributed Annotation Server uses DAS/1 protocol to talk to various annotation servers and provide a single view to client e.g. WormBase, FlyBase, Ensembl, TIGR

Bio-Extinct?: 

Bio-Extinct? Bio-XML – RIP? Bio-PHP – RIP? Bio-CORBA – RIP? Be wary of using the ‘latest and greatest’ R.I.P.

ClustalW: 

ClustalW Possibly most-used multiple sequence alignment program Source code is available, can be linked directly to your code with a bit of work Very fast, works with DNA and protein, very flexible – choice of penalties and scoring matrices

ClustalW: 

ClustalW Performs pairwise global alignments between all sequence pairs Builds a guide tree Builds multiple sequence alignment using guide tree, starting with most similar pair

ClustalW: 

ClustalW FOS_RAT MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNTQDFCADLSVSSANF 60 FOS_MOUSE MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNTQDFCADLSVSSANF 60 FOS_CHICK MMYQGFAGEYEAPSSRCSSASPAGDSLTYYPSPADSFSSMGSPVNSQDFCTDLAVSSANF 60 FOSB_MOUSE -MFQAFPGDYDS-GSRCSS-SPSAESQ--YLSSVDSFGSPPTAAASQE-CAGLGEMPGSF 54 FOSB_HUMAN -MFQAFPGDYDS-GSRCSS-SPSAESQ--YLSSVDSFGSPPTAAASQE-CAGLGEMPGSF 54 *:..* .:*:: .***** **:.:* * *..***.* :.. :*: *:.*. ...* - Image adapted from Clustal documentation

Slide 86: 

Python is an interpreted, interactive programming language created by Rossum in 1990. Python is fully dynamically typed and uses automatic memory management ; it is thus similar to Perl , Ruby , Scheme , Smalltalk , and Tcl . Python is developed as an open source project, managed by the non-profit Python Software Foundation . Python 2.4.2 was released on September 28 , 2005 . Biopython (http://biopython.org/) and biojava ( www.biojava.org ): Biopython and biojava are open source projects with very similar goals to bioperl MATLAB Bioinformatics Toolbox: MATLAB Bioinformatics Toolbox http://www.mathworks.com/products/bioinfo/

Slide 87: 

R-language for Statistical Computing ( http://www.r-project.org/ ): R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS Bioconductor : Bioinformatics with R (http://www.bioconductor.org/) The broad goals of Bioconductor are to provide access to a wide range of powerful statistical and graphical methods for the analysis of genomic data; facilitate the integration of biological metadata in the analysis of experimental data: e.g. literature data from PubMed, annotation data from LocusLink; allow the rapid development of extensible, scalable, and interoperable software; promote high-quality and reproducible research; provide training in computational and statistical methods for the analysis of genomic data. Microarray Software Comparison (http://ihome.cuhk.edu.hk/~b400559/arraysoft_rpackages.html) A website organizing and commenting on links to R software for gene expression data analysis , including software not available from Bioconductor or CRAN.

MATLAB: 

MATLAB The name MATLAB stands for mat rix lab oratory A high-performance language for technical computing Used extensively in industry and universities An interactive system whose basic data element is an array that does not require dimensioning. This allows you to solve many technical computing problems, especially those with matrix and vector formulations Features a family of add-on application-specific solutions called toolboxes Toolboxes (e.g., bioinformatics) are comprehensive collections of MATLAB functions (M-files) that extend the MATLAB environment to solve particular classes of problems

Slide 89: 

MATLAB Desktop

Homology Searches with BLAST: 

Homology Searches with BLAST BLASTN Nucleotide query vs nucleotide database BLASTP protein query vs protein database BLASTX automatic 6-frame translation of nucleotide query vs protein database TBLASTN protein query vs automatic 6-frame translation of nucleotide database TBLASTX automatic 6-frame translation of nucleotide query vs automatic 6-frame translation of nucleotide database

PSI-BLAST Position-Specific Iterated BLAST : 

PSI-BLAST Position-Specific Iterated BLAST combines statistically significant alignments produced by BLAST into a position-specific score matrix searches the database using this matrix allows multiple iterations of this process runs at approximately the same speed per iteration as gapped BLAST is much more sensitive to weak but biologically relevant sequence similarities

Five websites that all biologists should know: 

Five websites that all biologists should know NCBI (The National Center for Biotechnology Information; http://www.ncbi.nlm.nih.gov/ EBI (The European Bioinformatics Institute) http://www.ebi.ac.uk/ The Canadian Bioinformatics Resource http://www.cbr.nrc.ca/ SwissProt/ExPASy (Swiss Bioinformatics Resource) http://expasy.cbr.nrc.ca/sprot/ PDB (The Protein Databank) http://www.rcsb.org/PDB/

NCBI (http://www.ncbi.nlm.nih.gov/): 

NCBI ( http://www.ncbi.nlm.nih.gov/ ) Entrez interface to databases Medline/OMIM Genbank/Genpept/Structures BLAST server(s) Five-plus flavors of blast Draft Human Genome Much, much more…

EBI (http://www.ebi.ac.uk/): 

EBI ( http://www.ebi.ac.uk/ ) SRS database interface EMBL, SwissProt, and many more Many server-based tools ClustalW, DALI, …

SwissProt (http://expasy.cbr.nrc.ca/sprot/): 

SwissProt ( http://expasy.cbr.nrc.ca/sprot/ ) Curation!!! Error rate in the information is greatly reduced in comparison to most other databases. Extensive cross-linking to other data sources SwissProt is the ‘gold-standard’ by which other databases can be measured, and is the best place to start if you have a specific protein to investigate

A few more resources to be aware of: 

A few more resources to be aware of Human Genome Working Draft http://genome.ucsc.edu/ TIGR (The Institute for Genomics Research) http://www.tigr.org/ Celera http://www.celera.com/ (Model) Organism specific information: Yeast : http://genome-www.stanford.edu/Saccharomyces/ Arabidopis : http://www.tair.org/ Mouse : http://www.jax.org/ Fruitfly : http://www.fruitfly.org/ Nematode: http://www.wormbase.org/ Nucleic Acids Research Database Issue http://nar.oupjournals.org/ (First issue every year)

UNEXPLORED AREAS…: 

UNEXPLORED AREAS…

The Computational Biology Challenge: 

The Computational Biology Challenge "In principle, the string of genetic bits holds long-sought secrets of human development, physiology and medicine. In practice, our ability to transform such information into understanding remains woefully inadequate ". The Genome International Sequencing Consortium, ”Initial sequencing and analysis of the human genome,” Nature 409 : 860-921 (2001) [Emphasis added]

Predict Epitopes, Find Vaccine Targets: 

Predict Epitopes, Find Vaccine Targets Vaccines are often the only solution for viral diseases Finding & developing effective vaccine targets (epitopes) is slow and expensive process

Recognize Functional Sites, Help Scientists: 

Recognize Functional Sites, Help Scientists Effective recognition of initiation, control, and termination of biological processes is crucial to speeding up and focusing scientific experiments Data mining of bio seqs to find rules for recognizing & understanding functional sites Dragon’s 10x reduction of TSS recognition false positives

Diagnose Leukaemia, Benefit Children: 

Diagnose Leukaemia, Benefit Children Childhood leukaemia is a heterogeneous disease Treatment is based on subtype 3 different tests and 4 different experts are needed for diagnosis Curable in USA, fatal in Indonesia

Understand Proteins, Fight Diseases: 

Understand Proteins, Fight Diseases Understanding function and role of protein needs organised info on interaction pathways Such info are often reported in scientific paper but are seldom found in structured databases Knowledge extraction system to process free text extract protein names extract interactions Jak1

Conclusions: 

Conclusions We have only touched small parts of the elephant Trial and error (intelligently) is often your best tool

Slide 104: 

THANKS… Dr. Kailash Choudhary LMC AND IFAS, JODHPUR