Sheth MSFT eScience 14Oct2006

Uploaded from authorPOINT
Views:
 
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Semantic empowermentof Life Science Applications October 2006: 

Semantic empowerment of Life Science Applications October 2006 Amit Sheth LSDIS Lab, Department of Computer Science, University of Georgia Acknowledgement: NCRR funded Bioinformatics of Glycan Expression, collaborators, partners at CCRC (Dr. William S. York) and Satya S. Sahoo, Cartic Ramakrishnan, Christopher Thomas, Cory Henson.

Computation, data and semantics In life sciences: 

Computation, data and semantics In life sciences 'The development of a predictive biology will likely be one of the major creative enterprises of the 21st century.' Roger Brent, 1999 'The future will be the study of the genes and proteins of organisms in the context of their informational pathways or networks.' L. Hood, 2000 'Biological research is going to move from being hypothesis-driven to being data-driven.' Robert Robbins 'We’ll see over the next decade complete transformation (of life science industry) to very database-intensive as opposed to wet-lab intensive.' Debra Goldfarb We will show how semantics is a key enabler for achieving the above predictions and visions in which information and process play critical role.

Semantic Web and Life Science: 

Semantic Web and Life Science Data captured per year = 1 exabyte (1018) (Eric Neumann, Science, 2005) How much is that? Compare it to the estimate of the total words ever spoken by humans = 12 exabyte Death by data The need for Search Integration Analysis, decision support Discovery Not data, but analysis and insight, leading to decisions and#xB;and discovery

Semantic empowermentof Life Science Applications: 

Semantic empowerment of Life Science Applications Life Science research today deals with highly heterogeneous as well as massive amounts of data distributed across the world. We need more automated ways for integration and analysis leading to insight and discovery - to understand cellular components, molecular functions and biological processes, and more importantly complex interactions and interdependencies between them.

Benefits of Semantics: 

Benefits of Semantics Development of large domain-specific knowledge for reference, common nomenclature, tagging Integration of heterogeneous multi-source data: biomedical documents (text), scientific/experimental data and structured databases Semantic search, browsing, integration analysis, and discovery Faster and more reliable discovery leading to quality of life improvements

What is semantics & Semantic Web: 

What is semantics andamp; Semantic Web Meaning and use of data From syntax and structure to semantics (beyond formatting, organization, query interfaces,….) XML -andgt; RDF -andgt; OWL -andgt; Rules -andgt; Trust Ontologies at the heart of Semantic Web, capturing agreement and domain knowledge (Automatic) Semantic annotation, reasoning,… Also, increasing use of Services oriented Architecture -andgt; semantic Web services W3C SW for Health Care and Life Sciences

Semantic empowermentof Life Science Applications: 

Semantic empowerment of Life Science Applications This talk will demonstrate some of the efforts in: Building large (populated) life science ontologies (GlycO, ProPreO) Gathering/extracting knowledge and metadata: entity and relationship extraction from unstructured data, automatic semantic annotation of scientific/experimental data (e.g., mass spectrometry) Semantic web services and registries, leading to better discovery/reuse of scientific tools and their composition Ontology-driven applications developed

Slide8: 

Semantic Applications Active Semantic Medical Records Demo : an operational health care application using multiple ontologies, semantic annotations and rule based decsion support Semantic Browser Demo: contextual browsing of PubMed aided by ontology and schema (in future instance) level relationships N-glycosylation process: an example of scientific workflow Integrated Semantic Information andamp; Knowledge System (ISIS): integrated access and analysis of structured databases, sc. literature and experimental data Others we will not discuss: SemBowser, SemDrug, …. Let us start with a couple of simple applications

Life Science Ontologies: 

Life Science Ontologies ProPreO An ontology for capturing process and lifecycle information related to proteomic experiments 398 classes, 32 relationships 3.1 million instances Published through the National Center for Biomedical Ontology (NCBO) and Open Biomedical Ontologies (OBO) Glyco An ontology for structure and function of Glycopeptides 573 classes, 113 relationships Published through the National Center for Biomedical Ontology (NCBO)

N-Glycosylation metabolic pathway: 

N-Glycosylation metabolic pathway GNT-I attaches GlcNAc at position 2

GlycO ontology: 

Challenge – model hundreds of thousands of complex carbohydrate entities But, the differences between the entities are small (E.g. just one component) How to model all the concepts but preclude redundancy → ensure maintainability, scalability GlycO ontology

Slide12: 

N. Takahashi and K. Kato, Trends in Glycosciences and Glycotechnology, 15: 235-251 GlycoTree

EnzyO: 

EnzyO The enzyme ontology EnzyO is highly intertwined with GlycO. While it’s structure is mostly that of a taxonomy, it is highly restricted at the class level and hence allows for comfortable classification of enzyme instances from multiple organisms GlycO together with EnzyO contain all the information that is needed for the description of Metabolic pathways e.g. N-Glycan Biosynthesis

Pathway representation in GlycO: 

Pathway representation in GlycO Pathways do not need to be explicitly defined in GlycO. The residue-, glycan-, enzyme- and reaction descriptions contain all the knowledge necessary to infer pathways.

Zooming in a little …: 

Zooming in a little … The N-Glycan with KEGG ID 00015 is the substrate to the reaction R05987, which is catalyzed by an enzyme of the class EC 2.4.1.145. The product of this reaction is the Glycan with KEGG ID 00020.

GlycO population : 

Multiple data sources used in populating the ontology KEGG - Kyoto Encyclopedia of Genes and Genomes SWEETDB CARBANK Database Each data source has different schema for storing data There is significant overlap of instances in the data sources Hence, entity disambiguation and a common representational format are needed GlycO population

Slide17: 

Ontology population workflow

Slide18: 

[][Asn]{[(4+1)][b-D-GlcpNAc] {[(4+1)][b-D-GlcpNAc] {[(4+1)][b-D-Manp] {[(3+1)][a-D-Manp] {[(2+1)][b-D-GlcpNAc] {}[(4+1)][b-D-GlcpNAc] {}}[(6+1)][a-D-Manp] {[(2+1)][b-D-GlcpNAc]{}}}}}} Ontology population workflow

Slide19: 

andlt;Glycanandgt; andlt;aglycon name='Asn'/andgt; andlt;residue link='4' anomeric_carbon='1' anomer='b' chirality='D' monosaccharide='GlcNAc'andgt; andlt;residue link='4' anomeric_carbon='1' anomer='b' chirality='D' monosaccharide='GlcNAc'andgt; andlt;residue link='4' anomeric_carbon='1' anomer='b' chirality='D' monosaccharide='Man' andgt; andlt;residue link='3' anomeric_carbon='1' anomer='a' chirality='D' monosaccharide='Man' andgt; andlt;residue link='2' anomeric_carbon='1' anomer='b' chirality='D' monosaccharide='GlcNAc' andgt; andlt;/residueandgt; andlt;residue link='4' anomeric_carbon='1' anomer='b' chirality='D' monosaccharide='GlcNAc' andgt; andlt;/residueandgt; andlt;/residueandgt; andlt;residue link='6' anomeric_carbon='1' anomer='a' chirality='D' monosaccharide='Man' andgt; andlt;residue link='2' anomeric_carbon='1' anomer='b' chirality='D' monosaccharide='GlcNAc'andgt; andlt;/residueandgt; andlt;/residueandgt; andlt;/residueandgt; andlt;/residueandgt; andlt;/residueandgt; andlt;/Glycanandgt; Ontology population workflow

Slide20: 

Ontology population workflow

ProPreO ontology: 

Two aspects of glycoproteomics: What is it? → identification How much of it is there? → quantification Heterogeneity in data generation process, instrumental parameters, formats Need data and process provenance → ontology-mediated provenance Hence, ProPreO models both the glycoproteomics experimental process and attendant data ProPreO ontology

Slide22: 

ProPreO population: transformation to rdf Scientific Data Computational Methods Ontology instances

Slide23: 

'Protein RDF' chemical mass monoisotopic mass amino-acid sequence n-glycosylation concensus Protein Data amino-acid sequence Chemical Mass RDF Monoisotopic Mass RDF Amino-acid Sequence RDF 'Peptide RDF' chemical mass monoisotopic mass amino-acid sequence n-glycosylation concensus parent protein Calculate Chemical Mass Calculate Monoisotopic Mass Determine N-glycosylation Concensus Key Protein Path Peptide Path amino-acid sequence Extract Peptide Amino-acid Sequence from Protein Amino-acid Sequence ProPreO population: transformation to rdf Scientific Data Computational Methods RDF

Semantic empowermentof Life Science Applications: 

Semantic empowerment of Life Science Applications This talk will demonstrate some of the efforts in: building large life science ontologies (GlycO -an ontology for structure and function for Glycopeptides and ProPreO - an ontology for capturing process and lifecycle information related to proteomic experiments) and their application in advanced ontology-driven semantic applications entity and relationship extraction from unstructured data, automatic semantic annotation of scientific/experimental data (e.g., mass spectrometry), and resulting capability in integrated access and analysis of structured databases, scientific literature and experimental data semantic web services and registries, leading to better discovery/reuse of scientific tools and composition of scientific workflows that process high-throughput data and can be adaptive semantic applications developed

Slide25: 

Relationship extraction from unstructured data (other related research: biological entity extraction)

Overview : 

Overview 9284 documents 4733 documents Biologically active substance Disease or Syndrome causes affects causes complicates Fish Oils Raynaud’s Disease ??????? instance_of instance_of 5 documents UMLS MeSH PubMed

About the data used: 

About the data used UMLS – A high level schema of the biomedical domain 136 classes and 49 relationships Synonyms of all relationship – using variant lookup (tools from NLM) MeSH Terms already asserted as instance of one or more classes in UMLS PubMed Abstracts annotated with one or more MeSH terms T147—effect T147—induce T147—etiology T147—cause T147—effecting T147—induced

Example PubMed abstract (for the domain expert): 

Example PubMed abstract (for the domain expert)

Method – Parse Sentences in PubMed: 

Method – Parse Sentences in PubMed SS-Tagger (University of Tokyo) SS-Parser (University of Tokyo) (TOP (S (NP (NP (DT An) (JJ excessive) (ADJP (JJ endogenous) (CC or) (JJ exogenous) ) (NN stimulation) ) (PP (IN by) (NP (NN estrogen) ) ) ) (VP (VBZ induces) (NP (NP (JJ adenomatous) (NN hyperplasia) ) (PP (IN of) (NP (DT the) (NN endometrium) ) ) ) ) ) )

Method – Identify entities and Relationships in Parse Tree: 

Method – Identify entities and Relationships in Parse Tree

Method – Identify entities and Relationships in Parse Tree: 

Method – Identify entities and Relationships in Parse Tree

Method – Fact Extraction from Parse Tree: 

Method – Fact Extraction from Parse Tree

Slide33: 

Semantic annotation of scientific/experimental data

ProPreO: Ontology-mediated provenance: 

830.9570 194.9604 2 580.2985 0.3592 688.3214 0.2526 779.4759 38.4939 784.3607 21.7736 1543.7476 1.3822 1544.7595 2.9977 1562.8113 37.4790 1660.7776 476.5043 parent ion m/z fragment ion m/z ms/ms peaklist data fragment ion abundance parent ion abundance parent ion charge ProPreO: Ontology-mediated provenance Mass Spectrometry (MS) Data

ProPreO: Ontology-mediated provenance: 

andlt;ms-ms_peak_listandgt; andlt;parameter instrument='micromass_QTOF_2_quadropole_time_of_flight_mass_spectrometer' mode='ms-ms'/andgt; andlt;parent_ion m-z='830.9570' abundance='194.9604' z='2'/andgt; andlt;fragment_ion m-z='580.2985' abundance='0.3592'/andgt; andlt;fragment_ion m-z='688.3214' abundance='0.2526'/andgt; andlt;fragment_ion m-z='779.4759' abundance='38.4939'/andgt; andlt;fragment_ion m-z='784.3607' abundance='21.7736'/andgt; andlt;fragment_ion m-z='1543.7476' abundance='1.3822'/andgt; andlt;fragment_ion m-z='1544.7595' abundance='2.9977'/andgt; andlt;fragment_ion m-z='1562.8113' abundance='37.4790'/andgt; andlt;fragment_ion m-z='1660.7776' abundance='476.5043'/andgt; andlt;/ms-ms_peak_listandgt; Ontological Concepts ProPreO: Ontology-mediated provenance Semantically Annotated MS Data

Semantic empowermentof Life Science Applications: 

Semantic empowerment of Life Science Applications This talk will demonstrate some of the efforts in: building large life science ontologies (GlycO -an ontology for structure and function for Glycopeptides and ProPreO - an ontology for capturing process and lifecycle information related to proteomic experiments) and their application in advanced ontology-driven semantic applications entity and relationship extraction from unstructured data, automatic semantic annotation of scientific/experimental data (e.g., mass spectrometry), and resulting capability in integrated access and analysis of structured databases, scientific literature and experimental data semantic web services and registries, leading to better discovery/reuse of scientific tools and composition of scientific workflows that process high-throughput data and can be adaptive semantic applications developed

N-Glycosylation Process (NGP): 

N-Glycosylation Process (NGP)

Slide38: 

Semantic Annotation Applications Semantic Web Process to incorporate provenance

Converting biological information to the W3C Resource Description Framework (RDF): Experience with Entrez Gene : 

Converting biological information to the W3C Resource Description Framework (RDF): Experience with Entrez Gene Collaboration with Dr. Olivier Bodenreider (US National Library of Medicine, NIH, Bethesda, MD)

Biomedical Knowledge Repository: 

Biomedical Knowledge Repository Entrez Biomedical Knowledge Repository ….

Implementation: 

Implementation XSLT Entrez Gene Entrez Gene XML Entrez Gene RDF graph Entrez Gene RDF

Web interface: 

Web interface XSLT ENTREZ GENE ENTREZ GENE XML ENTREZ GENE RDF GRAPH ENTREZ GENE RDF ….

Implementation: 

Implementation XSLT Entrez Gene Entrez Gene XML Entrez Gene RDF graph Entrez Gene RDF

XML: 

XML

Implementation: 

Implementation XSLT Entrez Gene Entrez Gene XML Entrez Gene RDF graph Entrez Gene RDF

Slide46: 

RDF Graph APP (geneid-351) Alzheimer’s Disease eg:has_protein_reference_name_E subject predicate object

Slide47: 

RDF Graph Entrez Gene RDF graph (W3C Validator Site - http://www.w3.org/RDF/Validator/)

Implementation: 

Implementation XSLT Entrez Gene Entrez Gene XML Entrez Gene RDF graph Entrez Gene RDF

RDF: 

RDF

Implementation: 

Implementation XSLT Entrez Gene Entrez Gene XML Entrez Gene RDF graph Entrez Gene RDF

Connecting different genes: 

Connecting different genes APP gene [Homo sapiens] APP gene [Gallus gallus] APP gene [Canis familiaris ] protease nexin-II amyloid beta A4 protein amyloid-beta protein A4 amyloid protein beta-amyloid peptide amyloid beta (A4) precursor protein (protease nexin-II, Alzheimer disease) cerebral vascular amyloid peptide amyloid protein eg:has_protein_reference_name_E amyloid beta A4 protein amyloid beta A4 protein Human APP gene is implicated in Alzheimer's disease. Which genes are functionally homologous to this gene?

Inference: 

Inference Rules are objects that allow inference from RDF data [1] Oracle 10g allows the creation of rulebase based on RDFS (RDF Schema) eg:Neurodegenerative Diseases eg:Gene-track_geneid/351 amyloid beta (A4) precursor protein (protease nexin-II, Alzheimer disease) eg:has_protein_reference_name_E eg:is_associated_with

Slide53: 

Raw2mzXML mzXML2Pkl Pkl2pSplit MASCOT Search ProVault Raw mzXML Pkl pSplit MACOT result ProVault result Experimental Data Semantic Annotation Metadata File SPARQL query-based User Interface Semantic Metadata Registry PROTEOMECOMMONS PROTEOMICS WORKFLOW Integrated Semantic Information and knowledge System (Isis) ProPreO ontology EXPERIMENTAL DATA Have I performed an error? Give me all result files from a similar organism, cell, preparation, mass spectrometric conditions and compare results. Is the result erroneous? Give me all result files from a similar organism, cell, preparation, mass spectrometric conditions and compare results.

Summary, Observations, Conclusions: 

Summary, Observations, Conclusions We now have semantics and services enabled approaches that support semantic search, semantic integration, semantic analytics, decision support and validation (e.g., error prevention in healthcare), knowledge discovery, process/pathway discovery, …

Slide55: 

http://lsdis.cs.uga.edu http://knoesis.org http://lsdis.cs.uga.edu/projects/asdoc/ http://lsdis.cs.uga.edu/projects/glycomics/