logging in or signing up Andreas SMIREP Willi Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 75 Category: Entertainment License: All Rights Reserved Like it (1) Dislike it (0) Added: December 23, 2007 This Presentation is Public Favorites: 1 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Predictive Graph Mining using SMIREP: Predictive Graph Mining using SMIREP Andreas Karwath and Luc De Raedt Institut für Informatik Albert-Ludwigs Universität Freiburg GermanyOverview: Overview Introduction SMIREP Fragment decomposition Growing and Pruning Experiments on DTP’s HIV data set Perspectives and related workSMIREP: SMIREP What ? Building predictive models (rules) of combined substructures inspired by IREP(*) Basic idea Learning graph structures using linear strings (SMILES) Decompose SMILES string to fragments Use these fragments to grow rulesMolecules and Fragments: Molecules and Fragments 2D-structure essentially Graphs Fragments substructres Sequence of atoms and bonds Linear Fragments ‚O‘, ‚C‘, ‚Cl‘, ‘S‘,... denote elements ‚-‘ ... single bond ‚=‚ ... double bond ‚#‘ ... triple bond ‚:‘ ... aromatic bond (hydrogens implicit) ‚.‘: ionic bonds (disconnections) ‚O-c:c:c:c-Cl‘ Smiles encoding: Smiles encoding Smiles Compact encoding of molecular structure Used by computational chemists Supported by many tools (e.g. OpenBabel, Daylight) SMILES language: Atoms (‚C‘, ‚O‘, ‚S‘, ...) Bonds (‚-‘, ‚=‘, ‚#‘, ‚:‘, ‚.‘, ) Branches (‚(‘ and ‚)‘) Cycles (numbers)Smiles encoding: Smiles encoding 1 2SMILES and SMARTS: SMILES and SMARTS SMILES representing molecules in 2D SMARTS is a matching language for SMILES, representing fragments. SMARTS is derived from SMILES to represent (sub-) fragments. Almost all SMILES strings are valid SMARTS search strings. SMARTS allows the use of wildcards. an arbitrary bond is denoted by a '~ ‘ an arbitrary atom by an '*'. The language allow to be more restrictive or less restrictive than SMILES Here: Subset of SMART : SMILES + atom wildcard Instead of ‚=O‘ use ‚*=O‘ Pattern Domain I:Molecular Fragments: Pattern Domain I: Molecular Fragments A fragment is a sequence of linearly connected atoms (e.g., ‚‘-C:C:C:C:Cl‘) ‚O‘, ‚C‘, ‚Cl‘, ‚N‘, ‘S‘,... denote atoms ‚-‘ ... single bond ‚=‚ ... double bond ‚#‘ ... triple bond ‚:‘ ... aromatic bond ‚.‘ ... ionic bond (disconnection)Pattern Domain II:Molecular Substructures: Pattern Domain II: Molecular Substructures A substructure consists of atoms, bonds, branches, and ring structures branch ... denoted by ‚(‘ and ‚)‘, e.g. ccc(=O)ccc ring structure ... denoted y cyclic link numbers. e.g. c1ccccc1SMILES for other graphs: SMILES for other graphs SMILES syntax can be applied to other undirected graphs. vertices ~ atoms edges ~ bonds. using cyclic link numbers to denote cyclic structures Example: ‘C-12345C-6C-1C-2C-3C-4C-56’ or 'C12345C6C1C2C3C4C56‘for short: IREP: IREP Incremental Reduced Error Prunning (Fürnkranz 1995, Cohen 1996 (IREP*)) Repeat: Split dataset of Pos and Neg into GrowPos, GrowNeg, PrunePos, and PruneNeg Grow rule for GrowPos and GrowNeg Prune rule using PrunePos and PruneNeg if certain condition fullfilled add rule to rule set retract examples from Pos and Neg covered by rule else stop and return current rule setSMIREP: SMIREP SMILES-IREP Repeat: Split dataset of Pos and Neg into GrowPos, GrowNeg, PrunePos, and PruneNeg Select k random seeds sk from GrowPos Grow rules for each sk using GrowPos and GrowNeg Prune rules using PrunePos and PruneNeg if certain condition fullfilled add rule to rule set retract examples from Pos and Neg covered by rule else stop and return current rule setFragment Decomposition I: Fragment Decomposition I Bring SMILES string into internal representation renumber cyclic link number to unique cyclic link numbers introduce ‚{‘ and ‚}‘ change order of atom an number for the beginning of a cycle Example: c1c2c(ccc1)nc1c(cccc1c2Nc1ccc(cc1)S(=O)) becomes {0c{1cc(ccc}0)n{2cc(cccc}2c}1N{3cccc(cc}3)S(=O))Fragment decomposition II: Fragment decomposition II Given SMILES string Sint (in the internal representation) find recrsivly all ground fragments n build a fragment tree: Decompose Construct root r = node(Sint, r) find all cyclic fragments Sc match {NT}N on Sint , with N being some number an T being some substring find all linear fragments Sl match A(B)C on Sint , with A, B and C being substrings for all s {Sc Sl} do add new node n = node(s, l) with l being a label Decomposition Example: Decomposition Example {0c{1cc(ccc}0)ncc(cc}1NccccS(=O)) {0c{1cc(ccc}0) {1cc(ccc}0)ncc(cc}1 {0c{1cc ncc(cc}1NccccS(=O)) ccc}0 {0c{1cc ccc}0 ncc(cc}1 ccc}0 {1cc cc}1NccccS(=O) ncc =O cc}1NccccSGrowing: Growing Grow rules for a given seed example construct fragment tree using the seed‘s internal SMILES string to cater for ionic bonds (disconnections) grow more than one tree if necessary repeat: for all ground fragments refine rule and score select k best until no improvement return best ruleRefinement operator: Refinement operator Lengthening: add ground fragment as conjunction to a existing rule Ascending: Given the labels, the following patterns are constructed a: Construct patterns A(B) and AC, with label ab and ac respectively. b: Construct new pattern A(B) with label ba c: Construct new pattern AC with label ac i, where i is an integer: Construct new pattern, where the pattern is the parents pattern. ab, ac, ba, or ca: Construct pattern A(B)C, where A(B)C is the parent of A, B, or C. r, no construction of a new pattern, the fragment was a root node. Pruning: Pruning Prune rules given the best rules from growing in reverse to original refinement maximise score: p-n/p+n Example: r0 : '*=N' 'OCn1c(nc(cc1)NC)CC ; [c], [a,ab,b] r-1 : '*=N' 'OCn1c(nc(cc1)NC) ; [c], [a,ab] r-2 : '*=N' 'OCn1c ; [c], [a] r-3 : '*=N' ; [c] The HIV Data Set: The HIV Data Set Developmental Therapeutics Program’s AIDS Antiviral Screen Database One of the largest public domain databases of this type Measures protection of human CEM cells from HIV-1 infection using a soluble formazan assay 41768 compounds 40282 Confirmed Inactive (CI) 1069 Confirmed Moderately Active (CM) 417 Confirmed Active (CA)Experimental Setup: Experimental Setup Find predictive rules to differentiate active from moderate (CA/CM) or active from inactive compounds (CA/CI) Implemented in Python using OpenBabel for matching 10-old cross validation lasting just above 10 minutes for each fold on CA/CM and just above 2 hours on the CA/CI experiment Results: Results CA/CM: precTrain: 0.823 recTrain: 0.483 precTest: 0.735 recTest: 0.481 predAcc: 0.797 nb Rules: 19.4 CA/CM: precTrain: 0.823 recTrain: 0.278 precTest: 0.706 recTest: 0.250 nb Rules: 10.5 Example rules: Example rules CA/CM rule coveres 18 active and 0 inactive on training and covering 3 active and 0 inactive on the test set. CA/CI: rule coveres 18 active and 0 inactive on training and covering 3 active and 0 inactive on the test set. Example rules: Example rules 1.) 2.)Discussion: Discussion SMIREP is a predictive graph mining system based on linear string representations using a rule learner is reasonably effective has easy interpretable predictive models is aplicable to other graph mining problems Further Work: Further Work full RIPPER add background knowledge (calculated values like logP, mol.weights, accessabilities,...) apply to more datasets (PTE, ...) use different machine learning algorithms apply to different graph mining problems (mRNA,..)Conclusion: Conclusion ILP Use of data mining Dehaspe, Toivonen and King, KDD 98 Inokuchi, Washio and Motoda, PKDD 00 Muggleton et al. and other inductive logic programming approaches MolFea : simpler fragments but much better scaling; use of Smiles and Smart Klopman’s Case and Multi Case systems Derive fragments during predictive modeling MolFea : employ traditional data mining principles; declarative Thanks : Thanks You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
Andreas SMIREP Willi Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 75 Category: Entertainment License: All Rights Reserved Like it (1) Dislike it (0) Added: December 23, 2007 This Presentation is Public Favorites: 1 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Predictive Graph Mining using SMIREP: Predictive Graph Mining using SMIREP Andreas Karwath and Luc De Raedt Institut für Informatik Albert-Ludwigs Universität Freiburg GermanyOverview: Overview Introduction SMIREP Fragment decomposition Growing and Pruning Experiments on DTP’s HIV data set Perspectives and related workSMIREP: SMIREP What ? Building predictive models (rules) of combined substructures inspired by IREP(*) Basic idea Learning graph structures using linear strings (SMILES) Decompose SMILES string to fragments Use these fragments to grow rulesMolecules and Fragments: Molecules and Fragments 2D-structure essentially Graphs Fragments substructres Sequence of atoms and bonds Linear Fragments ‚O‘, ‚C‘, ‚Cl‘, ‘S‘,... denote elements ‚-‘ ... single bond ‚=‚ ... double bond ‚#‘ ... triple bond ‚:‘ ... aromatic bond (hydrogens implicit) ‚.‘: ionic bonds (disconnections) ‚O-c:c:c:c-Cl‘ Smiles encoding: Smiles encoding Smiles Compact encoding of molecular structure Used by computational chemists Supported by many tools (e.g. OpenBabel, Daylight) SMILES language: Atoms (‚C‘, ‚O‘, ‚S‘, ...) Bonds (‚-‘, ‚=‘, ‚#‘, ‚:‘, ‚.‘, ) Branches (‚(‘ and ‚)‘) Cycles (numbers)Smiles encoding: Smiles encoding 1 2SMILES and SMARTS: SMILES and SMARTS SMILES representing molecules in 2D SMARTS is a matching language for SMILES, representing fragments. SMARTS is derived from SMILES to represent (sub-) fragments. Almost all SMILES strings are valid SMARTS search strings. SMARTS allows the use of wildcards. an arbitrary bond is denoted by a '~ ‘ an arbitrary atom by an '*'. The language allow to be more restrictive or less restrictive than SMILES Here: Subset of SMART : SMILES + atom wildcard Instead of ‚=O‘ use ‚*=O‘ Pattern Domain I:Molecular Fragments: Pattern Domain I: Molecular Fragments A fragment is a sequence of linearly connected atoms (e.g., ‚‘-C:C:C:C:Cl‘) ‚O‘, ‚C‘, ‚Cl‘, ‚N‘, ‘S‘,... denote atoms ‚-‘ ... single bond ‚=‚ ... double bond ‚#‘ ... triple bond ‚:‘ ... aromatic bond ‚.‘ ... ionic bond (disconnection)Pattern Domain II:Molecular Substructures: Pattern Domain II: Molecular Substructures A substructure consists of atoms, bonds, branches, and ring structures branch ... denoted by ‚(‘ and ‚)‘, e.g. ccc(=O)ccc ring structure ... denoted y cyclic link numbers. e.g. c1ccccc1SMILES for other graphs: SMILES for other graphs SMILES syntax can be applied to other undirected graphs. vertices ~ atoms edges ~ bonds. using cyclic link numbers to denote cyclic structures Example: ‘C-12345C-6C-1C-2C-3C-4C-56’ or 'C12345C6C1C2C3C4C56‘for short: IREP: IREP Incremental Reduced Error Prunning (Fürnkranz 1995, Cohen 1996 (IREP*)) Repeat: Split dataset of Pos and Neg into GrowPos, GrowNeg, PrunePos, and PruneNeg Grow rule for GrowPos and GrowNeg Prune rule using PrunePos and PruneNeg if certain condition fullfilled add rule to rule set retract examples from Pos and Neg covered by rule else stop and return current rule setSMIREP: SMIREP SMILES-IREP Repeat: Split dataset of Pos and Neg into GrowPos, GrowNeg, PrunePos, and PruneNeg Select k random seeds sk from GrowPos Grow rules for each sk using GrowPos and GrowNeg Prune rules using PrunePos and PruneNeg if certain condition fullfilled add rule to rule set retract examples from Pos and Neg covered by rule else stop and return current rule setFragment Decomposition I: Fragment Decomposition I Bring SMILES string into internal representation renumber cyclic link number to unique cyclic link numbers introduce ‚{‘ and ‚}‘ change order of atom an number for the beginning of a cycle Example: c1c2c(ccc1)nc1c(cccc1c2Nc1ccc(cc1)S(=O)) becomes {0c{1cc(ccc}0)n{2cc(cccc}2c}1N{3cccc(cc}3)S(=O))Fragment decomposition II: Fragment decomposition II Given SMILES string Sint (in the internal representation) find recrsivly all ground fragments n build a fragment tree: Decompose Construct root r = node(Sint, r) find all cyclic fragments Sc match {NT}N on Sint , with N being some number an T being some substring find all linear fragments Sl match A(B)C on Sint , with A, B and C being substrings for all s {Sc Sl} do add new node n = node(s, l) with l being a label Decomposition Example: Decomposition Example {0c{1cc(ccc}0)ncc(cc}1NccccS(=O)) {0c{1cc(ccc}0) {1cc(ccc}0)ncc(cc}1 {0c{1cc ncc(cc}1NccccS(=O)) ccc}0 {0c{1cc ccc}0 ncc(cc}1 ccc}0 {1cc cc}1NccccS(=O) ncc =O cc}1NccccSGrowing: Growing Grow rules for a given seed example construct fragment tree using the seed‘s internal SMILES string to cater for ionic bonds (disconnections) grow more than one tree if necessary repeat: for all ground fragments refine rule and score select k best until no improvement return best ruleRefinement operator: Refinement operator Lengthening: add ground fragment as conjunction to a existing rule Ascending: Given the labels, the following patterns are constructed a: Construct patterns A(B) and AC, with label ab and ac respectively. b: Construct new pattern A(B) with label ba c: Construct new pattern AC with label ac i, where i is an integer: Construct new pattern, where the pattern is the parents pattern. ab, ac, ba, or ca: Construct pattern A(B)C, where A(B)C is the parent of A, B, or C. r, no construction of a new pattern, the fragment was a root node. Pruning: Pruning Prune rules given the best rules from growing in reverse to original refinement maximise score: p-n/p+n Example: r0 : '*=N' 'OCn1c(nc(cc1)NC)CC ; [c], [a,ab,b] r-1 : '*=N' 'OCn1c(nc(cc1)NC) ; [c], [a,ab] r-2 : '*=N' 'OCn1c ; [c], [a] r-3 : '*=N' ; [c] The HIV Data Set: The HIV Data Set Developmental Therapeutics Program’s AIDS Antiviral Screen Database One of the largest public domain databases of this type Measures protection of human CEM cells from HIV-1 infection using a soluble formazan assay 41768 compounds 40282 Confirmed Inactive (CI) 1069 Confirmed Moderately Active (CM) 417 Confirmed Active (CA)Experimental Setup: Experimental Setup Find predictive rules to differentiate active from moderate (CA/CM) or active from inactive compounds (CA/CI) Implemented in Python using OpenBabel for matching 10-old cross validation lasting just above 10 minutes for each fold on CA/CM and just above 2 hours on the CA/CI experiment Results: Results CA/CM: precTrain: 0.823 recTrain: 0.483 precTest: 0.735 recTest: 0.481 predAcc: 0.797 nb Rules: 19.4 CA/CM: precTrain: 0.823 recTrain: 0.278 precTest: 0.706 recTest: 0.250 nb Rules: 10.5 Example rules: Example rules CA/CM rule coveres 18 active and 0 inactive on training and covering 3 active and 0 inactive on the test set. CA/CI: rule coveres 18 active and 0 inactive on training and covering 3 active and 0 inactive on the test set. Example rules: Example rules 1.) 2.)Discussion: Discussion SMIREP is a predictive graph mining system based on linear string representations using a rule learner is reasonably effective has easy interpretable predictive models is aplicable to other graph mining problems Further Work: Further Work full RIPPER add background knowledge (calculated values like logP, mol.weights, accessabilities,...) apply to more datasets (PTE, ...) use different machine learning algorithms apply to different graph mining problems (mRNA,..)Conclusion: Conclusion ILP Use of data mining Dehaspe, Toivonen and King, KDD 98 Inokuchi, Washio and Motoda, PKDD 00 Muggleton et al. and other inductive logic programming approaches MolFea : simpler fragments but much better scaling; use of Smiles and Smart Klopman’s Case and Multi Case systems Derive fragments during predictive modeling MolFea : employ traditional data mining principles; declarative Thanks : Thanks