Andreas SMIREP

Uploaded from authorPOINTLite
Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Predictive Graph Mining using SMIREP: 

Predictive Graph Mining using SMIREP Andreas Karwath and Luc De Raedt Institut für Informatik Albert-Ludwigs Universität Freiburg Germany

Overview: 

Overview Introduction SMIREP Fragment decomposition Growing and Pruning Experiments on DTP’s HIV data set Perspectives and related work

SMIREP: 

SMIREP What ? Building predictive models (rules) of combined substructures inspired by IREP(*) Basic idea Learning graph structures using linear strings (SMILES) Decompose SMILES string to fragments Use these fragments to grow rules

Molecules and Fragments: 

Molecules and Fragments 2D-structure essentially Graphs Fragments substructres Sequence of atoms and bonds Linear Fragments ‚O‘, ‚C‘, ‚Cl‘, ‘S‘,... denote elements ‚-‘ ... single bond ‚=‚ ... double bond ‚#‘ ... triple bond ‚:‘ ... aromatic bond (hydrogens implicit) ‚.‘: ionic bonds (disconnections) ‚O-c:c:c:c-Cl‘

Smiles encoding: 

Smiles encoding Smiles Compact encoding of molecular structure Used by computational chemists Supported by many tools (e.g. OpenBabel, Daylight) SMILES language: Atoms (‚C‘, ‚O‘, ‚S‘, ...) Bonds (‚-‘, ‚=‘, ‚#‘, ‚:‘, ‚.‘, ) Branches (‚(‘ and ‚)‘) Cycles (numbers)

Smiles encoding: 

Smiles encoding 1 2

SMILES and SMARTS: 

SMILES and SMARTS SMILES representing molecules in 2D SMARTS is a matching language for SMILES, representing fragments. SMARTS is derived from SMILES to represent (sub-) fragments. Almost all SMILES strings are valid SMARTS search strings. SMARTS allows the use of wildcards. an arbitrary bond is denoted by a '~ ‘ an arbitrary atom by an '*'. The language allow to be more restrictive or less restrictive than SMILES Here: Subset of SMART : SMILES + atom wildcard Instead of ‚=O‘ use ‚*=O‘

Pattern Domain I: Molecular Fragments: 

Pattern Domain I: Molecular Fragments A fragment is a sequence of linearly connected atoms (e.g., ‚‘-C:C:C:C:Cl‘) ‚O‘, ‚C‘, ‚Cl‘, ‚N‘, ‘S‘,... denote atoms ‚-‘ ... single bond ‚=‚ ... double bond ‚#‘ ... triple bond ‚:‘ ... aromatic bond ‚.‘ ... ionic bond (disconnection)

Pattern Domain II: Molecular Substructures: 

Pattern Domain II: Molecular Substructures A substructure consists of atoms, bonds, branches, and ring structures branch ... denoted by ‚(‘ and ‚)‘, e.g. ccc(=O)ccc ring structure ... denoted y cyclic link numbers. e.g. c1ccccc1

SMILES for other graphs: 

SMILES for other graphs SMILES syntax can be applied to other undirected graphs. vertices ~ atoms edges ~ bonds. using cyclic link numbers to denote cyclic structures Example: ‘C-12345C-6C-1C-2C-3C-4C-56’ or 'C12345C6C1C2C3C4C56‘for short:

IREP: 

IREP Incremental Reduced Error Prunning (Fürnkranz 1995, Cohen 1996 (IREP*)) Repeat: Split dataset of Pos and Neg into GrowPos, GrowNeg, PrunePos, and PruneNeg Grow rule for GrowPos and GrowNeg Prune rule using PrunePos and PruneNeg if certain condition fullfilled add rule to rule set retract examples from Pos and Neg covered by rule else stop and return current rule set

SMIREP: 

SMIREP SMILES-IREP Repeat: Split dataset of Pos and Neg into GrowPos, GrowNeg, PrunePos, and PruneNeg Select k random seeds sk from GrowPos Grow rules for each sk using GrowPos and GrowNeg Prune rules using PrunePos and PruneNeg if certain condition fullfilled add rule to rule set retract examples from Pos and Neg covered by rule else stop and return current rule set

Fragment Decomposition I: 

Fragment Decomposition I Bring SMILES string into internal representation renumber cyclic link number to unique cyclic link numbers introduce ‚{‘ and ‚}‘ change order of atom an number for the beginning of a cycle Example: c1c2c(ccc1)nc1c(cccc1c2Nc1ccc(cc1)S(=O)) becomes {0c{1cc(ccc}0)n{2cc(cccc}2c}1N{3cccc(cc}3)S(=O))

Fragment decomposition II: 

Fragment decomposition II Given SMILES string Sint (in the internal representation) find recrsivly all ground fragments n build a fragment tree: Decompose Construct root r = node(Sint, r) find all cyclic fragments Sc match {NT}N on Sint , with N being some number an T being some substring find all linear fragments Sl match A(B)C on Sint , with A, B and C being substrings for all s  {Sc  Sl} do add new node n = node(s, l) with l being a label

Decomposition Example: 

Decomposition Example {0c{1cc(ccc}0)ncc(cc}1NccccS(=O)) {0c{1cc(ccc}0) {1cc(ccc}0)ncc(cc}1 {0c{1cc ncc(cc}1NccccS(=O)) ccc}0 {0c{1cc ccc}0 ncc(cc}1 ccc}0 {1cc cc}1NccccS(=O) ncc =O cc}1NccccS

Growing: 

Growing Grow rules for a given seed example construct fragment tree using the seed‘s internal SMILES string to cater for ionic bonds (disconnections) grow more than one tree if necessary repeat: for all ground fragments refine rule and score select k best until no improvement return best rule

Refinement operator: 

Refinement operator Lengthening: add ground fragment as conjunction to a existing rule Ascending: Given the labels, the following patterns are constructed a: Construct patterns A(B) and AC, with label ab and ac respectively. b: Construct new pattern A(B) with label ba c: Construct new pattern AC with label ac i, where i is an integer: Construct new pattern, where the pattern is the parents pattern. ab, ac, ba, or ca: Construct pattern A(B)C, where A(B)C is the parent of A, B, or C. r, no construction of a new pattern, the fragment was a root node.

Pruning: 

Pruning Prune rules given the best rules from growing in reverse to original refinement maximise score: p-n/p+n Example: r0 : '*=N'  'OCn1c(nc(cc1)NC)CC ; [c], [a,ab,b] r-1 : '*=N'  'OCn1c(nc(cc1)NC) ; [c], [a,ab] r-2 : '*=N'  'OCn1c ; [c], [a] r-3 : '*=N' ; [c]

The HIV Data Set: 

The HIV Data Set Developmental Therapeutics Program’s AIDS Antiviral Screen Database One of the largest public domain databases of this type Measures protection of human CEM cells from HIV-1 infection using a soluble formazan assay 41768 compounds 40282 Confirmed Inactive (CI) 1069 Confirmed Moderately Active (CM) 417 Confirmed Active (CA)

Experimental Setup: 

Experimental Setup Find predictive rules to differentiate active from moderate (CA/CM) or active from inactive compounds (CA/CI) Implemented in Python using OpenBabel for matching 10-old cross validation lasting just above 10 minutes for each fold on CA/CM and just above 2 hours on the CA/CI experiment

Results: 

Results CA/CM: precTrain: 0.823 recTrain: 0.483 precTest: 0.735 recTest: 0.481 predAcc: 0.797 nb Rules: 19.4 CA/CM: precTrain: 0.823 recTrain: 0.278 precTest: 0.706 recTest: 0.250 nb Rules: 10.5

Example rules: 

Example rules CA/CM rule coveres 18 active and 0 inactive on training and covering 3 active and 0 inactive on the test set. CA/CI: rule coveres 18 active and 0 inactive on training and covering 3 active and 0 inactive on the test set.

Example rules: 

Example rules 1.) 2.)

Discussion: 

Discussion SMIREP is a predictive graph mining system based on linear string representations using a rule learner is reasonably effective has easy interpretable predictive models is aplicable to other graph mining problems

Further Work: 

Further Work full RIPPER add background knowledge (calculated values like logP, mol.weights, accessabilities,...) apply to more datasets (PTE, ...) use different machine learning algorithms apply to different graph mining problems (mRNA,..)

Conclusion: 

Conclusion ILP Use of data mining Dehaspe, Toivonen and King, KDD 98 Inokuchi, Washio and Motoda, PKDD 00 Muggleton et al. and other inductive logic programming approaches MolFea : simpler fragments but much better scaling; use of Smiles and Smart Klopman’s Case and Multi Case systems Derive fragments during predictive modeling MolFea : employ traditional data mining principles; declarative

Thanks : 

Thanks