icml kdd2003

Uploaded from authorPOINTLite
Views:
 
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Statistical Learning from Relational Data: 

Statistical Learning from Relational Data Daphne Koller Stanford University Joint work with many many people

Relational Data is Everywhere: 

Relational Data is Everywhere The web Webpages (& the entities they represent), hyperlinks Social networks People, institutions, friendship links Biological data Genes, proteins, interactions, regulation Bibliometrics Papers, authors, journals, citations Corporate databases Customers, products, transactions

Relational Data is Different: 

Relational Data is Different Data instances not independent Topics of linked webpages are correlated Data instances are not identically distributed: Heterogeneous instances (papers, authors) No IID assumption  This is a good thing 

New Learning Tasks: 

New Learning Tasks Collective classification of related instances Labeling an entire website of related webpages Relational clustering Finding coherent clusters in the genome Link prediction & classification Predicting when two people are likely to be friends Pattern detection in network of related objects Finding groups (research groups, terrorist groups)

Probabilistic Models : 

Probabilistic Models Uncertainty model: space of “possible worlds”; probability distribution over this space. Worlds: often defined via a set of state variables medical diagnosis: diseases, symptoms, findings, … each world: an assignment of values to variables Number of worlds is exponential in # of vars 2n if we have n binary variables

Outline: 

Outline Relational Bayesian networks* Relational Markov networks Collective Classification Relational clustering * with Avi Pfeffer, Nir Friedman, Lise Getoor

Bayesian Networks: 

Bayesian Networks nodes = variables edges = direct influence Graph structure encodes independence assumptions: Letter conditionally independent of Intelligence given Grade Job Grade SAT Intelligence Difficulty

BN semantics: 

BN semantics Compact & natural representation: nodes have  k parents  2kn vs. 2n params parameters natural and easy to elicit conditional independencies in BN structure + local probability models full joint distribution over domain =

Reasoning using BNs: 

Full joint distribution specifies answer to any query: P(variable | evidence about others) Reasoning using BNs Letter Grade SAT Intelligence Difficulty Letter SAT

Bayesian Networks: Problem: 

Bayesian Networks: Problem Bayesian nets use propositional representation Real world has objects, related to each other Intelligence Difficulty Grade A C These “instances” are not independent

Relational Schema: 

Relational Schema Specifies types of objects in domain, attributes of each type of object & types of relations between objects Teach Student Intelligence Course Difficulty Professor Teaching-Ability In Take Classes Relations Attributes

St. Nordaf University: 

St. Nordaf University Teaches Teaches In-course In-course Registered In-course Prof. Smith Prof. Jones George Jane Welcome to CS101 Registered Registered Grade Grade Grade Satisfac Satisfac Satisfac World 

Relational Bayesian Networks: 

Relational Bayesian Networks Universals: Probabilistic patterns hold for all objects in class Locality: Represent direct probabilistic dependencies Links define potential interactions [K. & Pfeffer; Poole; Ngo & Haddawy]

RBN Semantics: 

Prof. Smith Prof. Jones Welcome to CS101 RBN Semantics George Jane Ground model: variables: attributes of all objects dependencies: determined by relational links & template model Welcome to CS101

The Web of Influence: 

low / high The Web of Influence easy / hard

Likelihood Function: 

Likelihood Function Likelihood of a BN with shared parameters Joint likelihood is a product of likelihood terms One for each attribute X.A and its family For each X.A, the likelihood function aggregates counts from all occurrences x.A in world  [Friedman, Getoor, K., Pfeffer, 1999]

Likelihood Function: Multinomials: 

Likelihood Function: Multinomials Sufficient statistics: Log-likelihood:

RBN Parameter Estimation: 

RBN Parameter Estimation MLE parameters: Bayesian estimation: Prior for each attribute X.A Posterior uses aggregated sufficient statistics aggregated sufficient statistics

Learning RBN Structure: 

Learning RBN Structure Define set of legal RBN structures Ones with legal class dependency graphs Define scoring function  Bayesian score Product of family scores: One for each X.A Uses aggregated sufficient statistics Search for high-scoring legal structure [Friedman, Getoor, K., Pfeffer, 1999]

Learning RBN Structure: 

Learning RBN Structure All operations done at class level Dependency structure = parents for X.A Acyclicity checked using class dependency graph Score computed at class level Individual objects only contribute to sufficient statistics Can be obtained efficiently using standard DB queries

Outline: 

Outline Relational Bayesian networks* Relational Markov networks† Collective Classification Relational clustering * with Avi Pfeffer, Nir Friedman, Lise Getoor † with Ben Taskar, Pieter Abbeel

Why Undirected Models?: 

Why Undirected Models? Symmetric, non-causal interactions E.g., web: categories of linked pages are correlated Cannot introduce direct edges because of cycles Patterns involving multiple entities E.g., web: “triangle” patterns Directed edges not appropriate “Solution”: Impose arbitrary direction Not clear how to parameterize CPD for variables involved in multiple interactions Very difficult within a class-based parameterization [Taskar, Abbeel, K. 2001]

Markov Networks: 

Markov Networks A Markov network is an undirected graph over some set of variables V Graph associated with a set of potentials i Each potential is factor over subset Vi Variables in Vi must be a (sub)clique in network

Markov Networks: 

Markov Networks Laura Noah Mary James Kyle

Relational Markov Networks: 

Relational Markov Networks Universals: Probabilistic patterns hold for all groups of objects Locality: Represent local probabilistic dependencies Sets of links give us possible interactions

RMN Semantics: 

Welcome to CS101 RMN Semantics Grade Grade Intelligence Intelligence George Jane Jill Intelligence Geo Study Group CS Study Group Grade Grade

Outline: 

Outline Relational Bayesian Networks Relational Markov Networks Collective Classification* Discriminative training Web page classification Link prediction Relational clustering * with Ben Taskar, Carlos Guestrin, Ming Fai Wong, Pieter Abbeel

Collective Classification: 

Model Structure Probabilistic Relational Model Training Data New Data Learning Inference Conclusions Collective Classification Train on one year of student intelligence, course difficulty, and grades Given only grades in following year, predict all students’ intelligence Example: Features: .x Labels: .y* Features: ’.x Labels: ’.y

Learning RMN Parameters: 

Learning RMN Parameters Template potential  Parameterize potentials as log-linear model

Max Likelihood Estimation: 

Max Likelihood Estimation maximizew Estimation Classification argmaxy .x .y* We don’t care about the joint distribution P(.x, .y)

Web  KB: 

Web  KB [Craven et al.]

Web Classification Experiments: 

Web Classification Experiments WebKB dataset Four CS department websites Bag of words on each page Links between pages Anchor text for links Experimental setup Trained on three universities Tested on fourth Repeated for all four combinations

Standard Classification: 

Professor department extract information computer science machine learning … Standard Classification Categories: faculty course project student other Page

Standard Classification: 

Standard Classification working with Tom Mitchell … Page test set error 4-fold CV: Trained on 3 universities Tested on 4th Discriminatively trained naïve Markov = Logistic Regression

Power of Context: 

Power of Context Professor? Student? Post-doc?

Collective Classification: 

Collective Classification

Collective Classification: 

Collective Classification Classify all pages collectively, maximizing the joint label probability test set error [Taskar, Abbeel, K., 2002]

More Complex Structure: 

More Complex Structure

More Complex Structure: 

More Complex Structure

Collective Classification: Results: 

Collective Classification: Results Logistic Links Section Link+Section [Taskar, Abbeel, K., 2002] test set error 35.4% error reduction over logistic

Max Conditional Likelihood: 

Max Conditional Likelihood maximizew Estimation Classification argmaxy .x .y* We don’t care about the conditional distribution P(.y |.x)

Max Margin Estimation: 

margin # labeling mistakes in y Max Margin Estimation [Taskar, Guestrin, K., 2003] (see also [Collins, 2002; Hoffman 2003]) Quadratic program  Exponentially many constraints  What we really want: correct class labels

Max Margin Markov Networks: 

Max Margin Markov Networks We use structure of Markov network to provide equivalent formulation of QP Exponential only in tree width of network Complexity = max-likelihood classification Can solve approximately in networks where induced width is too large Analogous to loopy belief propagation Can use kernel-based features! SVMs meet graphical models [Taskar, Guestrin, K., 2003]

WebKB Revisited: 

WebKB Revisited 16.1% relative reduction in error relative to cond. likelihood RMNs

Predicting Relationships: 

Predicting Relationships Even more interesting: relationships between objects Tom Mitchell Professor WebKB Project Sean Slattery Student

Predicting Relations: 

Predicting Relations Introduce exists/type attribute for each potential link Learn discriminative model for this attribute Collectively predict its value in new world Relation ... Page Word1 WordN From- ... Page Word1 WordN To- Exists/ Type ... LinkWord1 LinkWordN Category Category 72.9% error reduction over flat [Taskar, Wong, Abbeel, K., 2003]

Outline: 

Outline Relational Bayesian Networks Relational Markov Networks Collective Classification Relational clustering Movie data* Biological data† * with Ben Taskar, Eran Segal † with Eran Segal, Nir Friedman, Aviv Regev, Dana Pe’er, Haidong Wang, Micha Shapira, David Botstein

Relational Clustering: 

Model Structure Probabilistic Relational Model Unlabeled Relational Data Learning Relational Clustering Given only students’ grades, cluster similar students Example: Clustering of instances

Learning w. Missing Data: EM: 

Learning w. Missing Data: EM EM Algorithm applies essentially unchanged E-step computes expected sufficient statistics, aggregated over all objects in class M-step uses ML (or MAP) parameter estimation Key difference: In general, the hidden variables are not independent Computation of expected sufficient statistics requires inference over entire network

Learning w. Missing Data: EM: 

Learning w. Missing Data: EM low / high easy / hard [Dempster et al. 77]

Movie Data: 

Movie Data Internet Movie Database http://www.imdb.com

Discovering Hidden Types: 

Discovering Hidden Types Type Type Type [Taskar, Segal, K., 2001] Learn model using EM

Discovering Hidden Types: 

Discovering Hidden Types [Taskar, Segal, K., 2001]

Biology 101: Gene Expression: 

Biology 101: Gene Expression DNA Swi5 Cells express different subsets of their genes in different tissues and under different conditions

Gene Expression Microarrays: 

Gene Expression Microarrays Measure mRNA level for all genes in one condition Hundreds of experiments Highly noisy Expression of gene i in experiment j Experiments Genes Induced Repressed

Standard Analysis: 

Standard Analysis Cluster genes by similarity of expression profiles Manually examine clusters to understand what’s common to genes in cluster

General Approach: 

General Approach Expression level is a function of gene properties and experiment properties Learn model that best explains the data Observed properties: gene sequence, array condition, … Hidden properties: gene cluster Assignment to hidden variables (e.g., module assignment) Expression level as function of properties

Clustering as a PRM: 

Level Gene Experiment Cluster Expression ID Clustering as a PRM

Modular Regulation: 

Modular Regulation Learn functional modules: Clusters of genes that are similarly controlled Learn control program for modules Expression as function of control genes

Module Network PRM: 

[Segal, Regev, Pe’er, Koller, Friedman, 2003] Level Gene Controlk Experiment Cluster Expression Control2 Control1 Module Network PRM Activity level of control gene in experiment

Experimental Results: 

Experimental Results Yeast Stress Data (Gasch et al.) 2355 genes that showed activity 173 experiments (microarrays): Diverse environmental stress conditions (e.g. heat shock) Learned module network with 50 modules: Cluster assignments are hidden variables Structure of dependency trees unknown Learned model using structural EM algorithm Segal et al., Nature Genetics, 2003

Biological Evaluation: 

Biological Evaluation Find sets of co-regulated genes (regulatory module) Find the regulators of each module [Segal et al., Nature Genetics, 2003] 46/50 30/50

Experimental Results: 

Experimental Results Hypothesis: Regulator ‘X’ regulates process ‘Y’ Experiment: Knock out ‘X’ and rerun the experiment X [Segal et al., Nature Genetics, 2003]

Differentially Expressed Genes: 

Differentially Expressed Genes [Segal et al., Nature Genetics, 2003]

Biological Experiments Validation: 

Were the differentially expressed genes predicted as targets? Rank modules by enrichment for diff. expressed genes Biological Experiments Validation [Segal et al., Nature Genetics, 2003]

Biology 102: Pathways: 

Biology 102: Pathways Pathways are sets of genes that act together to achieve a common function

Finding Pathways: Attempt I: 

Finding Pathways: Attempt I Use protein-protein interaction data

Finding Pathways: Attempt I: 

Finding Pathways: Attempt I Use protein-protein interaction data

Finding Pathways: Attempt I: 

Finding Pathways: Attempt I Use protein-protein interaction data Problems: Data is very noisy Structure is lost: Large connected component in interaction graph (3527/3589 genes)

Finding Pathways: Attempt II: 

Finding Pathways: Attempt II Use expression microarray clusters Pathway I Pathway II Problems: Expression is only ‘weak’ indicator of interaction Interacting pathways are not separable

Finding Pathways: Our Approach: 

Finding Pathways: Our Approach Use both types of data to find pathways Find “active” interactions using gene expression Find pathway-related co-expression using interactions Pathway I Pathway II Pathway III Pathway IV [Segal, Wang, K., 2003]

Probabilistic Model: 

Probabilistic Model ... Pathway Exp1 ExpN Gene Interacts [Segal, Wang, K., 2003] 1 Expression level in N arrays protein product interaction Compatibility potential Cluster all genes collectively, maximizing the joint model likelihood

Capturing Protein Complexes: 

Capturing Protein Complexes Independent data set of interacting proteins 0 50 100 150 200 250 300 350 400 0 10 20 30 40 50 60 70 80 90 100 Complex Coverage (%) Num Complexes Our method Standard expression clustering 124 complexes covered at 50% for our method 46 complexes covered at 50% for clustering [Segal, Wang, K., 2003]

RNAse Complex Pathway: 

YHR081W RRP40 RRP42 MTR3 RRP45 RRP4 RRP43 DIS3 TRM7 SKI6 RRP46 CSL4 RNAse Complex Pathway YHR081W SKI6 RRP42 RRP45 RRP46 RRP43 TRM7 RRP40 MTR3 RRP4 DIS3 CSL4 Includes all 10 known pathway genes Only 5 genes found by clustering [Segal, Wang, K., 2003]

Interaction Clustering: 

Interaction Clustering RNAse complex found by interaction clustering as part of cluster with 138 genes [Segal, Wang, K., 2003]

Truth in Advertising: 

Truth in Advertising Huge graphical models: 3000-50,000 hidden variables Hundreds of thousands of observed nodes Very densely connected Learning: Multiple iterations of model updates Each requires running inference on the model Inference: Exact inference is intractable Use belief propagation Single inference iteration: 1-6 hours Algorithmic ideas key to scaling

Relational Data: A New Challenge: 

Relational Data: A New Challenge Data consists of different types of instances Instances are related in complex networks Instances are not independent New tasks for machine learning Collective classification Relational clustering Link prediction Group detection Opportunity

Slide78: 

http://robotics.stanford.edu/~koller/