An Introduction to Causal Modeling and Discovery Using Graphical Models Greg CooperUniversity of Pittsburgh : An Introduction to Causal Modeling and Discovery Using Graphical Models Greg Cooper University of Pittsburgh
Overview : Overview Introduction
Representation
Inference
Learning
Evaluation
What Is Causality? : What Is Causality? Much consideration in philosophy
I will treat it as a primitive
Roughly, if we manipulate something and something else changes, then the former causally influences the latter.
Why is Causation Important? : Why is Causation Important? Causal issues arise in most fields including medicine, business, law, economics, and the sciences
An intelligent agent is continually considering what to do next in order to change the world (including the agent’s own mind). That is a causal question.
Representing Causation Using Causal Bayesian Networks : Representing Causation Using Causal Bayesian Networks A causal Bayesian network (CBN) represents some entity (e.g., a patient) that we want to model causally
Features of the entity are represented by variables/nodes in the CBN
Direct causation is represented by arcs
An Example of a Causal Bayesian Network Structure : An Example of a Causal Bayesian Network Structure History of Smoking (HS) Lung Cancer (LC) Chronic Bronchitis (CB) Fatigue (F) Weight Loss (WL)
An Example of the Accompanying Causal Bayesian Network Parameters : An Example of the Accompanying Causal Bayesian Network Parameters P(HS = no) = 0.80 P(HS = yes) = 0.20
P(CB = absent | HS = no) = 0.95 P(CB = present | HS = no) = 0.05
P(CB = absent | HS = yes) = 0.75 P(CB = present | HS = yes) = 0.25
P(LC = absent | HS = no) = 0.99995 P(LC = present | HS = no) = 0.00005
P(LC = absent | HS = yes) = 0.997 P(LC = present | HS = yes) = 0.003
Causal Markov Condition : Causal Markov Condition A node is independent of its non-effects given just its direct causes.
This is the key representational property of causal Bayesian networks.
Special case: A node is independent of its distant causes given just its direct causes.
General notion: Causality is local
Causal Modeling Framework : Causal Modeling Framework An underlying process generates entities that share the same causal network structure. The entities may have different parameters (probabilities).
Each entity independently samples the joint distribution defined by its CBN to generate values (data) for each variable in the CBN model
Slide10 : Entity Generator HS1 LC1 WL1 HS2 LC2 WL2 HS3 LC3 WL3 (no, absent, absent) existing entities entity
feature
values samples (yes, present, present) (yes, absent, absent) (no, absent, absent) (yes, absent, absent)
Discovering the Average Causal Bayesian Network : Discovering the Average Causal Bayesian Network HSavg LCavg WLavg
Some Key Types of Causal Relationships : Some Key Types of Causal Relationships HS LC HS WL Direct
Causation Indirect
Causation Confounding HS LC CB WL F Sampled = true Sampling bias
Inference Using a Single CBN When Given Evidence in the Form of Observations : Inference Using a Single CBN When Given Evidence in the Form of Observations History of Smoking (HS) Lung Cancer (LC) Chronic Bronchitis (CB) Fatigue (F) Weight Loss (WL) P(F | CB = present, WL = present, CBN1)
Inference : Inference The Markov Condition implies the following equation:
The above equation specifies the full joint probability distribution over the model variables.
From the joint distribution we can derive any conditional probability of interest.
Inference Algorithms : Inference Algorithms In the worst case, the brute force algorithm is exponential time in the number of variables in the model
Numerous exact inference algorithms have been developed that exploit independences among the variables in the causal Bayesian network.
However, in the worst case, these algorithms are exponential time.
Inference in causal Bayesian networks is NP-hard (Cooper, AIJ, 1990).
Inference Using a Single CBN When Given Evidence in the Form of Manipulations : Inference Using a Single CBN When Given Evidence in the Form of Manipulations P(F | MCB = present, CBN1) Let MCB be a new variable that can have the same values as CB (present, absent) plus the value observe.
Add an arc from MCB to CB.
Define the probability distribution of CB given its parents.
Inference Using a Single CBN When Given Evidence in the Form of Manipulations : Inference Using a Single CBN When Given Evidence in the Form of Manipulations History of Smoking (HS) Lung Cancer (LC) Chronic Bronchitis (CB) Fatigue (F) Weight Loss (WL) P(F | MCB = present, CBN1) MCB
A Deterministic Manipulation : A Deterministic Manipulation History of Smoking (HS) Lung Cancer (LC) Chronic Bronchitis (CB) Fatigue (F) Weight Loss (WL) P(F | MCB = present), CBN1) MCB
Inference Using a Single CBN When Given Evidence in the Form of Observations and Manipulations : Inference Using a Single CBN When Given Evidence in the Form of Observations and Manipulations History of Smoking (HS) Lung Cancer (LC) Chronic Bronchitis (CB) Fatigue (F) Weight Loss (WL) P(F | MCB = present, WL = present, CBN1) MCB
Inference Using Multiple CBNs: Model Averaging : Inference Using Multiple CBNs: Model Averaging
Some Key Reasons for Learning CBNs : Some Key Reasons for Learning CBNs Scientific discovery among measured variables
Example of general: What are the causal relationships among HS, LC, CB, F, and WL?
Example of focused: What are the causes of LC from among HS, CB, F, and WL?
Scientific discovery of hidden processes
Prediction
Example: The effect of not smoking on contracting lung cancer
Major Methods for Learning CBNs from Data : Major Methods for Learning CBNs from Data Constraint-based methods
Uses tests of independence to find patterns of relationships among variables that support causal relationships
Relatively efficient in discovery of causal models with hidden variables
See talk by Frederick Eberhardt this morning
Score-based methods Bayesian scoring
Allows informative prior probabilities of causal structure and parameters
Non-Bayesian scoring
Does not allow informative prior probabilities
Learning CBNs from Observational Data: A Bayesian Formulation : Learning CBNs from Observational Data: A Bayesian Formulation where D is observational data,
Si is the structure of CBNi,
and K is background knowledge and belief.
Learning CBNs from Observational Data When There Are No Hidden Variables : Learning CBNs from Observational Data When There Are No Hidden Variables where i are the parameters associated with Si and the sum is over all CBNs for which P(Sj | K) andgt; 0.
The BD Marginal Likelihood : The BD Marginal Likelihood The previous integral has the following closed form solution, when we assume Dirichlet priors (ijk and ij), multinomial likelihoods (Nijk and Nij denote counts), parameter independence, and parameter modularity:
Searching for Network Structures : Searching for Network Structures Greedy search often used
Hybrid methods have been explored that constraints and scoring
Some algorithms guarantee locating the generating model in the large sample limit (assuming Markov and Faithfulness conditions), as for example the GES algorithm (Chickering, JMLR, 2002)
The ability to approximate the generating network is often quite good
An excellent discussion and evaluation of several state-of-the-art methods, including a relatively new method (Max-Min Hill Climbing) is at:
Tsamardinos, Brown, Aliferis, Machine Learning, 2006.
The Complexity of Search : The Complexity of Search Given a complete dataset and no hidden variables, locating the Bayesian network structure that has the highest posterior probability is NP-hard (Chickering, AIS, 1996; Chickering, et al, JMLR, 2004).
We Can Learn More from Observational and Experimental Data Together than from Either One Alone : We Can Learn More from Observational and Experimental Data Together than from Either One Alone We cannot learn the above causal structure from observational or experimental data alone. We need both.
Learning CBNs from Observational Data When There Are Hidden Variables : Learning CBNs from Observational Data When There Are Hidden Variables where Hi (Hj) are the hidden variables in Si (Sj) and the sum in the numerator (denominator) is taken over all values of Hi (Hj).
Learning CBNs from Observational and Experimental Data: A Bayesian Formulation : Learning CBNs from Observational and Experimental Data: A Bayesian Formulation • For each model variable Xi that is experimentally manipulated in at least one case, introduce a potential parent MXi of Xi .
• Xi can have parents as well from among the other {X1, ..., Xi-1, Xi+1, ..., Xn} domain variables in the model.
• Priors on the distribution of Xi will include conditioning on MXi,when it is a parent of Xi, as well as conditioning on the other parents of Xi.
• Define MXi to have the same values vi1, vi2, ... , viq as Xi, plus a value o (for observe).
o When MXi has value vij in a given case, this represents that the
experimenter intended to manipulate Xi to have value vij in the case.
o When MXi has value observe in a given case, this represents that no
attempt was made by the experimenter to manipulate Xi, but
rather, Xi was merely observed to have the value recorded for it.
• With the above variable additions in place, use the previous Bayesian methods for causal modeling from observational data.
An Example Database Containing Observations and Manipulations : An Example Database Containing Observations and Manipulations
Faithfulness Condition : Faithfulness Condition Faithfulness Condition
Any independence among variables in the data generating distribution follows from the Markov Condition applied to the data generating causal structure.
A simple counter example:
Challenges of Bayesian Learning of Causal Networks : Challenges of Bayesian Learning of Causal Networks Major challenges
Large search spaces
Hidden variables
Feedback
Assessing parameter and structure priors
Modeling complicated distributions
The remainder of this talk will summarize several methods for dealing with hidden variables, which is arguably the biggest major challenge today
These examples provide only a small sample of previous research
Learning Belief Networks in the Presence of Missing Values and Hidden Variables(N. Friedman, ICML, 1997) : Learning Belief Networks in the Presence of Missing Values and Hidden Variables (N. Friedman, ICML, 1997) Assumes a fixed set of measured and hidden variables
Uses Expectation Maximization (EM) to 'fill in' the values of the hidden variable
Uses BIC to score causal network structures with the filled-in data. Greedily finds best structure and then returns to the EM step using this new structure.
Some subsequent work
Use patterns of induced relationships among the measured variables to suggest where to introduce hidden variables (Elidan, et al., NIPS, 2000)
Determining the cardinality of the hidden variables introduced (Elidan andamp; Friedman, UAI, 2001)
A Non-Parametric Bayesian Methods for Inferring Hidden Causes(Wood, et al., UAI, 2006) : A Non-Parametric Bayesian Methods for Inferring Hidden Causes (Wood, et al., UAI, 2006) Learns hidden causes of measured variables
Assumes binary variables and noisy-OR interactions
Uses MCMC to sample the hidden structures
Allows in principle an infinite number of hidden variables
In practice, the number of optimal hidden variables is constrained by the measured data
hidden variables measured variables
Bayesian Learning of Measurement and Structural Model(Silva & Scheines, ICML, 2006) : Bayesian Learning of Measurement and Structural Model (Silva andamp; Scheines, ICML, 2006) Learns the following type of models
Assumes continuous variables, mixture of Gaussian distributions, and linear interactions
hidden variables measured variables
Mixed Ancestral Graphs* : Mixed Ancestral Graphs* A MAG(G) is a graphical object that contains only the observed variables, causal arcs, and a new relationship for representing hidden confounding.
There exist methods for scoring linear MAGS (Richardson andamp; Spirtes Ancestral Graph Markov Models, Annals of Statistics, 2002) * This slide was adapted from a slide provided by Peter Spirtes.
A Theoretical Study of Y Structures for Causal Discovery(Mani, Spirtes, Cooper, UAI, 2006) : A Theoretical Study of Y Structures for Causal Discovery (Mani, Spirtes, Cooper, UAI, 2006) Learn a Bayesian network structure on the measured variables
Identify patterns in the structure that suggest causal relationships
The 'Y' structure shown in green supports that D is an unconfounded cause of F. A B C D E F
Causal Discovery Using Subsets of Variables : Causal Discovery Using Subsets of Variables Search for an estimate M of the Markov blanket of a variable X (e.g., Aliferis, et al., AMIA, 2002)
X is independent of other variables in the generating causal network model, conditioned on the variables in X’s Markov blanket
Within M search for patterns among the variables that suggest a causal relationship to X (e.g., Mani, doctoral dissertation, Un. of Pittsburgh, 2006)
Causal Identifiability : Causal Identifiability Generally depends upon
Markov Condition
Faithfulness Condition
Informative structural relationships among the measured variables
Example of the 'Y structure':
C E A B
Evaluation of Causal Discovery : Evaluation of Causal Discovery In evaluating a classifier, the correct answer in any instance is just the value of some variable of interest, which typically is explicitly in the data set. This make evaluation relatively straightforward.
In evaluating the output of a causal discovery algorithm, the answer is not in the dataset. In general we need some outside knowledge to confirm that the causal output is correct. This makes evaluation relatively difficult. Thus, causal discovery algorithms have not been thoroughly evaluated.
Methods for Evaluating Causal Discovery Algorithms : Methods for Evaluating Causal Discovery Algorithms Simulated data
Real data with expert judgments of causation
Real data with previously validated causal relationships
Real data with follow up experiments
An Example of an Evaluation Using Simulated Data(Mani, poster here) : An Example of an Evaluation Using Simulated Data (Mani, poster here) Generated 20,000 observational data samples from each of five CBNs that were manually constructed
Applied the BLCD algorithm, which considers many 4-variable subsets of all the variables and applies Bayesian scoring. It is based on the causal properties of 'Y' structures.
Results
Precision: 83%
Recall: 27%
An Example of an Evaluation Using Previously Validated Causal Relationships(Yoo, et al., PSB, 2002) : An Example of an Evaluation Using Previously Validated Causal Relationships (Yoo, et al., PSB, 2002) ILVS is a Bayesian method that considers pairwise relationships among a set of variables
It works best when given both observational and experimental data
ILVS was applied to a previously collected DNA microarray dataset on 9 genes that control galactose metabolism in yeast (Ideker, et al., Science, 2001) The causal relationships among the genes have been extensively studied and reported in the literature.
ILVS predicted 12 of 27 known causal relationships among the genes (44% recall) and of those 12 eight were correct (67% precision)
Yoo has explored numerous extensions to ILVS
An Example of an Evaluation Using Real Data with Follow Up Experiments(Sachs, et al., Science, 2005) : An Example of an Evaluation Using Real Data with Follow Up Experiments (Sachs, et al., Science, 2005) Experimentally manipulated human immune system cells
Used flow cytometry to measure the effects on 11 proteins and phospholipids on a large number of individual cells
Used a Bayesian method for causally learning from observational and experimental data
Derived 17 causal relationships with high probability
15 highly supported by the literature (precision = 15/17 = 88%)
The other two were confirmed experimentally by the authors (precision = 17/17 = 100%)
Three causal relationships were missed ('recall' = 17 /20 = 85%)
A Possible Approach to Combining Causal Discovery and Feature Selection : A Possible Approach to Combining Causal Discovery and Feature Selection Use prior knowledge and statistical associations to develop overlapping groups of features (variables)
Derive causal probabilistic relationships within groups
Have the causal groups constrain each other
Determine additional groups of features that might constrain causal relationships further
Either go to step 2 or step 6
Model average within and across groups to derive approximate model-averaged causal relationships
David Danks 2002. Learning the Causal Structure of Overlapping Variable Sets. In S. Lange, K. Satoh, andamp; C.H. Smith, eds. Discovery Science: Proceedings of the 5th International Conference. Berlin: Springer-Verlag. pp. 178-191.
Some Suggestions for Further Information : Some Suggestions for Further Information Books
Glymour, Cooper (eds), Computation, Causation, and Discovery (MIT Press, 1999)
Pearl, Causality: Models, Reasoning, and Inference (Cambridge University Press, 2000)
Spirtes, Glymour, Scheines, Causation, Prediction, and Search (MIT Press, 2001)
Neapolitan, Learning Bayesian Networks (Prentice Hall, 2003)
Conferences
UAI, ICML, NIPS, AAAI, IJCAI
Journals
JMLR, Machine Learning
Acknowledgement : Acknowledgement Thanks to Peter Spirtes for his comments on an outline of this talk