A Guided Tour of Finite Mixture Models: From Pearson to the WebICML ‘01 Keynote TalkWilliams College, MAJune 29th 2001 : A Guided Tour of Finite Mixture Models: From Pearson to the Web ICML ‘01 Keynote Talk Williams College, MA June 29th 2001 Padhraic Smyth
Information and Computer Science
University of California, Irvine
www.datalab.uci.edu
Outline : Outline What are mixture models?
Definitions and examples
How can we learn mixture models?
A brief history and illustration
What are mixture models useful for?
Applications in Web and transaction data
Recent research in mixtures?
Acknowledgements : Acknowledgements Students:
Igor Cadez, Scott Gaffney, Xianping Ge, Dima Pavlov
Collaborators
David Heckerman, Chris Meek, Heikki Mannila, Christine McLaren, Geoff McLachlan, David Wolpert
Funding
NSF, NIH, NIST, KLA-Tencor, UCI Cancer Center, Microsoft Research, IBM Research, HNC Software.
Finite Mixture Models : Finite Mixture Models
Finite Mixture Models : Finite Mixture Models
Finite Mixture Models : Finite Mixture Models
Finite Mixture Models : Finite Mixture Models
Finite Mixture Models : Finite Mixture Models Weightk ComponentModelk Parametersk
Example: Mixture of Gaussians : Example: Mixture of Gaussians Gaussian mixtures:
Example: Mixture of Gaussians : Example: Mixture of Gaussians Gaussian mixtures: Each mixture component is a multidimensional Gaussian with its own mean mk and covariance 'shape' Sk
Example: Mixture of Gaussians : Example: Mixture of Gaussians Gaussian mixtures: Each mixture component is a multidimensional Gaussian with its own mean mk and covariance 'shape' Sk
e.g., K=2, 1-dim:
{q, a} = {m1 , s1 , m2 , s2 , a1}
Slide12 :
Slide13 :
Slide14 :
Example: Mixture of Naïve Bayes : Example: Mixture of Naïve Bayes
Example: Mixture of Naïve Bayes : Example: Mixture of Naïve Bayes
Example: Mixture of Naïve Bayes : Example: Mixture of Naïve Bayes
Conditional Independence
model for each component
(often quite useful to first-order)
Mixtures of Naïve Bayes : Mixtures of Naïve Bayes 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Terms Documents
Mixtures of Naïve Bayes : Mixtures of Naïve Bayes 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Terms Documents 1 1 1 1 Component 1 Component 2
Other Component Models : Other Component Models Mixtures of Rectangles
Pelleg and Moore (ICML, 2001)
Mixtures of Trees
Meila and Jordan (2000)
Mixtures of Curves
Quandt and Ramsey (1978)
Mixtures of Sequences
Poulsen (1990)
Interpretation of Mixtures : Interpretation of Mixtures 1. C has a direct (physical) interpretation
e.g., C = {age of fish}, C = {male, female}
Interpretation of Mixtures : Interpretation of Mixtures 1. C has a direct (physical) interpretation
e.g., C = {age of fish}, C = {male, female}
2. C might have an interpretation
e.g., clusters of Web surfers
Interpretation of Mixtures : Interpretation of Mixtures 1. C has a direct (physical) interpretation
e.g., C = {age of fish}, C = {male, female}
2. C might have an interpretation
e.g., clusters of Web surfers
3. C is just a convenient latent variable
e.g., flexible density estimation
Graphical Models for Mixtures : Graphical Models for Mixtures E.g., Mixtures of Naïve Bayes: C
Graphical Models for Mixtures : Graphical Models for Mixtures E.g., Mixtures of Naïve Bayes: C X1 X2 X3
Graphical Models for Mixtures : Graphical Models for Mixtures E.g., Mixtures of Naïve Bayes: (discrete, hidden) C (observed) X1 X2 X3
Sequential Mixtures : Sequential Mixtures C X1 X2 X3 X1 X2 X3 C X1 X2 X3 C Time t-1 Time t Time t+1
Sequential Mixtures : Sequential Mixtures C X1 X2 X3 X1 X2 X3 C X1 X2 X3 C Time t-1 Time t Time t+1 Markov Mixtures = C has Markov dependence
= Hidden Markov Model
(here with naïve Bayes)
C = discrete state,
couples observables through time
Dynamic Mixtures : Dynamic Mixtures Computer Vision
mixtures of Kalman filters for tracking
Atmospheric Science
mixtures of curves and dynamical models for cyclones
Economics
mixtures of switching regressions for the US economy
Limitations of Mixtures : Limitations of Mixtures Discrete state space
not always appropriate
e.g., in modeling dynamical systems
Training
no closed form solution, can be tricky
Interpretability
many different mixture solutions may explain the same data
Learning of mixture models : Learning of mixture models
Learning Mixtures from Data : Learning Mixtures from Data Consider fixed K
e.g., Unknown parameters Q = {m1 , s1 , m2 , s2 , a1}
Given data D = {x1,…….xN}, we want to find the parameters Q that 'best fit' the data
Early Attempts : Early Attempts Weldon’s data, 1893
- n=1000 crabs from Bay of Naples
- Ratio of forehead to body length
- suspected existence of 2 separate species
Early Attempts : Early Attempts Karl Pearson, 1894:
- JRSS paper
- proposed a mixture of 2 Gaussians
- 5 parameters Q = {m1 , s1 , m2 , s2 , a1}
- parameter estimation -andgt; method of moments
- involved solution of 9th order equations!
(see Chapter 10, Stigler (1986), The History of Statistics)
Slide35 : 'The solution of an equation of the ninth degree, where almost all powers, to the ninth, of the unknown quantity are existing, is, however, a very laborious task. Mr. Pearson has indeed possessed the energy to perform his heroic task…. But I fear he will have few successors…..'
Charlier (1906)
Maximum Likelihood Principle : Maximum Likelihood Principle Fisher, 1922
assume a probabilistic model
likelihood = p(data | parameters, model)
find the parameters that make the data most likely
Maximum Likelihood Principle : Maximum Likelihood Principle Fisher, 1922
assume a probabilistic model
likelihood = p(data | parameters, model)
find the parameters that make the data most likely
1977: The EM Algorithm : 1977: The EM Algorithm Dempster, Laird, and Rubin
general framework for likelihood-based parameter estimation with missing data
start with initial guesses of parameters
Estep: estimate memberships given params
Mstep: estimate params given memberships
Repeat until convergence
converges to a (local) maximum of likelihood
Estep and Mstep are often computationally simple
generalizes to maximum a posteriori (with priors)
Slide39 :
Example of a Log-Likelihood Surface : Example of a Log-Likelihood Surface Log Scale for Sigma 2
Mean 2
Slide41 :
Slide42 :
Slide43 :
Slide44 :
Slide45 :
Slide46 :
Slide47 :
Slide48 :
Slide49 :
Slide50 : Anemia Group Control Group
Slide51 : Data for an Individual Patient Healthy state Anemic state (Cadez et al., Machine Learning, in press)
Alternatives to EM : Alternatives to EM Method of Moments
EM is more efficient
Direct optimization
e.g., gradient descent, Newton methods
EM is simpler to implement
Sampling (e.g., MCMC)
Minimum distance, e.g.,
How many components? : How many components? 2 general approaches:
1. Best density estimator
e.g., what predicts best on new data
2. True number of components
- cannot be answered from data alone
Slide54 : Data-generating
process ('truth') K=2 Model Class 'Closest' model in terms
of KL distance
Slide55 : Data-generating
process ('truth') K=10 Model Class
Prescriptions for Model Selection : Prescriptions for Model Selection Minimize distance to 'truth'
Maximize predictive 'logp score'
gives an estimate of KL(model, truth)
pick model that predicts best on validation data
Bayesian techniques
p(k|data): impossible to compute exactly
closed-form approximations:
BIC, Autoclass, etc
sampling
Monte Carlo techniques: quite tricky for mixtures
Mixture model applications : Mixture model applications
What are Mixtures used for? : What are Mixtures used for? Modeling heterogeneity
e.g., inferring multiple species in biology
Handling missing data
e.g., variables and cases missing in model-building
Density estimation
e.g., as flexible priors in Bayesian statistics
Clustering
components as clusters
Model Averaging
combining density models
Mixtures of non-vector data : Mixtures of non-vector data Example
N individuals, and sets of sequences for each
e.g., Web session data
Clustering of the N individuals?
Vectorize data and apply vector methods?
Pairwise distances of sets of sequences?
Parameter estimate for each individual and then cluster?
Mixtures of {Sequences, Curves, …} : Mixtures of {Sequences, Curves, …}
Mixtures of {Sequences, Curves, …} : Mixtures of {Sequences, Curves, …} Generative Model
- select a component ck for individual i
- generate data according to p(Di | ck)
- p(Di | ck) can be very general
- e.g., sets of sequences, spatial patterns, etc
[Note: given p(Di | ck), we can define an EM algorithm]
Application 1: Web Log Visualization (Cadez, Heckerman, Meek, Smyth, KDD 2000) : Application 1: Web Log Visualization (Cadez, Heckerman, Meek, Smyth, KDD 2000) MSNBC Web logs
2 million individuals per day
different session lengths per individual
difficult visualization and clustering problem
WebCanvas
uses mixtures of SFSMs to cluster individuals based on their observed sequences
software tool: EM mixture modeling + visualization
Slide63 :
Example: Mixtures of SFSMs : Example: Mixtures of SFSMs Simple model for traversal on a Web site
(equivalent to first-order Markov with end-state)
Generative model for large sets of Web users
- different behaviors andlt;=andgt; mixture of SFSMs
EM algorithm is quite simple: weighted counts
Slide65 : WebCanvas: Cadez, Heckerman, et al, KDD 2000
Slide66 :
Timing Results : Timing Results
Transaction Data : Transaction Data
Profiling for Transaction Data : Profiling for Transaction Data Profiling
given transactions for a set of individuals
infer a predictive model for future transactions of individual i
typical applications:
automated recommender systems: e.g., Amazon.com
personalization: e.g., wireless information services
existing techniques
collaborative filtering, association rules
unsuited for prediction, e.g., inclusion of seasonality.
Slide70 : Retail Data Set
200,000 individuals
all market baskets over 2 years
50 departments, 50k items
Problem
predict what an individual will purchase in the future
want to generalize across products
want to allow heterogeneity Application 2: Transaction Data (Cadez, Smyth, Mannila, KDD 2001)
Transaction Data : Transaction Data
Examples of Mixture Components : Examples of Mixture Components Components
'encode'
typical
combinations
of clothes
Mixture-based Profiles : Mixture-based Profiles Predictive
Profile for
Individual i Basket
model for
Component k
('Basis function') Probability
that Individual i
engages in
'behavior' k,
given that they
enter the store
Hierarchical Bayes Model : Hierarchical Bayes Model Empirical Prior on
Mixture Weights Individual 1 B1 Individual i B1 B2 B3 Individual N B1 B2 Individuals with little data get 'shrunk' to the prior
Individuals with a lot of data are more data-driven
Data and Profile Example : Data and Profile Example
Data and Profile Example : Data and Profile Example
Data and Profile Example : Data and Profile Example
Data and Profile Example : Data and Profile Example
Data and Profile Example : Data and Profile Example No Training
Data for 14 No Purchases above Dept 25
Slide80 :
Slide81 :
Transaction mixtures : Transaction mixtures Mixture-based profiles
interpretable and flexible
more accurate than non-mixture approaches
training time = linear(number of items)
Applications
early detection of high-value customers
visualization and exploration
forecasting customer behavior
Extensions of Mixtures : Extensions of Mixtures
Extensions : Multiple Causes : Extensions : Multiple Causes Single 'cause' variable C X
Extensions: Multiple Causes : Extensions: Multiple Causes Single 'cause' variable
Multiple Causes ('Factors')
examples
Hoffman (1999,2000) for text
ICA for signals C X C2 X C1 C3
Extensions: High Dimensions : Extensions: High Dimensions Density estimation is difficult in high-d
Extensions : High Dimensions : Extensions : High Dimensions Density estimation is difficult in high-d
Extensions: High Dimensions : Extensions: High Dimensions Density estimation is difficult in high-d Global
PCA
Extensions: High Dimensions : Extensions: High Dimensions Density estimation is difficult in high-d Mixtures of PCA
(Tipping and Bishop
1999)
Extensions: Predictive Mixtures : Extensions: Predictive Mixtures Standard mixture model
Extensions: Predictive Mixtures : Extensions: Predictive Mixtures Standard mixture model
Conditional Mixtures:
e.g., mixtures of experts, Jordan and Jacobs (1994) Input-dependent
Extensions: Learning Algorithms : Extensions: Learning Algorithms Fast algorithms
kd-trees (Moore, 1998)
Caching(Bradley et al.)
Random projections:
DasGupta (2000)
Mean-squared error criteria
Scott (2000)
Bayesian techniques
reversible-jump MCMC (Green, et al.)
Classic References : Classic References
Statistical Analysis of Finite Mixture Distributions
Titterington, Smith, and Makov
Wiley, 1985
Finite Mixture Models
McLachlan and Peel
Wiley 2000
Conclusions : Conclusions Mixtures
flexible 'tool' in the machine learner’s toolbox
Beyond mixtures of Gaussians…
mixtures of sequences, curves, trees, etc.
Applications
numerous and broad
Slide95 :
Slide96 :
Concavity of Likelihood (Cadez and Smyth, NIPS 2000) : Concavity of Likelihood (Cadez and Smyth, NIPS 2000)
Application 3: Model Averaging : Application 3: Model Averaging Prediction of
Model k Weight of
Model k Bayesian model averaging for p(x):
- since we don’t know which model (if any)
is the true one, average out this uncertainty Prediction of
x given data D
Stacked Mixtures(Smyth and Wolpert, Machine Learning, 1999) : Stacked Mixtures (Smyth and Wolpert, Machine Learning, 1999) Simple idea:
use cross-validation to estimate the weights,
rather than using a Bayesian scheme
Two-phase learning
1. Learn each model Mk on Dtrain
2. Learn mixture model weights on Dvalidation - components are fixed
- EM just learns the weights
Outperforms any single model selection technique
Even outperforms 'cheating'
Query Approximation(Pavlov and Smyth, KDD 2001) : Query Approximation (Pavlov and Smyth, KDD 2001) Approximate
Models Query
Generator Large
Database
Query Approximation : Query Approximation Large
Database Construct
Probability
Models
Offline
e.g., mixtures,
belief networks, etc Approximate
Models Query
Generator
Query Approximation : Query Approximation Query
Generator Large
Database Construct
Probability
Models
Offline
e.g., mixtures,
belief networks, etc Provide
Fast Query
Answers
Online
Approximate
Models
Stacking for Query Model Combining : Stacking for Query Model Combining Conjunctive
Queries
on Microsoft
Web Data,
32k records,
294 attributes
Available
online at
UCI KDD Archive
Slide104 : 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Attributes Observation
Vectors 1 1 1 1
Slide105 : 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Attributes Observation
Vectors C1 C1 C1 C1 C1 C1 C1 C2 C2 C2 C2 C2 C2 C2 1 1 1 1 Treat as Missing C2
Slide106 : 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Attributes Observation
Vectors C1 C1 C1 C1 C1 C1 C1 C2 C2 C2 C2 C2 C2 C2 1 1 1 1 Treat as Missing P(C1|x1) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C2|x1) P(C2|..) P(C2|..) P(C2|..) P(C2|..) P(C2|..) P(C2|..) P(C2|..) P(C2|..) P(C2|..) P(C2|..) P(C2|..) P(C2|..) P(C2|..) P(C2|..) E-Step: estimate component
membership probabilities given
current parameter estimates
Slide107 : 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Attributes Observation
Vectors C1 C1 C1 C1 C1 C1 C1 C2 C2 C2 C2 C2 C2 C2 1 1 1 1 Treat as Missing P(C1|x1) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C2|x1) P(C2|..) P(C2|..) P(C2|..) P(C2|..) P(C2|..) P(C2|..) P(C2|..) P(C2|..) P(C2|..) P(C2|..) P(C2|..) P(C2|..) P(C2|..) P(C2|..) M-Step: use 'fractional' weighted data
to get new estimates of the parameters