Share PowerPoint. Anywhere!

icml01 smyth

Uploaded from authorPOINT
Download as Download Not Available PPT
Presentation Description

No description available

Views: 12
Like it  ( Likes) Dislike it  ( Dislikes)
Added: September 25, 2007 This presentation is Public
Presentation Category :Science & Technology
Tags Add Tags
Presentation StatisticsNew!
Views on authorSTREAM: 12
Presentation Transcript

A Guided Tour of Finite Mixture Models: From Pearson to the WebICML ‘01 Keynote TalkWilliams College, MAJune 29th 2001 : A Guided Tour of Finite Mixture Models: From Pearson to the Web ICML ‘01 Keynote Talk Williams College, MA June 29th 2001 Padhraic Smyth Information and Computer Science University of California, Irvine www.datalab.uci.edu


Outline : Outline What are mixture models? Definitions and examples How can we learn mixture models? A brief history and illustration What are mixture models useful for? Applications in Web and transaction data Recent research in mixtures?


Acknowledgements : Acknowledgements Students: Igor Cadez, Scott Gaffney, Xianping Ge, Dima Pavlov Collaborators David Heckerman, Chris Meek, Heikki Mannila, Christine McLaren, Geoff McLachlan, David Wolpert Funding NSF, NIH, NIST, KLA-Tencor, UCI Cancer Center, Microsoft Research, IBM Research, HNC Software.


Finite Mixture Models : Finite Mixture Models


Finite Mixture Models : Finite Mixture Models


Finite Mixture Models : Finite Mixture Models


Finite Mixture Models : Finite Mixture Models


Finite Mixture Models : Finite Mixture Models Weightk ComponentModelk Parametersk


Example: Mixture of Gaussians : Example: Mixture of Gaussians Gaussian mixtures:


Example: Mixture of Gaussians : Example: Mixture of Gaussians Gaussian mixtures: Each mixture component is a multidimensional Gaussian with its own mean mk and covariance 'shape' Sk


Example: Mixture of Gaussians : Example: Mixture of Gaussians Gaussian mixtures: Each mixture component is a multidimensional Gaussian with its own mean mk and covariance 'shape' Sk e.g., K=2, 1-dim: {q, a} = {m1 , s1 , m2 , s2 , a1}


Slide12 :


Slide13 :


Slide14 :


Example: Mixture of Naïve Bayes : Example: Mixture of Naïve Bayes


Example: Mixture of Naïve Bayes : Example: Mixture of Naïve Bayes


Example: Mixture of Naïve Bayes : Example: Mixture of Naïve Bayes Conditional Independence model for each component (often quite useful to first-order)


Mixtures of Naïve Bayes : Mixtures of Naïve Bayes 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Terms Documents


Mixtures of Naïve Bayes : Mixtures of Naïve Bayes 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Terms Documents 1 1 1 1 Component 1 Component 2


Other Component Models : Other Component Models Mixtures of Rectangles Pelleg and Moore (ICML, 2001) Mixtures of Trees Meila and Jordan (2000) Mixtures of Curves Quandt and Ramsey (1978) Mixtures of Sequences Poulsen (1990)


Interpretation of Mixtures : Interpretation of Mixtures 1. C has a direct (physical) interpretation e.g., C = {age of fish}, C = {male, female}


Interpretation of Mixtures : Interpretation of Mixtures 1. C has a direct (physical) interpretation e.g., C = {age of fish}, C = {male, female} 2. C might have an interpretation e.g., clusters of Web surfers


Interpretation of Mixtures : Interpretation of Mixtures 1. C has a direct (physical) interpretation e.g., C = {age of fish}, C = {male, female} 2. C might have an interpretation e.g., clusters of Web surfers 3. C is just a convenient latent variable e.g., flexible density estimation


Graphical Models for Mixtures : Graphical Models for Mixtures E.g., Mixtures of Naïve Bayes: C


Graphical Models for Mixtures : Graphical Models for Mixtures E.g., Mixtures of Naïve Bayes: C X1 X2 X3


Graphical Models for Mixtures : Graphical Models for Mixtures E.g., Mixtures of Naïve Bayes: (discrete, hidden) C (observed) X1 X2 X3


Sequential Mixtures : Sequential Mixtures C X1 X2 X3 X1 X2 X3 C X1 X2 X3 C Time t-1 Time t Time t+1


Sequential Mixtures : Sequential Mixtures C X1 X2 X3 X1 X2 X3 C X1 X2 X3 C Time t-1 Time t Time t+1 Markov Mixtures = C has Markov dependence = Hidden Markov Model (here with naïve Bayes) C = discrete state, couples observables through time


Dynamic Mixtures : Dynamic Mixtures Computer Vision mixtures of Kalman filters for tracking Atmospheric Science mixtures of curves and dynamical models for cyclones Economics mixtures of switching regressions for the US economy


Limitations of Mixtures : Limitations of Mixtures Discrete state space not always appropriate e.g., in modeling dynamical systems Training no closed form solution, can be tricky Interpretability many different mixture solutions may explain the same data


Learning of mixture models : Learning of mixture models


Learning Mixtures from Data : Learning Mixtures from Data Consider fixed K e.g., Unknown parameters Q = {m1 , s1 , m2 , s2 , a1} Given data D = {x1,…….xN}, we want to find the parameters Q that 'best fit' the data


Early Attempts : Early Attempts Weldon’s data, 1893 - n=1000 crabs from Bay of Naples - Ratio of forehead to body length - suspected existence of 2 separate species


Early Attempts : Early Attempts Karl Pearson, 1894: - JRSS paper - proposed a mixture of 2 Gaussians - 5 parameters Q = {m1 , s1 , m2 , s2 , a1} - parameter estimation -andgt; method of moments - involved solution of 9th order equations! (see Chapter 10, Stigler (1986), The History of Statistics)


Slide35 : 'The solution of an equation of the ninth degree, where almost all powers, to the ninth, of the unknown quantity are existing, is, however, a very laborious task. Mr. Pearson has indeed possessed the energy to perform his heroic task…. But I fear he will have few successors…..' Charlier (1906)


Maximum Likelihood Principle : Maximum Likelihood Principle Fisher, 1922 assume a probabilistic model likelihood = p(data | parameters, model) find the parameters that make the data most likely


Maximum Likelihood Principle : Maximum Likelihood Principle Fisher, 1922 assume a probabilistic model likelihood = p(data | parameters, model) find the parameters that make the data most likely


1977: The EM Algorithm : 1977: The EM Algorithm Dempster, Laird, and Rubin general framework for likelihood-based parameter estimation with missing data start with initial guesses of parameters Estep: estimate memberships given params Mstep: estimate params given memberships Repeat until convergence converges to a (local) maximum of likelihood Estep and Mstep are often computationally simple generalizes to maximum a posteriori (with priors)


Slide39 :


Example of a Log-Likelihood Surface : Example of a Log-Likelihood Surface Log Scale for Sigma 2 Mean 2


Slide41 :


Slide42 :


Slide43 :


Slide44 :


Slide45 :


Slide46 :


Slide47 :


Slide48 :


Slide49 :


Slide50 : Anemia Group Control Group


Slide51 : Data for an Individual Patient Healthy state Anemic state (Cadez et al., Machine Learning, in press)


Alternatives to EM : Alternatives to EM Method of Moments EM is more efficient Direct optimization e.g., gradient descent, Newton methods EM is simpler to implement Sampling (e.g., MCMC) Minimum distance, e.g.,


How many components? : How many components? 2 general approaches: 1. Best density estimator e.g., what predicts best on new data 2. True number of components - cannot be answered from data alone


Slide54 : Data-generating process ('truth') K=2 Model Class 'Closest' model in terms of KL distance


Slide55 : Data-generating process ('truth') K=10 Model Class


Prescriptions for Model Selection : Prescriptions for Model Selection Minimize distance to 'truth' Maximize predictive 'logp score' gives an estimate of KL(model, truth) pick model that predicts best on validation data Bayesian techniques p(k|data): impossible to compute exactly closed-form approximations: BIC, Autoclass, etc sampling Monte Carlo techniques: quite tricky for mixtures


Mixture model applications : Mixture model applications


What are Mixtures used for? : What are Mixtures used for? Modeling heterogeneity e.g., inferring multiple species in biology Handling missing data e.g., variables and cases missing in model-building Density estimation e.g., as flexible priors in Bayesian statistics Clustering components as clusters Model Averaging combining density models


Mixtures of non-vector data : Mixtures of non-vector data Example N individuals, and sets of sequences for each e.g., Web session data Clustering of the N individuals? Vectorize data and apply vector methods? Pairwise distances of sets of sequences? Parameter estimate for each individual and then cluster?


Mixtures of {Sequences, Curves, …} : Mixtures of {Sequences, Curves, …}


Mixtures of {Sequences, Curves, …} : Mixtures of {Sequences, Curves, …} Generative Model - select a component ck for individual i - generate data according to p(Di | ck) - p(Di | ck) can be very general - e.g., sets of sequences, spatial patterns, etc [Note: given p(Di | ck), we can define an EM algorithm]


Application 1: Web Log Visualization (Cadez, Heckerman, Meek, Smyth, KDD 2000) : Application 1: Web Log Visualization (Cadez, Heckerman, Meek, Smyth, KDD 2000) MSNBC Web logs 2 million individuals per day different session lengths per individual difficult visualization and clustering problem WebCanvas uses mixtures of SFSMs to cluster individuals based on their observed sequences software tool: EM mixture modeling + visualization


Slide63 :


Example: Mixtures of SFSMs : Example: Mixtures of SFSMs Simple model for traversal on a Web site (equivalent to first-order Markov with end-state) Generative model for large sets of Web users - different behaviors andlt;=andgt; mixture of SFSMs EM algorithm is quite simple: weighted counts


Slide65 : WebCanvas: Cadez, Heckerman, et al, KDD 2000


Slide66 :


Timing Results : Timing Results


Transaction Data : Transaction Data


Profiling for Transaction Data : Profiling for Transaction Data Profiling given transactions for a set of individuals infer a predictive model for future transactions of individual i typical applications: automated recommender systems: e.g., Amazon.com personalization: e.g., wireless information services existing techniques collaborative filtering, association rules unsuited for prediction, e.g., inclusion of seasonality.


Slide70 : Retail Data Set 200,000 individuals all market baskets over 2 years 50 departments, 50k items Problem predict what an individual will purchase in the future want to generalize across products want to allow heterogeneity Application 2: Transaction Data (Cadez, Smyth, Mannila, KDD 2001)


Transaction Data : Transaction Data


Examples of Mixture Components : Examples of Mixture Components Components 'encode' typical combinations of clothes


Mixture-based Profiles : Mixture-based Profiles Predictive Profile for Individual i Basket model for Component k ('Basis function') Probability that Individual i engages in 'behavior' k, given that they enter the store


Hierarchical Bayes Model : Hierarchical Bayes Model Empirical Prior on Mixture Weights Individual 1 B1 Individual i B1 B2 B3 Individual N B1 B2 Individuals with little data get 'shrunk' to the prior Individuals with a lot of data are more data-driven


Data and Profile Example : Data and Profile Example


Data and Profile Example : Data and Profile Example


Data and Profile Example : Data and Profile Example


Data and Profile Example : Data and Profile Example


Data and Profile Example : Data and Profile Example No Training Data for 14 No Purchases above Dept 25


Slide80 :


Slide81 :


Transaction mixtures : Transaction mixtures Mixture-based profiles interpretable and flexible more accurate than non-mixture approaches training time = linear(number of items) Applications early detection of high-value customers visualization and exploration forecasting customer behavior


Extensions of Mixtures : Extensions of Mixtures


Extensions : Multiple Causes : Extensions : Multiple Causes Single 'cause' variable C X


Extensions: Multiple Causes : Extensions: Multiple Causes Single 'cause' variable Multiple Causes ('Factors') examples Hoffman (1999,2000) for text ICA for signals C X C2 X C1 C3


Extensions: High Dimensions : Extensions: High Dimensions Density estimation is difficult in high-d


Extensions : High Dimensions : Extensions : High Dimensions Density estimation is difficult in high-d


Extensions: High Dimensions : Extensions: High Dimensions Density estimation is difficult in high-d Global PCA


Extensions: High Dimensions : Extensions: High Dimensions Density estimation is difficult in high-d Mixtures of PCA (Tipping and Bishop 1999)


Extensions: Predictive Mixtures : Extensions: Predictive Mixtures Standard mixture model


Extensions: Predictive Mixtures : Extensions: Predictive Mixtures Standard mixture model Conditional Mixtures: e.g., mixtures of experts, Jordan and Jacobs (1994) Input-dependent


Extensions: Learning Algorithms : Extensions: Learning Algorithms Fast algorithms kd-trees (Moore, 1998) Caching(Bradley et al.) Random projections: DasGupta (2000) Mean-squared error criteria Scott (2000) Bayesian techniques reversible-jump MCMC (Green, et al.)


Classic References : Classic References Statistical Analysis of Finite Mixture Distributions Titterington, Smith, and Makov Wiley, 1985 Finite Mixture Models McLachlan and Peel Wiley 2000


Conclusions : Conclusions Mixtures flexible 'tool' in the machine learner’s toolbox Beyond mixtures of Gaussians… mixtures of sequences, curves, trees, etc. Applications numerous and broad


Slide95 :


Slide96 :


Concavity of Likelihood (Cadez and Smyth, NIPS 2000) : Concavity of Likelihood (Cadez and Smyth, NIPS 2000)


Application 3: Model Averaging : Application 3: Model Averaging Prediction of Model k Weight of Model k Bayesian model averaging for p(x): - since we don’t know which model (if any) is the true one, average out this uncertainty Prediction of x given data D


Stacked Mixtures(Smyth and Wolpert, Machine Learning, 1999) : Stacked Mixtures (Smyth and Wolpert, Machine Learning, 1999) Simple idea: use cross-validation to estimate the weights, rather than using a Bayesian scheme Two-phase learning 1. Learn each model Mk on Dtrain 2. Learn mixture model weights on Dvalidation - components are fixed - EM just learns the weights Outperforms any single model selection technique Even outperforms 'cheating'


Query Approximation(Pavlov and Smyth, KDD 2001) : Query Approximation (Pavlov and Smyth, KDD 2001) Approximate Models Query Generator Large Database


Query Approximation : Query Approximation Large Database Construct Probability Models Offline e.g., mixtures, belief networks, etc Approximate Models Query Generator


Query Approximation : Query Approximation Query Generator Large Database Construct Probability Models Offline e.g., mixtures, belief networks, etc Provide Fast Query Answers Online Approximate Models


Stacking for Query Model Combining : Stacking for Query Model Combining Conjunctive Queries on Microsoft Web Data, 32k records, 294 attributes Available online at UCI KDD Archive


Slide104 : 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Attributes Observation Vectors 1 1 1 1


Slide105 : 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Attributes Observation Vectors C1 C1 C1 C1 C1 C1 C1 C2 C2 C2 C2 C2 C2 C2 1 1 1 1 Treat as Missing C2


Slide106 : 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Attributes Observation Vectors C1 C1 C1 C1 C1 C1 C1 C2 C2 C2 C2 C2 C2 C2 1 1 1 1 Treat as Missing P(C1|x1) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C2|x1) P(C2|..) P(C2|..) P(C2|..) P(C2|..) P(C2|..) P(C2|..) P(C2|..) P(C2|..) P(C2|..) P(C2|..) P(C2|..) P(C2|..) P(C2|..) P(C2|..) E-Step: estimate component membership probabilities given current parameter estimates


Slide107 : 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Attributes Observation Vectors C1 C1 C1 C1 C1 C1 C1 C2 C2 C2 C2 C2 C2 C2 1 1 1 1 Treat as Missing P(C1|x1) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C1|..) P(C2|x1) P(C2|..) P(C2|..) P(C2|..) P(C2|..) P(C2|..) P(C2|..) P(C2|..) P(C2|..) P(C2|..) P(C2|..) P(C2|..) P(C2|..) P(C2|..) P(C2|..) M-Step: use 'fractional' weighted data to get new estimates of the parameters