logging in or signing up dhillon2004 itcc Cubemiddle Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 130 Category: News & Reports.. License: All Rights Reserved Like it (0) Dislike it (0) Added: September 06, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon University of Texas, Austin : Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon University of Texas, Austin Joint work with A. Banerjee, J. Ghosh, Y. Guan, S. Mallela, S. Merugu andamp; D. Modha Data Mining Seminar Series, Mar 26, 2004 Clustering: Unsupervised Learning: Clustering: Unsupervised Learning Grouping together of 'similar' objects Hard Clustering -- Each object belongs to a single cluster Soft Clustering -- Each object is probabilistically assigned to clusters Contingency Tables: Contingency Tables Let X and Y be discrete random variables X and Y take values in {1, 2, …, m} and {1, 2, …, n} p(X, Y) denotes the joint probability distribution—if not known, it is often estimated based on co-occurrence data Application areas: text mining, market-basket analysis, analysis of browsing behavior, etc. Key Obstacles in Clustering Contingency Tables High Dimensionality, Sparsity, Noise Need for robust and scalable algorithms Co-Clustering: Co-Clustering Simultaneously Cluster rows of p(X, Y) into k disjoint groups Cluster columns of p(X, Y) into l disjoint groups Key goal is to exploit the 'duality' between row and column clustering to overcome sparsity and noise Co-clustering Example for Text Data: Co-clustering Example for Text Data document word word clusters document clusters Co-clustering clusters both words and documents simultaneously using the underlying co-occurrence frequency matrix Co-clustering and Information Theory: Co-clustering and Information Theory View 'co-occurrence' matrix as a joint probability distribution over row andamp; column random variables We seek a 'hard-clustering' of both rows and columns such that 'information' in the compressed matrix is maximized. Information Theory Concepts: Information Theory Concepts Entropy of a random variable X with probability distribution p: The Kullback-Leibler (KL) Divergence or 'Relative Entropy' between two probability distributions p and q: Mutual Information between random variables X and Y: “Optimal” Co-Clustering: 'Optimal' Co-Clustering Seek random variables and taking values in {1, 2, …, k} and {1, 2, …, l} such that mutual information is maximized: where = R(X) is a function of X alone where = C(Y) is a function of Y alone Related Work : Related Work Distributional Clustering Pereira, Tishby andamp; Lee (1993), Baker andamp; McCallum (1998) Information Bottleneck Tishby, Pereira andamp; Bialek(1999), Slonim, Friedman andamp; Tishby (2001), Berkhin andamp; Becher(2002) Probabilistic Latent Semantic Indexing Hofmann (1999), Hofmann andamp; Puzicha (1999) Non-Negative Matrix Approximation Lee andamp; Seung(2000) Information-Theoretic Co-clustering: Information-Theoretic Co-clustering Lemma: 'Loss in mutual information' equals p is the input distribution q is an approximation to p Can be shown that q(x,y) is a maximum entropy approximation subject to cluster constraints. Slide11: Slide12: Slide13: Slide14: Slide15: #parameters that determine q(x,y) are: Decomposition Lemma: Decomposition Lemma Question: How to minimize ? Following Lemma reveals the Answer: Note that may be thought of as the 'prototype' of row cluster. Similarly, Co-Clustering Algorithm: Co-Clustering Algorithm [Step 1] Set . Start with , Compute . [Step 2] For every row , assign it to the cluster that minimizes [Step 3] We have . Compute . [Step 4] For every column , assign it to the cluster that minimizes [Step 5] We have . Compute . Iterate 2-5. Properties of Co-clustering Algorithm: Properties of Co-clustering Algorithm Main Theorem: Co-clustering 'monotonically' decreases loss in mutual information Co-clustering converges to a local minimum Can be generalized to multi-dimensional contingency tables q can be viewed as a 'low complexity' non-negative matrix approximation q preserves marginals of p, and co-cluster statistics Implicit dimensionality reduction at each step helps overcome sparsity andamp; high-dimensionality Computationally economical Slide19: Slide20: Slide21: Slide22: Applications -- Text Classification: Applications -- Text Classification Assigning class labels to text documents Training and Testing Phases Document collection Class-1 Class-m Grouped into classes Training Data New Document Classifier (Learns from Training data) New Document With Assigned class Feature Clustering (dimensionality reduction): Feature Clustering (dimensionality reduction) Feature Selection Feature Clustering Document Bag-of-words 1 m Vector Of words Select the 'best' words Throw away rest Frequency based pruning Information criterion based pruning Document Bag-of-words Vector Of words 1 m Cluster#1 Cluster#k Do not throw away words Cluster words instead Use clusters as features Word#1 Word#k Experiments: Experiments Data sets 20 Newsgroups data 20 classes, 20000 documents Classic3 data set 3 classes (cisi, med and cran), 3893 documents Dmoz Science HTML data 49 leaves in the hierarchy 5000 documents with 14538 words Available at http://www.cs.utexas.edu/users/manyam/dmoz.txt Implementation Details Bow – for indexing,co-clustering, clustering and classifying Results (20Ng): Results (20Ng) Classification Accuracy on 20 Newsgroups data with 1/3-2/3 test-train split Divisive clustering beats feature selection algorithms by a large margin The effect is more significant at lower number of features Results (Dmoz): Results (Dmoz) Classification Accuracy on Dmoz data with 1/3-2/3 test train split Divisive Clustering is better at lower number of features Note contrasting behavior of Naïve Bayes and SVMs Results (Dmoz): Results (Dmoz) Naïve Bayes on Dmoz data with only 2% Training data Note that Divisive Clustering achieves higher maximum than IG with a significant 13% increase Divisive Clustering performs better than IG at lower training data Slide29: Science Math Physics Social Science Number Theory Logic Mechanics Quantum Theory Economics Archeology Flat classifier builds a classifier over the leaf classes in the above hierarchy Hierarchical Classifier builds a classifier at each internal node of the hierarchy Hierarchical Classification Results (Dmoz): Results (Dmoz) Hierarchical Classifier (Naïve Bayes at each node) Hierarchical Classifier: 64.54% accuracy at just 10 features (Flat achieves 64.04% accuracy at 1000 features) Hierarchical Classifier improves accuracy to 68.42 % from 64.42%(maximum) achieved by flat classifiers Anecdotal Evidence: Anecdotal Evidence Top few words sorted in Clusters obtained by Divisive and Agglomerative approaches on 20 Newsgroups data Co-Clustering Results (CLASSIC3): Co-Clustering Results (CLASSIC3) Results – Binary (subset of 20Ng data): Results – Binary (subset of 20Ng data) Precision – 20Ng data: Precision – 20Ng data Results: Sparsity (Binary_subject data) : Results: Sparsity (Binary_subject data) Results: Sparsity (Binary_subject data) : Results: Sparsity (Binary_subject data) Results (Monotonicity): Results (Monotonicity) Conclusions: Conclusions Information-theoretic approach to clustering, co-clustering and matrix approximation Implicit dimensionality reduction at each step to overcome sparsity andamp; high-dimensionality Theoretical approach has the potential of extending to other problems: Multi-dimensional co-clustering MDL to choose number of co-clusters Generalized co-clustering by Bregman divergence More Information: More Information Email: inderjit@cs.utexas.edu Papers are available at: http://www.cs.utexas.edu/users/inderjit 'Divisive Information-Theoretic Feature Clustering for Text Classification', Dhillon, Mallela andamp; Kumar, Journal of Machine Learning Research(JMLR), March 2003 (also KDD, 2002) 'Information-Theoretic Co-clustering', Dhillon, Mallela andamp; Modha, KDD, 2003. 'Clustering with Bregman Divergences', Banerjee, Merugu, Dhillon andamp; Ghosh, SIAM Data Mining Proceedings, April, 2004. 'A Generalized Maximum Entropy Approach to Bregman Co-clustering andamp; Matrix Approximation', Banerjee, Dhillon, Ghosh, Merugu andamp; Modha, working manuscript, 2004. You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
dhillon2004 itcc Cubemiddle Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 130 Category: News & Reports.. License: All Rights Reserved Like it (0) Dislike it (0) Added: September 06, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon University of Texas, Austin : Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon University of Texas, Austin Joint work with A. Banerjee, J. Ghosh, Y. Guan, S. Mallela, S. Merugu andamp; D. Modha Data Mining Seminar Series, Mar 26, 2004 Clustering: Unsupervised Learning: Clustering: Unsupervised Learning Grouping together of 'similar' objects Hard Clustering -- Each object belongs to a single cluster Soft Clustering -- Each object is probabilistically assigned to clusters Contingency Tables: Contingency Tables Let X and Y be discrete random variables X and Y take values in {1, 2, …, m} and {1, 2, …, n} p(X, Y) denotes the joint probability distribution—if not known, it is often estimated based on co-occurrence data Application areas: text mining, market-basket analysis, analysis of browsing behavior, etc. Key Obstacles in Clustering Contingency Tables High Dimensionality, Sparsity, Noise Need for robust and scalable algorithms Co-Clustering: Co-Clustering Simultaneously Cluster rows of p(X, Y) into k disjoint groups Cluster columns of p(X, Y) into l disjoint groups Key goal is to exploit the 'duality' between row and column clustering to overcome sparsity and noise Co-clustering Example for Text Data: Co-clustering Example for Text Data document word word clusters document clusters Co-clustering clusters both words and documents simultaneously using the underlying co-occurrence frequency matrix Co-clustering and Information Theory: Co-clustering and Information Theory View 'co-occurrence' matrix as a joint probability distribution over row andamp; column random variables We seek a 'hard-clustering' of both rows and columns such that 'information' in the compressed matrix is maximized. Information Theory Concepts: Information Theory Concepts Entropy of a random variable X with probability distribution p: The Kullback-Leibler (KL) Divergence or 'Relative Entropy' between two probability distributions p and q: Mutual Information between random variables X and Y: “Optimal” Co-Clustering: 'Optimal' Co-Clustering Seek random variables and taking values in {1, 2, …, k} and {1, 2, …, l} such that mutual information is maximized: where = R(X) is a function of X alone where = C(Y) is a function of Y alone Related Work : Related Work Distributional Clustering Pereira, Tishby andamp; Lee (1993), Baker andamp; McCallum (1998) Information Bottleneck Tishby, Pereira andamp; Bialek(1999), Slonim, Friedman andamp; Tishby (2001), Berkhin andamp; Becher(2002) Probabilistic Latent Semantic Indexing Hofmann (1999), Hofmann andamp; Puzicha (1999) Non-Negative Matrix Approximation Lee andamp; Seung(2000) Information-Theoretic Co-clustering: Information-Theoretic Co-clustering Lemma: 'Loss in mutual information' equals p is the input distribution q is an approximation to p Can be shown that q(x,y) is a maximum entropy approximation subject to cluster constraints. Slide11: Slide12: Slide13: Slide14: Slide15: #parameters that determine q(x,y) are: Decomposition Lemma: Decomposition Lemma Question: How to minimize ? Following Lemma reveals the Answer: Note that may be thought of as the 'prototype' of row cluster. Similarly, Co-Clustering Algorithm: Co-Clustering Algorithm [Step 1] Set . Start with , Compute . [Step 2] For every row , assign it to the cluster that minimizes [Step 3] We have . Compute . [Step 4] For every column , assign it to the cluster that minimizes [Step 5] We have . Compute . Iterate 2-5. Properties of Co-clustering Algorithm: Properties of Co-clustering Algorithm Main Theorem: Co-clustering 'monotonically' decreases loss in mutual information Co-clustering converges to a local minimum Can be generalized to multi-dimensional contingency tables q can be viewed as a 'low complexity' non-negative matrix approximation q preserves marginals of p, and co-cluster statistics Implicit dimensionality reduction at each step helps overcome sparsity andamp; high-dimensionality Computationally economical Slide19: Slide20: Slide21: Slide22: Applications -- Text Classification: Applications -- Text Classification Assigning class labels to text documents Training and Testing Phases Document collection Class-1 Class-m Grouped into classes Training Data New Document Classifier (Learns from Training data) New Document With Assigned class Feature Clustering (dimensionality reduction): Feature Clustering (dimensionality reduction) Feature Selection Feature Clustering Document Bag-of-words 1 m Vector Of words Select the 'best' words Throw away rest Frequency based pruning Information criterion based pruning Document Bag-of-words Vector Of words 1 m Cluster#1 Cluster#k Do not throw away words Cluster words instead Use clusters as features Word#1 Word#k Experiments: Experiments Data sets 20 Newsgroups data 20 classes, 20000 documents Classic3 data set 3 classes (cisi, med and cran), 3893 documents Dmoz Science HTML data 49 leaves in the hierarchy 5000 documents with 14538 words Available at http://www.cs.utexas.edu/users/manyam/dmoz.txt Implementation Details Bow – for indexing,co-clustering, clustering and classifying Results (20Ng): Results (20Ng) Classification Accuracy on 20 Newsgroups data with 1/3-2/3 test-train split Divisive clustering beats feature selection algorithms by a large margin The effect is more significant at lower number of features Results (Dmoz): Results (Dmoz) Classification Accuracy on Dmoz data with 1/3-2/3 test train split Divisive Clustering is better at lower number of features Note contrasting behavior of Naïve Bayes and SVMs Results (Dmoz): Results (Dmoz) Naïve Bayes on Dmoz data with only 2% Training data Note that Divisive Clustering achieves higher maximum than IG with a significant 13% increase Divisive Clustering performs better than IG at lower training data Slide29: Science Math Physics Social Science Number Theory Logic Mechanics Quantum Theory Economics Archeology Flat classifier builds a classifier over the leaf classes in the above hierarchy Hierarchical Classifier builds a classifier at each internal node of the hierarchy Hierarchical Classification Results (Dmoz): Results (Dmoz) Hierarchical Classifier (Naïve Bayes at each node) Hierarchical Classifier: 64.54% accuracy at just 10 features (Flat achieves 64.04% accuracy at 1000 features) Hierarchical Classifier improves accuracy to 68.42 % from 64.42%(maximum) achieved by flat classifiers Anecdotal Evidence: Anecdotal Evidence Top few words sorted in Clusters obtained by Divisive and Agglomerative approaches on 20 Newsgroups data Co-Clustering Results (CLASSIC3): Co-Clustering Results (CLASSIC3) Results – Binary (subset of 20Ng data): Results – Binary (subset of 20Ng data) Precision – 20Ng data: Precision – 20Ng data Results: Sparsity (Binary_subject data) : Results: Sparsity (Binary_subject data) Results: Sparsity (Binary_subject data) : Results: Sparsity (Binary_subject data) Results (Monotonicity): Results (Monotonicity) Conclusions: Conclusions Information-theoretic approach to clustering, co-clustering and matrix approximation Implicit dimensionality reduction at each step to overcome sparsity andamp; high-dimensionality Theoretical approach has the potential of extending to other problems: Multi-dimensional co-clustering MDL to choose number of co-clusters Generalized co-clustering by Bregman divergence More Information: More Information Email: inderjit@cs.utexas.edu Papers are available at: http://www.cs.utexas.edu/users/inderjit 'Divisive Information-Theoretic Feature Clustering for Text Classification', Dhillon, Mallela andamp; Kumar, Journal of Machine Learning Research(JMLR), March 2003 (also KDD, 2002) 'Information-Theoretic Co-clustering', Dhillon, Mallela andamp; Modha, KDD, 2003. 'Clustering with Bregman Divergences', Banerjee, Merugu, Dhillon andamp; Ghosh, SIAM Data Mining Proceedings, April, 2004. 'A Generalized Maximum Entropy Approach to Bregman Co-clustering andamp; Matrix Approximation', Banerjee, Dhillon, Ghosh, Merugu andamp; Modha, working manuscript, 2004.