dhillon2004 itcc

Uploaded from authorPOINT
Views:
 
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon University of Texas, Austin : 

Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon University of Texas, Austin Joint work with A. Banerjee, J. Ghosh, Y. Guan, S. Mallela, S. Merugu andamp; D. Modha Data Mining Seminar Series, Mar 26, 2004

Clustering: Unsupervised Learning: 

Clustering: Unsupervised Learning Grouping together of 'similar' objects Hard Clustering -- Each object belongs to a single cluster Soft Clustering -- Each object is probabilistically assigned to clusters

Contingency Tables: 

Contingency Tables Let X and Y be discrete random variables X and Y take values in {1, 2, …, m} and {1, 2, …, n} p(X, Y) denotes the joint probability distribution—if not known, it is often estimated based on co-occurrence data Application areas: text mining, market-basket analysis, analysis of browsing behavior, etc. Key Obstacles in Clustering Contingency Tables High Dimensionality, Sparsity, Noise Need for robust and scalable algorithms

Co-Clustering: 

Co-Clustering Simultaneously Cluster rows of p(X, Y) into k disjoint groups Cluster columns of p(X, Y) into l disjoint groups Key goal is to exploit the 'duality' between row and column clustering to overcome sparsity and noise

Co-clustering Example for Text Data: 

Co-clustering Example for Text Data document word word clusters document clusters Co-clustering clusters both words and documents simultaneously using the underlying co-occurrence frequency matrix

Co-clustering and Information Theory: 

Co-clustering and Information Theory View 'co-occurrence' matrix as a joint probability distribution over row andamp; column random variables We seek a 'hard-clustering' of both rows and columns such that 'information' in the compressed matrix is maximized.

Information Theory Concepts: 

Information Theory Concepts Entropy of a random variable X with probability distribution p: The Kullback-Leibler (KL) Divergence or 'Relative Entropy' between two probability distributions p and q: Mutual Information between random variables X and Y:

“Optimal” Co-Clustering: 

'Optimal' Co-Clustering Seek random variables and taking values in {1, 2, …, k} and {1, 2, …, l} such that mutual information is maximized: where = R(X) is a function of X alone where = C(Y) is a function of Y alone

Related Work : 

Related Work Distributional Clustering Pereira, Tishby andamp; Lee (1993), Baker andamp; McCallum (1998) Information Bottleneck Tishby, Pereira andamp; Bialek(1999), Slonim, Friedman andamp; Tishby (2001), Berkhin andamp; Becher(2002) Probabilistic Latent Semantic Indexing Hofmann (1999), Hofmann andamp; Puzicha (1999) Non-Negative Matrix Approximation Lee andamp; Seung(2000)

Information-Theoretic Co-clustering: 

Information-Theoretic Co-clustering Lemma: 'Loss in mutual information' equals p is the input distribution q is an approximation to p Can be shown that q(x,y) is a maximum entropy approximation subject to cluster constraints.

Slide11: 


Slide12: 


Slide13: 


Slide14: 


Slide15: 

#parameters that determine q(x,y) are:

Decomposition Lemma: 

Decomposition Lemma Question: How to minimize ? Following Lemma reveals the Answer: Note that may be thought of as the 'prototype' of row cluster. Similarly,

Co-Clustering Algorithm: 

Co-Clustering Algorithm [Step 1] Set . Start with , Compute . [Step 2] For every row , assign it to the cluster that minimizes [Step 3] We have . Compute . [Step 4] For every column , assign it to the cluster that minimizes [Step 5] We have . Compute . Iterate 2-5.

Properties of Co-clustering Algorithm: 

Properties of Co-clustering Algorithm Main Theorem: Co-clustering 'monotonically' decreases loss in mutual information Co-clustering converges to a local minimum Can be generalized to multi-dimensional contingency tables q can be viewed as a 'low complexity' non-negative matrix approximation q preserves marginals of p, and co-cluster statistics Implicit dimensionality reduction at each step helps overcome sparsity andamp; high-dimensionality Computationally economical

Slide19: 


Slide20: 


Slide21: 


Slide22: 


Applications -- Text Classification: 

Applications -- Text Classification Assigning class labels to text documents Training and Testing Phases Document collection Class-1 Class-m Grouped into classes Training Data New Document Classifier (Learns from Training data) New Document With Assigned class

Feature Clustering (dimensionality reduction): 

Feature Clustering (dimensionality reduction) Feature Selection Feature Clustering Document Bag-of-words 1 m Vector Of words Select the 'best' words Throw away rest Frequency based pruning Information criterion based pruning Document Bag-of-words Vector Of words 1 m Cluster#1 Cluster#k Do not throw away words Cluster words instead Use clusters as features Word#1 Word#k

Experiments: 

Experiments Data sets 20 Newsgroups data 20 classes, 20000 documents Classic3 data set 3 classes (cisi, med and cran), 3893 documents Dmoz Science HTML data 49 leaves in the hierarchy 5000 documents with 14538 words Available at http://www.cs.utexas.edu/users/manyam/dmoz.txt Implementation Details Bow – for indexing,co-clustering, clustering and classifying

Results (20Ng): 

Results (20Ng) Classification Accuracy on 20 Newsgroups data with 1/3-2/3 test-train split Divisive clustering beats feature selection algorithms by a large margin The effect is more significant at lower number of features

Results (Dmoz): 

Results (Dmoz) Classification Accuracy on Dmoz data with 1/3-2/3 test train split Divisive Clustering is better at lower number of features Note contrasting behavior of Naïve Bayes and SVMs

Results (Dmoz): 

Results (Dmoz) Naïve Bayes on Dmoz data with only 2% Training data Note that Divisive Clustering achieves higher maximum than IG with a significant 13% increase Divisive Clustering performs better than IG at lower training data

Slide29: 

Science Math Physics Social Science Number Theory Logic Mechanics Quantum Theory Economics Archeology Flat classifier builds a classifier over the leaf classes in the above hierarchy Hierarchical Classifier builds a classifier at each internal node of the hierarchy Hierarchical Classification

Results (Dmoz): 

Results (Dmoz) Hierarchical Classifier (Naïve Bayes at each node) Hierarchical Classifier: 64.54% accuracy at just 10 features (Flat achieves 64.04% accuracy at 1000 features) Hierarchical Classifier improves accuracy to 68.42 % from 64.42%(maximum) achieved by flat classifiers

Anecdotal Evidence: 

Anecdotal Evidence Top few words sorted in Clusters obtained by Divisive and Agglomerative approaches on 20 Newsgroups data

Co-Clustering Results (CLASSIC3): 

Co-Clustering Results (CLASSIC3)

Results – Binary (subset of 20Ng data): 

Results – Binary (subset of 20Ng data)

Precision – 20Ng data: 

Precision – 20Ng data

Results: Sparsity (Binary_subject data) : 

Results: Sparsity (Binary_subject data)

Results: Sparsity (Binary_subject data) : 

Results: Sparsity (Binary_subject data)

Results (Monotonicity): 

Results (Monotonicity)

Conclusions: 

Conclusions Information-theoretic approach to clustering, co-clustering and matrix approximation Implicit dimensionality reduction at each step to overcome sparsity andamp; high-dimensionality Theoretical approach has the potential of extending to other problems: Multi-dimensional co-clustering MDL to choose number of co-clusters Generalized co-clustering by Bregman divergence

More Information: 

More Information Email: inderjit@cs.utexas.edu Papers are available at: http://www.cs.utexas.edu/users/inderjit 'Divisive Information-Theoretic Feature Clustering for Text Classification', Dhillon, Mallela andamp; Kumar, Journal of Machine Learning Research(JMLR), March 2003 (also KDD, 2002) 'Information-Theoretic Co-clustering', Dhillon, Mallela andamp; Modha, KDD, 2003. 'Clustering with Bregman Divergences', Banerjee, Merugu, Dhillon andamp; Ghosh, SIAM Data Mining Proceedings, April, 2004. 'A Generalized Maximum Entropy Approach to Bregman Co-clustering andamp; Matrix Approximation', Banerjee, Dhillon, Ghosh, Merugu andamp; Modha, working manuscript, 2004.