Automatic Subspace Clustering Of High Dimensional Data For Data Mining Application: Automatic Subspace Clustering Of High Dimensional Data For Data Mining Application Author Rakesh Agrawal , Johannes Gehrke, Dimitrios Gunopulos, Prabhakar Raghavan CLIQUE Prepared by : Raed T Aldahdooh Supervised : Dr. Wesam Ashour
Schedule: Introduction Motivation Contributions Of The Paper Subspace Clustering CLIQUE (Clustering in Quest ) Performance Experiments Conclusions Schedule
Introduction: Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98) CLIQUE can be considered as both density-based and grid-based Clustering high-dimensional data. Automatically identifying subspaces of a high dimensional data space that allow better clustering than original space Introduction
Major challenges of Clustering High-Dimensional Data: Many irrelevant dimensions may mask clusters. Distance measure becomes meaningless—due to equi -distance. Clusters may exist only in some subspaces. Major challenges of Clustering High-Dimensional Data
The Curse of Dimensionality (graphs adapted from Parsons et al. KDD Explorations 2004): Only data in one dimension is relatively packed. Adding a dimension “stretch” the points across that dimension, making them further apart. Density decrease dramatically. Distance measure becomes meaningless —due to equi -distance. The Curse of Dimensionality (graphs adapted from Parsons et al. KDD Explorations 2004)
Clustering in High Dimensional Space: Methods Feature transformation: only effective if most dimensions are relevant PCA “ Principal component analysis ” & SVD “ Singular value decomposition ” useful only when features are highly correlated/redundant Feature selection: wrapper or filter approaches useful to find a subspace where the data have nice clusters Subspace-clustering: find clusters in all the possible subspaces CLIQUE , ProClus , and frequent pattern-based clustering Clustering in High Dimensional Space
Motivation: The need for developing new algorithms Effective treatment of high dimensionality: To effectively extract information from a huge amount of data in databases. In other words. The running time of algorithms must be predictable and usable in large database. Interpretability of results: User expect clustering results in the high dimensional data to be interpretable, comprehensible. Scalability and usability: Many clustering algorithms don’t well in a large database may contain millions of objects, Clustering on a sample of a given data set may lead to biased results. In other words, The clustering technique should be fast and scale with the number of dimensions and the size of input and insensitive to the order of input data. Motivation
Contributions of the paper: CLIQUE satisfies the above desiderata ( Effective , interpretability, Scalability and Usability ). CLIQUE can automatically finds subspaces with high density clusters. CLIQUE generates a minimal description for each cluster in DNF expressions. Empirical evaluation shows that CLIQUE scales linearly with the number of input records and has good scalability as the number of dimension in the dimensionality of the hidden cluster . Contributions of the paper
DNF Expression: a disjunctive normal form (DNF) is a standardization (or normalization) of a logical formula which is a disjunction of conjunctive clauses. A disjunction of conjunctions where every variable or its negation is represented once in each conjunction ( a minterm ) each minterm appears only once Example: DNF of p q is (p q) ( p q ). DNF Expression
Why Subspace Clustering? (adapted from Parsons et al. SIGKDD Explorations 2004) : Why Subspace Clustering? (adapted from Parsons et al. SIGKDD Explorations 2004) Clusters may exist only in some subspaces. Subspace-clustering: find clusters in all the subspaces.
Subspace Clustering: Subspace Clustering What’s ( a)unit (b)dense unit ( c)a cluster ( d)a minimal description of a cluster. In Figure 1,the two dim space(age , salary ) has been partitioned by a 10x10 grid. ξ =10 The unit u=(30≤age<35) Λ (1≤salary<2) A and B are both region A=(30≤ age<50) Λ (4≤salary<8) B =(40≤age<60) Λ (2≤salary<6) Assuming the dense units have been shaded , AUB is a cluster( A,B are connected regions) A∩B is not a maximal region. The minimal description for this cluster AUB is the DNF expression ： ( (30≤ age<50) Λ (4≤salary<8)) v ( (40≤age<60) Λ (2≤salary<6 ))
: In Figure2. Assuming T=20 % ( density threshold _ 3 point ) If selectivity(u)>T then u is a dense unit. Where selectivity in the fraction of total data points contained in the unit . No 2-dimen unit is dense and there are no clusters In the original data space. Subspace Clustering The points are projected on the salary dimension , there are three 1-dim dense units, and there are two clusters in the 1-dim salary subspace, C=(5≤salary<7 )and D=(2≤salary<3) But there is no dense unit and cluster in 1-dim age subspace
CLIQUE Algorithms: 3. Generation of minimal description for the clusters . CLIQUE consists of the following three steps ： Identification of subspace that contain clusters. Identification of clusters . Generation of minimal description for the clusters . CLIQUE Algorithms Title in here 2. Identification of clusters. 1. Identification of subspace that contain clusters. CLIQUE consists of the following three steps ：
Identification of subspaces: Downward closure (DC) property : If a cluster is satisfied in a k -dimensional space, it is also satisfied in all of its ( k -1) -dimensional subspaces. Due to the DC property, identification of subspaces is carried out in an iterative bottom-up fashion (from lower to higher dimensional subspaces). Identification of subspaces
Identification of subspace that contain cluster: The difficulty in identifying subspaces that contain clusters lies in finding dense units in different subspaces. A. using a bottom-up algorithm to find dense units that exploits the monotonicity of the clustering criterion with respect to dimensionality to prune the search space. Lemma1 (monotonicity) ： If k-dim unit is dense ,then so are it’s projections in (k-1)-dim space . The bottom-up algorithm process Determines 1-dim dense unit and interaction(self-join) to get 2-dim dense unit. Until having (k-1)dim dense units, We can self-join D K-1 to get the candidate k-dim units. we discard those dense units from Ck which have a projection (k-1 )-dim that isn't included in Ck-1 . B. Making the bottom-up algorithm faster with MDL-base pruning. Identification of subspace that contain cluster
Identification of subspaces: A. Determination of dense units Determine the set D1 of all one-dimensional dense units. k=1 While D k ≠ do k=k+1 Determine the set D k as the set of all the k-dimensional dense units all of whose (k-1)-dimensional projections, belong to Dk-1. End while Identification of subspaces
Identification of subspaces: B. Determination of high coverage subspaces . Determine all the subspaces that contain at least one dense unit. Sort these subspaces in descending order according to their coverage (fraction of the num. of points of the original data set they contain). Optimize a suitably defined Minimum Description Length criterion function and determine a threshold under which a coverage is considered “low”. Select the subspaces with “high” coverage. Identification of subspaces
Finding Clusters: The input to the step of Finding Clusters is a set of dense units D all in the same k-dim space. Depth-first search algorithm Using a Depth –first search algorithm to find the connected components of the graph , By starting with some U in D, Assign it the first cluster number and find all the units it is connected to, then if there still are units in D that have not yet been visited , we find one and repeat the procedure. Finding Clusters
Identification of clusters: For each high coverage subspace S do Consider the set E of all the dense units in S. While E ≠ m´ =1 Select a randomly chosen unit u from E. Assign to Cm´, u and all units of E that are connected to u. E=E-Cm´ End while End for Identification of clusters
Generating minimal cluster descriptions: Generating minimal cluster descriptions The input to this step consists of disjoint clusters in k-dim subspace. The goal is to generate a minimal description of each cluster with two steps: Covering with maximal region. Minimal cover .
Slide 21: The CLIQUE Algorithm (cont.) 3. Minimal description of clusters The minimal description of a cluster C, produced by the Last procedure, is the minimum possible union of hyper rectangular regions. For example A B is the minimum cluster description of the shaded region. C D E is a non-minimal cluster description of the same region.
Slide 22: The CLIQUE Algorithm (cont.) 3. Minimal description of clusters (algorithm) For each cluster C do 1 st stage c =0 While C ≠ c = c +1 Choose a dense unit in C For i =1 to l Grow the unit in both directions along the i -th dimension, trying to cover as many units in C as possible (boxes that are not belong to C should not be covered). End for Define the set I containing all the units covered by the above procedure C=C - I End while 2 nd stage Remove all covers whose units are covered by at least another cover.
How to obtain a maximal region covering a dense unit U: How to obtain a maximal region covering a dense unit U
Example: A two dimensional grid of lines of edge size ξ applied in the two-dimensional feature space. Two-dimensional and one-dimensional units are defined: u i q denotes the i -th one dimensional unit along x q u ij denotes the two dimensional unit resulting from the Cartesian product of the i -th and j - th intervals along x 1 and x 2 , respectively. ξ =1 0 and τ =8% (thus, each unit containing more than 5 points is considered to be dense). Example
Slide 25: The points in u 48 and u 58 , u 75 and u 76 , u 83 and u 93 are collinear.
1- Identification of subspaces: One-dimensional dense units: D 1 ={ u 2 1 , u 3 1 , u 4 1 , u 5 1 , u 8 1 , u 9 1 , u 1 2 , u 2 2 , u 3 2 , u 5 2 , u 6 2 } Two-dimensional dense units: D 2 ={ u 21 , u 22 , u 32 , u 33 , u 83 , u 93 } 1- Identification of subspaces Notes: Although each one of the u 48 , u 75 , u 76 contains more that 5 points, they are not included in D 2 . Although it seems unnatural for u 83 and u 93 to be included in D 2 , they are included since u 3 2 is dense. All subspaces of the two-dimensional space contain clusters.
2- Identification of cluster: One-dimensional clusters: C 1 ={ u 2 1 , u 3 1 , u 4 1 , u 5 1 } C 2 ={ u 8 1 , u 9 1 } C 3 ={ u 1 2 , u 2 2 , u 3 2 } C 4 ={ u 5 2 , u 6 2 } Two-dimensional clusters: C 5 ={ u 21 , u 22 , u 32 , u 33 } C 6 ={ u 83 , u 93 } 2- Identification of cluster One-dimensional dense units: D 1 ={ u 2 1 , u 3 1 , u 4 1 , u 5 1 , u 8 1 , u 9 1 , u 1 2 , u 2 2 , u 3 2 , u 5 2 , u 6 2 } Two-dimensional dense units: D 2 ={ u 21 , u 22 , u 32 , u 33 , u 83 , u 93 }
3- Description of cluster: C 1 ={( x 1 ): 1 x 1 <5} C 2 ={( x 1 ): 7 x 1 <9} C 3 ={( x 2 ): 0 x 2 <3} C 4 ={( x 2 ): 4 x 2 <6} C 5 ={( x 1 , x 2 ): 1 x 1 <2 , 0 x 2 <2}{( x 1 , x 2 ): 2 x 1 <3 , 1 x 2 <3} C 6 ={( x 1 , x 2 ): 7 x 1 <9 , 2 x 2 <3} Note that C 2 and C 6 are essentially the same cluster, which is reported twice by the algorithm. 3- Description of cluster
Performance Experiments: We now empirically evaluate CLIQUE using synthetic data (Generator from M.Zait and H.Messatfa . a comparative study of clustering methods) The goals of the experiments are to assess the efficiency of CLIQUE ： Efficiency ： Determine how the running time scales with Dimensionality of the data space. Dimensionality of clusters. Size of data. Accuracy ： Test if CLIQUE recovers known clusters in some subspaces of a high dimensional data space. Performance Experiments
Synthetic data results: Synthetic data results
Accuracy：Comparisons with BIRCH, DBSCAN: Accuracy ： Comparisons with BIRCH, DBSCAN Using clusters embedded in 5-dim subspaces while varying the dimensional of the space from 5 to50. CLIQUE was able to recover all clusters in every case.
Strength and Weakness of CLIQUE: Strength automatically finds subspaces of the highest dimensionality such that high density clusters exist in those subspaces insensitive to the order of records in input and does not presume some canonical data distribution scales linearly with the size of input and has good scalability as the number of dimensions in the data increases Weakness The accuracy of the clustering result may be degraded at the expense of simplicity of the method Strength and Weakness of CLIQUE
Conclusions: T he problem of high dimensionality is often tackled by requiring the user to specify the subspace for cluster analysis . But user-identification of quite error-prone . CLIQUE can find clusters embedded in subspaces of high dimensional data without requiring the user to guess subspaces that might have interesting clusters. CLIQUE generates cluster descriptions in the form of DNF expressions that are minimized for ease of comprehension. CLIQUE is insensitive to the order of input records , Some clustering algorithms are sensitive to the order of input data. Empirical evolution shows that CLIQUE scales linearly with the size of input and has good scalability as the number of dimension in the data. CLIQUE can accurately discover clusters embedded in lower dimensional subspaces. Conclusions