Clustering : Clustering By : Babu Ram Dawadi
Clustering : 2 Clustering cluster is a collection of data objects, in which the objects similar to one another within the same cluster and dissimilar to the objects in other clusters
Cluster analysis is the process of finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters.
Clustering: Given a database D = {t1, t2, .., tn}, a distance measure dis(ti, tj) defined between any two objects ti and tj, and an integer value k, the clustering problem is to define a mapping f: D -> {1, …, k} where each ti is assigned to one cluster Kj, 1<=j<=k. here ‘k’ is the number of clusters.
Examples of Clustering Applications : 3 Examples of Clustering Applications Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs
Land use: Identification of areas of similar land use in an earth observation database
Insurance: Identifying groups of motor insurance policy holders with a high average claim cost
City-planning: Identifying groups of houses according to their house type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults
Categories of Clustering : 4 Categories of Clustering Partition-based
Hierarchical
Density-based
Grid-based
Model-based
Partitioning Algorithms: Basic Concept : 5 Partitioning Algorithms: Basic Concept Partitioning method: Construct a partition of a database D of n objects into a set of k clusters
Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion
Heuristic methods: k-means and k-medoids algorithms
k-means (MacQueen’67): Each cluster is represented by the center of the cluster
k-medoids or PAM (Partition around medoids): Each cluster is represented by one of the objects in the cluster
The K-Means Clustering Method : 6 The K-Means Clustering Method Chose k number of clusters to be determined
Chose k objects randomly as the initial cluster centers
Repeat
Assign each object to their closest cluster center, using Euclidean distance
Compute new cluster centers, calculate mean point
Until
No change in cluster centers or
No object change its clusters
The K-Means Clustering Method : 7 The K-Means Clustering Method
K-Means Clustering : 8 K-Means Clustering
K-Means Clustering : 9 K-Means Clustering
K-Means Clustering : 10 K-Means Clustering
K-Means Clustering : 11 K-Means Clustering
Weakness of K-means : 12 Weakness of K-means Applicable only when mean is defined, then what about categorical data?
Need to specify K, the number of clusters, in advance
run the algorithm with different K values
Unable to handle noisy data and outliers
Works best when clusters are of approximately of equal size
Hierarchical Clustering : 13 Hierarchical Clustering Clustering comes in a form of a tree – dendrogram visualizing how data contribute to individual clusters
Clustering is realized in a successive manner through:
successive splits, or
successive aggregations
Hierarchical Clustering : 14 Hierarchical Clustering Provides graphical illustration of relationships between the data in the form of dendrogram
Dendrogram is a binary tree
Two fundamental approaches:
Bottom – up (agglomerative approach)
Top-down (divisive approach)
Hierarchical Clustering: Types : 15 Hierarchical Clustering: Types Agglomerative(Bottom-up or agglomerative):
starts with as many clusters as there are records, with each cluster having only one record. Then pairs of clusters are successively merged until the number of clusters reduces to k.
at each stage, the pair of clusters are merged which are nearest to each other. If the merging is continued, it terminates in the hierarchy of clusters which is built with just a single cluster containing all the records.
Divisive algorithm (Top-down or divisive ): takes the opposite approach from the agglomerative techniques. These starts with all the records in one cluster, and then try to split that clusters into smaller pieces.
Hierarchical Clustering : 16 Hierarchical Clustering Top -down Bottom-up
Hierarchical methods : 17 Hierarchical methods Agglomerative methods start with each object in the data forming its own cluster, and then successively merge the clusters until one large cluster is formed (that encompasses the entire dataset)
Divisive methods start by considering the entire data as one cluster and then split up the cluster(s) until each object forms one cluster
Slide 18: 18 Agglomerative Divisive Remove
Outlier
Density-Based Clustering Methods : Density-Based Clustering Methods Clustering based on density (local cluster criterion), such as density-connected points
Major features:
Discover clusters of arbitrary shape
Handle noise
One scan
Need density parameters as termination condition
Several interesting studies:
DBSCAN: Ester, et al. (KDD’96)
OPTICS: Ankerst, et al (SIGMOD’99).
DENCLUE: Hinneburg & D. Keim (KDD’98)
CLIQUE: Agrawal, et al. (SIGMOD’98)
Density-Based Clustering: Background : Density-Based Clustering: Background The basic terms
The neighbourhood of an object that enclosed in a circle with radius Eps is called Eps - neighbourhood of that object
Eps neighbourhood with minimum object points is called core object.
An object A from a dataset is directly density reachable from object B where A is the member of Eps- neighbourhood of B and B is a core object.
Density-Based Clustering: : Density-Based Clustering: Density-reachable:
A point p is density-reachable from a point q wrt. Eps, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi
Density-connected
A point p is density-connected to a point q wrt. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o wrt. Eps and MinPts. p q p1
DBSCAN: Density Based Spatial Clustering of Applications with Noise : DBSCAN: Density Based Spatial Clustering of Applications with Noise Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points
Discovers clusters of arbitrary shape in spatial databases with noise
DBSCAN: The Algorithm : DBSCAN: The Algorithm Arbitrary select a point p
Retrieve all points density-reachable from p wrt Eps and MinPts.
If p is a core point, a cluster is formed.
If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database.
Continue the process until all of the points have been processed.
Grid-Based Clustering Method : Grid-Based Clustering Method Using multi-resolution grid data structure
Several interesting methods
STING (a STatistical INformation Grid approach) by Wang, Yang and Muntz (1997)
WaveCluster by Sheikholeslami, Chatterjee, and Zhang (VLDB’98)
A multi-resolution clustering approach using wavelet method
CLIQUE: Agrawal, et al. (SIGMOD’98)
Grid-Based Clustering : © 2007 Cios / Pedrycz / Swiniarski / Kurgan 25 Grid-Based Clustering Describe structure of data in the language of generic
geometric constructs – hyperboxes and their combinations Collection of clusters of different geometry Formation of clusters by merging adjacent hyperboxes of the grid
Grid-Based Clustering Steps : 26 Grid-Based Clustering Steps Formation of the grid structure
Insertion of data into the grid structure
Computation of the density index of each hyperbox of the grid structure
Sorting the hyperboxes with respect to the values of their density index
Identification of cluster centers (viz. the hyperboxes of the highest density)
Traversal of neighboring hyperboxes and merging process
Choice of the grid:
too rough grid may not help capture the details of the structure in the data.
too detailed grid produces a significant computational overhead.
STING: A Statistical Information Grid : STING: A Statistical Information Grid The spatial area is divided into rectangular cells
There are several levels of cells corresponding to different levels of resolution
STING: A Statistical Information Grid : STING: A Statistical Information Grid Each cell at a high level is partitioned into a number of smaller cells in the next lower level
Statistical info of each cell is calculated and stored beforehand and is used to answer queries
Parameters of higher level cells can be easily calculated from parameters of lower level cell
count, mean, s, min, max
type of distribution—normal, uniform, etc.
For each cell in the current level compute the confidence interval
STING: A Statistical Information Grid : STING: A Statistical Information Grid Remove the irrelevant cells from further consideration
When finish examining the current layer, proceed to the next lower level
Repeat this process until the bottom layer is reached
Advantages:
Query-independent, easy to parallelize, incremental update
O(K), where K is the number of grid cells at the lowest level
Disadvantages:
All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected
Model-Based Clustering Methods : Model-Based Clustering Methods Attempt to optimize the fit between the data and some mathematical model
Statistical and AI approach
COBWEB (Fisher’87)
A popular a simple method of incremental conceptual learning
Creates a hierarchical clustering in the form of a classification tree
Each node refers to a concept and contains a probabilistic description of that concept
COBWEB Clustering Method : COBWEB Clustering Method A classification tree
Summary : Summary Cluster analysis groups objects based on their similarity and has wide applications
Clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods
Outlier detection and analysis are very useful for fraud detection, etc. and can be performed by statistical, distance-based or deviation-based approaches