Chapter3_3_Clustering

Views:
 
Category: Education
     
 

Presentation Description

No description available.

Comments

By: wfanwar (117 month(s) ago)

May I download it please it's really good or May you send it to my email with all my regards

By: eliltz (118 month(s) ago)

can i please download ? i have a test today...

By: eliltz (118 month(s) ago)

Thank you!

By: mehul23 (121 month(s) ago)

i want to download it.

By: mehul23 (121 month(s) ago)

great topic

Presentation Transcript

Clustering : 

Clustering By : Babu Ram Dawadi

Clustering : 

2 Clustering cluster is a collection of data objects, in which the objects similar to one another within the same cluster and dissimilar to the objects in other clusters Cluster analysis is the process of finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters. Clustering: Given a database D = {t1, t2, .., tn}, a distance measure dis(ti, tj) defined between any two objects ti and tj, and an integer value k, the clustering problem is to define a mapping f: D -> {1, …, k} where each ti is assigned to one cluster Kj, 1<=j<=k. here ‘k’ is the number of clusters.

Examples of Clustering Applications : 

3 Examples of Clustering Applications Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Land use: Identification of areas of similar land use in an earth observation database Insurance: Identifying groups of motor insurance policy holders with a high average claim cost City-planning: Identifying groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults

Categories of Clustering : 

4 Categories of Clustering Partition-based Hierarchical Density-based Grid-based Model-based

Partitioning Algorithms: Basic Concept : 

5 Partitioning Algorithms: Basic Concept Partitioning method: Construct a partition of a database D of n objects into a set of k clusters Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion Heuristic methods: k-means and k-medoids algorithms k-means (MacQueen’67): Each cluster is represented by the center of the cluster k-medoids or PAM (Partition around medoids): Each cluster is represented by one of the objects in the cluster

The K-Means Clustering Method : 

6 The K-Means Clustering Method Chose k number of clusters to be determined Chose k objects randomly as the initial cluster centers Repeat Assign each object to their closest cluster center, using Euclidean distance Compute new cluster centers, calculate mean point Until No change in cluster centers or No object change its clusters

The K-Means Clustering Method : 

7 The K-Means Clustering Method

K-Means Clustering : 

8 K-Means Clustering

K-Means Clustering : 

9 K-Means Clustering

K-Means Clustering : 

10 K-Means Clustering

K-Means Clustering : 

11 K-Means Clustering

Weakness of K-means : 

12 Weakness of K-means Applicable only when mean is defined, then what about categorical data? Need to specify K, the number of clusters, in advance run the algorithm with different K values Unable to handle noisy data and outliers Works best when clusters are of approximately of equal size

Hierarchical Clustering : 

13 Hierarchical Clustering Clustering comes in a form of a tree – dendrogram visualizing how data contribute to individual clusters Clustering is realized in a successive manner through: successive splits, or successive aggregations

Hierarchical Clustering : 

14 Hierarchical Clustering Provides graphical illustration of relationships between the data in the form of dendrogram Dendrogram is a binary tree Two fundamental approaches: Bottom – up (agglomerative approach) Top-down (divisive approach)

Hierarchical Clustering: Types : 

15 Hierarchical Clustering: Types Agglomerative(Bottom-up or agglomerative): starts with as many clusters as there are records, with each cluster having only one record. Then pairs of clusters are successively merged until the number of clusters reduces to k. at each stage, the pair of clusters are merged which are nearest to each other. If the merging is continued, it terminates in the hierarchy of clusters which is built with just a single cluster containing all the records. Divisive algorithm (Top-down or divisive ): takes the opposite approach from the agglomerative techniques. These starts with all the records in one cluster, and then try to split that clusters into smaller pieces.

Hierarchical Clustering : 

16 Hierarchical Clustering Top -down Bottom-up

Hierarchical methods : 

17 Hierarchical methods Agglomerative methods start with each object in the data forming its own cluster, and then successively merge the clusters until one large cluster is formed (that encompasses the entire dataset) Divisive methods start by considering the entire data as one cluster and then split up the cluster(s) until each object forms one cluster

Slide 18: 

18 Agglomerative Divisive Remove Outlier

Density-Based Clustering Methods : 

Density-Based Clustering Methods Clustering based on density (local cluster criterion), such as density-connected points Major features: Discover clusters of arbitrary shape Handle noise One scan Need density parameters as termination condition Several interesting studies: DBSCAN: Ester, et al. (KDD’96) OPTICS: Ankerst, et al (SIGMOD’99). DENCLUE: Hinneburg & D. Keim (KDD’98) CLIQUE: Agrawal, et al. (SIGMOD’98)

Density-Based Clustering: Background : 

Density-Based Clustering: Background The basic terms The neighbourhood of an object that enclosed in a circle with radius Eps is called Eps - neighbourhood of that object Eps neighbourhood with minimum object points is called core object. An object A from a dataset is directly density reachable from object B where A is the member of Eps- neighbourhood of B and B is a core object.

Density-Based Clustering: : 

Density-Based Clustering: Density-reachable: A point p is density-reachable from a point q wrt. Eps, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi Density-connected A point p is density-connected to a point q wrt. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o wrt. Eps and MinPts. p q p1

DBSCAN: Density Based Spatial Clustering of Applications with Noise : 

DBSCAN: Density Based Spatial Clustering of Applications with Noise Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points Discovers clusters of arbitrary shape in spatial databases with noise

DBSCAN: The Algorithm : 

DBSCAN: The Algorithm Arbitrary select a point p Retrieve all points density-reachable from p wrt Eps and MinPts. If p is a core point, a cluster is formed. If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database. Continue the process until all of the points have been processed.

Grid-Based Clustering Method : 

Grid-Based Clustering Method Using multi-resolution grid data structure Several interesting methods STING (a STatistical INformation Grid approach) by Wang, Yang and Muntz (1997) WaveCluster by Sheikholeslami, Chatterjee, and Zhang (VLDB’98) A multi-resolution clustering approach using wavelet method CLIQUE: Agrawal, et al. (SIGMOD’98)

Grid-Based Clustering : 

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 25 Grid-Based Clustering Describe structure of data in the language of generic geometric constructs – hyperboxes and their combinations Collection of clusters of different geometry Formation of clusters by merging adjacent hyperboxes of the grid

Grid-Based Clustering Steps : 

26 Grid-Based Clustering Steps Formation of the grid structure Insertion of data into the grid structure Computation of the density index of each hyperbox of the grid structure Sorting the hyperboxes with respect to the values of their density index Identification of cluster centers (viz. the hyperboxes of the highest density) Traversal of neighboring hyperboxes and merging process Choice of the grid: too rough grid may not help capture the details of the structure in the data. too detailed grid produces a significant computational overhead.

STING: A Statistical Information Grid : 

STING: A Statistical Information Grid The spatial area is divided into rectangular cells There are several levels of cells corresponding to different levels of resolution

STING: A Statistical Information Grid : 

STING: A Statistical Information Grid Each cell at a high level is partitioned into a number of smaller cells in the next lower level Statistical info of each cell is calculated and stored beforehand and is used to answer queries Parameters of higher level cells can be easily calculated from parameters of lower level cell count, mean, s, min, max type of distribution—normal, uniform, etc. For each cell in the current level compute the confidence interval

STING: A Statistical Information Grid : 

STING: A Statistical Information Grid Remove the irrelevant cells from further consideration When finish examining the current layer, proceed to the next lower level Repeat this process until the bottom layer is reached Advantages: Query-independent, easy to parallelize, incremental update O(K), where K is the number of grid cells at the lowest level Disadvantages: All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected

Model-Based Clustering Methods : 

Model-Based Clustering Methods Attempt to optimize the fit between the data and some mathematical model Statistical and AI approach COBWEB (Fisher’87) A popular a simple method of incremental conceptual learning Creates a hierarchical clustering in the form of a classification tree Each node refers to a concept and contains a probabilistic description of that concept

COBWEB Clustering Method : 

COBWEB Clustering Method A classification tree

Summary : 

Summary Cluster analysis groups objects based on their similarity and has wide applications Clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods Outlier detection and analysis are very useful for fraud detection, etc. and can be performed by statistical, distance-based or deviation-based approaches

authorStream Live Help