slide 1: 3/22/2012
1
K-means Algorithm g
Cluster Analysis in Data Mining
Presented by Zijun Zhang
Algorithm Description
What is Cluster Analysis
Cluster analysis groups data objects based only on
information found in data that describes the objects and their
relationships.
Goal of Cluster Analysis
The objects within a group be similar to one another and jgp
different from the objects in other groups
slide 2: 3/22/2012
2
Algorithm Description
Types of Clustering
Partitioning and Hierarchical Clustering
Hierarchical Clustering
- A set of nested clusters organized as a hierarchical tree
Partitioning Clustering gg
- A division data objects into non-overlapping subsets
clusters such that each data object is in exactly one subset
Algorithm Description
p4
p1
p3
p2
A Partitional Clustering Hierarchical Clustering
slide 3: 3/22/2012
3
Algorithm Description
What is K-means
1. Partitional clustering approach
2. Each cluster is associated with a centroid center point
3. Each point is assigned to the cluster with the closest centroid
4 Number of clusters K must be specified 4. Number of clusters K must be specified
Algorithm Statement
Basic Algorithm of K-means
slide 4: 3/22/2012
4
Algorithm Statement
Details of K-means
1 Initial centroids are often chosen randomly 1. Initial centroids are often chosen randomly.
- Clusters produced vary from one run to another
2. The centroid is typically the mean of the points in the cluster.
3.‘Closeness’ is measured by Euclidean distance cosine similarity correlation
etc.
4. K-means will converge for common similarity measures mentioned above.
5. Most of the convergence happens in the first few iterations. 5. Most of the convergence happens in the first few iterations.
- Often the stopping condition is changed to ‘Until relatively few points
change clusters’
Algorithm Statement
Euclidean Distance
A simple example: Find the distance between two points the original
and the point 34
slide 5: 3/22/2012
5
Algorithm Statement
Update Centroid
We use the following equation to calculate the n dimensional We use the following equation to calculate the n dimensional
centroid point amid k n-dimensional points
Example: Find the centroid of 3 2D points 24 52
and 89 and 89
Example of K-means
Select three initial centroids
1
1.5
2
2.5
3
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
x
slide 6: 3/22/2012
6
Example of K-means
Assigning the points to nearest K clusters and re-compute the
centroids
1
1.5
2
2.5
3
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
x
Example of K-means
K-means terminates since the centroids converge to certain points
and do not change.
1
1.5
2
2.5
3
y
Iteration 6
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
x
slide 7: 3/22/2012
7
Example of K-means
2
2.5
3
Iteration 1
2
2.5
3
Iteration 2
2
2.5
3
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
x
y
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
x
y
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
x
y
3
Iteration 4
3
Iteration 5
3
Iteration 6
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
x
y
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
x
y
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
x
y
Example of K-means
Demo of K-means
slide 8: 3/22/2012
8
Evaluating K-means Clusters
Most common measure is Sum of Squared Error SSE
For each point the error is the distance to the nearest cluster
To get SSE we square these errors and sum them To get SSE we square these errors and sum them.
x is a data point in cluster Ci and mi is the representative point for cluster
Ci
can show that mi corresponds to the center mean of the cluster
Given two clusters we can choose the one with the smallest error
K
iC x
i
i
x m dist SSE
1
2
Given two clusters we can choose the one with the smallest error
One easy way to reduce SSE is to increase K the number of clusters
A good clustering with smaller K can have a lower SSE than a poor
clustering with higher K
Problem about K
How to choose K
1. Use another clustering method like EM.
2. Run algorithm on data with several different values of K.
3. Use the prior knowledge about the characteristics of the problem.
slide 9: 3/22/2012
9
Problem about initialize centers
How to initialize centers
- Random Points in Feature Space
- Random Points From Data Set
- Look For Dense Regions of Space
- Space them uniformly around the feature space
Cluster Quality
slide 10: 3/22/2012
10
Cluster Quality
Limitation of K-means
K-means has problems when clusters are of
differing g
Sizes
Densities
Non-globular shapes
K hbl h hd i K-means has problems when the data contains
outliers.
slide 11: 3/22/2012
11
Limitation of K-means
Original Points
K-means 3 Clusters
Application of K-means
Image Segmentation
The k-means clustering algorithm is commonly used in
computer vision as a form of image segmentation. The
results of the segmentation are used to aid border detection
and object recognition.
slide 12: 3/22/2012
12
K-means in Wind Energy
Clustering can be applied to detect
b lit i i d d t b l abnormality in wind data abnormal
vibration
Monitor Wind Turbine Conditions
Beneficial to preventative maintenance
K means can be more powerful and K-means can be more powerful and
applicable after appropriate modifications
K-means in Wind Energy
Modified K-means
slide 13: 3/22/2012
13
K-means in Wind Energy
Clustering cost function
2
1
1
ji
k
ji
iC
dk
n
x
xc x c
1
k
i
i
nm
2 1
k
1
1
1
ji
ji k
iC
i
i
dk
m
x
xc x c
K-means in Wind Energy
Determination of k value
002
0.03
0.04
0.05
0.06
0.07
0.08
0.09
Cost of clustering
0
0.01
0.02
23 456789 10 11 12 13
Number of clusters
slide 14: 3/22/2012
14
K-means in Wind Energy
Summary of clustering result
No. of Cluster c
1
Drive train acc. c
2
Wind speed Number of points Percentage
1 71.9612 9.97514 313 8.75524
2 65.8387 9.42031 295 8.25175
3 233.9184 9.57990 96 2.68531
4 17.4187 7.13375 240 6.71329
5 3.3706 8.99211 437 12.22378
6 0.3741 0.40378 217 6.06993
7 18.1361 8.09900 410 11.46853
8 0.7684 10.56663 419 11.72028
9 62.0493 8.81445 283 7.91608
10 81.7522 10.67867 181 5.06294
11 83.8067 8.10663 101 2.82517
12 0.9283 9.78571 583 16.30769
K-means in Wind Energy
Visualization of monitoring result
slide 15: 3/22/2012
15
K-means in Wind Energy
Visualization of vibration under normal condition
14
4
6
8
10
12
14
Wind speed m/s
0
2
0 20 40 60 80 100 120 140
Drive train acceleration
Reference
1. Introduction to Data Mining P.N. Tan M. Steinbach V. Kumar Addison Wesley
2. An efficient k-means clustering algorithm: Analysis and implementation T. Kanungo D. M.
Mount N. Netanyahu C. Piatko R. Silverman and A. Y. Wu IEEE Trans. PatternAnalysis
and Machine Intelligence 24 2002 881-892
3. http://www.cs.cmu.edu/cga/ai-course/kmeans.pdf
4. http://www.cse.msstate.edu/url/teaching/CSE6633Fall08/lec1620k-means.pdf
slide 16: 3/22/2012
16
Appendix One
Original Points
K-means 2 Clusters
Appendix Two
Original Points K-means Clusters
One solution is to use many clusters.
Find parts of clusters but need to put together.