k-means Clustering :
k-means Clustering Partition data into groups (clusters)
Used in machine learning, collaborative filtering etc….
Extremely large data sets
Algorithm similar to expectation maximization
Choose cluster centers and iteratively assign points and recalculate centroids
Parallelizing k-means on MapReduce :
Parallelizing k-means on MapReduce Use MapReduce to calculate each iteration
Map Phase:
Input: Point
Output: Closest Cluster ID, Point
Reduce Phase:
Output: Cluster ID, Centroid(Points)
This requires a small amount of shared data (the k cluster centers)
Clustering 2-d Points(harder to visualize higher dimensional data) :
Clustering 2-d Points(harder to visualize higher dimensional data)
Netflix Prize Dataset :
Netflix Prize Dataset 100,000,000 Data Points
Each point is a (User ID, Movie ID, Movie Rating)
Run k-means to cluster movies based on user ratings
Groups of movies rated similarly by users will be closer together
Netflix Prize Dataset :
Netflix Prize Dataset MapReduce Version got approximately 10x speedup over traditional
Bottleneck is still the data transfer
Achieved performance speedup by using a combiner (i.e. local reducer)
Movies with similar genres clustered together
e.g. Batman, Batman Begins, Batman Returns, Batman and Robin