CS 264 Presentation

Views:
 
Category: Education
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

k-means Clustering on MapReduce : 

k-means Clustering on MapReduce Jesse Cohen December 2009

k-means Clustering : 

k-means Clustering Partition data into groups (clusters) Used in machine learning, collaborative filtering etc…. Extremely large data sets Algorithm similar to expectation maximization Choose cluster centers and iteratively assign points and recalculate centroids

Parallelizing k-means on MapReduce : 

Parallelizing k-means on MapReduce Use MapReduce to calculate each iteration Map Phase: Input: Point Output: Closest Cluster ID, Point Reduce Phase: Output: Cluster ID, Centroid(Points) This requires a small amount of shared data (the k cluster centers)

Clustering 2-d Points(harder to visualize higher dimensional data) : 

Clustering 2-d Points(harder to visualize higher dimensional data)

Netflix Prize Dataset : 

Netflix Prize Dataset 100,000,000 Data Points Each point is a (User ID, Movie ID, Movie Rating) Run k-means to cluster movies based on user ratings Groups of movies rated similarly by users will be closer together

Netflix Prize Dataset : 

Netflix Prize Dataset MapReduce Version got approximately 10x speedup over traditional Bottleneck is still the data transfer Achieved performance speedup by using a combiner (i.e. local reducer) Movies with similar genres clustered together e.g. Batman, Batman Begins, Batman Returns, Batman and Robin