spectral slides1

Uploaded from authorPOINTLite
Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Clustering: 

Clustering

Example application: posterizing: 

Example application: posterizing Lots of pixels, many colors Want to pick just a few colors Solution: treat RGB triples as points in R3 and cluster Use center points of clusters as new colors

Posterization problems: 

Posterization problems The “distance” in RGB space… Sqrt[ (r1 - r2 ) 2 + (g1 - g2 ) 2 + (b1 - b2 ) 2 ] … is not “perceptually uniform” Distance of 0.2 in one area (black-to-grey distance, for example) may seem much larger than another (yellow-to-yellow/green) Approach ignores pixel adjaceny in the image One solution: cluster in R5 = (x,y,r,g,b)

Tissue Classification: 

Tissue Classification

Problems: 

Problems Not really a “clustering” problem, although similar tissues tend to be clustered Fundamentally a mixture-model: a pixel contains both bone and soft-tissue, for example.

Friendship nets: Facebook: 

Friendship nets: Facebook Given facebook data… Construct “clusters of friends”

Problems: 

Problems No coordinates Information about “closeness” is 0/1 (“have you friended me yet???”)

Netflix: 

Netflix

Approach: 

Approach N = number of movies Netflix has My coordinates = (1,0,0,1,1,0,… where xi = 1 means “I liked movie I” Now finding clusters lets Netflix make recommendations

Problems: 

Problems My coordinates really look like this: (*,*,0,*,*,*,…,*,1,1,*,*,…) with “*” meaning “Never seen the movie and don’t know.” Even if we did know all my coordinates, the problem lies in {0,1}N rather than RN; is our euclidean intuition really appropriate?

Document classification: 

Document classification Could represent a document by a vector (0,1,0,…) representing whether each English word (aardvark, and, anchovy, …) occurs in the doc Clusters represent “similar topics”

Problems: 

Problems Non-isotropic distance: two documents having “the” in common are far less likely to be similar than two with “aardvark” in common Really need a distance metric that compensates for this before applying clustering

Conclusion: 

Conclusion Clustering seems to have a lot of interesting applications… But it’s important, before starting, to have an embedding of your data in RN where distance in RN is really related to distance between items (at least for small distances!)

A first clustering algorithm: 

A first clustering algorithm Assuming data that’s really pretty well clustered…how do you find the clusters? Intro to K-means

Distance vs. Cluster distance: 

Distance vs. Cluster distance

Conclusion: 

Conclusion We might want to use a distance, D, to indicate how much two things are “in the same cluster” Tempting to write D(pi, pk) Really needs to be D({p1, p2, …}, pi, pk) One view of clustering is that we want to use euclidean distance, d, to bootstrap discovery of cluster distance, D. K-means works when d and D are very similar.

How are distance and “cluster distance” related: 

How are distance and “cluster distance” related “If you’re a friend of my friend, you’re my friend” Suggests a graph-theory approach: find connected components in a graph Edges in graph when two points are “close enough”

Problems: 

Problems Edges in graph when two points are “close enough” Very data-sensitive: a small perturbation of data can join two clusters These points are much “closer” than these: “If you’re friends with lots of my friends, you’re my friend.”

Leads to study of “how connected are nodes in a graph”?: 

Leads to study of “how connected are nodes in a graph”? The travelling token problem was an intro to that question

Solution to travelling token: 

Solution to travelling token

Slide32: 

…and so on Make a VERY long movie Play it VERY rapidly How “pink” each node appears tells you what fraction of the time the token spends there.

Insight:: 

Insight:

Create a shorter movie!: 

Create a shorter movie! Two tokens (possibly at same spot) in each frame

“Doubled” movie: 

“Doubled” movie

Apply repeatedly: 

Apply repeatedly Limiting version of the movie… Has a huge number of moving tokens every frame must look the same!

Every frame looks the same: 

Every frame looks the same If ui = number of tokens at node i di = degree of node i j is a neighbor of i Then i sends j a quantity ui / di tokens “every frame looks the same” if j sends that many back to i. That happens exactly if uk / dk = constant Population of tokens at a node is proportional to node’s degree!

Matrix form: 

Matrix form Let aij be 1 if i and j are connected be 0 otherwise Let D = diag(deg of node 1, deg of node 2, …) Let M = D-1 A Then our solution u satisfies Mu = u Insight: Things related to graph diffusion, neighboring, clustering, are related to eigenvalue problems.

Some Graph Terminology and Notation: 

Some Graph Terminology and Notation

Ng, Jordan, Weiss: 

Ng, Jordan, Weiss S = {s1, s2, …, sn} in Rp. Want to cluster into k subsets Form n x n matrix A with aij = exp(-||si – sj||)2/2s2) except aii = 0 D = diag(row sums of A); L = D-1/2 A D-1/2 Find k largest (column) eigenvectors of D; arrange in an n x k matrix, X

Slide44: 

X contains eigenvectors as columns Normalize each row of X to get Y. Rows of Y are points on the unit sphere. There’s one row per original point Cluster these points on the unit sphere using k-means; use these clusters on S.

Matlab: 

Matlab function y = njw(s, sigma, k) % s = nx3 array of pts; k =#clust. n = size(s, 1); [X,Y] = meshgrid(1:n, 1:n); diffs = s(X,:) - s(Y, :); dists = reshape(dot(diffs', diffs'), n, n);%squared dists. A = exp(-dists/(2*sigma^2)); for i = 1:n % ICK A(i,i) = 0; end D = sum(A'); %% ith entry is sum of A's ith row L = diag(1 ./ (D .^ 0.5)) * A * diag(1 ./ D .^ 0.5); [X,D] = eigs(L,k); % k largest eigenvals/vecs of L. Y = diag((1./dot(X', X')).^0.5) * X; % normalize rows to unit length IDX = kmeans(Y,k, 'emptyaction', 'singleton'); % IDX is a vector of cluster-ids, one per point of S.

Visualization part…: 

Visualization part… clf; hold on; for t = 1:k pts = s(IDX == t, :); c = hsv2rgb( [(t-1)/k, 0.8, 0.8] ) plot3(pts(:,1), pts(:, 2), pts(:, 3),… 'o', 'Color', c); end hold off; figure(gcf);