part3

Uploaded from authorPOINTLite
Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Slide1: 

Web Community Construction

Problem with Web Searching: 

Problem with Web Searching Current Web search tools such as Yahoo!, AltaVista, Google, … return more information than those required Example: search “Java Programming”, Google returns 1,330,000 (more than 1 million) Web pages, AltaVista returns 16,921,862 (more than 16 million) Web pages. Users only concern about a small and interesting portion of the returned search results. Necessary to build web communities. Related research work: Constructing good Web communities. Identifying related Web pages. Clustering Web pages.

Web Community: 

Web Community A web community is defined as a set of web pages that link to more web pages within the set than to web pages outside the community (Flake/Lawrence/Giles 2000) Hyperlink analysis if a web page A has a hyperlink to page B, then the author of page A usually considers that page B contains valuable information that is related to page A; hyperlinks convey a considerable amount of latent human judgment;

Some recent works on Web Communities: 

Some recent works on Web Communities Noise page elimination Related page finding Web page clustering

Authority and Hub Pages : 

Authority and Hub Pages A page is a good authoritative page with respect to a given query if it is referenced (i.e., pointed to) by many (good hub) pages that are related to the query. A page is a good hub page with respect to a given query if it points to many good authoritative pages with respect to the query. Good authoritative pages (authorities) and good hub pages (hubs) reinforce each other.

Constructing Web Community: 

Given a query/topic, a web community consists of two distinct, but interrelated, types of pages: authority pages (authorities) and hub pages (hubs). Constructing Web Community Hubs Authorities Figure 2. A skeletal web community

Slide7: 

Construct Good Quality Web Page Communities World Wide Web Concerned Info. Space Web Page Community Hubs Authorities Search Engines ? ?

Slide8: 

Kleinberg’s HITS Algorithm (Hypertext Induced Topic Search) Concerned Info. Space Calculate Authority/Hub values Web Page Community R : root set of pages B : base set of pages R construction: collecting r highest-ranked pages from the search results. B construction: adding to R more pages that are pointed by or pointing to the pages in R.

HITS Algorithm: 

HITS Algorithm Hub and authority value computation "p  q" denotes "page p has a hyperlink to page q". Normalize the a and h vectors after each iteration. Improved HITS algorithm (Bharat and Henzinger, 1998) ,

Slide10: 

Problems with various HITS algorithms Concerned Info. Space Community Construction Algorithms Web Page Community R : root set of pages B : base set of pages R construction: collecting r highest-ranked pages from the search results. B construction: adding to R more pages that are pointed by or pointing to the pages in R. Topic Drift Problem Noise pages

Example: 

Example Noise page --- contains no query terms Experiment Query Term: Harvard Search Engine: AltaVista Size of Base Set: 8064 Size of Root Set: 200

Slide12: 

Ten arbitrarily selected noise pages

Slide13: 

Ten arbitrarily selected topic-related pages

Slide14: 

Authorities & Hubs (by HITS algorithm)

Slide15: 

Noise Page Elimination Concerned Info. Space Community Construction Algorithms Good Quality Web Page Community Noise Page Elimination Algorithms !

Hyperlink Analysis via SVD: 

Hyperlink hyperlink matrix Hyperlink matrix singular value decomposition getting intrinsic relationships between the pages Hyperlink Analysis via SVD

SVD of a Matrix: 

SVD of a Matrix Definition of SVD then where

Slide18: 

Approximation Property is the best approximation to A with rank k ( ).

Noise Page Elimination Algorithm: 

Noise Page Elimination Algorithm Main Procedure of the Algorithm Hyperlinks Matrix A SVD of A Approximation Matrix Ak Ak captures main structure information of A Ak filters minor factors in A For a proper k, computing cost can be reduced

Slide20: 

Noise Page Elimination Algorithm B R Matrix A = (aij) n×n Matrix S=(sij)(m-n)×n Size(R)= n Size(B)= m î í ì = ® ® = otherwise j i i j j i s ij 0 or or 1

Slide21: 

Matrix A SVD of A Choose parameter k such that Approximation matrix of A Rows of matrix Coordinate vector of page i in k-dim. space. page i  B - R i = 1, 2, …, n.

Slide22: 

Matrix S Each row s in S, s =(g1, g2, …, gn) in new k-space Pages in A and S are mapped into the vectors in the same k-dimensional subspace, it is possible to measure the similarity (relevance degree) between a page in S and a page in A. Yes No Keep page i in B Eliminate page i from B

Choice of Parameter c: 

Choice of Parameter c Algorithm is called c = avgAlgo Algorithm c = maxAlgo Algorithm c = minAlgo Algorithm

Slide24: 

Authorities & Hubs after Elimination (NPEA and HITS algorithm)

Finding Related Web Pages: 

Given a URL or web page, find a set of pages that address the same topic of the original page. Co-citation algorithm: the sibling pages of the given URL u that have the most number of common parent pages with u are the related pages. HITS and Improved HITS algorithms: authority pages with the highest authority values are related pages. Finding Related Web Pages

Slide26: 

Finding related pages Co-Citation Algorithm Given page (URL) u, page source S is constructed in the following way: chooses up to B (e.g. 2000) arbitrary parents of u; for each of these parents p, adds to S up to BF (e.g. 8) children of p that surround the link from p to u. The elements of S are siblings. for each s in S, co-citation degree of s and u is determined; the algorithm returns the 10 pages that have the highest co-citation degrees with u as the relevant pages.

Slide27: 

Extended Co-Citation Algorithm Given page u, Choose up to B parent pages of u, and add up to BF children pages (different from u) of each parent page. choose up to F children pages of u, and add up to FB parent pages (diff from u) of each child page. B, F, BF and FB are used to keep the page source to a reasonable size. In practice, choose B = FB = 200, F = BF = 40.

Web Page Clustering: 

With a set of pages, the key is to find a similarity between the pages Text content based algorithms Regard a web page as a set of words or phrases; Each page is represented as a feature vector; The similarity is measured as where wik is the k-th component of the feature vector of the i-th page, and L is the length of the feature vectors. Web Page Clustering

Slide29: 

Hyperlink based algorithms Co-citation algorithm Linkage vector algorithm Each page P is represented as two linkage vectors: Pout and Pin ; The similarity is defined as : dot product of vectors Pout and Qout .

Web Page Clustering : 

Page Source Construction Given a user’s query topic, page source S = R  BV  FV with hyperlinks. Web Page Clustering

Slide31: 

Page Similarity Assume: size(R) = m, size(V) = n, V = BV  FV Hyperlinks Adjacency Matrix C

Slide32: 

(Out-link relationship of page i in R) (In-link relationship of page i in R) Out-link similarity of pages i and j In-link similarity of pages i and j

Slide33: 

Similarity between any two pages i and j in R: Similarity Matrix SM = (smi,j)mm , where MODij = || rowi || + || rowj || + || coli || + || colj ||

Slide34: 

Matrix-Based Clustering Algorithms Three main steps: Similarity Matrix Permutation Purpose: put closely related pages together in the similarity matrix SM. Matrix Partition Purpose: cluster pages at one level with the partition. Hierarchical Clustering Recursively apply the above two steps to the existing clusters.

Slide35: 

Matrix-based hierarchical clustering diagram: Similarity Matrix mm Page Source SM CL1 CL2 Matrix Partition Hierarchical Clusters Permuted Similarity Matrix mm SM’

Slide36: 

An example of matrix permutation and partition 1 0 0 0 .47 0 .6 0 0 0 1 0 0 0 0 .49 0 0 0 0 1 0 0 0 0 .7 .66 0 0 0 1 0 0 0 0 .42 .47 0 0 0 1 0 .9 0 0 0 0 0 0 0 1 0 .92 .45 .6 .49 0 0 .9 0 1 0 0 0 0 .7 0 0 .92 0 1 0 0 0 .66 .42 0 .45 0 0 1 P1 P2 P3 P4 P5 P6 P7 P8 P9 P1 P2 P3 P4 P5 P6 P7 P8 P9 (a) A similarity matrix (b) Permuted matrix and partition 1 .47 .6 0 0 0 0 0 0 .47 1 .9 0 0 0 0 0 0 .6 .9 1 .49 0 0 0 0 0 0 0 .49 1 0 0 0 0 0 0 0 0 0 1 .42 0 0 0 0 0 0 0 .42 1 .66 0 .45 0 0 0 0 0 .66 1 .7 0 0 0 0 0 0 0 .7 1 .92 0 0 0 0 0 .45 0 .92 1 D P1 P5 P7 P2 P4 P9 P3 P8 P6 P1 P5 P7 P2 P4 P9 P3 P8 P6 Cluster1 = { P1 , P5 , P7 , P2 } , Cluster2 = { P4 , P9 , P3 , P8 , P6 }

Slide37: 

Examples of Some Major Clusters