logging in or signing up part3 Nikita Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 88 Category: Entertainment License: All Rights Reserved Like it (1) Dislike it (0) Added: November 20, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Slide1: Web Community Construction Problem with Web Searching: Problem with Web Searching Current Web search tools such as Yahoo!, AltaVista, Google, … return more information than those required Example: search “Java Programming”, Google returns 1,330,000 (more than 1 million) Web pages, AltaVista returns 16,921,862 (more than 16 million) Web pages. Users only concern about a small and interesting portion of the returned search results. Necessary to build web communities. Related research work: Constructing good Web communities. Identifying related Web pages. Clustering Web pages. Web Community: Web Community A web community is defined as a set of web pages that link to more web pages within the set than to web pages outside the community (Flake/Lawrence/Giles 2000) Hyperlink analysis if a web page A has a hyperlink to page B, then the author of page A usually considers that page B contains valuable information that is related to page A; hyperlinks convey a considerable amount of latent human judgment; Some recent works on Web Communities: Some recent works on Web Communities Noise page elimination Related page finding Web page clustering Authority and Hub Pages : Authority and Hub Pages A page is a good authoritative page with respect to a given query if it is referenced (i.e., pointed to) by many (good hub) pages that are related to the query. A page is a good hub page with respect to a given query if it points to many good authoritative pages with respect to the query. Good authoritative pages (authorities) and good hub pages (hubs) reinforce each other. Constructing Web Community: Given a query/topic, a web community consists of two distinct, but interrelated, types of pages: authority pages (authorities) and hub pages (hubs). Constructing Web Community Hubs Authorities Figure 2. A skeletal web communitySlide7: Construct Good Quality Web Page Communities World Wide Web Concerned Info. Space Web Page Community Hubs Authorities Search Engines ? ?Slide8: Kleinberg’s HITS Algorithm (Hypertext Induced Topic Search) Concerned Info. Space Calculate Authority/Hub values Web Page Community R : root set of pages B : base set of pages R construction: collecting r highest-ranked pages from the search results. B construction: adding to R more pages that are pointed by or pointing to the pages in R.HITS Algorithm: HITS Algorithm Hub and authority value computation "p q" denotes "page p has a hyperlink to page q". Normalize the a and h vectors after each iteration. Improved HITS algorithm (Bharat and Henzinger, 1998) , Slide10: Problems with various HITS algorithms Concerned Info. Space Community Construction Algorithms Web Page Community R : root set of pages B : base set of pages R construction: collecting r highest-ranked pages from the search results. B construction: adding to R more pages that are pointed by or pointing to the pages in R. Topic Drift Problem Noise pagesExample: Example Noise page --- contains no query terms Experiment Query Term: Harvard Search Engine: AltaVista Size of Base Set: 8064 Size of Root Set: 200 Slide12: Ten arbitrarily selected noise pagesSlide13: Ten arbitrarily selected topic-related pagesSlide14: Authorities & Hubs (by HITS algorithm)Slide15: Noise Page Elimination Concerned Info. Space Community Construction Algorithms Good Quality Web Page Community Noise Page Elimination Algorithms !Hyperlink Analysis via SVD: Hyperlink hyperlink matrix Hyperlink matrix singular value decomposition getting intrinsic relationships between the pages Hyperlink Analysis via SVD SVD of a Matrix: SVD of a Matrix Definition of SVD then where Slide18: Approximation Property is the best approximation to A with rank k ( ). Noise Page Elimination Algorithm: Noise Page Elimination Algorithm Main Procedure of the Algorithm Hyperlinks Matrix A SVD of A Approximation Matrix Ak Ak captures main structure information of A Ak filters minor factors in A For a proper k, computing cost can be reduced Slide20: Noise Page Elimination Algorithm B R Matrix A = (aij) n×n Matrix S=(sij)(m-n)×n Size(R)= n Size(B)= m î í ì = ® ® = otherwise j i i j j i s ij 0 or or 1Slide21: Matrix A SVD of A Choose parameter k such that Approximation matrix of A Rows of matrix Coordinate vector of page i in k-dim. space. page i B - R i = 1, 2, …, n. Slide22: Matrix S Each row s in S, s =(g1, g2, …, gn) in new k-space Pages in A and S are mapped into the vectors in the same k-dimensional subspace, it is possible to measure the similarity (relevance degree) between a page in S and a page in A. Yes No Keep page i in B Eliminate page i from B Choice of Parameter c: Choice of Parameter c Algorithm is called c = avgAlgo Algorithm c = maxAlgo Algorithm c = minAlgo Algorithm Slide24: Authorities & Hubs after Elimination (NPEA and HITS algorithm)Finding Related Web Pages: Given a URL or web page, find a set of pages that address the same topic of the original page. Co-citation algorithm: the sibling pages of the given URL u that have the most number of common parent pages with u are the related pages. HITS and Improved HITS algorithms: authority pages with the highest authority values are related pages. Finding Related Web PagesSlide26: Finding related pages Co-Citation Algorithm Given page (URL) u, page source S is constructed in the following way: chooses up to B (e.g. 2000) arbitrary parents of u; for each of these parents p, adds to S up to BF (e.g. 8) children of p that surround the link from p to u. The elements of S are siblings. for each s in S, co-citation degree of s and u is determined; the algorithm returns the 10 pages that have the highest co-citation degrees with u as the relevant pages.Slide27: Extended Co-Citation Algorithm Given page u, Choose up to B parent pages of u, and add up to BF children pages (different from u) of each parent page. choose up to F children pages of u, and add up to FB parent pages (diff from u) of each child page. B, F, BF and FB are used to keep the page source to a reasonable size. In practice, choose B = FB = 200, F = BF = 40. Web Page Clustering: With a set of pages, the key is to find a similarity between the pages Text content based algorithms Regard a web page as a set of words or phrases; Each page is represented as a feature vector; The similarity is measured as where wik is the k-th component of the feature vector of the i-th page, and L is the length of the feature vectors. Web Page Clustering Slide29: Hyperlink based algorithms Co-citation algorithm Linkage vector algorithm Each page P is represented as two linkage vectors: Pout and Pin ; The similarity is defined as : dot product of vectors Pout and Qout . Web Page Clustering : Page Source Construction Given a user’s query topic, page source S = R BV FV with hyperlinks. Web Page Clustering Slide31: Page Similarity Assume: size(R) = m, size(V) = n, V = BV FV Hyperlinks Adjacency Matrix C Slide32: (Out-link relationship of page i in R) (In-link relationship of page i in R) Out-link similarity of pages i and j In-link similarity of pages i and jSlide33: Similarity between any two pages i and j in R: Similarity Matrix SM = (smi,j)mm , where MODij = || rowi || + || rowj || + || coli || + || colj || Slide34: Matrix-Based Clustering Algorithms Three main steps: Similarity Matrix Permutation Purpose: put closely related pages together in the similarity matrix SM. Matrix Partition Purpose: cluster pages at one level with the partition. Hierarchical Clustering Recursively apply the above two steps to the existing clusters.Slide35: Matrix-based hierarchical clustering diagram: Similarity Matrix mm Page Source SM CL1 CL2 Matrix Partition Hierarchical Clusters Permuted Similarity Matrix mm SM’ Slide36: An example of matrix permutation and partition 1 0 0 0 .47 0 .6 0 0 0 1 0 0 0 0 .49 0 0 0 0 1 0 0 0 0 .7 .66 0 0 0 1 0 0 0 0 .42 .47 0 0 0 1 0 .9 0 0 0 0 0 0 0 1 0 .92 .45 .6 .49 0 0 .9 0 1 0 0 0 0 .7 0 0 .92 0 1 0 0 0 .66 .42 0 .45 0 0 1 P1 P2 P3 P4 P5 P6 P7 P8 P9 P1 P2 P3 P4 P5 P6 P7 P8 P9 (a) A similarity matrix (b) Permuted matrix and partition 1 .47 .6 0 0 0 0 0 0 .47 1 .9 0 0 0 0 0 0 .6 .9 1 .49 0 0 0 0 0 0 0 .49 1 0 0 0 0 0 0 0 0 0 1 .42 0 0 0 0 0 0 0 .42 1 .66 0 .45 0 0 0 0 0 .66 1 .7 0 0 0 0 0 0 0 .7 1 .92 0 0 0 0 0 .45 0 .92 1 D P1 P5 P7 P2 P4 P9 P3 P8 P6 P1 P5 P7 P2 P4 P9 P3 P8 P6 Cluster1 = { P1 , P5 , P7 , P2 } , Cluster2 = { P4 , P9 , P3 , P8 , P6 }Slide37: Examples of Some Major Clusters You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
part3 Nikita Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 88 Category: Entertainment License: All Rights Reserved Like it (1) Dislike it (0) Added: November 20, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Slide1: Web Community Construction Problem with Web Searching: Problem with Web Searching Current Web search tools such as Yahoo!, AltaVista, Google, … return more information than those required Example: search “Java Programming”, Google returns 1,330,000 (more than 1 million) Web pages, AltaVista returns 16,921,862 (more than 16 million) Web pages. Users only concern about a small and interesting portion of the returned search results. Necessary to build web communities. Related research work: Constructing good Web communities. Identifying related Web pages. Clustering Web pages. Web Community: Web Community A web community is defined as a set of web pages that link to more web pages within the set than to web pages outside the community (Flake/Lawrence/Giles 2000) Hyperlink analysis if a web page A has a hyperlink to page B, then the author of page A usually considers that page B contains valuable information that is related to page A; hyperlinks convey a considerable amount of latent human judgment; Some recent works on Web Communities: Some recent works on Web Communities Noise page elimination Related page finding Web page clustering Authority and Hub Pages : Authority and Hub Pages A page is a good authoritative page with respect to a given query if it is referenced (i.e., pointed to) by many (good hub) pages that are related to the query. A page is a good hub page with respect to a given query if it points to many good authoritative pages with respect to the query. Good authoritative pages (authorities) and good hub pages (hubs) reinforce each other. Constructing Web Community: Given a query/topic, a web community consists of two distinct, but interrelated, types of pages: authority pages (authorities) and hub pages (hubs). Constructing Web Community Hubs Authorities Figure 2. A skeletal web communitySlide7: Construct Good Quality Web Page Communities World Wide Web Concerned Info. Space Web Page Community Hubs Authorities Search Engines ? ?Slide8: Kleinberg’s HITS Algorithm (Hypertext Induced Topic Search) Concerned Info. Space Calculate Authority/Hub values Web Page Community R : root set of pages B : base set of pages R construction: collecting r highest-ranked pages from the search results. B construction: adding to R more pages that are pointed by or pointing to the pages in R.HITS Algorithm: HITS Algorithm Hub and authority value computation "p q" denotes "page p has a hyperlink to page q". Normalize the a and h vectors after each iteration. Improved HITS algorithm (Bharat and Henzinger, 1998) , Slide10: Problems with various HITS algorithms Concerned Info. Space Community Construction Algorithms Web Page Community R : root set of pages B : base set of pages R construction: collecting r highest-ranked pages from the search results. B construction: adding to R more pages that are pointed by or pointing to the pages in R. Topic Drift Problem Noise pagesExample: Example Noise page --- contains no query terms Experiment Query Term: Harvard Search Engine: AltaVista Size of Base Set: 8064 Size of Root Set: 200 Slide12: Ten arbitrarily selected noise pagesSlide13: Ten arbitrarily selected topic-related pagesSlide14: Authorities & Hubs (by HITS algorithm)Slide15: Noise Page Elimination Concerned Info. Space Community Construction Algorithms Good Quality Web Page Community Noise Page Elimination Algorithms !Hyperlink Analysis via SVD: Hyperlink hyperlink matrix Hyperlink matrix singular value decomposition getting intrinsic relationships between the pages Hyperlink Analysis via SVD SVD of a Matrix: SVD of a Matrix Definition of SVD then where Slide18: Approximation Property is the best approximation to A with rank k ( ). Noise Page Elimination Algorithm: Noise Page Elimination Algorithm Main Procedure of the Algorithm Hyperlinks Matrix A SVD of A Approximation Matrix Ak Ak captures main structure information of A Ak filters minor factors in A For a proper k, computing cost can be reduced Slide20: Noise Page Elimination Algorithm B R Matrix A = (aij) n×n Matrix S=(sij)(m-n)×n Size(R)= n Size(B)= m î í ì = ® ® = otherwise j i i j j i s ij 0 or or 1Slide21: Matrix A SVD of A Choose parameter k such that Approximation matrix of A Rows of matrix Coordinate vector of page i in k-dim. space. page i B - R i = 1, 2, …, n. Slide22: Matrix S Each row s in S, s =(g1, g2, …, gn) in new k-space Pages in A and S are mapped into the vectors in the same k-dimensional subspace, it is possible to measure the similarity (relevance degree) between a page in S and a page in A. Yes No Keep page i in B Eliminate page i from B Choice of Parameter c: Choice of Parameter c Algorithm is called c = avgAlgo Algorithm c = maxAlgo Algorithm c = minAlgo Algorithm Slide24: Authorities & Hubs after Elimination (NPEA and HITS algorithm)Finding Related Web Pages: Given a URL or web page, find a set of pages that address the same topic of the original page. Co-citation algorithm: the sibling pages of the given URL u that have the most number of common parent pages with u are the related pages. HITS and Improved HITS algorithms: authority pages with the highest authority values are related pages. Finding Related Web PagesSlide26: Finding related pages Co-Citation Algorithm Given page (URL) u, page source S is constructed in the following way: chooses up to B (e.g. 2000) arbitrary parents of u; for each of these parents p, adds to S up to BF (e.g. 8) children of p that surround the link from p to u. The elements of S are siblings. for each s in S, co-citation degree of s and u is determined; the algorithm returns the 10 pages that have the highest co-citation degrees with u as the relevant pages.Slide27: Extended Co-Citation Algorithm Given page u, Choose up to B parent pages of u, and add up to BF children pages (different from u) of each parent page. choose up to F children pages of u, and add up to FB parent pages (diff from u) of each child page. B, F, BF and FB are used to keep the page source to a reasonable size. In practice, choose B = FB = 200, F = BF = 40. Web Page Clustering: With a set of pages, the key is to find a similarity between the pages Text content based algorithms Regard a web page as a set of words or phrases; Each page is represented as a feature vector; The similarity is measured as where wik is the k-th component of the feature vector of the i-th page, and L is the length of the feature vectors. Web Page Clustering Slide29: Hyperlink based algorithms Co-citation algorithm Linkage vector algorithm Each page P is represented as two linkage vectors: Pout and Pin ; The similarity is defined as : dot product of vectors Pout and Qout . Web Page Clustering : Page Source Construction Given a user’s query topic, page source S = R BV FV with hyperlinks. Web Page Clustering Slide31: Page Similarity Assume: size(R) = m, size(V) = n, V = BV FV Hyperlinks Adjacency Matrix C Slide32: (Out-link relationship of page i in R) (In-link relationship of page i in R) Out-link similarity of pages i and j In-link similarity of pages i and jSlide33: Similarity between any two pages i and j in R: Similarity Matrix SM = (smi,j)mm , where MODij = || rowi || + || rowj || + || coli || + || colj || Slide34: Matrix-Based Clustering Algorithms Three main steps: Similarity Matrix Permutation Purpose: put closely related pages together in the similarity matrix SM. Matrix Partition Purpose: cluster pages at one level with the partition. Hierarchical Clustering Recursively apply the above two steps to the existing clusters.Slide35: Matrix-based hierarchical clustering diagram: Similarity Matrix mm Page Source SM CL1 CL2 Matrix Partition Hierarchical Clusters Permuted Similarity Matrix mm SM’ Slide36: An example of matrix permutation and partition 1 0 0 0 .47 0 .6 0 0 0 1 0 0 0 0 .49 0 0 0 0 1 0 0 0 0 .7 .66 0 0 0 1 0 0 0 0 .42 .47 0 0 0 1 0 .9 0 0 0 0 0 0 0 1 0 .92 .45 .6 .49 0 0 .9 0 1 0 0 0 0 .7 0 0 .92 0 1 0 0 0 .66 .42 0 .45 0 0 1 P1 P2 P3 P4 P5 P6 P7 P8 P9 P1 P2 P3 P4 P5 P6 P7 P8 P9 (a) A similarity matrix (b) Permuted matrix and partition 1 .47 .6 0 0 0 0 0 0 .47 1 .9 0 0 0 0 0 0 .6 .9 1 .49 0 0 0 0 0 0 0 .49 1 0 0 0 0 0 0 0 0 0 1 .42 0 0 0 0 0 0 0 .42 1 .66 0 .45 0 0 0 0 0 .66 1 .7 0 0 0 0 0 0 0 .7 1 .92 0 0 0 0 0 .45 0 .92 1 D P1 P5 P7 P2 P4 P9 P3 P8 P6 P1 P5 P7 P2 P4 P9 P3 P8 P6 Cluster1 = { P1 , P5 , P7 , P2 } , Cluster2 = { P4 , P9 , P3 , P8 , P6 }Slide37: Examples of Some Major Clusters