logging in or signing up hsearch Lassie Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 84 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: November 20, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Web Page Clustering Using Heuristic Search in the Web Graph: Web Page Clustering Using Heuristic Search in the Web Graph Ron Bekkerman Shlomo Zilberstein James AllanExample application: Example application Given a person name Find out everything about this person What is available in the Web Possible solution: Query a search engine with the person name Retrieve documents Cluster retrieved documents Build largest clusters possible Query:: “Michel Décary” Query: “Michel Décary”Query:: Query: “Michel Décary” Lawyer Computer scientist Chanson singerQuery:: Query: “Michel Décary” Lawyer Computer scientist Chanson singer Query:: Query: “Michel Décary” Lawyer Computer scientist Chanson singer Query:: Query: “Michel Décary” Lawyer Computer scientist Chanson singer + Summary of observations: Summary of observations Topical clustering is not enough Although not to be ignored Web graph topology should be exploited Close pages tend to be semantically related There’s a reason for hyperlinking page A → page B Be careful: arbitrary connections exist as well Apply beam search to find close pages Use heuristics to prune undesired branchesExample: breadth first search: Example: breadth first search Clustering by multi-agent search: Clustering by multi-agent search Each page is represented by a Web agent Whose task is to traverse the Web graph And meet as many other agents as possibleReal-world case: Real-world case The Web is tightly interconnected About 70% agents meet after 3 search iterations Which is clearly an undesired outcome Heuristic 1: Heuristic 1 Elimination of high-degree nodes Both high in-degree and high out-degree They often connect unrelated pages www.google.comHeuristic 2: Heuristic 2 Person name sharing Expanded nodes share a hyperlink AND a person name (ignore too popular names) Jonathan Stern Jonathan SternHeuristic 3: Heuristic 3 Anchor text sharing Anchor texts often summarize the content of hyperlinked pages Same idea as in person name sharing But much simpler to implement No sophisticated information extraction needed Shallow parsing of HTML is enough Again, ignore too frequent anchor texts Like “Contact Us” or “Copyright” Algorithmic enhancement: Algorithmic enhancement Unpleasant artifact: too long connections Too weak semantic relationships Proposed solution: keep track of cluster’s diameter Start with a tightly connected component Add pages found within one hopExperimentation domains: Experimentation domains Web appearance disambiguation Given pages retrieved on N people names From one social network Filter out pages that refer to their unrelated namesakes Clustering of Web search results Represent ranked lists of retrieved documents as clusters of semantically related documents Bekkerman & McCallum, WWW-05 Hearst & Pedersen, SIGIR-96Slide17: Disambiguation dataset 12 names out of Melinda Gervasio’s social networkDisambiguation results: Disambiguation results SHS – sequential heuristic search (basic algorithm) IHS – incremental heuristic search With the enhancement of diameter tracking h/d – high degree node elimination names – person name heuristic Each point: one iteration of search Only 2 iterations are enoughJaguar dataset: Jaguar dataset 100 pages retrieved on query “Jaguar” 23 different categories! The task is to build 3 clusters Of cars, Mac OS and wild catsJaguar results: Jaguar results SHS algorithm fails 70 agents meet together anchors – anchor text heuristic Best performance: High degree AND anchors heuristicsTopical & topological clustering: Topical & topological clustering Build topical clusters Enrich topical clusters with pages obtained by heuristic search based clustering Bekkerman & McCallum, WWW-05 Best performance: after one iteration of heuristic search only!Conclusion: Conclusion First application of heuristic search to the Web graph Very simple algorithms / heurstics Heuristic search theory yet to be applied E.g., can an admissible heuristic be proposed? Search can be performed in real time! Modern search engines store the link structure of most of the Web Maximum 2 search iterations are enough Fully distributable in a multi-agent fashion You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
hsearch Lassie Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 84 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: November 20, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Web Page Clustering Using Heuristic Search in the Web Graph: Web Page Clustering Using Heuristic Search in the Web Graph Ron Bekkerman Shlomo Zilberstein James AllanExample application: Example application Given a person name Find out everything about this person What is available in the Web Possible solution: Query a search engine with the person name Retrieve documents Cluster retrieved documents Build largest clusters possible Query:: “Michel Décary” Query: “Michel Décary”Query:: Query: “Michel Décary” Lawyer Computer scientist Chanson singerQuery:: Query: “Michel Décary” Lawyer Computer scientist Chanson singer Query:: Query: “Michel Décary” Lawyer Computer scientist Chanson singer Query:: Query: “Michel Décary” Lawyer Computer scientist Chanson singer + Summary of observations: Summary of observations Topical clustering is not enough Although not to be ignored Web graph topology should be exploited Close pages tend to be semantically related There’s a reason for hyperlinking page A → page B Be careful: arbitrary connections exist as well Apply beam search to find close pages Use heuristics to prune undesired branchesExample: breadth first search: Example: breadth first search Clustering by multi-agent search: Clustering by multi-agent search Each page is represented by a Web agent Whose task is to traverse the Web graph And meet as many other agents as possibleReal-world case: Real-world case The Web is tightly interconnected About 70% agents meet after 3 search iterations Which is clearly an undesired outcome Heuristic 1: Heuristic 1 Elimination of high-degree nodes Both high in-degree and high out-degree They often connect unrelated pages www.google.comHeuristic 2: Heuristic 2 Person name sharing Expanded nodes share a hyperlink AND a person name (ignore too popular names) Jonathan Stern Jonathan SternHeuristic 3: Heuristic 3 Anchor text sharing Anchor texts often summarize the content of hyperlinked pages Same idea as in person name sharing But much simpler to implement No sophisticated information extraction needed Shallow parsing of HTML is enough Again, ignore too frequent anchor texts Like “Contact Us” or “Copyright” Algorithmic enhancement: Algorithmic enhancement Unpleasant artifact: too long connections Too weak semantic relationships Proposed solution: keep track of cluster’s diameter Start with a tightly connected component Add pages found within one hopExperimentation domains: Experimentation domains Web appearance disambiguation Given pages retrieved on N people names From one social network Filter out pages that refer to their unrelated namesakes Clustering of Web search results Represent ranked lists of retrieved documents as clusters of semantically related documents Bekkerman & McCallum, WWW-05 Hearst & Pedersen, SIGIR-96Slide17: Disambiguation dataset 12 names out of Melinda Gervasio’s social networkDisambiguation results: Disambiguation results SHS – sequential heuristic search (basic algorithm) IHS – incremental heuristic search With the enhancement of diameter tracking h/d – high degree node elimination names – person name heuristic Each point: one iteration of search Only 2 iterations are enoughJaguar dataset: Jaguar dataset 100 pages retrieved on query “Jaguar” 23 different categories! The task is to build 3 clusters Of cars, Mac OS and wild catsJaguar results: Jaguar results SHS algorithm fails 70 agents meet together anchors – anchor text heuristic Best performance: High degree AND anchors heuristicsTopical & topological clustering: Topical & topological clustering Build topical clusters Enrich topical clusters with pages obtained by heuristic search based clustering Bekkerman & McCallum, WWW-05 Best performance: after one iteration of heuristic search only!Conclusion: Conclusion First application of heuristic search to the Web graph Very simple algorithms / heurstics Heuristic search theory yet to be applied E.g., can an admissible heuristic be proposed? Search can be performed in real time! Modern search engines store the link structure of most of the Web Maximum 2 search iterations are enough Fully distributable in a multi-agent fashion