hsearch

Uploaded from authorPOINTLite
Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Web Page Clustering Using Heuristic Search in the Web Graph: 

Web Page Clustering Using Heuristic Search in the Web Graph Ron Bekkerman Shlomo Zilberstein James Allan

Example application: 

Example application Given a person name Find out everything about this person What is available in the Web Possible solution: Query a search engine with the person name Retrieve documents Cluster retrieved documents Build largest clusters possible

Query:: 

“Michel Décary” Query: “Michel Décary”

Query:: 

Query: “Michel Décary” Lawyer Computer scientist Chanson singer

Query:: 

Query: “Michel Décary” Lawyer Computer scientist Chanson singer

Query:: 

Query: “Michel Décary” Lawyer Computer scientist Chanson singer

Query:: 

Query: “Michel Décary” Lawyer Computer scientist Chanson singer +

Summary of observations: 

Summary of observations Topical clustering is not enough Although not to be ignored Web graph topology should be exploited Close pages tend to be semantically related There’s a reason for hyperlinking page A → page B Be careful: arbitrary connections exist as well Apply beam search to find close pages Use heuristics to prune undesired branches

Example: breadth first search: 

Example: breadth first search

Clustering by multi-agent search: 

Clustering by multi-agent search Each page is represented by a Web agent Whose task is to traverse the Web graph And meet as many other agents as possible

Real-world case: 

Real-world case The Web is tightly interconnected About 70% agents meet after 3 search iterations Which is clearly an undesired outcome

Heuristic 1: 

Heuristic 1 Elimination of high-degree nodes Both high in-degree and high out-degree They often connect unrelated pages www.google.com

Heuristic 2: 

Heuristic 2 Person name sharing Expanded nodes share a hyperlink AND a person name (ignore too popular names) Jonathan Stern Jonathan Stern

Heuristic 3: 

Heuristic 3 Anchor text sharing Anchor texts often summarize the content of hyperlinked pages Same idea as in person name sharing But much simpler to implement No sophisticated information extraction needed Shallow parsing of HTML is enough Again, ignore too frequent anchor texts Like “Contact Us” or “Copyright”

Algorithmic enhancement: 

Algorithmic enhancement Unpleasant artifact: too long connections Too weak semantic relationships Proposed solution: keep track of cluster’s diameter Start with a tightly connected component Add pages found within one hop

Experimentation domains: 

Experimentation domains Web appearance disambiguation Given pages retrieved on N people names From one social network Filter out pages that refer to their unrelated namesakes Clustering of Web search results Represent ranked lists of retrieved documents as clusters of semantically related documents Bekkerman & McCallum, WWW-05 Hearst & Pedersen, SIGIR-96

Slide17: 

Disambiguation dataset 12 names out of Melinda Gervasio’s social network

Disambiguation results: 

Disambiguation results SHS – sequential heuristic search (basic algorithm) IHS – incremental heuristic search With the enhancement of diameter tracking h/d – high degree node elimination names – person name heuristic Each point: one iteration of search Only 2 iterations are enough

Jaguar dataset: 

Jaguar dataset 100 pages retrieved on query “Jaguar” 23 different categories! The task is to build 3 clusters Of cars, Mac OS and wild cats

Jaguar results: 

Jaguar results SHS algorithm fails 70 agents meet together anchors – anchor text heuristic Best performance: High degree AND anchors heuristics

Topical & topological clustering: 

Topical & topological clustering Build topical clusters Enrich topical clusters with pages obtained by heuristic search based clustering Bekkerman & McCallum, WWW-05 Best performance: after one iteration of heuristic search only!

Conclusion: 

Conclusion First application of heuristic search to the Web graph Very simple algorithms / heurstics Heuristic search theory yet to be applied E.g., can an admissible heuristic be proposed? Search can be performed in real time! Modern search engines store the link structure of most of the Web Maximum 2 search iterations are enough Fully distributable in a multi-agent fashion