logging in or signing up 15 2 SelectedTopics Hilltop Sudiksha Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 78 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: October 19, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Web Search – Summer Term 2006VII. Selected Topics -The Hilltop Algorithm: Web Search – Summer Term 2006 VII. Selected Topics - The Hilltop Algorithm (c) Wolfgang Hürst, Albert-Ludwigs-UniversityRecap (PageRank and HITS): Recap (PageRank and HITS) PageRank and HITS: Both search the web based on a) Relevance (content, anchor text, ...) b) Quality, importance, authority, ... The latter one: based on link structure PageRank: Global, query-independent, recursive calculation over all pages HITS: Local subgraph containing relevant documents, distinguishes between hubs and authoritiesHilltop [1]: Basic Idea: Hilltop [1]: Basic Idea Observation: Many web user (authors) - Create web pages with link lists about topics they are very familiar with (experts) - Maintain these pages well / try to keep them up-to-date - Link to good, high quality pages Idea: Try to find such pages automatically and use their link structure for ranking Compare HITS: Similar to hubs but more explicit description of "expert" sources and global view (i.e. query independent)Hilltop [1]: Basic Idea (cont.): Hilltop [1]: Basic Idea (cont.) An expert page is a page that is about a certain topic and has links to many non-affiliated pages on that topic - "non-affiliated" means authors from non-affiliated organizations (modeled, e.g. by URL processing) - "links to many ... pages" can be modeled (e.g.) by a threshold A page is an authority on a query topic if, and only if, some of the best experts on that query topic point to itHilltop [1]: Basic Idea (cont.): Hilltop [1]: Basic Idea (cont.) General approach: 1. Identify experts (in advance, i.e. query independent) 2. Select experts for a particular topic (depending on a specific query) 3. Use these experts to find and rank authorities for this topicIdentifying good expert pages: Identifying good expert pages What makes a good expert and how can they be found? A good expert is objective, diverse, unbiased, and point to numerous non-affiliated pages. Two hosts can be defined as affiliated if ... - ... they share the same first 3 octets of the IP address OR - ... the rightmost non-generic token in the hostname is the same (token = substrings in a hostname delimited by ".")Identifying good expert pages: Identifying good expert pages 1st: Devide all (indexed) web pages into groups of affiliated ones 2nd: Get experts (i.e. pages pointing to lots of non-affiliated pages) based on their number of links to different groups (e.g. using a threshold) Note: This is all topic-independent! Possible extensions: - Consider topic-related clusters (if available) - Consider special characteristics of a page (e.g. similar formatting, etc.)Indexing experts: Indexing experts Identification of experts: done in advance, i.e. topic / query independent Selection of experts for a particular topic: done during the search process, i.e. query dep. Therefore: create inverted file for all pages that have been identified as an expert Only index so called key phrases, i.e. - Take all words in the title, in headlines (<h1>, <h2>, ... tags), in the anchor text of a URL - Associate these phrases with the respective URLsSearch: Get and rank authorities: Search: Get and rank authorities With this, we have: - Experts for different topics - All information we need to select all experts for a particular topic given the query terms qi Query processing is now done in two steps 1. Select & rate experts (based on query) 2. Select & rate authorities (based on experts)1. Select & rate experts: 1. Select & rate experts Select page as an expert (e.g.) if all query terms qi are associated with at least one URL Rate the selected experts by calculating an expert score for each expert p For this, we define - LevelScore(p) = Weighting of the type of key phrase (e.g. title: 16, heading: 6, anchor: 1) - FullnessFactor(p,qi) = Measure for the no. of terms in p that contain query terms q IF m 2 THEN FullnessFactor(p,q) = 1 ELSE FullnessFactor(p,q) = 1-(m-2) / plen1. Select & rate experts (cont.): 1. Select & rate experts (cont.) Based on the LevelScore and FullnessScore, some measures Si are calculated as follows: Si = LevelScore(p) X FullnessFactor(p,q) (with being the sum over all key phrases p with k-i query terms) The expert score is finally calculated as Expert_score = 232S0 + 216S1 + S22. Select & rate authorities: 2. Select & rate authorities Select pages as targets if they are referenced by at least two of these experts Rate them by calculating a target score: 1. Calculate an Edge_Score(E,T) for each edge (i.e., link) from any expert E to target page T Edge_Score(E,T) = Expert_Score(E)*query terms q occ(q,T) with occ(q,T) = no. of diff. key phrases for T containing q2. Select & rate authorities (cont.): 2. Select & rate authorities (cont.) 1. Calculate an Edge_Score(E,T) for each edge (i.e., link) from any expert E to target page T 2. Check all experts pointing to the same target and for affiliated experts, remove all edges but the one with the highest Edge_Score 3. The Target_Score is now calculated as the sum of all remaining Edge_Scores Possible extension: Combine Target_Scores with a page-dependent Match_Score (depending on the appearance of search terms on the page)Hilltop: Summary: Hilltop: Summary Preprocessing: - Divide the web into groups of affiliated pages (based on their authors / URLs) - Select experts (based on linkage and groups) Searching: Select and rate 1. Experts referencing to pages about a particular topic (represented by the query) 2. Authorities for this particular topicHilltop: Discussion: Hilltop: Discussion Main properties (when compared to PageRank and HITS): - Topic/query-dependent (unlike PageRank) - Pre-selection of experts (unlike HITS), i.e. - all experts are considered (no subgraph) - efficient online calculation can be done - Page content and structure is considered Potential problems / criticism: - Uses lots of intuitive assumptions that are modeled by heuristicsReferences: References [1] BHARAT, MIHAILA: WHEN EXPERTS AGREE: USING NON-AFFILIATED EXPERTS TO RANK POPULAR TOPICS. ACM TRANSACTIONS ON INFORMATION SYSTEMS, VOL. 20, NO. 1, JAN. 2002 You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
15 2 SelectedTopics Hilltop Sudiksha Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 78 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: October 19, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Web Search – Summer Term 2006VII. Selected Topics -The Hilltop Algorithm: Web Search – Summer Term 2006 VII. Selected Topics - The Hilltop Algorithm (c) Wolfgang Hürst, Albert-Ludwigs-UniversityRecap (PageRank and HITS): Recap (PageRank and HITS) PageRank and HITS: Both search the web based on a) Relevance (content, anchor text, ...) b) Quality, importance, authority, ... The latter one: based on link structure PageRank: Global, query-independent, recursive calculation over all pages HITS: Local subgraph containing relevant documents, distinguishes between hubs and authoritiesHilltop [1]: Basic Idea: Hilltop [1]: Basic Idea Observation: Many web user (authors) - Create web pages with link lists about topics they are very familiar with (experts) - Maintain these pages well / try to keep them up-to-date - Link to good, high quality pages Idea: Try to find such pages automatically and use their link structure for ranking Compare HITS: Similar to hubs but more explicit description of "expert" sources and global view (i.e. query independent)Hilltop [1]: Basic Idea (cont.): Hilltop [1]: Basic Idea (cont.) An expert page is a page that is about a certain topic and has links to many non-affiliated pages on that topic - "non-affiliated" means authors from non-affiliated organizations (modeled, e.g. by URL processing) - "links to many ... pages" can be modeled (e.g.) by a threshold A page is an authority on a query topic if, and only if, some of the best experts on that query topic point to itHilltop [1]: Basic Idea (cont.): Hilltop [1]: Basic Idea (cont.) General approach: 1. Identify experts (in advance, i.e. query independent) 2. Select experts for a particular topic (depending on a specific query) 3. Use these experts to find and rank authorities for this topicIdentifying good expert pages: Identifying good expert pages What makes a good expert and how can they be found? A good expert is objective, diverse, unbiased, and point to numerous non-affiliated pages. Two hosts can be defined as affiliated if ... - ... they share the same first 3 octets of the IP address OR - ... the rightmost non-generic token in the hostname is the same (token = substrings in a hostname delimited by ".")Identifying good expert pages: Identifying good expert pages 1st: Devide all (indexed) web pages into groups of affiliated ones 2nd: Get experts (i.e. pages pointing to lots of non-affiliated pages) based on their number of links to different groups (e.g. using a threshold) Note: This is all topic-independent! Possible extensions: - Consider topic-related clusters (if available) - Consider special characteristics of a page (e.g. similar formatting, etc.)Indexing experts: Indexing experts Identification of experts: done in advance, i.e. topic / query independent Selection of experts for a particular topic: done during the search process, i.e. query dep. Therefore: create inverted file for all pages that have been identified as an expert Only index so called key phrases, i.e. - Take all words in the title, in headlines (<h1>, <h2>, ... tags), in the anchor text of a URL - Associate these phrases with the respective URLsSearch: Get and rank authorities: Search: Get and rank authorities With this, we have: - Experts for different topics - All information we need to select all experts for a particular topic given the query terms qi Query processing is now done in two steps 1. Select & rate experts (based on query) 2. Select & rate authorities (based on experts)1. Select & rate experts: 1. Select & rate experts Select page as an expert (e.g.) if all query terms qi are associated with at least one URL Rate the selected experts by calculating an expert score for each expert p For this, we define - LevelScore(p) = Weighting of the type of key phrase (e.g. title: 16, heading: 6, anchor: 1) - FullnessFactor(p,qi) = Measure for the no. of terms in p that contain query terms q IF m 2 THEN FullnessFactor(p,q) = 1 ELSE FullnessFactor(p,q) = 1-(m-2) / plen1. Select & rate experts (cont.): 1. Select & rate experts (cont.) Based on the LevelScore and FullnessScore, some measures Si are calculated as follows: Si = LevelScore(p) X FullnessFactor(p,q) (with being the sum over all key phrases p with k-i query terms) The expert score is finally calculated as Expert_score = 232S0 + 216S1 + S22. Select & rate authorities: 2. Select & rate authorities Select pages as targets if they are referenced by at least two of these experts Rate them by calculating a target score: 1. Calculate an Edge_Score(E,T) for each edge (i.e., link) from any expert E to target page T Edge_Score(E,T) = Expert_Score(E)*query terms q occ(q,T) with occ(q,T) = no. of diff. key phrases for T containing q2. Select & rate authorities (cont.): 2. Select & rate authorities (cont.) 1. Calculate an Edge_Score(E,T) for each edge (i.e., link) from any expert E to target page T 2. Check all experts pointing to the same target and for affiliated experts, remove all edges but the one with the highest Edge_Score 3. The Target_Score is now calculated as the sum of all remaining Edge_Scores Possible extension: Combine Target_Scores with a page-dependent Match_Score (depending on the appearance of search terms on the page)Hilltop: Summary: Hilltop: Summary Preprocessing: - Divide the web into groups of affiliated pages (based on their authors / URLs) - Select experts (based on linkage and groups) Searching: Select and rate 1. Experts referencing to pages about a particular topic (represented by the query) 2. Authorities for this particular topicHilltop: Discussion: Hilltop: Discussion Main properties (when compared to PageRank and HITS): - Topic/query-dependent (unlike PageRank) - Pre-selection of experts (unlike HITS), i.e. - all experts are considered (no subgraph) - efficient online calculation can be done - Page content and structure is considered Potential problems / criticism: - Uses lots of intuitive assumptions that are modeled by heuristicsReferences: References [1] BHARAT, MIHAILA: WHEN EXPERTS AGREE: USING NON-AFFILIATED EXPERTS TO RANK POPULAR TOPICS. ACM TRANSACTIONS ON INFORMATION SYSTEMS, VOL. 20, NO. 1, JAN. 2002