logging in or signing up Relation Extraction Gourmet Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 166 Category: News & Reports.. License: All Rights Reserved Like it (0) Dislike it (0) Added: September 03, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript CS345Data Mining: CS345 Data Mining Mining the Web for Structured Data Our view of the web so far…: Our view of the web so far… Web pages as atomic units Great for some applications e.g., Conventional web search But not always the right model Going beyond web pages: Going beyond web pages Question answering What is the height of Mt Everest? Who killed Abraham Lincoln? Relation Extraction Find all andlt;company,CEOandgt; pairs Virtual Databases Answer database-like queries over web data E.g., Find all software engineering jobs in Fortune 500 companies Question Answering: Question Answering E.g., Who killed Abraham Lincoln? Naïve algorithm Find all web pages containing the terms 'killed' and 'Abraham Lincoln' in close proximity Extract k-grams from a small window around the terms Find the most commonly occuring k-grams Question Answering: Question Answering Naïve algorithm works fairly well! Some improvements Use sentence structure e.g., restrict to noun phrases only Rewrite questions before matching 'What is the height of Mt Everest' becomes 'The height of Mt Everest is andlt;blankandgt;' The number of pages analyzed is more important than the sophistication of the NLP For simple questions Reference: Dumais et al Relation Extraction: Relation Extraction Find pairs (title, author) Where title is the name of a book E.g., (Foundation, Isaac Asimov) Find pairs (company, hq) E.g., (Microsoft, Redmond) Find pairs (abbreviation, expansion) (ADA, American Dental Association) Can also have tuples with andgt;2 components Relation Extraction: Relation Extraction Assumptions: No single source contains all the tuples Each tuple appears on many web pages Components of tuple appear 'close' together Foundation, by Isaac Asimov Isaac Asimov’s masterpiece, the andlt;emandgt;Foundationandlt;/emandgt; trilogy There are repeated patterns in the way tuples are represented on web pages Naïve approach: Naïve approach Study a few websites and come up with a set of patterns e.g., regular expressions letter = [A-Za-z. ] title = letter{5,40} author = letter{10,30} andlt;bandgt;(title)andlt;/bandgt; by (author) Problems with naïve approach: Problems with naïve approach A pattern that works on one web page might produce nonsense when applied to another So patterns need to be page-specific, or at least site-specific Impossible for a human to exhaustively enumerate patterns for every relevant website Will result in low coverage Better approach (Brin): Better approach (Brin) Exploit duality between patterns and tuples Find tuples that match a set of patterns Find patterns that match a lot of tuples DIPRE (Dual Iterative Pattern Relation Extraction) Patterns Tuples Match Generate DIPRE Algorithm: DIPRE Algorithm R Ã SampleTuples e.g., a small set of andlt;title,authorandgt; pairs O Ã FindOccurrences(R) Occurrences of tuples on web pages Keep some surrounding context P Ã GenPatterns(O) Look for patterns in the way tuples occur Make sure patterns are not too general! R Ã MatchingTuples(P) Return or go back to Step 2 Occurrences: Occurrences e.g., Titles and authors Restrict to cases where author and title appear in close proximity on web page andlt;liandgt;andlt;bandgt; Foundation andlt;/bandgt; by Isaac Asimov (1951) url = http://www.scifi.org/bydecade/1950.html order = [title,author] (or [author,title]) denote as 0 or 1 prefix = 'andlt;liandgt;andlt;bandgt; ' (limit to e.g., 10 characters) middle = 'andlt;/bandgt; by ' suffix = '(1951) ' occurrence = (’Foundation’,’Isaac Asimov’,url,order,prefix,middle,suffix) Patterns: Patterns andlt;liandgt;andlt;bandgt; Foundation andlt;/bandgt; by Isaac Asimov (1951) andlt;pandgt;andlt;bandgt; Nightfall andlt;/bandgt; by Isaac Asimov (1941) order = [title,author] (say 0) shared prefix = andlt;bandgt; shared middle = andlt;/bandgt; by shared suffix = (19 pattern = (order,shared prefix, shared middle, shared suffix) URL Prefix: URL Prefix Patterns may be specific to a website Or even parts of it Add urlprefix component to pattern http://www.scifi.org/bydecade/1950.html occurence: andlt;liandgt;andlt;bandgt; Foundation andlt;/bandgt; by Isaac Asimov (1951) http://www.scifi.org/bydecade/1940.html occurence: andlt;pandgt;andlt;bandgt; Nightfall andlt;/bandgt; by Isaac Asimov (1941) shared urlprefix = http://www.scifi.org/bydecade/19 pattern = (urlprefix,order,prefix,middle,suffix) Generating Patterns: Generating Patterns Group occurences by order and middle Let O = set of occurences with the same order and middle pattern.order = O.order pattern.middle = O.middle pattern.urlprefix = longest common prefix of all urls in O pattern.prefix = longest common prefix of occurrences in O pattern.suffix = longest common suffix of occurrences in O Example: Example http://www.scifi.org/bydecade/1950.html occurence: andlt;liandgt;andlt;bandgt; Foundation andlt;/bandgt; by Isaac Asimov (1951) http://www.scifi.org/bydecade/1940.html occurence: andlt;pandgt;andlt;bandgt; Nightfall andlt;/bandgt; by Isaac Asimov (1941) order = [title,author] middle = ' andlt;/bandgt; by ' urlprefix = http://www.scifi.org/bydecade/19 prefix = 'andlt;bandgt; ' suffix = ' (19' Example: Example http://www.scifi.org/bydecade/1950.html occurence: Foundation, by Isaac Asimov, has been hailed… http://www.scifi.org/bydecade/1940.html occurence: Nightfall, by Isaac Asimov, tells the tale of… order = [title,author] middle = ', by ' urlprefix = http://www.scifi.org/bydecade/19 prefix = '' suffix = ', ' Pattern Specificity: Pattern Specificity We want to avoid generating patterns that are too general One approach: For pattern p, define specificity = |urlprefix||middle||prefix||suffix| Suppose n(p) = number of occurences that match the pattern p Discard patterns where n(p) andlt; nmin Discard patterns p where specificity(p)n(p) andlt; threshold Pattern Generation Algorithm: Pattern Generation Algorithm Group occurences by order and middle Let O = a set of occurences with the same order and middle p = GeneratePattern(O) If p meets specificity requirements, add p to set of patterns Otherwise, try to split O into multiple subgroups by extending the urlprefix by one character If all occurences in O are from the same URL, we cannot extend the urlprefix, so we discard O Extending the URL prefix: Extending the URL prefix Suppose O contains occurences from urls of the form http://www.scifi.org/bydecade/195?.html http://www.scifi.org/bydecade/194?.html urlprefix = http://www.scifi.org/bydecade/19 When we extend the urlprefix, we split O into two subsets: urlprefix = http://www.scifi.org/bydecade/194 urlprefix = http://www.scifi.org/bydecade/195 Finding occurrences and matches: Finding occurrences and matches Finding occurrences Use inverted index on web pages Examine resulting pages to extract occurrences Finding matches Use urlprefix to restrict set of pages to examine Scan each page using regex constructed from pattern Relation Drift: Relation Drift Small contaminations can easily lead to huge divergences Need to tightly control process Snowball (Agichtein and Gravano) Trust only tuples that match many patterns Trust only patterns with high 'support' and 'confidence' Pattern support: Pattern support Similar to DIPRE Eliminate patterns not supported by at least nmin known good tuples either seed tuples or tuples generated in a prior iteration Pattern Confidence: Pattern Confidence Suppose tuple t matches pattern p What is the probability that tuple t is valid? Call this probability the confidence of pattern p, denoted conf(p) Assume independent of other patterns How can we estimate conf(p)? Categorizing pattern matches: Categorizing pattern matches Given pattern p, suppose we can partition its matching tuples into groups p.positive, p.negative, and p.unknown Grouping methodology is application-specific Categorizing Matches: Categorizing Matches e.g., Organizations and Headquarters A tuple that exactly matches a known pair (org,hq) is positive A tuple that matches the org of a known tuple but a different hq is negative Assume org is key for relation A tuple that matches a hq that is not a known city is negative Assume we have a list of valid city names All other occurrences are unknown Categorizing Matches: Categorizing Matches Books and authors One possibility… A tuple that matches a known tuple is positive A tuple that matches the title of a known tuple but has a different author is negative Assume title is key for relation All other tuples are unknown Can come up with other schemes if we have more information e.g., list of possible legal people names Example: Example Suppose we know the tuples Foundation, Isaac Asimov Startide Rising, David Brin Suppose pattern p matches Foundation, Isaac Asimov Startide Rising, David Brin Foundation, Doubleday Rendezvous with Rama, Arthur C. Clarke |p.positive| = 2, |p.negative| = 1, |p.unknown| = 1 Pattern Confidence (1): Pattern Confidence (1) pos(p) = |p.positive| neg(p) = |p.negative| un(p) = |p.unknown| conf(p) = pos(p)/(pos(p)+neg(p)) Pattern Confidence (2): Pattern Confidence (2) Another definition – penalize patterns with many unknown matches conf(p) = pos(p)/(pos(p)+neg(p)+un(p)) where 0 · · 1 Tuple confidence: Tuple confidence Suppose candidate tuple t matches patterns p1 and p2 What is the probability that t is an valid tuple? Assume matches of different patterns are independent events Tuple confidence: Tuple confidence Pr[t matches p1 and t is not valid] = 1-conf(p1) Pr[t matches p2 and t is not valid] = 1-conf(p2) Pr[t matches {p1,p2} and t is not valid] = (1-conf(p1))(1-conf(p2)) Pr[t matches {p1,p2} and t is valid] = 1 - (1-conf(p1))(1-conf(p2)) If tuple t matches a set of patterns P conf(t) = 1 - p2P(1-conf(p)) Snowball algorithm: Snowball algorithm Start with seed set R of tuples Generate set P of patterns from R Compute support and confidence for each pattern in P Discard patterns with low support or confidence Generate new set T of tuples matching patterns P Compute confidence of each tuple in T Add to R the tuples t2T with conf(t)andgt;threshold. Go back to step 2 Some refinements: Some refinements Give more weight to tuples found earlier Approximate pattern matches Entity tagging Tuple confidence: Tuple confidence If tuple t matches a set of patterns P conf(t) = 1 - p2P(1-conf(p)) Suppose we allow tuples that don’t exactly match patterns but only approximately conf(t) = 1 - p2P(1-conf(p)match(t,p)) You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
Relation Extraction Gourmet Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 166 Category: News & Reports.. License: All Rights Reserved Like it (0) Dislike it (0) Added: September 03, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript CS345Data Mining: CS345 Data Mining Mining the Web for Structured Data Our view of the web so far…: Our view of the web so far… Web pages as atomic units Great for some applications e.g., Conventional web search But not always the right model Going beyond web pages: Going beyond web pages Question answering What is the height of Mt Everest? Who killed Abraham Lincoln? Relation Extraction Find all andlt;company,CEOandgt; pairs Virtual Databases Answer database-like queries over web data E.g., Find all software engineering jobs in Fortune 500 companies Question Answering: Question Answering E.g., Who killed Abraham Lincoln? Naïve algorithm Find all web pages containing the terms 'killed' and 'Abraham Lincoln' in close proximity Extract k-grams from a small window around the terms Find the most commonly occuring k-grams Question Answering: Question Answering Naïve algorithm works fairly well! Some improvements Use sentence structure e.g., restrict to noun phrases only Rewrite questions before matching 'What is the height of Mt Everest' becomes 'The height of Mt Everest is andlt;blankandgt;' The number of pages analyzed is more important than the sophistication of the NLP For simple questions Reference: Dumais et al Relation Extraction: Relation Extraction Find pairs (title, author) Where title is the name of a book E.g., (Foundation, Isaac Asimov) Find pairs (company, hq) E.g., (Microsoft, Redmond) Find pairs (abbreviation, expansion) (ADA, American Dental Association) Can also have tuples with andgt;2 components Relation Extraction: Relation Extraction Assumptions: No single source contains all the tuples Each tuple appears on many web pages Components of tuple appear 'close' together Foundation, by Isaac Asimov Isaac Asimov’s masterpiece, the andlt;emandgt;Foundationandlt;/emandgt; trilogy There are repeated patterns in the way tuples are represented on web pages Naïve approach: Naïve approach Study a few websites and come up with a set of patterns e.g., regular expressions letter = [A-Za-z. ] title = letter{5,40} author = letter{10,30} andlt;bandgt;(title)andlt;/bandgt; by (author) Problems with naïve approach: Problems with naïve approach A pattern that works on one web page might produce nonsense when applied to another So patterns need to be page-specific, or at least site-specific Impossible for a human to exhaustively enumerate patterns for every relevant website Will result in low coverage Better approach (Brin): Better approach (Brin) Exploit duality between patterns and tuples Find tuples that match a set of patterns Find patterns that match a lot of tuples DIPRE (Dual Iterative Pattern Relation Extraction) Patterns Tuples Match Generate DIPRE Algorithm: DIPRE Algorithm R Ã SampleTuples e.g., a small set of andlt;title,authorandgt; pairs O Ã FindOccurrences(R) Occurrences of tuples on web pages Keep some surrounding context P Ã GenPatterns(O) Look for patterns in the way tuples occur Make sure patterns are not too general! R Ã MatchingTuples(P) Return or go back to Step 2 Occurrences: Occurrences e.g., Titles and authors Restrict to cases where author and title appear in close proximity on web page andlt;liandgt;andlt;bandgt; Foundation andlt;/bandgt; by Isaac Asimov (1951) url = http://www.scifi.org/bydecade/1950.html order = [title,author] (or [author,title]) denote as 0 or 1 prefix = 'andlt;liandgt;andlt;bandgt; ' (limit to e.g., 10 characters) middle = 'andlt;/bandgt; by ' suffix = '(1951) ' occurrence = (’Foundation’,’Isaac Asimov’,url,order,prefix,middle,suffix) Patterns: Patterns andlt;liandgt;andlt;bandgt; Foundation andlt;/bandgt; by Isaac Asimov (1951) andlt;pandgt;andlt;bandgt; Nightfall andlt;/bandgt; by Isaac Asimov (1941) order = [title,author] (say 0) shared prefix = andlt;bandgt; shared middle = andlt;/bandgt; by shared suffix = (19 pattern = (order,shared prefix, shared middle, shared suffix) URL Prefix: URL Prefix Patterns may be specific to a website Or even parts of it Add urlprefix component to pattern http://www.scifi.org/bydecade/1950.html occurence: andlt;liandgt;andlt;bandgt; Foundation andlt;/bandgt; by Isaac Asimov (1951) http://www.scifi.org/bydecade/1940.html occurence: andlt;pandgt;andlt;bandgt; Nightfall andlt;/bandgt; by Isaac Asimov (1941) shared urlprefix = http://www.scifi.org/bydecade/19 pattern = (urlprefix,order,prefix,middle,suffix) Generating Patterns: Generating Patterns Group occurences by order and middle Let O = set of occurences with the same order and middle pattern.order = O.order pattern.middle = O.middle pattern.urlprefix = longest common prefix of all urls in O pattern.prefix = longest common prefix of occurrences in O pattern.suffix = longest common suffix of occurrences in O Example: Example http://www.scifi.org/bydecade/1950.html occurence: andlt;liandgt;andlt;bandgt; Foundation andlt;/bandgt; by Isaac Asimov (1951) http://www.scifi.org/bydecade/1940.html occurence: andlt;pandgt;andlt;bandgt; Nightfall andlt;/bandgt; by Isaac Asimov (1941) order = [title,author] middle = ' andlt;/bandgt; by ' urlprefix = http://www.scifi.org/bydecade/19 prefix = 'andlt;bandgt; ' suffix = ' (19' Example: Example http://www.scifi.org/bydecade/1950.html occurence: Foundation, by Isaac Asimov, has been hailed… http://www.scifi.org/bydecade/1940.html occurence: Nightfall, by Isaac Asimov, tells the tale of… order = [title,author] middle = ', by ' urlprefix = http://www.scifi.org/bydecade/19 prefix = '' suffix = ', ' Pattern Specificity: Pattern Specificity We want to avoid generating patterns that are too general One approach: For pattern p, define specificity = |urlprefix||middle||prefix||suffix| Suppose n(p) = number of occurences that match the pattern p Discard patterns where n(p) andlt; nmin Discard patterns p where specificity(p)n(p) andlt; threshold Pattern Generation Algorithm: Pattern Generation Algorithm Group occurences by order and middle Let O = a set of occurences with the same order and middle p = GeneratePattern(O) If p meets specificity requirements, add p to set of patterns Otherwise, try to split O into multiple subgroups by extending the urlprefix by one character If all occurences in O are from the same URL, we cannot extend the urlprefix, so we discard O Extending the URL prefix: Extending the URL prefix Suppose O contains occurences from urls of the form http://www.scifi.org/bydecade/195?.html http://www.scifi.org/bydecade/194?.html urlprefix = http://www.scifi.org/bydecade/19 When we extend the urlprefix, we split O into two subsets: urlprefix = http://www.scifi.org/bydecade/194 urlprefix = http://www.scifi.org/bydecade/195 Finding occurrences and matches: Finding occurrences and matches Finding occurrences Use inverted index on web pages Examine resulting pages to extract occurrences Finding matches Use urlprefix to restrict set of pages to examine Scan each page using regex constructed from pattern Relation Drift: Relation Drift Small contaminations can easily lead to huge divergences Need to tightly control process Snowball (Agichtein and Gravano) Trust only tuples that match many patterns Trust only patterns with high 'support' and 'confidence' Pattern support: Pattern support Similar to DIPRE Eliminate patterns not supported by at least nmin known good tuples either seed tuples or tuples generated in a prior iteration Pattern Confidence: Pattern Confidence Suppose tuple t matches pattern p What is the probability that tuple t is valid? Call this probability the confidence of pattern p, denoted conf(p) Assume independent of other patterns How can we estimate conf(p)? Categorizing pattern matches: Categorizing pattern matches Given pattern p, suppose we can partition its matching tuples into groups p.positive, p.negative, and p.unknown Grouping methodology is application-specific Categorizing Matches: Categorizing Matches e.g., Organizations and Headquarters A tuple that exactly matches a known pair (org,hq) is positive A tuple that matches the org of a known tuple but a different hq is negative Assume org is key for relation A tuple that matches a hq that is not a known city is negative Assume we have a list of valid city names All other occurrences are unknown Categorizing Matches: Categorizing Matches Books and authors One possibility… A tuple that matches a known tuple is positive A tuple that matches the title of a known tuple but has a different author is negative Assume title is key for relation All other tuples are unknown Can come up with other schemes if we have more information e.g., list of possible legal people names Example: Example Suppose we know the tuples Foundation, Isaac Asimov Startide Rising, David Brin Suppose pattern p matches Foundation, Isaac Asimov Startide Rising, David Brin Foundation, Doubleday Rendezvous with Rama, Arthur C. Clarke |p.positive| = 2, |p.negative| = 1, |p.unknown| = 1 Pattern Confidence (1): Pattern Confidence (1) pos(p) = |p.positive| neg(p) = |p.negative| un(p) = |p.unknown| conf(p) = pos(p)/(pos(p)+neg(p)) Pattern Confidence (2): Pattern Confidence (2) Another definition – penalize patterns with many unknown matches conf(p) = pos(p)/(pos(p)+neg(p)+un(p)) where 0 · · 1 Tuple confidence: Tuple confidence Suppose candidate tuple t matches patterns p1 and p2 What is the probability that t is an valid tuple? Assume matches of different patterns are independent events Tuple confidence: Tuple confidence Pr[t matches p1 and t is not valid] = 1-conf(p1) Pr[t matches p2 and t is not valid] = 1-conf(p2) Pr[t matches {p1,p2} and t is not valid] = (1-conf(p1))(1-conf(p2)) Pr[t matches {p1,p2} and t is valid] = 1 - (1-conf(p1))(1-conf(p2)) If tuple t matches a set of patterns P conf(t) = 1 - p2P(1-conf(p)) Snowball algorithm: Snowball algorithm Start with seed set R of tuples Generate set P of patterns from R Compute support and confidence for each pattern in P Discard patterns with low support or confidence Generate new set T of tuples matching patterns P Compute confidence of each tuple in T Add to R the tuples t2T with conf(t)andgt;threshold. Go back to step 2 Some refinements: Some refinements Give more weight to tuples found earlier Approximate pattern matches Entity tagging Tuple confidence: Tuple confidence If tuple t matches a set of patterns P conf(t) = 1 - p2P(1-conf(p)) Suppose we allow tuples that don’t exactly match patterns but only approximately conf(t) = 1 - p2P(1-conf(p)match(t,p))