logging in or signing up 04 05 knowitall Cubemiddle Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 19 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: October 09, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript KnowItAll: KnowItAll April 5 2007 William CohenAnnouncements: Announcements Reminder: project presentations (or progress report) Sign up for a 30min presentation (or else) First pair of slots is April 17 Last pair of slots is May 10 William is out of town April 6-April 9 So, no office hours Friday. Next week: no critiques assigned But I will lectureBootstrapping: Bootstrapping BM’98 Brin’98 Hearst ‘92 Scalability, surface patterns, use of web crawlers… Learning, semi-supervised learning, dual feature spaces… Deeper linguistic features, free text… Collins & Singer ‘99 Riloff & Jones ‘99 Cucerzan & Yarowsky ‘99 Etzioni et al 2005 Rosenfeld and Feldman 2006 … … Stevenson & Greenwood 2005 Clever idea for learning relation patterns & strong experimental results De-emphasize duality, focus on distance between patterns.Know It All: Know It AllArchitecture: Architecture Set of (disjoint?) predicates to consider + two names for each ~= [H92] Context – keywords from user to filter out non-domain pages … ?Architecture: ArchitectureBootstrapping - 1: Bootstrapping - 1 “city” query template ruleBootstrapping - 2: Bootstrapping - 2 Each discriminator U is a function: fU(x) = hits(“city x”)/hits(“x”) i.e. fU(“Pittsburgh”) = hits(“city Pittsburgh”)/hits(“Pittsburgh”) These are then used to create features: fU(x)>θ and fU(x)<θ Bootstrapping - 3: Bootstrapping - 3 Submit the queries & apply the rules to produce initial seeds. Evaluate each seed with each discriminator U: e.g., compute PMI stats like: |hits(“city Boston”)| / |hits(“Boston”)| Take the top seeds from each class and call them POSITIVE then use disjointness of classes to find NEGATIVE seeds. Train a NaiveBayes classifier using thresholded U’s as features.Bootstrapping - 4: Bootstrapping - 4 Estimate using the classifier based on the previously-trained discriminators Some ad hoc stopping conditions… (“signal to noise” ratio)Architecture - 2: Architecture - 2Extensions to KnowItAll: Extensions to KnowItAll Problem: Unsupervised learning finds clusters—what if the text doesn’t support the clustering we want Eg target is “scientist”, but natural clusters are “biologist”, “physicist”, “chemist” Solution: subclass extraction Modify template/rule system to extract subclasses of target class (eg scientist chemist, biologist, …) Check extracted subclasses with WordNet and/or PMI-like method (as for instances) Extract from each subclass recursively Extensions to KnowItAll: Extensions to KnowItAll Problem: Set of rules is limited: Derived from fixed set of “templates” (general patterns ~ from H92) Solution 1: Pattern learning: augment the initial set of rules derivable from templates Search for instances I on the web Generate patterns: some substring of I in context: “b1 … b4 I a1 … a4” Assume classes are disjoint and estimate recall/precision of each pattern P Exclude patterns that cover only one seed (very low recall) Take the top 200 remaining patterns and Evaluate them as extractors “using PMI” (?) Evaluate them as discriminators (in usual way?) Examples: “headquartered in <city>”, “<city> hotels”, …,Extensions to KnowItAll: Extensions to KnowItAll Solution 2: List extraction: augment the initial set of rules with rules that are local to a specific web page Search for pages containing small sets of instances (eg “London Paris Rome Pittsburgh”) For each page P: Find subtrees T of the DOM tree that contain >k seeds Find longest common prefix/suffix of the seeds in T [Some heuristics added to generalize this further] Find all other strings inside T with the same prefix/suffix Heuristically select the “best” wrapper for a page Wrapper = P, T, prefix, suffix Slide15: T1 w1 Italy, Japan, France, Israel, Spain, Brazil, Dog, Cat, AlligatorSlide16: T2 w1 Italy, Japan, France, Israel, Spain, Brazil, Dog, Cat, Alligator w2 Italy, Japan, France, Israel, Spain, Brazil, Dog, Cat, AlligatorSlide17: T3 w1 Italy, Japan, France, Israel, Spain, Brazil, Dog, Cat, Alligator w2 Italy, Japan, France, Israel, Spain, Brazil, Dog, Cat, Alligator w3 Italy, Japan, France, Israel, Spain, BrazilSlide18: T4 w1 Italy, Japan, France, Israel, Spain, Brazil, Dog, Cat, Alligator w2 Italy, Japan, France, Israel, Spain, Brazil, Dog, Cat, Alligator w3 Italy, Japan, France, Israel, Spain, Brazil w4 Italy, JapanSlide19: […]Results - City: Results - CityResults - Film: Results - FilmResults - Scientist: Results - ScientistObservations: Observations Corpus is accessed indirectly thru Google API Only use top k discriminators Run extractors via query keywords & extract Limited by network access time Lots of moving parts to engineer Rule templates Signal-to-noise LE wrapper evaluation details Parameters: number of discriminators, number of seeds to keep, number of names per concept, ….KnowItNow: Son of KnowItAll: KnowItNow: Son of KnowItAll Goal: faster results, not better results Difference 1: Store documents locally Build local index (Bindings Engine) optimized for finding instances of KnowItAll rules and patterns Based on inverted index term (doc,position,contextInfo) KnowItNow: Son of KnowItAll: KnowItNow: Son of KnowItAll Difference 2: New model (URNS model) to merge information from multiple extraction rules Intuition: instances generated from each extractor are assumed to be a mixture of two distributions Random noise from large instance pool Stuff with known structure (e.g., uniform, Zipf’s law, …) Using EM you can estimate mixture probabilities and parameters of non-noisy data Prob(x noise|x extracted)KnowItNow: Son of KnowItAll: KnowItNow: Son of KnowItAll … 137 colors = 41% of mass 15,346 colors = 59% of mass Prob(noise)= 0.59 Non-noisy data: uniform over 137 instances … 59% of mass doesn’t Prob(noise)= 0.59 Non-noisy data: Zipf’s over >N instances 41% of mass fits powerlaw You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
04 05 knowitall Cubemiddle Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 19 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: October 09, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript KnowItAll: KnowItAll April 5 2007 William CohenAnnouncements: Announcements Reminder: project presentations (or progress report) Sign up for a 30min presentation (or else) First pair of slots is April 17 Last pair of slots is May 10 William is out of town April 6-April 9 So, no office hours Friday. Next week: no critiques assigned But I will lectureBootstrapping: Bootstrapping BM’98 Brin’98 Hearst ‘92 Scalability, surface patterns, use of web crawlers… Learning, semi-supervised learning, dual feature spaces… Deeper linguistic features, free text… Collins & Singer ‘99 Riloff & Jones ‘99 Cucerzan & Yarowsky ‘99 Etzioni et al 2005 Rosenfeld and Feldman 2006 … … Stevenson & Greenwood 2005 Clever idea for learning relation patterns & strong experimental results De-emphasize duality, focus on distance between patterns.Know It All: Know It AllArchitecture: Architecture Set of (disjoint?) predicates to consider + two names for each ~= [H92] Context – keywords from user to filter out non-domain pages … ?Architecture: ArchitectureBootstrapping - 1: Bootstrapping - 1 “city” query template ruleBootstrapping - 2: Bootstrapping - 2 Each discriminator U is a function: fU(x) = hits(“city x”)/hits(“x”) i.e. fU(“Pittsburgh”) = hits(“city Pittsburgh”)/hits(“Pittsburgh”) These are then used to create features: fU(x)>θ and fU(x)<θ Bootstrapping - 3: Bootstrapping - 3 Submit the queries & apply the rules to produce initial seeds. Evaluate each seed with each discriminator U: e.g., compute PMI stats like: |hits(“city Boston”)| / |hits(“Boston”)| Take the top seeds from each class and call them POSITIVE then use disjointness of classes to find NEGATIVE seeds. Train a NaiveBayes classifier using thresholded U’s as features.Bootstrapping - 4: Bootstrapping - 4 Estimate using the classifier based on the previously-trained discriminators Some ad hoc stopping conditions… (“signal to noise” ratio)Architecture - 2: Architecture - 2Extensions to KnowItAll: Extensions to KnowItAll Problem: Unsupervised learning finds clusters—what if the text doesn’t support the clustering we want Eg target is “scientist”, but natural clusters are “biologist”, “physicist”, “chemist” Solution: subclass extraction Modify template/rule system to extract subclasses of target class (eg scientist chemist, biologist, …) Check extracted subclasses with WordNet and/or PMI-like method (as for instances) Extract from each subclass recursively Extensions to KnowItAll: Extensions to KnowItAll Problem: Set of rules is limited: Derived from fixed set of “templates” (general patterns ~ from H92) Solution 1: Pattern learning: augment the initial set of rules derivable from templates Search for instances I on the web Generate patterns: some substring of I in context: “b1 … b4 I a1 … a4” Assume classes are disjoint and estimate recall/precision of each pattern P Exclude patterns that cover only one seed (very low recall) Take the top 200 remaining patterns and Evaluate them as extractors “using PMI” (?) Evaluate them as discriminators (in usual way?) Examples: “headquartered in <city>”, “<city> hotels”, …,Extensions to KnowItAll: Extensions to KnowItAll Solution 2: List extraction: augment the initial set of rules with rules that are local to a specific web page Search for pages containing small sets of instances (eg “London Paris Rome Pittsburgh”) For each page P: Find subtrees T of the DOM tree that contain >k seeds Find longest common prefix/suffix of the seeds in T [Some heuristics added to generalize this further] Find all other strings inside T with the same prefix/suffix Heuristically select the “best” wrapper for a page Wrapper = P, T, prefix, suffix Slide15: T1 w1 Italy, Japan, France, Israel, Spain, Brazil, Dog, Cat, AlligatorSlide16: T2 w1 Italy, Japan, France, Israel, Spain, Brazil, Dog, Cat, Alligator w2 Italy, Japan, France, Israel, Spain, Brazil, Dog, Cat, AlligatorSlide17: T3 w1 Italy, Japan, France, Israel, Spain, Brazil, Dog, Cat, Alligator w2 Italy, Japan, France, Israel, Spain, Brazil, Dog, Cat, Alligator w3 Italy, Japan, France, Israel, Spain, BrazilSlide18: T4 w1 Italy, Japan, France, Israel, Spain, Brazil, Dog, Cat, Alligator w2 Italy, Japan, France, Israel, Spain, Brazil, Dog, Cat, Alligator w3 Italy, Japan, France, Israel, Spain, Brazil w4 Italy, JapanSlide19: […]Results - City: Results - CityResults - Film: Results - FilmResults - Scientist: Results - ScientistObservations: Observations Corpus is accessed indirectly thru Google API Only use top k discriminators Run extractors via query keywords & extract Limited by network access time Lots of moving parts to engineer Rule templates Signal-to-noise LE wrapper evaluation details Parameters: number of discriminators, number of seeds to keep, number of names per concept, ….KnowItNow: Son of KnowItAll: KnowItNow: Son of KnowItAll Goal: faster results, not better results Difference 1: Store documents locally Build local index (Bindings Engine) optimized for finding instances of KnowItAll rules and patterns Based on inverted index term (doc,position,contextInfo) KnowItNow: Son of KnowItAll: KnowItNow: Son of KnowItAll Difference 2: New model (URNS model) to merge information from multiple extraction rules Intuition: instances generated from each extractor are assumed to be a mixture of two distributions Random noise from large instance pool Stuff with known structure (e.g., uniform, Zipf’s law, …) Using EM you can estimate mixture probabilities and parameters of non-noisy data Prob(x noise|x extracted)KnowItNow: Son of KnowItAll: KnowItNow: Son of KnowItAll … 137 colors = 41% of mass 15,346 colors = 59% of mass Prob(noise)= 0.59 Non-noisy data: uniform over 137 instances … 59% of mass doesn’t Prob(noise)= 0.59 Non-noisy data: Zipf’s over >N instances 41% of mass fits powerlaw