logging in or signing up DIPRE FunnyGuy Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 99 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: October 17, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Extracting Patterns and Relations from the World Wide WebAuthor: Sergey Brin: Extracting Patterns and Relations from the World Wide Web Author: Sergey Brin Shui-Lung Chuang Jun. 29, 2000Introduction: Introduction Google (by Larry Page and Sergey Brin) Rank Web pages by link authority Crawl plenty of Web data A small mirror of the World Wide Web Provide document content and hyperlink structure Web content mining What can we do for these data? Re-discover the information encoded by the authors Structure the data to more useful informationA Case StudyConstructing A Book DB: A Case Study Constructing A Book DB A book is represented as a relation (title,author) From thousands of online sources v.s. from a few large information sources (wrapper) Challenges for extracting information from the Web Distributed information sources Many different formats Web DataProposed Approach: Proposed Approach Duality of patterns and relations A good set of patterns a good set of tuples A good set of tuples a good set of patterns Dual Iterative Pattern Relation Extraction Web DataRepresentation : Representation A book (title,author) Occurrences of books (author,title,order,url,prefix,middle,suffix) Patterns for books (order,urlprefix,prefix,middle,suffix) e.g., order=T, urlurlprefix* *prefix, author, middle, title, suffix* Author: [A-Z][A-Za-z .,&]5,30[A-Za-z.] Title: [A-Z0-9][A-Za-z0-9 .,:’#!?;&]4,45[A-Za-z0-9!]Pattern Generation: Pattern Generation GenOnePattern(O) Verify that the order and the middle are the same op.order order and op.middle middle op.urlprefix longest matching prefix of all the urls op.prefix longest matching suffix of all prefix’s op.suffix longest matching prefix of all suffix’s GenPattern(O) Split O into O1,…,Ok by order and middle For each group Oi, p GenOnePattern(Oi) If p meets specificity requirements then output p, Otherwise If all o in Oi have the same URL then reject Oi Else, split Oi into subgroups by the characters in their urls. Repeat step 2 for these subgroups.Example: Example fgrep title fgrep author SeedsExample: Example fgrep title fgrep author Seeds GenPattern() Regular Expression MatchingExperiment of Finding Books: Experiment of Finding Books Total test data 24 million web pages totaling 147 gigabytes Experimental results (3 iterations) Quality of results 19/20 are bonafide books 5/20 are not found on Amazon (2.5 million books)Other Issues: Other Issues Theoretical pattern evaluation Given a large database D, R is the target relation and R’ is an approximation of R Coverage: |R’R| / |R| Error rate: |R’-R| / |R’| In the Web, a low error rate is more critical than high coverage Pattern specificity -log(P(XMd(p)), X is a random var. over a uniform dist. Estimation: specificity(p)=|middle||urlprefix||prefix||suffix| Performance ScalabilitySummary: Summary Extracting information from the Web Data source Well known: wrapper Need to detect: discriminative patterns Data format Semi-structured: formatting hints, e.g., HTML tags Plane text: linguistic hints Pattern generation Hand-written: accurate but time-consuming Auto-generated: error-prone Dual Iterative Pattern Relation Extraction -------------- -------------- -------------- -------------- -------------- Back-end DB Query Pattern/Template -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
DIPRE FunnyGuy Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 99 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: October 17, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Extracting Patterns and Relations from the World Wide WebAuthor: Sergey Brin: Extracting Patterns and Relations from the World Wide Web Author: Sergey Brin Shui-Lung Chuang Jun. 29, 2000Introduction: Introduction Google (by Larry Page and Sergey Brin) Rank Web pages by link authority Crawl plenty of Web data A small mirror of the World Wide Web Provide document content and hyperlink structure Web content mining What can we do for these data? Re-discover the information encoded by the authors Structure the data to more useful informationA Case StudyConstructing A Book DB: A Case Study Constructing A Book DB A book is represented as a relation (title,author) From thousands of online sources v.s. from a few large information sources (wrapper) Challenges for extracting information from the Web Distributed information sources Many different formats Web DataProposed Approach: Proposed Approach Duality of patterns and relations A good set of patterns a good set of tuples A good set of tuples a good set of patterns Dual Iterative Pattern Relation Extraction Web DataRepresentation : Representation A book (title,author) Occurrences of books (author,title,order,url,prefix,middle,suffix) Patterns for books (order,urlprefix,prefix,middle,suffix) e.g., order=T, urlurlprefix* *prefix, author, middle, title, suffix* Author: [A-Z][A-Za-z .,&]5,30[A-Za-z.] Title: [A-Z0-9][A-Za-z0-9 .,:’#!?;&]4,45[A-Za-z0-9!]Pattern Generation: Pattern Generation GenOnePattern(O) Verify that the order and the middle are the same op.order order and op.middle middle op.urlprefix longest matching prefix of all the urls op.prefix longest matching suffix of all prefix’s op.suffix longest matching prefix of all suffix’s GenPattern(O) Split O into O1,…,Ok by order and middle For each group Oi, p GenOnePattern(Oi) If p meets specificity requirements then output p, Otherwise If all o in Oi have the same URL then reject Oi Else, split Oi into subgroups by the characters in their urls. Repeat step 2 for these subgroups.Example: Example fgrep title fgrep author SeedsExample: Example fgrep title fgrep author Seeds GenPattern() Regular Expression MatchingExperiment of Finding Books: Experiment of Finding Books Total test data 24 million web pages totaling 147 gigabytes Experimental results (3 iterations) Quality of results 19/20 are bonafide books 5/20 are not found on Amazon (2.5 million books)Other Issues: Other Issues Theoretical pattern evaluation Given a large database D, R is the target relation and R’ is an approximation of R Coverage: |R’R| / |R| Error rate: |R’-R| / |R’| In the Web, a low error rate is more critical than high coverage Pattern specificity -log(P(XMd(p)), X is a random var. over a uniform dist. Estimation: specificity(p)=|middle||urlprefix||prefix||suffix| Performance ScalabilitySummary: Summary Extracting information from the Web Data source Well known: wrapper Need to detect: discriminative patterns Data format Semi-structured: formatting hints, e.g., HTML tags Plane text: linguistic hints Pattern generation Hand-written: accurate but time-consuming Auto-generated: error-prone Dual Iterative Pattern Relation Extraction -------------- -------------- -------------- -------------- -------------- Back-end DB Query Pattern/Template -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- --------------