DIPRE

Uploaded from authorPOINTLite
Views:
 
Category: Education
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Extracting Patterns and Relations from the World Wide Web Author: Sergey Brin: 

Extracting Patterns and Relations from the World Wide Web Author: Sergey Brin Shui-Lung Chuang Jun. 29, 2000

Introduction: 

Introduction Google (by Larry Page and Sergey Brin) Rank Web pages by link authority Crawl plenty of Web data A small mirror of the World Wide Web Provide document content and hyperlink structure Web content mining What can we do for these data? Re-discover the information encoded by the authors Structure the data to more useful information

A Case Study Constructing A Book DB: 

A Case Study Constructing A Book DB A book is represented as a relation (title,author) From thousands of online sources v.s. from a few large information sources (wrapper) Challenges for extracting information from the Web Distributed information sources Many different formats Web Data

Proposed Approach: 

Proposed Approach Duality of patterns and relations A good set of patterns  a good set of tuples A good set of tuples  a good set of patterns Dual Iterative Pattern Relation Extraction Web Data

Representation : 

Representation A book (title,author) Occurrences of books (author,title,order,url,prefix,middle,suffix) Patterns for books (order,urlprefix,prefix,middle,suffix) e.g., order=T, urlurlprefix* *prefix, author, middle, title, suffix* Author: [A-Z][A-Za-z .,&]5,30[A-Za-z.] Title: [A-Z0-9][A-Za-z0-9 .,:’#!?;&]4,45[A-Za-z0-9!]

Pattern Generation: 

Pattern Generation GenOnePattern(O) Verify that the order and the middle are the same op.order  order and op.middle  middle op.urlprefix  longest matching prefix of all the urls op.prefix  longest matching suffix of all prefix’s op.suffix  longest matching prefix of all suffix’s GenPattern(O) Split O into O1,…,Ok by order and middle For each group Oi, p  GenOnePattern(Oi) If p meets specificity requirements then output p, Otherwise If all o in Oi have the same URL then reject Oi Else, split Oi into subgroups by the characters in their urls. Repeat step 2 for these subgroups.

Example: 

Example fgrep title  fgrep author Seeds

Example: 

Example fgrep title  fgrep author Seeds GenPattern() Regular Expression Matching

Experiment of Finding Books: 

Experiment of Finding Books Total test data 24 million web pages totaling 147 gigabytes Experimental results (3 iterations) Quality of results 19/20 are bonafide books 5/20 are not found on Amazon (2.5 million books)

Other Issues: 

Other Issues Theoretical pattern evaluation Given a large database D, R is the target relation and R’ is an approximation of R Coverage: |R’R| / |R| Error rate: |R’-R| / |R’| In the Web, a low error rate is more critical than high coverage Pattern specificity -log(P(XMd(p)), X is a random var. over a uniform dist. Estimation: specificity(p)=|middle||urlprefix||prefix||suffix| Performance Scalability

Summary: 

Summary Extracting information from the Web Data source Well known: wrapper Need to detect: discriminative patterns Data format Semi-structured: formatting hints, e.g., HTML tags Plane text: linguistic hints Pattern generation Hand-written: accurate but time-consuming Auto-generated: error-prone Dual Iterative Pattern Relation Extraction -------------- -------------- -------------- -------------- -------------- Back-end DB Query Pattern/Template -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- --------------