Stalker

Uploaded from authorPOINTLite
Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

A Hierarchical Approach to Wrapper Induction [Muslea et al, In Proceedings of Agents ‘99]: 

A Hierarchical Approach to Wrapper Induction [Muslea et al, In Proceedings of Agents ‘99] Acknowledgements: Weijun Lou Dr. Graig Knoblock Dr. Ion Muslea Dr. Steve Minton Dr. Subbarao Kambhampati

Outline of the presentation: 

Outline of the presentation Main Contents in the paper ECT – Embedded Catalog Tree Extraction Rules Extraction Rules as Finite Automata Rules Learning – Stalker Experimental Results Strength vs Weakness Q/A

Slide3: 

Name: KFC Cuisine: Fast Food Locations: Venice (310) 123-4567, (800) 888-4412. L.A. (213) 987-6543. Encino (818) 999-4567, (888) 727-3131. RESTAURANT Name List ( Locations ) Cuisine City List (PhoneNumbers) AreaCode Phone The Embedded Catalog Tree (ECT)

The Embedded Catalog Tree (ECT) (continuous): 

The Embedded Catalog Tree (ECT) (continuous) Common conventions for structuring HTML page: The information on a page often exhibits some hierarchy Semistructured information is often presented in the form of lists of tuples The formalism (ECT) can describe the structure of a wide-wage of semistructured documents

The Embedded Catalog Tree (ECT) (continuous): 

The Embedded Catalog Tree (ECT) (continuous) The formalism: The leaves are the items of interest for user The internal nodes (include root) represent lists of k-tuples Each item in the k-tuple can be either a leaf l or another list L (called an embedded list) k indicates the number of items in the tuple

Outline of the presentation : 

Outline of the presentation Main Contents in the paper ECT – Embedded Catalog Tree Extraction Rules Extraction Rules as Finite Automata Rules Learning – Stalker Experimental Results Strength vs Weakness Q/A

Extraction Rules: 

Extraction Rules Why need rules: Given the ECT together with an extraction rule attached each edge and a iteration rule associated with each list node, the problem of extracting any item of interest by simply determining the path P from root to the corresponding leaf and by successively extracting each node x ∈P from its parent p.

Slide8: 

Training Example: 1: <p> Name: <b> Yala </b> <p>Cuisine: Thai <p> <i> 2: 4000 Colfax, Phoenix, AZ 85258 (602) 508-1570 3: </i><br><i> 4: 523 Vernon, Las Vegas, NV 89104 (702) 578-2293 5: </i><br><i> 6: 403 Pico, LA, CA 90007 (213) 789-0008 7: </i> Example of Rule Induction To extract the name, apply R1: SkipTo(<b>) start rule and R2: SkipTo</b> end rule to the parent node to identify the prefix and suffix of the item. Alternative rules to R1 can be R3: SkipTo(Name) SkipTo(<b>) or R4: SkipTo(Name Symbol HtmlTag). - match correctly. To extract list node, apply start rule SkipTo(<p><i>) and end rule skipTo(</i>). To break the list into individual tuples, apply SkipTo(<i>) and end rule SkipTo(</i>) iterately to the list node.

Extraction rules: 

Extraction rules How to express extraction rules ? – a sequence of landmarks: Landmark – a group of consecutive tokens (e.g. words, numbers, HTML tags, substrings, etc), eg. <b> Wildcard – a class of tokens, eg. Number, Sign, HtmlTag, AllCaps, etc. Function-like expression which take a landmark or a wildcard as arguments, eg. SkipTo(:<b>) Cascade functions, eg. SkipTo(AllCaps) NextLandmark(Number) Disjunctive rule, eg. either SkipTo(<b>) or SkipTo(<i>)

Slide10: 

Training Example: 1: <p> Name: <b> Yala </b> <p>Cuisine: Thai <p> <i> 2: 4000 Colfax, Phoenix, AZ 85258 (602) 508-1570 3: </i><br><i> 4: 523 Vernon, Las Vegas, NV 89104 (702) 578-2293 5: </i><br><i> 6: 403 Pico, LA, CA 90007 (213) 789-0008 7: </i> Example of Rule Induction (cont’s) To extract area-code, apply SkipTo(‘(‘) and end rule SkipTo(‘)’) iterately to each tuple To find beginning of ZIP code, apply R5: SkipTo(,) SkipUntil(Number) or R6: SkipTo(AllCaps) NextLandmark(Number)

Extraction Rules (cont’s): 

Extraction Rules (cont’s) Advantages Hierarchical extraction based on the ECT formalism allows to wrap info sources that have arbitrary many levels Each node is extracted independently of its siblings, allows missing items or items ordering variation in info sources. In general, using inductive algorithm (Stalker) with hierarchical extraction (ECT) turns an extremely hard problem into several simple ones: rather than finding a single rule, find a number of simpler rules which deal with the easier task of extracting one item from its ECT

Outline of the presentation: 

Outline of the presentation Main Contents in the paper ECT – Embedded Catalog Tree Extraction Rules Extraction Rules as Finite Automata Rules Learning – Stalker Experimental Results Strength vs Weakness Q/A

Extraction Rules as Finite Automata: 

Extraction Rules as Finite Automata Landmark automata: a group of function-like SkipTo()s that must be applied in a pro-established order represents a landmark automaton. Any extraction rule in previous slides is a landmark automaton. Linear landmark: landmark described by a sequence of tokens and wildcards. Linear landmarks are sufficiently expressive to allow efficient navigation within ECT Linear landmarks can be generated and refined in a simple way during learning

Extraction Rules as Finite Automata (cont’s): 

Extraction Rules as Finite Automata (cont’s) Training Examples: E1: 513 Pico, <b>Venice</b>, Phone: 1-<b>800</b>-555-1515 E2: 90 Colfax, <b>Palms</b>, Phone: (818)508-1570 E3: 523 1st St., <b>LA </b?, Phone: 1-<b>888</b>-578-2293 E4: 403 Vernon, <b>Watts</b>, Phone: (310)798-0008 Rule for start of the area code: either Skipto(‘(‘) or SkipTo (Phone) SkipTo (<b>) An SLG for the start of the area code Landmark automata (LA) are nondeterministic finite automata in which Si->Sj (i=j) is labeled by a landmark li,j; i.e. the transition takes place iff the input in the state is a string that is accepted by the landmark li,j. Simple Landamark Grammars (SLGs) are the class of Las that correspond to the disjuctive rules

Extraction Rules as Finite Automata (cont’s): 

Extraction Rules as Finite Automata (cont’s) The properties of SLG: The initial state S0 has a branching –factor of k It has exactly k accepting states (one per branch) All k branches that leave the S0 are sequential Las Linear Landmarks label each non loop transitions; All looping transitions have the meaning “consume all tokens until you encounter the linear landmark that leads to the next state s3 s0 s1 s2 ( Phone <b> s0

Outline of the presentation: 

Outline of the presentation Main Contents in the paper ECT – Embedded Catalog Tree Extraction Rules Extraction Rules as Finite Automata Rules Learning – Stalker Experimental Results Strength vs Weakness Q/A

Example of Extraction Task: 

Example of Extraction Task NAME Casablanca Restaurant STREET 220 Lincoln Boulevard CITY Venice PHONE (310) 392-5751

The Wrapper Architecture: 

Extraction Rules Query Data Information Extractor EC Tree The Wrapper Architecture

Learning the Extraction Rules: 

Inductive Learning System Extraction Rules Labeled Pages Learning the Extraction Rules GUI

The STALKER Algorithm: 

The STALKER Algorithm STALKER(Examples) let RetVal be an empty SLG WHILE Examples =0 aDisjunct = LearnDisjunct(Examples) remove all examples covered by aDisjunct add aDisjunct to RetVal return RetVal LearnDisjunct(Examples) Terminals = Wildcards U GetTokens(Examples) Candidates = GetInitailCandidateds(Examples) WHILE candidates ≠0 DO let D = BestDisjunct (Candidates) IF D is a perfect Disjunct THEN return D FOR EACH t ∈Terminals DO Candidates = Candidates U Refine(D, t) remove D from Candidates return best disjunct

Slide21: 

Example of Rule Induction Training Examples: E1: 513 Pico, <b>Venice</b>, Phone: 1-<b>800</b>-555-1515 E2: 90 Colfax, <b>Palms</b>, Phone: (818)508-1570 E3: 523 1st St., <b>LA </b?, Phone: 1-<b>888</b>-578-2293 E4: 403 Vernon, <b>Watts</b>, Phone: (310)798-0008 Initial candidates in the first iteration R1 R2 R3 R4

Slide22: 

Example of Rule Induction Training Examples: E1: 513 Pico, <b>Venice</b>, Phone: 1-<b>800</b>-555-1515 E3: 523 1st St., <b>LA </b?, Phone: 1-<b>888</b>-578-2293 Initial candidates and Refinements in the second iteration

Slide23: 

STALKER (cont’s) Stalker is a typical sequential covering algorithm Stalker can be easily extended to uses SkipUntil() and NextLandMark() feature No change on the rule refining process GenerateInitCandidates() need to change to also generate SkipTo()SkipUntil() and SkipTo()NexLandmark()

Outline of the presentation: 

Outline of the presentation Main Contents in the paper ECT – Embedded Catalog Tree Extraction Rules Extraction Rules as Finite Automata Rules Learning – Stalker Experimental Results Strength vs Weakness Q/A

Experimental Results: 

Experimental Results Illustrative information sources Comparison between WIEN and STALKER Correctness levels for S1 and S2

Results Analysis: 

Results Analysis Stalker usually requires less than 10 examples to obtain a 97% average accuracy over 500 trials Termination criteria Stalker: either reach the 97% accuracy threshold or consumed 10 examples WIEN: 100% accuracy The most impressiveness lies in WIEN uses two orders of magnitude more example than Stalker While Stalker allows both missing items and various item ordering, accuracy ranges between 85% and 100% Stalker is reasonably fast Stalker: less than 20 sec per experiment for S1 and S2; less than 40 sec per item for S2 and S4. WIEN: 5 sec per experiment for S1 and 83 sec for S2 For easier task like S2 which has extremely regular structure, except for list extraction rule, all other rules have a 100% accuracy based on a single example! The learning curves for the list extraction and list iteration for all four sources show that these two rules can be extracted relative independently of the extraction difficulty of different source, they all converges quick and get accuracy level above 95% - From another perspective, it proves the usefulness of EC formalism

Outline of the presentation: 

Outline of the presentation Main Contents in the paper ECT – Embedded Catalog Tree Extraction Rules Extraction Rules as Finite Automata Rules Learning – Stalker Experimental Results Strength vs Weakness Q/A

Strength vs Weakness: 

Strength vs Weakness Strength: Powerful extraction language EC Formalism: turns hard problem into a problem that is extremely easy in practice. Stalker: allows miss items and various item orderings One hard-to-extract item does not affect others Disadvantage: Requires hand-craft labelling, though usually 10 examples are sufficient. Man-made thresholds in termination criteria. For some hard tasks like S4, the accuracy obtained is still not satisfactory.

Outline of the presentation: 

Outline of the presentation Main Contents in the paper ECT – Embedded Catalog Tree Extraction Rules Extraction Rules as Finite Automata Rules Learning – Stalker Experimental Results Strength vs Weakness Q/A