Information Extractionfrom the World Wide Web : Information Extraction from the World Wide Web William W. Cohen
Carnegie Mellon University
Andrew McCallum
University of Massachusetts Amherst
KDD 2003
Example: The Problem : Example: The Problem Martin Baker, a person Genomics job Employers job posting form
Example: A Solution : Example: A Solution
Extracting Job Openings from the Web : Extracting Job Openings from the Web
Slide5 : Job Openings:
Category = Food Services
Keyword = Baker
Location = Continental U.S.
Slide6 : Data Mining the Extracted Job Information
Slide7 : IE from Research Papers
IE fromChinese Documents regarding Weather : IE from Chinese Documents regarding Weather Chinese Academy of Sciences 200k+ documents
several millennia old
- Qing Dynasty Archives
- memos
- newspaper articles
- diaries
IE from SEC Filings : IE from SEC Filings This filing covers the period from December 1996 to September 1997.
ENRON GLOBAL POWER andamp; PIPELINES L.L.C.
CONSOLIDATED BALANCE SHEETS
(IN THOUSANDS, EXCEPT SHARE AMOUNTS)
SEPTEMBER 30, DECEMBER 31,
1997 1996
------------- ------------
(UNAUDITED)
ASSETS
Current Assets
Cash and cash equivalents $ 54,262 $ 24,582
Accounts receivable 8,473 6,301
Current portion of notes receivable 1,470 1,394
Other current assets 336 404
-------- --------
Total Current Assets 71,730 32,681
-------- --------
Investments in to Unconsolidated Subsidiaries 286,340 298,530
Notes Receivable 16,059 12,111
-------- --------
Total Assets $374,408 $343,843
======== ========
LIABILITIES AND SHAREHOLDERS' EQUITY
Current Liabilities
Accounts payable $ 13,461 $ 11,277
Accrued taxes 1,910 1,488
-------- --------
Total Current Liabilities 15,371 49,348
-------- --------
Deferred Income Taxes 525 4,301
The U.S. energy markets in 1997 were subject to significant fluctuation Data mine these reports for
- suspicious behavior,
to better understand
what is normal.
What is “Information Extraction” : What is 'Information Extraction' Filling slots in a database from sub-segments of text. As a task: October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a 'cancer' that stifled technological innovation.
Today, Microsoft claims to 'love' the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
'We can be open source. We love the concept of shared source,' said Bill Veghte, a Microsoft VP. 'That's a super-important shift for us in terms of code access.'
Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION
What is “Information Extraction” : What is 'Information Extraction' Filling slots in a database from sub-segments of text. As a task: October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a 'cancer' that stifled technological innovation.
Today, Microsoft claims to 'love' the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
'We can be open source. We love the concept of shared source,' said Bill Veghte, a Microsoft VP. 'That's a super-important shift for us in terms of code access.'
Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION
Bill Gates CEO Microsoft
Bill Veghte VP Microsoft
Richard Stallman founder Free Soft.. IE
What is “Information Extraction” : What is 'Information Extraction' Information Extraction =
segmentation + classification + clustering + association As a family of techniques: October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a 'cancer' that stifled technological innovation.
Today, Microsoft claims to 'love' the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
'We can be open source. We love the concept of shared source,' said Bill Veghte, a Microsoft VP. 'That's a super-important shift for us in terms of code access.'
Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
What is “Information Extraction” : What is 'Information Extraction' Information Extraction =
segmentation + classification + association + clustering As a family of techniques: October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a 'cancer' that stifled technological innovation.
Today, Microsoft claims to 'love' the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
'We can be open source. We love the concept of shared source,' said Bill Veghte, a Microsoft VP. 'That's a super-important shift for us in terms of code access.'
Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
What is “Information Extraction” : What is 'Information Extraction' Information Extraction =
segmentation + classification + association + clustering As a family of techniques: October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a 'cancer' that stifled technological innovation.
Today, Microsoft claims to 'love' the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
'We can be open source. We love the concept of shared source,' said Bill Veghte, a Microsoft VP. 'That's a super-important shift for us in terms of code access.'
Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
What is “Information Extraction” : What is 'Information Extraction' Information Extraction =
segmentation + classification + association + clustering As a family of techniques: October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a 'cancer' that stifled technological innovation.
Today, Microsoft claims to 'love' the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
'We can be open source. We love the concept of shared source,' said Bill Veghte, a Microsoft VP. 'That's a super-important shift for us in terms of code access.'
Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation * * * *
IE in Context : IE in Context Create ontology Segment
Classify
Associate
Cluster Load DB Spider Query,
Search Data mine IE Document collection Database Filter by relevance Label training data Train extraction models
Why IE from the Web? : Why IE from the Web? Science
Grand old dream of AI: Build large KB* and reason with it. IE from the Web enables the creation of this KB.
IE from the Web is a complex problem that inspires new advances in machine learning.
Profit
Many companies interested in leveraging data currently 'locked in unstructured text on the Web'.
Not yet a monopolistic winner in this space.
Fun!
Build tools that we researchers like to use ourselves: Cora andamp; CiteSeer, MRQE.com, FAQFinder,…
See our work get used by the general public. * KB = 'Knowledge Base'
Tutorial Outline : Tutorial Outline IE History
Landscape of problems and solutions
Parade of models for segmenting/classifying:
Sliding window
Boundary finding
Finite state machines
Trees
Overview of related problems and solutions
Association, Clustering
Integration with Data Mining
Where to go from here 15 min break
IE History : IE History Pre-Web
Mostly news articles
De Jong’s FRUMP [1982]
Hand-built system to fill Schank-style 'scripts' from news wire
Message Understanding Conference (MUC) DARPA [’87-’95], TIPSTER [’92-’96]
Most early work dominated by hand-built models
E.g. SRI’s FASTUS, hand-built FSMs.
But by 1990’s, some machine learning: Lehnert, Cardie, Grishman and then HMMs: Elkan [Leek ’97], BBN [Bikel et al ’98]
Web
AAAI ’94 Spring Symposium on 'Software Agents'
Much discussion of ML applied to Web. Maes, Mitchell, Etzioni.
Tom Mitchell’s WebKB, ‘96
Build KB’s from the Web.
Wrapper Induction
Initially hand-build, then ML: [Soderland ’96], [Kushmeric ’97],…
What makes IE from the Web Different? : www.apple.com/retail What makes IE from the Web Different? Less grammar, but more formatting andamp; linking The directory structure, link structure, formatting andamp; layout of the Web is its own new grammar. Apple to Open Its First Retail Store
in New York City
MACWORLD EXPO, NEW YORK--July 17, 2002--Apple's first retail store in New York City will open in Manhattan's SoHo district on Thursday, July 18 at 8:00 a.m. EDT. The SoHo store will be Apple's largest retail store to date and is a stunning example of Apple's commitment to offering customers the world's best computer shopping experience.
'Fourteen months after opening our first retail store, our 31 stores are attracting over 100,000 visitors each week,' said Steve Jobs, Apple's CEO. 'We hope our SoHo store will surprise and delight both Mac and PC users who want to see everything the Mac can do to enhance their digital lifestyles.' www.apple.com/retail/soho www.apple.com/retail/soho/theatre.html Newswire Web
Landscape of IE Tasks (1/4):Pattern Feature Domain : Landscape of IE Tasks (1/4): Pattern Feature Domain Text paragraphs
without formatting Grammatical sentences and some formatting andamp; links Non-grammatical snippets, rich formatting andamp; links Tables Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR.
Landscape of IE Tasks (2/4):Pattern Scope : Landscape of IE Tasks (2/4): Pattern Scope Web site specific Genre specific Wide, non-specific Amazon.com Book Pages Resumes University Names Formatting Layout Language
Landscape of IE Tasks (3/4):Pattern Complexity : Landscape of IE Tasks (3/4): Pattern Complexity Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern University of Arkansas
P.O. Box 140
Hope, AR 71802 …was among the six houses sold by Hope Feldman that year. Ambiguous patterns, needing context and many sources of evidence The CALD main office can be reached at 412-268-1299 The big Wyoming sky… U.S. states U.S. phone numbers U.S. postal addresses Person names Headquarters:
1128 Main Street, 4th Floor
Cincinnati, Ohio 45210 Pawel Opalinski, Software Engineer at WhizBang Labs. E.g. word patterns:
Landscape of IE Tasks (4/4):Pattern Combinations : Landscape of IE Tasks (4/4): Pattern Combinations Single entity Person: Jack Welch Binary relationship Relation: Person-Title
Person: Jack Welch
Title: CEO N-ary record 'Named entity' extraction Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt. Relation: Company-Location
Company: General Electric
Location: Connecticut Relation: Succession
Company: General Electric
Title: CEO
Out: Jack Welsh
In: Jeffrey Immelt Person: Jeffrey Immelt Location: Connecticut
Evaluation of Single Entity Extraction : Evaluation of Single Entity Extraction Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke. TRUTH: PRED: Precision = = # correctly predicted segments 2 # predicted segments 6 Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke. Recall = = # correctly predicted segments 2 # true segments 4 F1 = Harmonic mean of Precision andamp; Recall = ((1/P) + (1/R)) / 2 1
State of the Art Performance : State of the Art Performance Named entity recognition
Person, Location, Organization, …
F1 in high 80’s or low- to mid-90’s
Binary relation extraction
Contained-in (Location1, Location2) Member-of (Person1, Organization1)
F1 in 60’s or 70’s or 80’s
Wrapper induction
Extremely accurate performance obtainable
Human effort (~30min) required on each site
Landscape of IE Techniques (1/1):Models : Landscape of IE Techniques (1/1): Models Any of these models can be used to capture words, formatting or both. Lexicons Alabama
Alaska
…
Wisconsin
Wyoming Abraham Lincoln was born in Kentucky. member? …and beyond
Landscape:Focus of this Tutorial : Landscape: Focus of this Tutorial Pattern complexity Pattern feature domain Pattern scope Pattern combinations Models closed set regular complex ambiguous words words + formatting formatting site-specific genre-specific general entity binary n-ary lexicon regex window boundary FSM CFG
Sliding Windows : Sliding Windows
Extraction by Sliding Window : Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement
Extraction by Sliding Window : Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement
Extraction by Sliding Window : Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement
Extraction by Sliding Window : Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement
A “Naïve Bayes” Sliding Window Model : A 'Naïve Bayes' Sliding Window Model [Freitag 1997] 00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun w t-m w t-1 w t w t+n w t+n+1 w t+n+m prefix contents suffix Other examples of sliding window: [Baluja et al 2000]
(decision tree over individual words andamp; their context) If P('Wean Hall Rm 5409' = LOCATION) is above some threshold, extract it. … … Estimate Pr(LOCATION|window) using Bayes rule
Try all 'reasonable' windows (vary length, position)
Assume independence for length, prefix words, suffix words, content words
Estimate from data quantities like: Pr('Place' in prefix|LOCATION)
“Naïve Bayes” Sliding Window Results : 'Naïve Bayes' Sliding Window Results GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. Domain: CMU UseNet Seminar Announcements Field F1
Person Name: 30%
Location: 61%
Start Time: 98%
SRV: a realistic sliding-window-classifier IE system : SRV: a realistic sliding-window-classifier IE system What windows to consider?
all windows containing as many tokens as the shortest example, but no more tokens than the longest example
How to represent a classifier? It might:
Restrict the length of window;
Restrict the vocabulary or formatting used before/after/inside window;
Restrict the relative order of tokens;
Etc…
andlt;titleandgt;Course Information for CS213andlt;/titleandgt;
andlt;h1andgt;CS 213 C++ Programmingandlt;/h1andgt; [Frietag AAAI ‘98]
SRV: a rule-learner for sliding-window classification : SRV: a rule-learner for sliding-window classification Top-down rule learning:
let RULES = ;;
while (there are uncovered positive examples) {
// construct a rule R to add to RULES
let R be a rule covering all examples;
while (R covers too many negative examples) {
let C = argmaxC VALUE( R, Randamp;C, uncoveredExamples)
over some set of candidate conditions C
let R = R - C;
}
let RULES = RULES + {R};
}
SRV: a rule-learner for sliding-window classification : SRV: a rule-learner for sliding-window classification Rule learning: greedily add conditions to rules, rules to rule set
Search metric: SRV algorithm greedily adds conditions to maximize 'information gain'
To prevent overfitting:
rules are built on 2/3 of data, then their false positive rate is estimated on the 1/3 holdout set.
Candidate conditions: … courseNumber(X) :-
tokenLength(X,=,2),
every(X, inTitle, false),
some(X, A, andlt;previousTokenandgt;, inTitle, true),
some(X, B, andlt;andgt;, tripleton, true)
Learning “first-order” rules : Learning 'first-order' rules A sample 'zero-th' order rule set:
(tok1InTitle andamp; tok1StartsPara andamp; tok2triple)
or (prevtok2EqCourse andamp; prevtok1EqNumber) or …
First-order 'rules' can be learned the same way—with additional search to find best 'condition'
phrase(X) :- firstToken(X,A), not startPara(A),
nextToken(A,B), triple(B)
phrase(X) :- firstToken(X,A), prevToken(A,C), eq(C,’number’),
prevToken(C,D), eq(D,’course’)
Semantics:
'p(X) :- q(X),r(X,Y),s(Y)' = '{X : exists Y : q(X) and r(X,Y) and s(Y)}'
SRV: a rule-learner for sliding-window classification : SRV: a rule-learner for sliding-window classification Primitive predicates used by SRV:
token(X,W), allLowerCase(W), numerical(W), …
nextToken(W,U), previousToken(W,V)
HTML-specific predicates:
inTitleTag(W), inH1Tag(W), inEmTag(W),…
emphasized(W) = 'inEmTag(W) or inBTag(W) or …'
tableNextCol(W,U) = 'U is some token in the column after the column W is in'
tablePreviousCol(W,V), tableRowHeader(W,T),…
SRV: a rule-learner for sliding-window classification : SRV: a rule-learner for sliding-window classification Non-primitive 'conditions' used by SRV:
every(+X, f, c) = for all W in X : f(W)=c
some(+X, W, andlt;f1,…,fkandgt;, g, c)= exists W: g(fk(…(f1(W)…))=c
tokenLength(+X, relop, c):
position(+W,direction,relop, c):
e.g., tokenLength(X,andgt;,4), position(W,fromEnd,andlt;,2)
Utility of non-primitive conditions in greedy rule search : Utility of non-primitive conditions in greedy rule search Greedy search for first-order rules is hard because useful conditions can give no immediate benefit:
phrase(X) :- token(X,A), prevToken(A,B),inTitle(B),
nextToken(A,C), tripleton(C)
Rapier: an alternative approach : Rapier: an alternative approach A bottom-up rule learner:
initialize RULES to be one rule per example;
repeat {
randomly pick N pairs of rules (Ri,Rj);
let {G1…,GN} be the consistent pairwise generalizations;
let G* = Gi that optimizes 'compression'
let RULES = RULES + {G*} – {R’: covers(G*,R’)}
}
where compression(G,RULES) = size of RULES- {R’: covers(G,R’)} and 'covers(G,R)' means every example matching G matches R [Califf andamp; Mooney, AAAI ‘99]
Slide44 : andlt;titleandgt;Course Information for CS213andlt;/titleandgt;
andlt;h1andgt;CS 213 C++ Programmingandlt;/h1andgt; … andlt;titleandgt;Syllabus and meeting times for Eng 214andlt;/titleandgt;
andlt;h1andgt;Eng 214 Software Engineering for Non-programmers andlt;/h1andgt;… courseNum(window1) :- token(window1,’CS’), doubleton(‘CS’), prevToken(‘CS’,’CS213’), inTitle(‘CS213’), nextTok(‘CS’,’213’), numeric(‘213’), tripleton(‘213’), nextTok(‘213’,’C++’), tripleton(‘C++’), …. courseNum(window2) :- token(window2,’Eng’), tripleton(‘Eng’), prevToken(‘Eng’,’214’), inTitle(‘214’), nextTok(‘Eng’,’214’), numeric(‘214’), tripleton(‘214’), nextTok(‘214’,’Software’), … courseNum(X) :- token(X,A),
prevToken(A, B), inTitle(B), nextTok(A,C)),
numeric(C), tripleton(C), nextTok(C,D), …
Rapier: an alternative approach : Rapier: an alternative approach Combines top-down and bottom-up learning
Bottom-up to find common restrictions on content
Top-down greedy addition of restrictions on context
Use of part-of-speech and semantic features (from WORDNET).
Special 'pattern-language' based on sequences of tokens, each of which satisfies one of a set of given constraints
andlt; andlt;tok2{‘ate’,’hit’},POS2{‘vb’}andgt;, andlt;tok2{‘the’}andgt;, andlt;POS2{‘nn’andgt;andgt;
Rapier: results – precision/recall : Rapier: results – precision/recall
Rapier – results vs. SRV : Rapier – results vs. SRV
Rule-learning approaches to sliding-window classification: Summary : Rule-learning approaches to sliding-window classification: Summary SRV, Rapier, and WHISK [Soderland KDD ‘97]
Representations for classifiers allow restriction of the relationships between tokens, etc
Representations are carefully chosen subsets of even more powerful representations based on logic programming (ILP and Prolog)
Use of these 'heavyweight' representations is complicated, but seems to pay off in results
Can simpler representations for classifiers work?
BWI: Learning to detect boundaries : BWI: Learning to detect boundaries Another formulation: learn three probabilistic classifiers:
START(i) = Prob( position i starts a field)
END(j) = Prob( position j ends a field)
LEN(k) = Prob( an extracted field has length k)
Then score a possible extraction (i,j) by
START(i) * END(j) * LEN(j-i)
LEN(k) is estimated from a histogram
[Freitag andamp; Kushmerick, AAAI 2000]
BWI: Learning to detect boundaries : BWI: Learning to detect boundaries BWI uses boosting to find 'detectors' for START and END
Each weak detector has a BEFORE and AFTER pattern (on tokens before/after position i).
Each 'pattern' is a sequence of tokens and/or wildcards like: anyAlphabeticToken, anyToken, anyUpperCaseLetter, anyNumber, …
Weak learner for 'patterns' uses greedy search (+ lookahead) to repeatedly extend a pair of empty BEFORE,AFTER patterns
BWI: Learning to detect boundaries : BWI: Learning to detect boundaries Field F1
Person Name: 30%
Location: 61%
Start Time: 98%
Problems with Sliding Windows and Boundary Finders : Problems with Sliding Windows and Boundary Finders Decisions in neighboring parts of the input are made independently from each other.
Naïve Bayes Sliding Window may predict a 'seminar end time' before the 'seminar start time'.
It is possible for two overlapping windows to both be above threshold.
In a Boundary-Finding system, left boundaries are laid down independently from right boundaries, and their pairing happens as a separate step.
Finite State Machines : Finite State Machines
Hidden Markov Models : Hidden Markov Models S t - 1 S t O t S t+1 O t +1 O t - 1 ... ... Finite state model Graphical model Parameters: for all states S={s1,s2,…}
Start state probabilities: P(st )
Transition probabilities: P(st|st-1 )
Observation (emission) probabilities: P(ot|st )
Training:
Maximize probability of training observations (w/ prior) HMMs are the standard sequence modeling tool in genomics, music, speech, NLP, …
... transitions observations o1 o2 o3 o4 o5 o6 o7 o8 Generates:
State
sequence
Observation
sequence Usually a multinomial over atomic, fixed alphabet
IE with Hidden Markov Models : IE with Hidden Markov Models Yesterday Pedro Domingos spoke this example sentence. Yesterday Pedro Domingos spoke this example sentence. Person name: Pedro Domingos Given a sequence of observations: and a trained HMM: Find the most likely state sequence: (Viterbi) Any words said to be generated by the designated 'person name'
state extract as a person name: person name location name background
HMM Example: “Nymble” : HMM Example: 'Nymble' Other examples of shrinkage for HMMs in IE: [Freitag and McCallum ‘99] Task: Named Entity Extraction Train on ~500k words of news wire text. Case Language F1 .
Mixed English 93%
Upper English 91%
Mixed Spanish 90% [Bikel, et al 1998], [BBN 'IdentiFinder'] Person Org Other (Five other name classes) start-of-sentence end-of-sentence Transition probabilities Observation probabilities P(st | st-1, ot-1 ) P(ot | st , st-1 ) Back-off to: Back-off to: P(st | st-1 ) P(st ) P(ot | st , ot-1 ) P(ot | st ) P(ot ) or Results:
We want More than an Atomic View of Words : We want More than an Atomic View of Words Would like richer representation of text:
many arbitrary, overlapping features of the words. S t - 1 S t O t S t+1 O t +1 O t - 1 identity of word
ends in '-ski'
is capitalized
is part of a noun phrase
is in a list of city names
is under node X in WordNet
is in bold font
is indented
is in hyperlink anchor
last person name was female
next two words are 'and Associates' … … part of noun phrase is 'Wisniewski' ends in
'-ski'
Problems with Richer Representationand a Joint Model : Problems with Richer Representation and a Joint Model These arbitrary features are not independent.
Multiple levels of granularity (chars, words, phrases)
Multiple dependent modalities (words, formatting, layout)
Past andamp; future
Two choices: Model the dependencies.
Each state would have its own Bayes Net. But we are already starved for training data! Ignore the dependencies.
This causes 'over-counting' of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi! S t - 1 S t O t S t+1 O t +1 O t - 1 S t - 1 S t O t S t+1 O t +1 O t - 1
Conditional Sequence Models : Conditional Sequence Models We prefer a model that is trained to maximize a conditional probability rather than joint probability: P(s|o) instead of P(s,o):
Can examine features, but not responsible for generating them.
Don’t have to explicitly model their dependencies.
Don’t 'waste modeling effort' trying to generate what we are given at test time anyway.
From HMMs to CRFs : Joint Conditional St-1 St Ot St+1 Ot+1 Ot-1 ... ... (A super-special case of
Conditional Random Fields.) Conditional Finite State Sequence Models From HMMs to CRFs [Lafferty, McCallum, Pereira 2001] [McCallum, Freitag andamp; Pereira, 2000] where
Conditional Random Fields : Conditional Random Fields St St+1 St+2 O = Ot, Ot+1, Ot+2, Ot+3, Ot+4 St+3 St+4 1. FSM special-case:
linear chain among unknowns, parameters tied across time steps. [Lafferty, McCallum, Pereira 2001] 2. In general:
CRFs = 'Conditionally-trained Markov Network'
arbitrary structure among unknowns 3. Relational Markov Networks [Taskar, Abbeel, Koller 2002]:
Parameters tied across hits from SQL-like queries ('clique templates')
Feature Functions : Feature Functions Yesterday Pedro Domingos spoke this example sentence. s3 s1 s2 s4 o = o1 o2 o3 o4 o5 o6 o7
Efficient Inference : Efficient Inference
Learning Parameters of CRFs : Learning Parameters of CRFs Methods:
iterative scaling (quite slow)
conjugate gradient (much faster)
limited-memory quasi-Newton methods, BFGS (super fast)
[Sha andamp; Pereira 2002] andamp; [Malouf 2002] Maximize log-likelihood of parameters L = {lk} given training data D Log-likelihood gradient:
Voted Perceptron Sequence Models : Voted Perceptron Sequence Models [Collins 2002] Like CRFs with stochastic gradient ascent and a Viterbi approximation. Avoids calculating the partition function (normalizer), Zo,
but gradient ascent, not 2nd-order or conjugate gradient method. Analogous to the gradient for this one training instance
General CRFs vs. HMMs : General CRFs vs. HMMs More general and expressive modeling technique
Comparable computational efficiency
Features may be arbitrary functions of any or all observations
Parameters need not fully specify generation of observations; require less training data
Easy to incorporate domain knowledge
State means only 'state of process', vs 'state of process' and 'observational history I’m keeping'
MEMM & CRF Related Work : MEMM andamp; CRF Related Work Maximum entropy for language tasks:
Language modeling [Rosenfeld ‘94, Chen andamp; Rosenfeld ‘99]
Part-of-speech tagging [Ratnaparkhi ‘98]
Segmentation [Beeferman, Berger andamp; Lafferty ‘99]
Named entity recognition 'MENE' [Borthwick, Grishman,…’98]
HMMs for similar language tasks
Part of speech tagging [Kupiec ‘92]
Named entity recognition [Bikel et al ‘99]
Other Information Extraction [Leek ‘97], [Freitag andamp; McCallum ‘99]
Serial Generative/Discriminative Approaches
Speech recognition [Schwartz andamp; Austin ‘93]
Reranking Parses [Collins, ‘00]
Other conditional Markov models
Non-probabilistic local decision models [Brill ‘95], [Roth ‘98]
Gradient-descent on state path [LeCun et al ‘98]
Markov Processes on Curves (MPCs) [Saul andamp; Rahim ‘99]
Voted Perceptron-trained FSMs [Collins ’02]
Person name Extraction : Person name Extraction [McCallum 2001, unpublished]
Person name Extraction : Person name Extraction
Features in Experiment : Features in Experiment Capitalized Xxxxx
Mixed Caps XxXxxx
All Caps XXXXX
Initial Cap X….
Contains Digit xxx5
All lowercase xxxx
Initial X
Punctuation .,:;!(), etc
Period .
Comma ,
Apostrophe ‘
Dash -
Preceded by HTML tag
Character n-gram classifier says string is a person name (80% accurate)
In stopword list (the, of, their, etc)
In honorific list (Mr, Mrs, Dr, Sen, etc)
In person suffix list (Jr, Sr, PhD, etc)
In name particle list (de, la, van, der, etc)
In Census lastname list; segmented by P(name)
In Census firstname list; segmented by P(name)
In locations lists (states, cities, countries)
In company name list ('J. C. Penny')
In list of company suffixes (Inc, andamp; Associates, Foundation) Hand-built FSM person-name extractor says yes, (prec/recall ~ 30/95)
Conjunctions of all previous feature pairs, evaluated at the current time step.
Conjunctions of all previous feature pairs, evaluated at current step and one step ahead.
All previous features, evaluated two steps ahead.
All previous features, evaluated one step behind.
Total number of features = ~500k
Training and Testing : Training and Testing Trained on 65k words from 85 pages, 30 different companies’ web sites.
Training takes 4 hours on a 1 GHz Pentium.
Training precision/recall is 96% / 96%.
Tested on different set of web pages with similar size characteristics.
Testing precision is 92 – 95%, recall is 89 – 91%.
Part-of-speech Tagging : Part-of-speech Tagging The asbestos fiber , crocidolite, is unusually resilient once
it enters the lungs , with even brief exposures to it causing
symptoms that show up decades later , researchers said . DT NN NN , NN , VBZ RB JJ IN
PRP VBZ DT NNS , IN RB JJ NNS TO PRP VBG
NNS WDT VBP RP NNS JJ , NNS VBD . 45 tags, 1M words training data, Penn Treebank Using spelling features* * use words, plus overlapping features: capitalized, begins with #, contains hyphen, ends in -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies.
[Lafferty, McCallum, Pereira 2001]
Table Extraction from Government Reports : Table Extraction from Government Reports Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was
slightly below 1994. Producer returns averaged $12.93 per hundredweight,
$0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds,
1 percent above 1994. Marketings include whole milk sold to plants and dealers
as well as milk sold directly to consumers.
An estimated 1.56 billion pounds of milk were used on farms where produced,
8 percent less than 1994. Calves were fed 78 percent of this milk with the
remainder consumed in producer households.
Milk Cows and Production of Milk and Milkfat:
United States, 1993-95
--------------------------------------------------------------------------------
: : Production of Milk and Milkfat 2/
: Number :-------------------------------------------------------
Year : of : Per Milk Cow : Percentage : Total
:Milk Cows 1/:-------------------: of Fat in All :------------------
: : Milk : Milkfat : Milk Produced : Milk : Milkfat
--------------------------------------------------------------------------------
: 1,000 Head --- Pounds --- Percent Million Pounds
:
1993 : 9,589 15,704 575 3.66 150,582 5,514.4
1994 : 9,500 16,175 592 3.66 153,664 5,623.7
1995 : 9,461 16,451 602 3.66 155,644 5,694.3
--------------------------------------------------------------------------------
1/ Average number during year, excluding heifers not yet fresh.
2/ Excludes milk sucked by calves.
Table Extraction from Government Reports : Table Extraction from Government Reports Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was
slightly below 1994. Producer returns averaged $12.93 per hundredweight,
$0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds,
1 percent above 1994. Marketings include whole milk sold to plants and dealers
as well as milk sold directly to consumers.
An estimated 1.56 billion pounds of milk were used on farms where produced,
8 percent less than 1994. Calves were fed 78 percent of this milk with the
remainder consumed in producer households.
Milk Cows and Production of Milk and Milkfat:
United States, 1993-95
--------------------------------------------------------------------------------
: : Production of Milk and Milkfat 2/
: Number :-------------------------------------------------------
Year : of : Per Milk Cow : Percentage : Total
:Milk Cows 1/:-------------------: of Fat in All :------------------
: : Milk : Milkfat : Milk Produced : Milk : Milkfat
--------------------------------------------------------------------------------
: 1,000 Head --- Pounds --- Percent Million Pounds
:
1993 : 9,589 15,704 575 3.66 150,582 5,514.4
1994 : 9,500 16,175 592 3.66 153,664 5,623.7
1995 : 9,461 16,451 602 3.66 155,644 5,694.3
--------------------------------------------------------------------------------
1/ Average number during year, excluding heifers not yet fresh.
2/ Excludes milk sucked by calves.
CRF Labels: Non-Table
Table Title
Table Header
Table Data Row
Table Section Data Row
Table Footnote
... (12 in all) [Pinto, McCallum, Wei, Croft, 2003] Features: Percentage of digit chars
Percentage of alpha chars
Indented
Contains 5+ consecutive spaces
Whitespace in this line aligns with prev.
...
Conjunctions of all previous features, time offset: {0,0}, {-1,0}, {0,1}, {1,2}.
100+ documents from www.fedstats.gov
Table Extraction Experimental Results : Table Extraction Experimental Results Line labels,
percent correct 95 % 65 % D error
= 85% 85 % HMM Stateless
MaxEnt CRF w/out
conjunctions CRF 52 % [Pinto, McCallum, Wei, Croft, 2003]
Named Entity Recognition : Named Entity Recognition CRICKET -
MILLNS SIGNS FOR BOLAND
CAPE TOWN 1996-08-22
South African provincial side Boland said on Thursday they had signed Leicestershire fast bowler David Millns on a one year contract.
Millns, who toured Australia with England A in 1992, replaces former England all-rounder Phillip DeFreitas as Boland's overseas professional. Labels: Examples: PER Yayuk Basuki
Innocent Butare
ORG 3M
KDP
Leicestershire
LOC Leicestershire
Nirmal Hriday
The Oval
MISC Java
Basque
1,000 Lakes Rally Reuters stories on international news Train on ~300k words
Automatically Induced Features : Automatically Induced Features Index Feature
0 inside-noun-phrase (ot-1)
5 stopword (ot)
20 capitalized (ot+1)
75 word=the (ot)
100 in-person-lexicon (ot-1)
200 word=in (ot+2)
500 word=Republic (ot+1)
711 word=RBI (ot) andamp; header=BASEBALL
1027 header=CRICKET (ot) andamp; in-English-county-lexicon (ot)
1298 company-suffix-word (firstmentiont+2)
4040 location (ot) andamp; POS=NNP (ot) andamp; capitalized (ot) andamp; stopword (ot-1)
4945 moderately-rare-first-name (ot-1) andamp; very-common-last-name (ot)
4474 word=the (ot-2) andamp; word=of (ot) [McCallum 2003]
Named Entity Extraction Results : Named Entity Extraction Results Method F1 # parameters
BBN's Identifinder, word features 79% ~500k
CRFs word features, 80% ~500k
w/out Feature Induction
CRFs many features, 75% ~3 million
w/out Feature Induction
CRFs many candidate features 90% ~60k
with Feature Induction [McCallum andamp; Li, 2003]
Inducing State-Transition Structure : Inducing State-Transition Structure [Chidlovskii, 2000] K-reversible grammars Structure learning for HMMs + IE
[Seymore et al 1999]
[Frietag andamp; McCallum 2000]
Limitations of Finite State Models : Limitations of Finite State Models Finite state models have a linear structure
Web documents have a hierarchical structure
Are we suffering by not modeling this structure more explicitly?
How can one learn a hierarchical extraction model?
Tree-based Models : Tree-based Models
Slide82 : Extracting from one web site
Use site-specific formatting information: e.g., 'the JobTitle is a bold-faced paragraph in column 2'
For large well-structured sites, like parsing a formal language
Extracting from many web sites:
Need general solutions to entity extraction, grouping into records, etc.
Primarily use content information
Must deal with a wide range of ways that users present data.
Analogous to parsing natural language
Problems are complementary:
Site-dependent learning can collect training data for a site-independent learner
Site-dependent learning can boost accuracy of a site-independent learner on selected key sites
Slide83 :
Slide84 :
Slide85 :
STALKER: Hierarchical boundary finding : STALKER: Hierarchical boundary finding Main idea:
To train a hierarchical extractor, pose a series of learning problems, one for each node in the hierarchy
At each stage, extraction is simplified by knowing about the 'context.'
[Muslea,Minton andamp; Knoblock 99]
Slide87 :
Slide88 :
Slide89 :
Slide90 : (BEFORE=(:), AFTER=null)
Slide91 : (BEFORE=(:), AFTER=null)
Slide92 : (BEFORE=(:), AFTER=null)
Stalker: hierarchical decomposition of two web sites : Stalker: hierarchical decomposition of two web sites
Stalker: summary and results : Stalker: summary and results Rule format:
'landmark automata' format for rules which extended BWI’s format
E.g.: andlt;aandgt;W. Cohenandlt;/aandgt; CMU: Web IE andlt;/liandgt;
BWI: BEFORE=(andlt;, /, a,andgt;, ANY, :)
STALKER: BEGIN = SkipTo(andlt;, /, a, andgt;), SkipTo(:)
Top-down rule learning algorithm
Carefully chosen ordering between types of rule specializations
Very fast learning: e.g. 8 examples vs. 274
A lesson: we often control the IE training data!
Why low sample complexity is important in “wrapper learning” : Why low sample complexity is important in 'wrapper learning' At training time, only four examples are available—but
one would like to generalize to future pages as well…
“Wrapster”: a hybrid approach to representing wrappers : 'Wrapster': a hybrid approach to representing wrappers Common representations for web pages include:
a rendered image
a DOM tree (tree of HTML markup andamp; text)
gives some of the power of hierarchical decomposition
a sequence of tokens
a bag of words, a sequence of characters, a node in a directed graph, . . .
Questions:
How can we engineer a system to generalize quickly?
How can we explore representational choices easily?
[Cohen,Jensenandamp;Hurst WWW02]
Wrapster architecture : Wrapster architecture Bias is an ordered set of 'builders'.
Builders are simple 'micro-learners'.
A single master algorithm co-ordinates learning.
Hybrid top-down/bottom-up rule learning
Terminology:
Span: substring of page, created by a predicate
Predicate: subset of span x span, created by a builder
Builder: a 'micro-learner', created by hand
Wrapster predicates : Wrapster predicates A predicate is a binary relation on spans:
p(s; t) means that t is extracted from s.
Membership in a predicate can be tested:
– Given (s,t), is p(s,t) true?
Predicates can be executed:
– EXECUTE(s,t) = { t : p(s,t) }
Example Wrapster predicate : Example Wrapster predicate http://wasBang.org/aboutus.html
WasBang.com contact info:
Currently we have offices in two locations:
Pittsburgh, PA
Provo, UT html head body p p ul li li a a 'Pittsburgh, PA' 'Provo, UT' 'WasBang.com .. info:' 'Currently..' …
Example Wrapster predicate : Example Wrapster predicate
Example:
p(s1,s2) iff s2 are the tokens below an li node inside a ul node inside s1.
EXECUTE(p,s1) extracts
– 'Pittsburgh, PA'
– 'Provo, UT' http://wasBang.org/aboutus.html
WasBang.com contact info:
Currently we have offices in two locations:
Pittsburgh, PA
Provo, UT
Wrapster builders : Wrapster builders Builders are based on simple, restricted languages, for example:
Ltagpath: p is defined by tag1,…,tagk and ptag1,…,tagk(s1,s2) is true iff s1 and s2 correspond to DOM nodes and s2 is reached from s1 by following a path ending in tag1,…,tagk
EXECUTE(pul,li,s1) = {'Pittsburgh,PA', 'Provo, UT'}
Lbracket: p is defined by a pair of strings (l,r), and pl,r(s1,s2) is true iff s2 is preceded by l and followed by r.
EXECUTE(pin,locations,s1) = {'two'}
Wrapster builders : Wrapster builders For each language L there is a builder B which implements:
LGG( positive examples of p(s1,s2)): least general p in L that covers all the positive examples (like pairwise generalization)
For Lbracket, longest common prefix and suffix of the examples.
REFINE(p, examples ): a set of p’s that cover some but not all of the examples.
For Ltagpath, extend the path with one additional tag that appears in the examples.
Builders/languages can be combined:
E.g. to construct a builder for (L1 and L2) or
(L1 composeWith L2)
Wrapster builders - examples : Wrapster builders - examples Compose `tagpaths’ and `brackets’
E.g., 'extract strings between ‘(‘ and ‘)’ inside a list item inside an unordered list'
Compose `tagpaths’ and language-based extractors
E.g., 'extract city names inside the first paragraph'
Extract items based on position inside a rendered table, or properties of the rendered text
E.g., 'extract items inside any column headed by text containing the words ‘Job’ and ‘Title’'
E.g. 'extract items in boldfaced italics'
Composing builders : Composing builders Composing builders for Ltagpath and Lbracket.
LGG of the locations would be
(ptags composeWith pL,R )
where
tags = ul,li
L = '('
R = ')' Jobs at WasBang.com:
Call (888)-555-1212 now to apply!
Webmaster (New York). Perl, servlets essential.
Librarian (Pittsburgh). MLS required.
Ski Instructor (Vancouver). Snowboarding skills also useful.
Composing builders – structural/global : Composing builders – structural/global Composing builders for Ltagpath and Lcity
Lcity = {pcity} where pcity(s1,s2) iff s2 is a city name inside of s2.
LGG of the locations would be
ptags composeWith pcity Jobs at WasBang.com:
Call Alberta Hill at 1-888-555-1212 now to apply!
Webmaster (New York). Perl, servlets essential.
Librarian (Pittsburgh). MLS required.
Ski Instructor (Vancouver). Snowboarding skills also useful.
Table-based builders : Table-based builders How to represent 'links to pages about singers'?
Builders can be based on a geometric view of a page.
Wrapster results : Wrapster results F1 #examples
Wrapster results : Wrapster results Examples needed for 100% accuracy
Broader Issues in IE : Broader Issues in IE
Broader View : Broader View Create ontology Segment
Classify
Associate
Cluster Load DB Spider Query,
Search Data mine IE Document collection Database Filter by relevance Label training data Train extraction models Up to now we have been focused on segmentation and classification
Broader View : Broader View Create ontology Segment
Classify
Associate
Cluster Load DB Spider Query,
Search Data mine IE Tokenize Document collection Database Filter by relevance Label training data Train extraction models Now touch on some other issues 1 2 3 4 5
(1) Association as Binary Classification : (1) Association as Binary Classification [Zelenko et al, 2002] Christos Faloutsos conferred with Ted Senator, the KDD 2003 General Chair. Person-Role (Christos Faloutsos, KDD 2003 General Chair) NO Person-Role ( Ted Senator, KDD 2003 General Chair) YES Person Person Role Do this with SVMs and tree kernels over parse trees.
(1) Association with Finite State Machines : (1) Association with Finite State Machines [Ray andamp; Craven, 2001] … This enzyme, UBC6, localizes to the endoplasmic reticulum, with the catalytic domain facing the cytosol. … DET this
N enzyme
N ubc6
V localizes
PREP to
ART the
ADJ endoplasmic
N reticulum
PREP with
ART the
ADJ catalytic
N domain
V facing
ART the N cytosol Subcellular-localization (UBC6, endoplasmic reticulum)
(1) Association using Parse Tree : (1) Association using Parse Tree [Miller et al 2000] Simultaneously POS tag, parse, extract andamp; associate! Increase space of parse constituents to include entity and relation tags
Notation Description .
ch head constituent category
cm modifier constituent category
Xp X of parent node
t POS tag
w word
Parameters e.g. .
P(ch|cp) P(vp|s)
P(cm|cp,chp,cm-1,wp) P(per/np|s,vp,null,said)
P(tm|cm,th,wh) P(per/nnp|per/np,vbd,said)
P(wm|cm,tm,th,wh) P(nance|per/np,per/nnp,vbd,said) (This is also a great example of extraction using a tree model.)
(1) Association with Graphical Models : (1) Association with Graphical Models [Roth andamp; Yih 2002] Capture arbitrary-distance dependencies among predictions.
(1) Association with Graphical Models : (1) Association with Graphical Models [Roth andamp; Yih 2002] Also capture long-distance dependencies among predictions. Local language models contribute evidence to entity classification. Random variable over the class of entity #1, e.g. over {person, location,…} Local language models contribute evidence to relation classification. Random variable over the class of relation between entity #2 and #1, e.g. over {lives-in, is-boss-of,…} Dependencies between classes of entities and relations! Inference with loopy belief propagation. person? person lives-in
(1) Association with Graphical Models : (1) Association with Graphical Models [Roth andamp; Yih 2002] Also capture long-distance dependencies among predictions. Local language models contribute evidence to entity classification. Random variable over the class of entity #1, e.g. over {person, location,…} Local language models contribute evidence to relation classification. Random variable over t