ie survey

Uploaded from authorPOINT
Views:
 
Category: Education
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Information Extractionand Integration: an Overview: 

Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

Example: The Problem: 

Example: The Problem Martin Baker, a person Genomics job Employers job posting form

Example: A Solution: 

Example: A Solution

Extracting Job Openings from the Web : 

Extracting Job Openings from the Web

Slide5: 

Job Openings: Category = Food Services Keyword = Baker Location = Continental U.S.

Slide6: 

IE from Research Papers

What is “Information Extraction”: 

What is 'Information Extraction' Filling slots in a database from sub-segments of text. As a task: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a 'cancer' that stifled technological innovation. Today, Microsoft claims to 'love' the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. 'We can be open source. We love the concept of shared source,' said Bill Veghte, a Microsoft VP. 'That's a super-important shift for us in terms of code access.' Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION

What is “Information Extraction”: 

What is 'Information Extraction' Filling slots in a database from sub-segments of text. As a task: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a 'cancer' that stifled technological innovation. Today, Microsoft claims to 'love' the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. 'We can be open source. We love the concept of shared source,' said Bill Veghte, a Microsoft VP. 'That's a super-important shift for us in terms of code access.' Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman founder Free Soft.. IE

What is “Information Extraction”: 

What is 'Information Extraction' Information Extraction = segmentation + classification + clustering + association As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a 'cancer' that stifled technological innovation. Today, Microsoft claims to 'love' the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. 'We can be open source. We love the concept of shared source,' said Bill Veghte, a Microsoft VP. 'That's a super-important shift for us in terms of code access.' Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation aka 'named entity extraction'

What is “Information Extraction”: 

What is 'Information Extraction' Information Extraction = segmentation + classification + association + clustering As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a 'cancer' that stifled technological innovation. Today, Microsoft claims to 'love' the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. 'We can be open source. We love the concept of shared source,' said Bill Veghte, a Microsoft VP. 'That's a super-important shift for us in terms of code access.' Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

What is “Information Extraction”: 

What is 'Information Extraction' Information Extraction = segmentation + classification + association + clustering As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a 'cancer' that stifled technological innovation. Today, Microsoft claims to 'love' the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. 'We can be open source. We love the concept of shared source,' said Bill Veghte, a Microsoft VP. 'That's a super-important shift for us in terms of code access.' Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

What is “Information Extraction”: 

What is 'Information Extraction' Information Extraction = segmentation + classification + association + clustering As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a 'cancer' that stifled technological innovation. Today, Microsoft claims to 'love' the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. 'We can be open source. We love the concept of shared source,' said Bill Veghte, a Microsoft VP. 'That's a super-important shift for us in terms of code access.' Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation * * * *

Tutorial Outline: 

Tutorial Outline IE History Landscape of problems and solutions Models for named entity recognition: Sliding window Boundary finding Finite state machines Overview of related problems and solutions Association, Clustering Integration with Data Mining

IE History: 

IE History Pre-Web Mostly news articles De Jong’s FRUMP [1982] Hand-built system to fill Schank-style 'scripts' from news wire Message Understanding Conference (MUC) DARPA [’87-’95], TIPSTER [’92-’96] Early work dominated by hand-built models E.g. SRI’s FASTUS, hand-built FSMs. But by 1990’s, some machine learning: Lehnert, Cardie, Grishman and then HMMs: Elkan [Leek ’97], BBN [Bikel et al ’98] Web AAAI ’94 Spring Symposium on 'Software Agents' Much discussion of ML applied to Web. Maes, Mitchell, Etzioni. Tom Mitchell’s WebKB, ‘96 Build KB’s from the Web. Wrapper Induction Initially hand-build, then ML: [Soderland ’96], [Kushmeric ’97],… Citeseer; Cora; FlipDog; contEd courses, corpInfo, …

IE History: 

IE History Biology Gene/protein entity extraction Protein/protein fact interaction Automated curation/integration of databases At CMU: SLIF (Murphy et al, subcellular information from images + text in journal articles) Email EPCA, PAL, RADAR, CALO: intelligent office assistant that 'understands' some part of email At CMU: web site update requests, office-space requests; calendar scheduling requests; social network analysis of email.

IE is different in different domains!: 

www.apple.com/retail IE is different in different domains! Example: on web there is less grammar, but more formatting andamp; linking The directory structure, link structure, formatting andamp; layout of the Web is its own new grammar. Apple to Open Its First Retail Store in New York City MACWORLD EXPO, NEW YORK--July 17, 2002--Apple's first retail store in New York City will open in Manhattan's SoHo district on Thursday, July 18 at 8:00 a.m. EDT. The SoHo store will be Apple's largest retail store to date and is a stunning example of Apple's commitment to offering customers the world's best computer shopping experience. 'Fourteen months after opening our first retail store, our 31 stores are attracting over 100,000 visitors each week,' said Steve Jobs, Apple's CEO. 'We hope our SoHo store will surprise and delight both Mac and PC users who want to see everything the Mac can do to enhance their digital lifestyles.' www.apple.com/retail/soho www.apple.com/retail/soho/theatre.html Newswire Web

Landscape of IE Tasks (1/4):Degree of Formatting: 

Landscape of IE Tasks (1/4): Degree of Formatting Text paragraphs without formatting Grammatical sentences and some formatting andamp; links Non-grammatical snippets, rich formatting andamp; links Tables Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR.

Landscape of IE Tasks (2/4):Intended Breadth of Coverage: 

Landscape of IE Tasks (2/4): Intended Breadth of Coverage Web site specific Genre specific Wide, non-specific Amazon.com Book Pages Resumes University Names Formatting Layout Language

Landscape of IE Tasks (3/4):Complexity: 

Landscape of IE Tasks (3/4): Complexity Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern University of Arkansas P.O. Box 140 Hope, AR 71802 …was among the six houses sold by Hope Feldman that year. Ambiguous patterns, needing context and many sources of evidence The CALD main office can be reached at 412-268-1299 The big Wyoming sky… U.S. states U.S. phone numbers U.S. postal addresses Person names Headquarters: 1128 Main Street, 4th Floor Cincinnati, Ohio 45210 Pawel Opalinski, Software Engineer at WhizBang Labs. E.g. word patterns:

Landscape of IE Tasks (4/4):Single Field/Record: 

Landscape of IE Tasks (4/4): Single Field/Record Single entity Person: Jack Welch Binary relationship Relation: Person-Title Person: Jack Welch Title: CEO N-ary record 'Named entity' extraction Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt. Relation: Company-Location Company: General Electric Location: Connecticut Relation: Succession Company: General Electric Title: CEO Out: Jack Welsh In: Jeffrey Immelt Person: Jeffrey Immelt Location: Connecticut

Landscape of IE Techniques (1/1):Models: 

Landscape of IE Techniques (1/1): Models Any of these models can be used to capture words, formatting or both. Lexicons Alabama Alaska … Wisconsin Wyoming Abraham Lincoln was born in Kentucky. member?

Sliding Windows: 

Sliding Windows

Extraction by Sliding Window: 

Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement

Extraction by Sliding Window: 

Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement

Extraction by Sliding Window: 

Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement

Extraction by Sliding Window: 

Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement

A “Naïve Bayes” Sliding Window Model: 

A 'Naïve Bayes' Sliding Window Model [Freitag 1997] 00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun w t-m w t-1 w t w t+n w t+n+1 w t+n+m prefix contents suffix If P('Wean Hall Rm 5409' = LOCATION) is above some threshold, extract it. … … Estimate Pr(LOCATION|window) using Bayes rule Try all 'reasonable' windows (vary length, position) Assume independence for length, prefix words, suffix words, content words Estimate from data quantities like: Pr('Place' in prefix|LOCATION)

“Naïve Bayes” Sliding Window Results: 

'Naïve Bayes' Sliding Window Results GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. Domain: CMU UseNet Seminar Announcements Field F1 Person Name: 30% Location: 61% Start Time: 98%

SRV: a realistic sliding-window-classifier IE system: 

SRV: a realistic sliding-window-classifier IE system What windows to consider? all windows containing as many tokens as the shortest example, but no more tokens than the longest example How to represent a classifier? It might: Restrict the length of window; Restrict the vocabulary or formatting used before/after/inside window; Restrict the relative order of tokens; Use inductive logic programming techniques to express all these… andlt;titleandgt;Course Information for CS213andlt;/titleandgt; andlt;h1andgt;CS 213 C++ Programmingandlt;/h1andgt; [Frietag AAAI ‘98]

SRV: a rule-learner for sliding-window classification: 

SRV: a rule-learner for sliding-window classification Primitive predicates used by SRV: token(X,W), allLowerCase(W), numerical(W), … nextToken(W,U), previousToken(W,V) HTML-specific predicates: inTitleTag(W), inH1Tag(W), inEmTag(W),… emphasized(W) = 'inEmTag(W) or inBTag(W) or …' tableNextCol(W,U) = 'U is some token in the column after the column W is in' tablePreviousCol(W,V), tableRowHeader(W,T),…

SRV: a rule-learner for sliding-window classification: 

SRV: a rule-learner for sliding-window classification Non-primitive 'conditions' used by SRV: every(+X, f, c) = for all W in X : f(W)=c some(+X, W, andlt;f1,…,fkandgt;, g, c)= exists W: g(fk(…(f1(W)…))=c tokenLength(+X, relop, c): position(+W,direction,relop, c): e.g., tokenLength(X,andgt;,4), position(W,fromEnd,andlt;,2)

Rapier – results vs. SRV: 

Rapier – results vs. SRV

Rule-learning approaches to sliding-window classification: Summary: 

Rule-learning approaches to sliding-window classification: Summary SRV, Rapier, and WHISK [Soderland KDD ‘97] Representations for classifiers allow restriction of the relationships between tokens, etc Representations are carefully chosen subsets of even more powerful representations based on logic programming (ILP and Prolog) Use of these 'heavyweight' representations is complicated, but seems to pay off in results Some questions to consider: Can simpler, propositional representations for classifiers work (see Roth and Yih) What learning methods to consider (NB, ILP, boosting, semi-supervised – see Collins andamp; Singer) When do we want to use this method vs fancier ones?

BWI: Learning to detect boundaries: 

BWI: Learning to detect boundaries Another formulation: learn three probabilistic classifiers: START(i) = Prob( position i starts a field) END(j) = Prob( position j ends a field) LEN(k) = Prob( an extracted field has length k) Then score a possible extraction (i,j) by START(i) * END(j) * LEN(j-i) LEN(k) is estimated from a histogram [Freitag andamp; Kushmerick, AAAI 2000]

BWI: Learning to detect boundaries: 

BWI: Learning to detect boundaries Field F1 Person Name: 30% Location: 61% Start Time: 98%

Problems with Sliding Windows and Boundary Finders: 

Problems with Sliding Windows and Boundary Finders Decisions in neighboring parts of the input are made independently from each other. Expensive for long entity names Sliding Window may predict a 'seminar end time' before the 'seminar start time'. It is possible for two overlapping windows to both be above threshold. In a Boundary-Finding system, left boundaries are laid down independently from right boundaries, and their pairing happens as a separate step.

Finite State Machines: 

Finite State Machines

IE with Hidden Markov Models: 

IE with Hidden Markov Models Yesterday Pedro Domingos spoke this example sentence. Yesterday Pedro Domingos spoke this example sentence. Person name: Pedro Domingos Given a sequence of observations: and a trained HMM: Find the most likely state sequence: (Viterbi) Any words said to be generated by the designated 'person name' state extract as a person name: person name location name background

HMM for Segmentation: 

HMM for Segmentation Simplest Model: One state per entity type

What is a “symbol” ???: 

What is a 'symbol' ??? Cohen =andgt; 'Cohen', 'cohen', 'Xxxxx', 'Xx', … ? 4601 =andgt; '4601', '9999', '9+', 'number', … ? Datamold: choose best abstraction level using holdout set

HMM Example: “Nymble”: 

HMM Example: 'Nymble' Other examples of shrinkage for HMMs in IE: [Freitag and McCallum ‘99] Task: Named Entity Extraction Train on ~500k words of news wire text. Case Language F1 . Mixed English 93% Upper English 91% Mixed Spanish 90% [Bikel, et al 1998], [BBN 'IdentiFinder'] Person Org Other (Five other name classes) start-of-sentence end-of-sentence Transition probabilities Observation probabilities P(st | st-1, ot-1 ) P(ot | st , st-1 ) Back-off to: Back-off to: P(st | st-1 ) P(st ) P(ot | st , ot-1 ) P(ot | st ) P(ot ) or Results:

What is a symbol?: 

What is a symbol? Bikel et al mix symbols from two abstraction levels

What is a symbol?: 

What is a symbol? Ideally we would like to use many, arbitrary, overlapping features of words. S t - 1 S t O t S t+1 O t +1 O t - 1 identity of word ends in '-ski' is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … … … part of noun phrase is 'Wisniewski' ends in '-ski' Lots of learning systems are not confounded by multiple, non-independent features: decision trees, neural nets, SVMs, …

What is a symbol?: 

What is a symbol? S t - 1 S t O t S t+1 O t +1 O t - 1 identity of word ends in '-ski' is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … … … part of noun phrase is 'Wisniewski' ends in '-ski' Idea: replace generative model in HMM with a maxent model, where state depends on observations

What is a symbol?: 

What is a symbol? S t - 1 S t O t S t+1 O t +1 O t - 1 identity of word ends in '-ski' is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … … … part of noun phrase is 'Wisniewski' ends in '-ski' Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state

What is a symbol?: 

What is a symbol? S t - 1 S t O t S t+1 O t +1 O t - 1 identity of word ends in '-ski' is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … … … part of noun phrase is 'Wisniewski' ends in '-ski' Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state history

Ratnaparkhi’s MXPOST : 

Ratnaparkhi’s MXPOST Sequential learning problem: predict POS tags of words. Uses MaxEnt model described above. Rich feature set. To smooth, discard features occurring andlt; 10 times.

Conditional Markov Models (CMMs) aka MEMMs aka Maxent Taggers vs HMMS: 

Conditional Markov Models (CMMs) aka MEMMs aka Maxent Taggers vs HMMS St-1 St Ot St+1 Ot+1 Ot-1 ... St-1 St Ot St+1 Ot+1 Ot-1 ...

Label Bias Problem: 

Label Bias Problem Consider this MEMM, and enough training data to perfectly model it: Pr(0123|rob) = Pr(1|0,r)/Z1 * Pr(2|1,o)/Z2 * Pr(3|2,b)/Z3 = 0.5 * 1 * 1 Pr(0453|rib) = Pr(4|0,r)/Z1’ * Pr(5|4,i)/Z2’ * Pr(3|5,b)/Z3’ = 0.5 * 1 *1 Pr(0123|rib)=1 Pr(0453|rob)=1

Another view of label bias [Sha & Pereira]: 

Another view of label bias [Sha andamp; Pereira] So what’s the alternative?

CMMs to CRFs: 

CMMs to CRFs

What’s the new model look like?: 

What’s the new model look like? What’s independent?

Graphical comparison among HMMs, MEMMs and CRFs: 

Graphical comparison among HMMs, MEMMs and CRFs HMM MEMM CRF

Sha & Pereira results: 

Sha andamp; Pereira results CRF beats MEMM (McNemar’s test); MEMM probably beats voted perceptron

Sha & Pereira results: 

Sha andamp; Pereira results in minutes, 375k examples

Combining Sliding Windows and Finite-State Models: 

Combining Sliding Windows and Finite-State Models

Problems with Tagging-based NER: 

Problems with Tagging-based NER Decisions are made token-by-token Feature engineering can be awkward: How long is the extracted name? Does the extracted name appear in a dictionary? Is the extracted name similar to (an acronym for, a short version of...) a name appearing earlier in the document? 'Tag engineering' is also crucial: Do we use one state per entity? a special state for beginning an entity? a special state for end?

Semi-Markov models for IE: 

Semi-Markov models for IE Train on sequences of labeled segments, not labeled words. S=(start,end,label) Build probability model of segment sequences, not word sequences Define features f of segments (Approximately) optimize feature weights on training data f(S) = words xt...xu, length, previous words, case information, ..., distance to known name maximize: with Sunita Sarawagi, IIT Bombay

Segments vs tagging: 

Segments vs tagging t x y t,u x y f(xt,yt) f(xj,yj)

A training algorithm for CSMM’s (1): 

A training algorithm for CSMM’s (1) Review: Collins’ perceptron training algorithm Correct tags Viterbi tags

A training algorithm for CSMM’s (2): 

A training algorithm for CSMM’s (2) Variant of Collins’ perceptron training algorithm: voted perceptron learner for TTRANS like Viterbi

A training algorithm for CSMM’s (3): 

A training algorithm for CSMM’s (3) Variant of Collins’ perceptron training algorithm: voted perceptron learner for TSEGTRANS like Viterbi

Results: 

Results

Slide64: 


Results: varying history: 

Results: varying history

Results: changing the dictionary: 

Results: changing the dictionary

Results: vs CRF: 

Results: vs CRF

Results: vs CRF: 

Results: vs CRF

Broader Issues in IE: 

Broader Issues in IE

Broader View: 

Broader View Create ontology Segment Classify Associate Cluster Load DB Spider Query, Search Data mine IE Document collection Database Filter by relevance Label training data Train extraction models Up to now we have been focused on segmentation and classification

Broader View: 

Broader View Create ontology Segment Classify Associate Cluster Load DB Spider Query, Search Data mine IE Tokenize Document collection Database Filter by relevance Label training data Train extraction models Now touch on some other issues 1 2 3 4 5

(1) Association as Binary Classification: 

(1) Association as Binary Classification [Zelenko et al, 2002] Christos Faloutsos conferred with Ted Senator, the KDD 2003 General Chair. Person-Role (Christos Faloutsos, KDD 2003 General Chair)  NO Person-Role ( Ted Senator, KDD 2003 General Chair)  YES Person Person Role Do this with SVMs and tree kernels over parse trees.

(1) Association with Finite State Machines: 

(1) Association with Finite State Machines [Ray andamp; Craven, 2001] … This enzyme, UBC6, localizes to the endoplasmic reticulum, with the catalytic domain facing the cytosol. … DET this N enzyme N ubc6 V localizes PREP to ART the ADJ endoplasmic N reticulum PREP with ART the ADJ catalytic N domain V facing ART the N cytosol Subcellular-localization (UBC6, endoplasmic reticulum)

(1) Association with Graphical Models: 

(1) Association with Graphical Models [Roth andamp; Yih 2002] Capture arbitrary-distance dependencies among predictions.

(1) Association with Graphical Models: 

(1) Association with Graphical Models [Roth andamp; Yih 2002] Also capture long-distance dependencies among predictions. Local language models contribute evidence to entity classification. Random variable over the class of entity #1, e.g. over {person, location,…} Local language models contribute evidence to relation classification. Random variable over the class of relation between entity #2 and #1, e.g. over {lives-in, is-boss-of,…} Dependencies between classes of entities and relations! Inference with loopy belief propagation. location person lives-in

Broader View: 

Broader View Create ontology Segment Classify Associate Cluster Load DB Spider Query, Search Data mine IE Tokenize Document collection Database Filter by relevance Label training data Train extraction models Now touch on some other issues 1 2 3 4 5 When do two extracted strings refer to the same object?

(2) Information Integration: 

(2) Information Integration Goal might be to merge results of two IE systems: [Minton, Knoblock, et al 2001], [Doan, Domingos, Halevy 2001], [Richardson andamp; Domingos 2003]

(2) Other Information Integration Issues: 

(2) Other Information Integration Issues Distance metrics for text – which work well? [Cohen, Ravikumar, Fienberg, 2003] Finessing integration by soft database operations based on similarity [Cohen, 2000] Integration of complex structured databases: (capture dependencies among multiple merges) [Cohen, MacAllister, Kautz KDD 2000; Pasula, Marthi, Milch, Russell, Shpitser, NIPS 2002; McCallum and Wellner, KDD WS 2003]

IE Resources: 

IE Resources Data RISE, http://www.isi.edu/~muslea/RISE/index.html Linguistic Data Consortium (LDC) Penn Treebank, Named Entities, Relations, etc. http://www.biostat.wisc.edu/~craven/ie Papers, tutorials, lectures, code http://www.cs.cmu.edu/~wcohen/10-707

References: 

References [Bikel et al 1997] Bikel, D.; Miller, S.; Schwartz, R.; and Weischedel, R. Nymble: a high-performance learning name-finder. In Proceedings of ANLP’97, p194-201. [Califf andamp; Mooney 1999], Califf, M.E.; Mooney, R.: Relational Learning of Pattern-Match Rules for Information Extraction, in Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99). [Cohen, Hurst, Jensen, 2002] Cohen, W.; Hurst, M.; Jensen, L.: A flexible learning system for wrapping tables and lists in HTML documents. Proceedings of The Eleventh International World Wide Web Conference (WWW-2002) [Cohen, Kautz, McAllester 2000] Cohen, W; Kautz, H.; McAllester, D.: Hardening soft information sources. Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000). [Cohen, 1998] Cohen, W.: Integration of Heterogeneous Databases Without Common Domains Using Queries Based on Textual Similarity, in Proceedings of ACM SIGMOD-98. [Cohen, 2000a] Cohen, W.: Data Integration using Similarity Joins and a Word-based Information Representation Language, ACM Transactions on Information Systems, 18(3). [Cohen, 2000b] Cohen, W. Automatically Extracting Features for Concept Learning from the Web, Machine Learning: Proceedings of the Seventeeth International Conference (ML-2000). [Collins andamp; Singer 1999] Collins, M.; and Singer, Y. Unsupervised models for named entity classification. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 1999. [De Jong 1982] De Jong, G. An Overview of the FRUMP System. In: Lehnert, W. andamp; Ringle, M. H. (eds), Strategies for Natural Language Processing. Larence Erlbaum, 1982, 149-176. [Freitag 98] Freitag, D: Information extraction from HTML: application of a general machine learning approach, Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98). [Freitag, 1999], Freitag, D. Machine Learning for Information Extraction in Informal Domains. Ph.D. dissertation, Carnegie Mellon University. [Freitag 2000], Freitag, D: Machine Learning for Information Extraction in Informal Domains, Machine Learning 39(2/3): 99-101 (2000). Freitag andamp; Kushmerick, 1999] Freitag, D; Kushmerick, D.: Boosted Wrapper Induction. Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99) [Freitag andamp; McCallum 1999] Freitag, D. and McCallum, A. Information extraction using HMMs and shrinakge. In Proceedings AAAI-99 Workshop on Machine Learning for Information Extraction. AAAI Technical Report WS-99-11. [Kushmerick, 2000] Kushmerick, N: Wrapper Induction: efficiency and expressiveness, Artificial Intelligence, 118(pp 15-68). [Lafferty, McCallum andamp; Pereira 2001] Lafferty, J.; McCallum, A.; and Pereira, F., Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, In Proceedings of ICML-2001. [Leek 1997] Leek, T. R. Information extraction using hidden Markov models. Master’s thesis. UC San Diego. [McCallum, Freitag andamp; Pereira 2000] McCallum, A.; Freitag, D.; and Pereira. F., Maximum entropy Markov models for information extraction and segmentation, In Proceedings of ICML-2000 [Miller et al 2000] Miller, S.; Fox, H.; Ramshaw, L.; Weischedel, R. A Novel Use of Statistical Parsing to Extract Information from Text. Proceedings of the 1st Annual Meeting of the North American Chapter of the ACL (NAACL), p. 226 - 233.

References: 

References [Muslea et al, 1999] Muslea, I.; Minton, S.; Knoblock, C. A.: A Hierarchical Approach to Wrapper Induction. Proceedings of Autonomous Agents-99. [Muslea et al, 2000] Musclea, I.; Minton, S.; and Knoblock, C. Hierarhical wrapper induction for semistructured information sources. Journal of Autonomous Agents and Multi-Agent Systems. [Nahm andamp; Mooney, 2000] Nahm, Y.; and Mooney, R. A mutually beneficial integration of data mining and information extraction. In Proceedings of the Seventeenth National Conference on Artificial Intelligence, pages 627--632, Austin, TX. [Punyakanok andamp; Roth 2001] Punyakanok, V.; and Roth, D. The use of classifiers in sequential inference. Advances in Neural Information Processing Systems 13. [Ratnaparkhi 1996] Ratnaparkhi, A., A maximum entropy part-of-speech tagger, in Proc. Empirical Methods in Natural Language Processing Conference, p133-141. [Ray andamp; Craven 2001] Ray, S.; and Craven, Ml. Representing Sentence Structure in Hidden Markov Models for Information Extraction. Proceedings of the 17th International Joint Conference on Artificial Intelligence, Seattle, WA. Morgan Kaufmann. [Soderland 1997]: Soderland, S.: Learning to Extract Text-Based Information from the World Wide Web. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (KDD-97). [Soderland 1999] Soderland, S. Learning information extraction rules for semi-structured and free text. Machine Learning, 34(1/3):233-277.