Information Extraction: Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)
Information Extraction (IE): Information Extraction (IE) Identify specific pieces of information (data) in a unstructured or semi-structured textual document.
Transform unstructured information in a corpus of documents or web pages into a structured database.
Applied to different types of text:
Newspaper articles
Web pages
Scientific articles
Newsgroup messages
Classified ads
Medical notes
Information Extraction vs. NLP?: Information Extraction vs. NLP? Information extraction is attempting to find some of the structure and meaning in the hopefully template driven web pages.
As IE becomes more ambitious and text becomes more free form, then ultimately we have IE becoming equal to NLP.
Web does give one particular boost to NLP
Massive corpora..
MUC: MUC DARPA funded significant efforts in IE in the early to mid 1990’s.
Message Understanding Conference (MUC) was an annual event/competition where results were presented.
Focused on extracting information from news articles:
Terrorist events
Industrial joint ventures
Company management changes
Information extraction of particular interest to the intelligence community (CIA, NSA).
Other Applications: Other Applications Job postings:
Newsgroups: Rapier from austin.jobs
Web pages: Flipdog
Job resumes:
BurningGlass
Mohomine
Seminar announcements
Company information from the web
Continuing education course info from the web
University information from the web
Apartment rental ads
Molecular biology information from MEDLINE
Sample Job Posting: Subject: US-TN-SOFTWARE PROGRAMMER
Date: 17 Nov 1996 17:37:29 GMT
Organization: Reference.Com Posting Service
Message-ID:
SOFTWARE PROGRAMMER
Position available for Software Programmer experienced in generating software for PC-Based Voice Mail systems. Experienced in C Programming. Must be familiar with communicating with and controlling voice cards; preferable Dialogic, however, experience with others such as Rhetorix and Natural Microsystems is okay. Prefer 5 years or more
experience with PC Based Voice Mail, but will consider as little as 2 years. Need to find a Senior level person who can come on board and pick up code with very little training.
Present Operating System is DOS. May go to OS-2 or UNIX in future.
Please reply to:
Kim Anderson
AdNET
(901) 458-2888 fax
kimander@memphisonline.com Subject: US-TN-SOFTWARE PROGRAMMER
Date: 17 Nov 1996 17:37:29 GMT
Organization: Reference.Com Posting Service
Message-ID:
SOFTWARE PROGRAMMER
Position available for Software Programmer experienced in generating software for PC-Based Voice Mail systems. Experienced in C Programming. Must be familiar with communicating with and controlling voice cards; preferable Dialogic, however, experience with others such as Rhetorix and Natural Microsystems is okay. Prefer 5 years or more
experience with PC Based Voice Mail, but will consider as little as 2 years. Need to find a Senior level person who can come on board and pick up code with very little training.
Present Operating System is DOS. May go to OS-2 or UNIX in future.
Please reply to:
Kim Anderson
AdNET
(901) 458-2888 fax
kimander@memphisonline.com Sample Job Posting
Extracted Job Template: Extracted Job Template computer_science_job
id: 56nigp$mrs@bilbo.reference.com
title: SOFTWARE PROGRAMMER
salary:
company:
recruiter:
state: TN
city:
country: US
language: C
platform: PC \ DOS \ OS-2 \ UNIX
application:
area: Voice Mail
req_years_experience: 2
desired_years_experience: 5
req_degree:
desired_degree:
post_date: 17 Nov 1996
Amazon Book Description: Amazon Book Description ….
The Age of Spiritual Machines : When Computers Exceed Human Intelligence
by
Ray Kurzweil
List Price: $14.95
Our Price: $11.96
You Save: $2.99
(20%)
….
The Age of Spiritual Machines : When Computers Exceed Human Intelligence
by
Ray Kurzweil
List Price: $14.95
Our Price: $11.96
You Save: $2.99
(20%)
…
Extracted Book Template: Extracted Book Template Title: The Age of Spiritual Machines :
When Computers Exceed Human Intelligence
Author: Ray Kurzweil
List-Price: $14.95
Price: $11.96
:
:
Web Extraction: Web Extraction Many web pages are generated automatically from an underlying database.
Therefore, the HTML structure of pages is fairly specific and regular (semi-structured).
However, output is intended for human consumption, not machine interpretation.
An IE system for such generated pages allows the web site to be viewed as a structured database.
An extractor for a semi-structured web site is sometimes referred to as a wrapper.
Process of extracting from such pages is sometimes referred to as screen scraping.
Web Extraction using DOM Trees: Web Extraction using DOM Trees Web extraction may be aided by first parsing web pages into DOM trees.
Extraction patterns can then be specified as paths from the root of the DOM tree to the node containing the text to extract.
May still need regex patterns to identify proper portion of the final CharacterData node.
Sample DOM Tree Extraction: Sample DOM Tree Extraction HTML BODY FONT B Age of Spiritual
Machines Ray
Kurzweil Element Character-Data HEADER by A Title: HTMLBODYBCharacterData
Author: HTML BODYFONTA CharacterData
Template Types: Template Types Slots in template typically filled by a substring from the document.
Some slots may have a fixed set of pre-specified possible fillers that may not occur in the text itself.
Terrorist act: threatened, attempted, accomplished.
Job type: clerical, service, custodial, etc.
Company type: SEC code
Some slots may allow multiple fillers.
Programming language
Some domains may allow multiple extracted templates per document.
Multiple apartment listings in one ad
Simple Extraction Patterns: Simple Extraction Patterns Specify an item to extract for a slot using a regular expression pattern.
Price pattern: “\b\$\d+(\.\d{2})?\b”
May require preceding (pre-filler) pattern to identify proper context.
Amazon list price:
Pre-filler pattern: “List Price: ”
Filler pattern: “\$\d+(\.\d{2})?\b”
May require succeeding (post-filler) pattern to identify the end of the filler.
Amazon list price:
Pre-filler pattern: “List Price: ”
Filler pattern: “.+”
Post-filler pattern: “”
Simple Template Extraction: Simple Template Extraction Extract slots in order, starting the search for the filler of the n+1 slot where the filler for the nth slot ended. Assumes slots always in a fixed order.
Title
Author
List price
…
Make patterns specific enough to identify each filler always starting from the beginning of the document.
Pre-Specified Filler Extraction: Pre-Specified Filler Extraction If a slot has a fixed set of pre-specified possible fillers, text categorization can be used to fill the slot.
Job category
Company type
Treat each of the possible values of the slot as a category, and classify the entire document to determine the correct filler.
Learning for IE: Learning for IE Writing accurate patterns for each slot for each domain (e.g. each web site) requires laborious software engineering.
Alternative is to use machine learning:
Build a training set of documents paired with human-produced filled extraction templates.
Learn extraction patterns for each slot using an appropriate machine learning algorithm.
Finding“Sweet Spots” in computer-mediated cooperative work: Finding“Sweet Spots” in computer-mediated cooperative work It is possible to get by with techniques blythely ignorant of semantics, when you have humans in the loop
All you need is to find the right sweet spot, where the computer plays a pre-processing role and presents “potential solutions”
…and the human very gratefully does the in-depth analysis on those few potential solutions
Examples:
The incredible success of “Bag of Words” model!
Bag of letters would be a disaster ;-)
Bag of sentences and/or NLP would be good
..but only to your discriminating and irascible searchers ;-)
Collaborative Computing AKA Brain Cycle StealingAKA Computizing Eyeballs: Collaborative Computing AKA Brain Cycle Stealing AKA Computizing Eyeballs A lot of exciting research related to web currently involves “co-opting” the masses to help with large-scale tasks
It is like “cycle stealing”—except we are stealing “human brain cycles” (the most idle of the computers if there is ever one ;-)
Remember the mice in the Hitch Hikers Guide to the Galaxy? (..who were running a mass-scale experiment on the humans to figure out the question..)
Collaborative knowledge compilation (wikipedia!)
Collaborative Curation
Collaborative tagging
Paid collacoration/contracting
Many big open issues
How do you pose the problem such that it can be solved using collaborative computing?
How do you “incentivize” people into letting you steal their brain cycles?
Pay them! (Amazon mturk.com )
Tapping into the Collective Unconscious: Tapping into the Collective Unconscious Another thread of exciting research is driven by the realization that WEB is not random at all!
It is written by humans
…so analyzing its structure and content allows us to tap into the collective unconscious ..
Meaning can emerge from syntactic notions such as “co-occurrences” and “connectedness”
Examples:
Analyzing term co-occurrences in the web-scale corpora to capture semantic information (today’s paper)
Analyzing the link-structure of the web graph to discover communities
DoD and NSA are very much into this as a way of breaking terrorist cells
Analyzing the transaction patterns of customers (collaborative filtering)
4/19: Information Extraction from unstructured text: 4/19: Information Extraction from unstructured text
Information Extraction from Unstructured Text:Automated Support for “Semantic Web”: Information Extraction from Unstructured Text: Automated Support for “Semantic Web” Semantic web needs:
Tagged data
Background knowledge
(blue sky approaches to) automate both
Knowledge Extraction
Extract base level knowledge (“facts”) directly from the web
Automated tagging
Start with a background ontology and tag other web pages
Semtag/Seeker
Extraction from Free Text involvesNatural Language Processing: Extraction from Free Text involves Natural Language Processing If extracting from automatically generated web pages, simple regex patterns usually work.
If extracting from more natural, unstructured, human-written text, some NLP may help.
Part-of-speech (POS) tagging
Mark each word as a noun, verb, preposition, etc.
Syntactic parsing
Identify phrases: NP, VP, PP
Semantic word categories (e.g. from WordNet)
KILL: kill, murder, assassinate, strangle, suffocate
Off-the-shelf software available to do this!
The “Brill” tagger
Extraction patterns can use POS or phrase tags. Analogy to regex patterns on DOM trees for structured tex
I. Generate-n-Test Architecture: I. Generate-n-Test Architecture Generic extraction patterns (Hearst ’92):
“…Cities such as Boston, Los Angeles, and Seattle…”
(“C such as NP1, NP2, and NP3”) =>
IS-A(each(head(NP)), C), … Detailed information for several countries such as maps, …” ProperNoun(head(NP))
“I listen to pretty much all music but prefer country such as Garth Brooks”
Template
Driven
Extraction
(where template
In in terms of
Syntax Tree)
Test: Test
Assess candidate extractions using Mutual Information (PMI-IR) (Turney ’01). Many variations are possible…
Assessment: Assessment PMI = frequency of I & D co-occurrence
5-50 discriminators Di
Each PMI for Di is a feature fi
Naïve Bayes evidence combination: PMI is used for feature selection. NBC is used for learning. Hits used for assessing
PMI as well as conditional probabilities
Assessment In Action: Assessment In Action I = “Yakima” (1,340,000)
D =
I+D = “Yakima city” (2760)
PMI = (2760 / 1.34M)= 0.02 I = “Avocado” (1,000,000)
I+D =“Avocado city” (10)
PMI = 0.00001 << 0.02
Some Sources of ambiguity: Some Sources of ambiguity Time: “Clinton is the president” (in 1996).
Context: “common misconceptions..”
Opinion: Elvis…
Multiple word senses: Amazon, Chicago, Chevy Chase, etc.
Dominant senses can mask recessive ones!
Approach: unmasking. ‘Chicago –City’
Chicago: Chicago City Movie
Chicago Unmasked: Chicago Unmasked City sense Movie sense
Impact of Unmasking on PMI: Impact of Unmasking on PMI Name Recessive Original Unmask Boost
Washington city 0.50 0.99 96%
Casablanca city 0.41 0.93 127%
Chevy Chase actor 0.09 0.58 512%
Chicago movie 0.02 0.21 972%
CBioC: Collaborative Bio-Curation: CBioC: Collaborative Bio-Curation Motivation
To help get information nuggets of articles and abstracts and store in a database.
The challenge is that the number of articles are huge and they keep growing, and need to process natural language.
The two existing approaches
human curation and use of automatic information extraction systems
They are not able to meet the challenge, as the first is expensive, while the second is error-prone.
CBioC (cont’d): CBioC (cont’d) Approach: We propose a solution that is inexpensive, and that scales up.
Our approach takes advantage of automatic information extraction methods as a starting point,
Based on the premise that if there are a lot of articles, then there must be a lot of readers and authors of these articles.
We provide a mechanism by which the readers of the articles can participate and collaborate in the curation of information.
We refer to our approach as “Collaborative Curation''.
Using the C-BioCurator System (cont’d): Using the C-BioCurator System (cont’d)
Slide43: What is the main difference between Knowitall and CBIOC? Assessment– Knowitall does it by HITS. CBioC by voting
Annotation: Annotation “The Chicago Bulls announced yesterday that Michael Jordan will. . . ”
The Chicago Bulls
announced yesterday that
Michael Jordan will...’’
Semantic Annotation: Semantic Annotation Picture from http://lsdis.cs.uga.edu/courses/SemWebFall2005/courseMaterials/CSCI8350-Metadata.ppt This simplest task of meta-data extraction on NL is to establish “type” relation between entities in the NL resources and concepts in ontologies. Name Entity Identification
Semantics: Semantics Semantic Annotation
- The content of annotation consists of some rich
semantic information
- Targeted not only at human reader of resources
but also software agents
- formal : metadata following structural standards
informal : personal notes written in the margin while
reading an article
- explicit : carry sufficient information for interpretation
tacit : many personal annotations (telegraphic and incomplete)
http://www-scf.usc.edu/~csci586/slides/6
Uses of Annotation: Uses of Annotation http://www-scf.usc.edu/~csci586/slides/8
Objectives of Annotation: Objectives of Annotation Generate Metadata for existing information
e.g., author-tag in HTML
RDF descriptions to HTML
Content description to Multimedia files
Employ metadata for
Improved search
Navigation
Presentation
Summarization of contents
http://www.aifb.uni-karlsruhe.de/WBS/sst/Teaching/Intelligente%20System%20im%20WWW%20SS%202000/10-Annotation.pdf
Annotation: Annotation Current practice of annotation for knowledge identification and extraction Reduce burden of text annotation for Knowledge Management
www.racai.ro/EUROLAN-2003/html/presentations/SheffieldWilksBrewsterDingli/Eurolan2003AlexieiDingli.ppt
SemTag & Seeker: SemTag & Seeker WWW-03 Best Paper Prize
Seeded with TAP ontology (72k concepts)
And ~700 human judgments
Crawled 264 million web pages
Extracted 434 million semantic tags
Automatically disambiguated
SemTag: SemTag Research project IBM
Very large scale – largest to date
264 million web pages
Goal: to provide early set of widespread semantic tags through automated generation
SemTag: SemTag Uses broad, shallow knowledge base
TAP – lexical and taxonomic information about popular objects
Music
Movies
Sports
Etc.
SemTag: SemTag Problem:
No write access to original document, so how do you annotate?
Solution:
Store annotations in a web-available database
SemTag: SemTag Semantic Label Bureau
Separate store of semantic annotation information
HTTP server that can be queried for annotation information
Example
Find all semantic tags for a given document
Find all semantic tags for a particular object
SemTag: SemTag Methodology
SemTag: SemTag Three phases
Spotting Pass:
Tokenize the document
All instances plus 20 word window
Learning Pass:
Find corpus-wide distribution of terms at each internal node of taxonomy
Based on a representative sample
Tagging Pass:
Scan windows to disambiguate each reference
Finally determined to be a TAP object
SemTag: SemTag Another problem magnified by the scale:
Ambiguity Resolution
Two fundamental categories of ambiguities:
Some labels appear at multiple locations
Some entities have labels that occur in contexts that have no representative in the taxonomy
SemTag: SemTag Solution:
Taxonomy Based Disambiguation (TBD)
TBD expectation:
Human tuned parameters used in small, critical sections
Automated approaches deal with bulk of information
SemTag: SemTag TBD methodology:
Each node in the taxonomy is associated with a set of labels
Cats, Football, Cars all contain “jaguar”
Each label in the text is stored with a window of 20 words – the context
Each node has an associated similarity function mapping a context to a similarity
Higher similarity more likely to contain a reference
SemTag: SemTag Similarity:
Built a 200,000 word lexicon (200,100 most common – 100 most common)
200,000 dimensional vector space
Training: spots (label, context) and correct node
Estimated the distribution of terms for nodes
Standard cosine similarity
TFIDF vectors (context vs. node)
SemTag: SemTag References inside the taxonomy vs. References outside the taxonomy
Multiple nodes: b = r b != p(v) Is a context c appropriate for a node v
SemTag: SemTag Some internal nodes very popular:
Associate a measurement of how accurate Sim is likely to be at a node
Also, how ambiguous the node is overall (consistency of human judgment)
TBD Algorithm: returns 1 or 0 to indicate whether a particular context c is on topic for a node v
82% accuracy on 434 million spots
SemTag: SemTag
Summary: Summary Information extraction can be motivated either as explicating more structure from the data or as an automated way to Semantic Web
Extraction complexity depends on whether the text you have is “templated” or “free-form”
Extraction from templated text can be done by regular expressions
Extraction from free form text requires NLP
Can be done in terms of parts-of-speech-tagging
“Annotation” involves connecting terms in a free form text to items in the background knowledge
It too can be automated