InfoXtract Location Normalization: A Hybrid Approach to Geographic References in Information Extraction May 31 2003 Edmonton, Alberta: InfoXtract Location Normalization: A Hybrid Approach to Geographic References in Information Extraction May 31 2003 Edmonton, Alberta NAACL-HLT Workshop on the Analysis of Geographic References
Huifeng Li, Rohini K. Srihari, Cheng Niu, and Wei Li
Cymfony Inc.
Contents: Contents Overview of Information Extraction System: InfoXtract
Introduction of Location Normalization (LocNZ)
Task of LocNZ
Problems and Proposed Method
Algorithm for LocNZ
Experimental Evaluation
Future Work
Overview of InfoXtract: Overview of InfoXtract InfoXtract produces the following information objects from a text
Named Entities (NEs) - “Bill Gates, chairman of Microsoft ….”
Correlated Entities (CEs) - “Bill Gates, chairman of Microsoft...”
Subject-Verb-Object (SVO) triples - Both syntactic & semantic forms of the structures
Entity Profiles - Profiles for entity types like people&organizations
General Events (GEs) - Domain-independent events
Event
Argument structures centering around verb with the associated information
“who did what to whom when (or how often) and where”
Predefined Events (PEs) - Domain-specific events
System component: integrated NLP and machine learning into IE
POS tagging
Shallow and deep parsing
Named Entity tagging
Combining supervised & unsupervised machine learning techniques
Concept-based analysis
Word sense disambiguation
Location / Time normalization
Co-reference analysis
Entity Profile fusion
Event extraction, fusion and linking
InfoXtract Architecture: InfoXtract Architecture Document Processor HTTP POST Knowledge Resources Lexicon Resources Grammars Process Manager Tokenlist Legend HTTP CORBA Output Manager Source Document Linguistic Processor(s) Tokenizer Tokenlist Lexicon Lookup Pragmatic Filtering POS Tagging Named Entity Detection Shallow Parsing Deep Parsing Relationship Detection Zoned Text Document XML Formatted Extracted Document NE PE CE Document & Error log Web Server HTTP response SVO CO Profile CGE Time Normalization Alias/Coreference Linking Location Normalization Profile/Event Linking Profile/Event Merge Legend Grammar
Module Procedure or Statistical Model Hybrid Module Language
models
Introduction of Location Normalization: Introduction of Location Normalization Task of location normalization (LocNZ)
Identify the correct sense of ambiguous location named entity
(1) Decide if a location name is a city, a province or a country
Support NE Tagger to decide sub-tag
New York (NeLoc) =>New York (NeLoc, NeCty)
(2) Decide which city, state or country do a city, island or state belongs to
18 states have city of Boston
Boston => Alabama, Arkansas, Massachusetts, Missouri,…
Result of LocNZ can be used to
(1) Support event extraction, merging and event visualization
Indicate where the event occurred
(2) Support profile generation
Provide location information of a person or an organization
(3) Support question answering
Provide location area for document categorization
Event and Profile Generation: Event and Profile Generation ::
Name: Julian Werner Hill
Position: Research chemist
Age: 91
Birth-place:
Affiliation: Du Pont Co.
Education: MIT
::
Name: St. Louis
State: Missouri
Country: United States of America
Zipcode: 63101
Lattitude: 90.191313
Longitude: 38.634616
Related_profiles: :
key verb: replace
who: John Doe
whom-what: Alvin Karloff
complement: CEO of ABC
when: last month
Where: Input: Alvin Karloff was replaced by John Doe as CEO of ABC at New York last month. Event Template
Argument structures centering around verb with the associated information Profile Template
presenting the subject's most noteworthy
characteristics and achievements
Event Visualization: Event Visualization Event type:
Who:
When: 1996-01-07
Where:
Preceding_event:
Subsequent_event: Event Visualization ; ; ; ; Result of LocNZ Indicates the place of an event occurred Predicate: Die
Who: Julian Werner Hill
When: 1996-01-07
Where:
Problems in Location Normalization: Problems in Location Normalization Difference between LocNZ and general WSD
Selection restriction is not sufficient
WSD: verb sense tagging relies mainly on co-occurrence constraints of semantic structures,Verb-Subject and Verb-Object in particular
LocNZ: depends primarily on the co-occurrence of related location entities in the same discourse (text)
Less clues in a text than verb and noun sense disambiguation
‘located in’ can indicate ‘San Francisco’ is a location only
Example) The Golden Gate Bridge is located in San Francisco
Lack of sources for default senses of location names
Tipster Gazetteer provides only small part of default senses
Little previous research on solving LocNZ
Major Types of Ambiguities: Major Types of Ambiguities City versus country and state name ambiguity
Canada (CITY) Kansas (PROVINCE 1) United States (COUNTRY)
Canada (CITY) Kentucky (PROVINCE 1) United States (COUNTRY)
Canada (COUNTRY)
New York state versus New York city
Same city name among different provinces ambiguity
- 33 Washington entries in the Gazetteer
Washington (CITY) Arkansas (PROVINCE 1) United States (COUNTRY)
Washington (CITY) California (PROVINCE 1) United States (COUNTRY)
Washington (CITY) Connecticut (PROVINCE 1) United States (COUNTRY)
Washington (CITY) District of Columbia (PROVINCE 1) United States (COUNTRY)
Washington (CITY) Georgia (PROVINCE 1) United States (COUNTRY)
Washington (CITY) Illinois (PROVINCE 1) United States (COUNTRY)
Washington (CITY) Indiana (PROVINCE 1) United States (COUNTRY)
Washington (CITY) Iowa (PROVINCE 1) United States (COUNTRY)
Washington (CITY) Kansas (PROVINCE 1) United States (COUNTRY)
Washington (CITY) Kentucky (PROVINCE 1) United States (COUNTRY) … …
… …
Example of Text with Location Names CNN news: http://www.cnn.com/2003/WEATHER/02/19/winter.storm.delays.ap/index.html: Example of Text with Location Names CNN news: http://www.cnn.com/2003/WEATHER/02/19/winter.storm.delays.ap/index.html A traveler gets the bad news as he looks at the departures list that shows all canceled flights at the Philadelphia International Airport.
MIAMI (AP) -- Travelers heading to and from the Northeast faced continued uncertainty Tuesday, even as airports in the mid-Atlantic region began slowly digging themselves out from one of the worst winter storms on record.
……
No flights left Florida for Baltimore-Washington International Airport until Tuesday afternoon. That airport was one of the hardest-hit by the storm, with a snowfall total of 28 inches.
Rosanna Blum, 38, of Hunt Valley, Maryland, had a confirmed seat on a Miami to Baltimore flight Tuesday afternoon, but still wasn't optimistic that she'd actually have the chance to use it.
……
Theresa York, from Maryland, works the phones at Miami Airport as she tries to find a flight back home.
……
"It's surreal," said Dawn Shuford, 35, as she reclined against her suitcase in a darkened hallway at BWI. She'd been trying since Sunday morning to get home to Seattle.
The Washington area's two other airports, Reagan National and Dulles, also had limited service.
Marty Legrow, from Connecticut, rests on her suitcase at Ronald Reagan National Airport in Washington.
Philadelphia International Airport resumed operations Tuesday but still expected to cancel about one-third of its flights. Flights slowly resumed at New York's LaGuardia, Kennedy and Newark airports, and Boston's Logan, where more than 2 feet of snow fell, had one runway open.
……
Margie D'Onofrio, 48, of King Of Prussia, Pennsylvania, and a travel companion left the Bahamas on Sunday, hoping to fly back to Philadelphia. They made it to Miami, and D'Onofrio said she did not expect to be home anytime Tuesday.
……
Passengers camped out overnight at many airports. Many fliers called ahead Tuesday and weren't clogging airports unnecessarily, Orlando International Airport spokeswoman Carolyn Fennell said.
Our Previous Method [Li et al. 2002]: Our Previous Method [Li et al. 2002] (1) Lexical grammar processing with local context
Identify City or State
City of Buffalo; New York State
Disambiguate meaning of a word
e.g. Williamsville, New York, USA
e.g. Brussels, Belgium
Propagate the analysis result within a text where it appears
One sense per discourse (Gale, Yarowsky et al, 1992)
(2) Construct graph and calculate maximum weight spanning tree considering global information with Kruskal Algorithm
Node: Location name senses
Edge: Similarity weight between two location name senses
Calculate similarities between locations in the graph referring to predefined similarity table
Choose maximum weight spanning tree that reflects most probable location senses in the document
(3) Default sense application
If similarity value is lower than a threshold, apply default senses
Problems of Previous Method: Problems of Previous Method For MST calculation, sort all the weighted edges
In case there are many locations, and each location has over 20 senses, the number of edges will increase a lot, and edges sorting will take much time, and value weighting is not distinctive enough
Solution: Adopted Prim’s Algorithm for MST combined with heuristics
If a location has sense of country, then select that sense as the default sense of that location (heuristics1)
If a location has province or capital senses, then select that sense as default sense after local context application (heuristics2)
The number of location mentions and the distance between them are taken into account
Previous method could not reflect these factor
Assign weight to the sense nodes in constructed graph
Choose the node with maximum weight
Weight Calculation: Weight Calculation Table 1: Impact weight of Sense2 on Sense1
Weight Assigned to Sense Nodes: Weight Assigned to Sense Nodes Canada
{Kansas,
Kentucky,
Country} Vancouver
{British Columbia
Washington
port in USA
Port in Canada} New York
{Prov in USA,
New York City,
…} Toronto
(Ontorio,
New South Wales,
Illinois,
…} Charlottetown
{Prov in USA,
New York City,
…} Prince Edward Island
{Island in Canada,
Island in South Africa,
Province in Canada} Quebec
(city in Quebec,
Quebec Prov,
Connecticut,
…}
Modified Algorithm: Modified Algorithm Look up the location gazetteer to associate candidate senses for each location NE;
If a location has sense of country, then select that sense as the default sense of that location (heuristics);
Call the pattern matching sub-module for local patterns like “Williamsville, New York, USA”;
Apply the ‘one sense per discourse’ principle for each disambiguated location name to propagate the selected sense to its other mentions within a document;
Apply default sense heuristics for a location with province or capital senses;
Call Prim’s algorithm in the discourse sub-module to resolve the remaining ambiguities;
If the difference between the sense with the maximum weight and the sense with next largest weight is equal to or lower than a threshold, choose the default sense of that name from lexicon. Otherwise, choose the sense with the maximum weight as output.
Experimental Evaluation: Experimental Evaluation
Discussion: Discussion Note: Column 5~9 used heuristics of default senses
Local patterns (Col-4) alone contribute 12% to the overall performance
Proper use of defaults senses and the heuristics(Col-5) can achieve close to 90%
Prim’s algorithm (Col-7) is clearly better than the previous method using Kruskal’s algorithm (Col-6), with 13%
But both methods cannot outperform default senses
When using all three types of evidence, the new hybrid method performance of 96% shown in Col-9
Future Work: Future Work Extend the scope of location normalization
Extend processing scope
Physical structure
famous building, bridge, airport, lake, street name,…
Extend gazetteer
Introduce more context information for disambiguation
Upgrade default meaning assignment