Slide1: Named Entity Recognition
http://gate.ac.uk/ http://nlp.shef.ac.uk/
Hamish Cunningham
Kalina Bontcheva
RANLP, Borovets, Bulgaria, 8th September 2003
Slide2: Structure of the Tutorial
task definition
applications
corpora, annotation
evaluation and testing
how to
preprocessing
approaches to NE
baseline
rule-based approaches
learning-based approaches
multilinguality
future challenges
Information Extraction: Information Extraction
Information Extraction (IE) pulls facts and structured information from the content of large text collections.
IR - IE - NLU
MUC: Message Understanding Conferences
ACE: Automatic Content Extraction
MUC-7 tasks: MUC-7 tasks
NE: Named Entity recognition and typing
CO: co-reference resolution
TE: Template Elements
TR: Template Relations
ST: Scenario Templates
An Example: An Example The shiny red rocket was fired on Tuesday. It is the brainchild of Dr. Big Head. Dr. Head is a staff scientist at We Build Rockets Inc. NE: entities are "rocket", "Tuesday", "Dr. Head" and "We Build Rockets"
CO: "it" refers to the rocket; "Dr. Head" and "Dr. Big Head" are the same
TE: the rocket is "shiny red" and Head's "brainchild".
TR: Dr. Head works for We Build Rockets Inc.
ST: a rocket launching event occurred with the various participants.
Performance levels: Performance levels Vary according to text type, domain, scenario, language
NE: up to 97% (tested in English, Spanish, Japanese, Chinese)
CO: 60-70% resolution
TE: 80%
TR: 75-80%
ST: 60% (but: human level may be only 80%)
What are Named Entities?: What are Named Entities? NER involves identification of proper names in texts, and classification into a set of predefined categories of interest
Person names
Organizations (companies, government organisations, committees, etc)
Locations (cities, countries, rivers, etc)
Date and time expressions
What are Named Entities (2): What are Named Entities (2) Other common types: measures (percent, money, weight etc), email addresses, Web addresses, street addresses, etc.
Some domain-specific entities: names of drugs, medical conditions, names of ships, bibliographic references etc.
MUC-7 entity definition guidelines [Chinchor’97]
http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/ne_task.html
What are NOT NEs (MUC-7): What are NOT NEs (MUC-7) Artefacts – Wall Street Journal
Common nouns, referring to named entities – the company, the committee
Names of groups of people and things named after people – the Tories, the Nobel prize
Adjectives derived from names – Bulgarian, Chinese
Numbers which are not times, dates, percentages, and money amounts
Basic Problems in NE: Basic Problems in NE Variation of NEs – e.g. John Smith, Mr Smith, John.
Ambiguity of NE types: John Smith (company vs. person)
May (person vs. month)
Washington (person vs. location)
1945 (date vs. time)
Ambiguity with common words, e.g. "may"
More complex problems in NE: More complex problems in NE Issues of style, structure, domain, genre etc.
Punctuation, spelling, spacing, formatting, ... all have an impact:
Dept. of Computing and Maths
Manchester Metropolitan University
Manchester
United Kingdom
Tell me more about Leonardo
Da Vinci
Slide12: Structure of the Tutorial
task definition
applications
corpora, annotation
evaluation and testing
how to
preprocessing
approaches to NE
baseline
rule-based approaches
learning-based approaches
multilinguality
future challenges
Applications: Applications Can help summarisation, ASR and MT
Intelligent document access
Browse document collections by the entities that occur in them
Formulate more complex queries than IR can answer
Example application domains:
News
Scientific articles, e.g, MEDLINE abstracts
Application -Threat tracker: Application -Threat tracker Search by entity:
http://www.alias-i.com/iraq/feature_description/entity_search.html
Application Example - KIM: Application Example - KIM Browsing by entity and ontology: http://www.ontotext.com/kim
Application Example - KIM: Application Example - KIM Ontotext’s KIM formal query over OWL (including
relations between entities) and results
Application Example - Perseus: Application Example - Perseus Time-line and geographic visualisation: http://www.perseus.tufts.edu/
Slide18: Structure of the Tutorial
task definition
applications
corpora, annotation
evaluation and testing
how to
preprocessing
approaches to NE
baseline
rule-based approaches
learning-based approaches
multilinguality
future challenges
Some NE Annotated Corpora: Some NE Annotated Corpora MUC-6 and MUC-7 corpora - English
CONLL shared task corpora http://cnts.uia.ac.be/conll2003/ner/ - NEs in English and German http://cnts.uia.ac.be/conll2002/ner/ - NEs in Spanish and Dutch
TIDES surprise language exercise (NEs in Cebuano and Hindi)
ACE – English - http://www.ldc.upenn.edu/Projects/ACE/
The MUC-7 corpus: The MUC-7 corpus 100 documents in SGML
News domain
1880 Organizations (46%)
1324 Locations (32%)
887 Persons (22%)
Inter-annotator agreement very high (~97%)
http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/muc_7_proceedings/marsh_slides.pdf
The MUC-7 Corpus (2): The MUC-7 Corpus (2) CAPE CANAVERAL, Fla. &MD; Working in chilly temperatures Wednesday night, NASA ground crews readied the space shuttle Endeavour for launch on a Japanese satellite retrieval mission.
Endeavour, with an international crew of six, was set to blast off from the Kennedy Space Center on Thursday at 4:18 a.m. EST, the start of a 49-minute launching period. The nine day shuttle flight was to be the 12th launched in darkness.
NE Annotation Tools - Alembic: NE Annotation Tools - Alembic
NE Annotation Tools – Alembic (2): NE Annotation Tools – Alembic (2)
NE Annotation Tools - GATE: NE Annotation Tools - GATE
Corpora and System Development: Corpora and System Development Corpora are divided typically into a training and testing portion
Rules/Learning algorithms are trained on the training part
Tuned on the testing portion in order to optimise
Rule priorities, rules effectiveness, etc.
Parameters of the learning algorithm and the features used
Evaluation set – the best system configuration is run on this data and the system performance is obtained
No further tuning once evaluation set is used!
Slide26: Structure of the Tutorial
task definition
applications
corpora, annotation
evaluation and testing
how to
preprocessing
approaches to NE
baseline
rule-based approaches
learning-based approaches
multilinguality
future challenges
Performance Evaluation: Performance Evaluation Evaluation metric – mathematically defines how to measure the system’s performance against a human-annotated, gold standard
Scoring program – implements the metric and provides performance measures
For each document and over the entire corpus
For each type of NE
The Evaluation Metric: The Evaluation Metric Precision = correct answers/answers produced
Recall = correct answers/total possible correct answers
Trade-off between precision and recall
F-Measure = (β2 + 1)PR / β2R + P [van Rijsbergen 75]
β reflects the weighting between precision and recall, typically β=1
The Evaluation Metric (2): The Evaluation Metric (2) We may also want to take account of partially correct answers:
Precision = Correct + ½ Partially correct
Correct + Incorrect + Partial
Recall = Correct + ½ Partially correct Correct + Missing + Partial
Why: NE boundaries are often misplaced, so some partially correct results
The MUC scorer (1): The MUC scorer (1) Document: 9601020572
-----------------------------------------------------------------
POS ACT| COR PAR INC | MIS SPU NON| REC PRE
------------------------+-------------+--------------+-----------
SUBTASK SCORES | | |
enamex | | |
organization 11 12| 9 0 0| 2 3 0| 82 75
person 24 26| 24 0 0| 0 2 0| 100 92
location 27 31| 25 0 0| 2 6 0| 93 81
…
* * * SUMMARY SCORES * * *
-----------------------------------------------------------------
POS ACT| COR PAR INC | MIS SPU NON| REC PRE
-----------------------+-------------+--------------+------------
TASK SCORES | | |
enamex | | |
organizatio 1855 1757|1553 0 37| 265 167 30| 84 88
person 883 859| 797 0 13| 73 49 4| 90 93
location 1322 1406|1199 0 13| 110 194 7| 91 85
The MUC scorer (2): The MUC scorer (2)
Using the detailed report we can track errors in each document, for each NE in the text
ENAMEX cor inc PERSON PERSON "Wernher von Braun" "Braun"
ENAMEX cor inc PERSON PERSON "von Braun" "Braun"
ENAMEX cor cor PERSON PERSON "Braun" "Braun"
…
ENAMEX cor cor LOCATI LOCATI "Saturn" "Saturn"
…
The GATE Evaluation Tool: The GATE Evaluation Tool
Regression Testing: Regression Testing Need to track system’s performance over time
When a change is made to the system we want to know what implications are over the entire corpus
Why: because an improvement in one case can lead to problems in others
GATE offers automated tool to help with the NE development task over time
Regression Testing (2): Regression Testing (2) At corpus level – GATE’s corpus benchmark tool –
tracking system’s performance over time
Slide35: Structure of the Tutorial
task definition
applications
corpora, annotation
evaluation and testing
how to
preprocessing
approaches to NE
baseline
rule-based approaches
learning-based approaches
multilinguality
future challenges
Pre-processing for NE Recognition: Pre-processing for NE Recognition Format detection
Word segmentation (for languages like Chinese)
Tokenisation
Sentence splitting
POS tagging
Two kinds of NE approaches: Two kinds of NE approaches Knowledge Engineering
rule based
developed by experienced language engineers
make use of human intuition
requires only small amount of training data
development could be very time consuming
some changes may be hard to accommodate Learning Systems
use statistics or other machine learning
developers do not need LE expertise
requires large amounts of annotated training data
some changes may require re-annotation of the entire training corpus
annotators are cheap (but you get what you pay for!)
Baseline: list lookup approach: Baseline: list lookup approach System that recognises only entities stored in its lists (gazetteers).
Advantages - Simple, fast, language independent, easy to retarget (just create lists)
Disadvantages – impossible to enumerate all names, collection and maintenance of lists, cannot deal with name variants, cannot resolve ambiguity
Creating Gazetteer Lists: Creating Gazetteer Lists Online phone directories and yellow pages for person and organisation names (e.g. [Paskaleva02])
Locations lists
US GEOnet Names Server (GNS) data – 3.9 million locations with 5.37 million names (e.g., [Manov03])
UN site: http://unstats.un.org/unsd/citydata
Global Discovery database from Europa technologies Ltd, UK (e.g., [Ignat03])
Automatic collection from annotated training data
Slide40: Structure of the Tutorial
task definition
applications
corpora, annotation
evaluation and testing
how to
preprocessing
approaches to NE
baseline
rule-based approaches
learning-based approaches
multilinguality
future challenges
Shallow Parsing Approach (internal structure): Shallow Parsing Approach (internal structure) Internal evidence – names often have internal structure. These components can be either stored or guessed, e.g. location:
Cap. Word + {City, Forest, Center, River}
e.g. Sherwood Forest
Cap. Word + {Street, Boulevard, Avenue, Crescent, Road}
e.g. Portobello Street
Problems with the shallow parsing approach: Problems with the shallow parsing approach Ambiguously capitalised words (first word in sentence) [All American Bank] vs. All [State Police]
Semantic ambiguity "John F. Kennedy" = airport (location) "Philip Morris" = organisation
Structural ambiguity [Cable and Wireless] vs. [Microsoft] and [Dell]; [Center for Computational Linguistics] vs. message from [City Hospital] for [John Smith]
Shallow Parsing Approach with Context: Shallow Parsing Approach with Context Use of context-based patterns is helpful in ambiguous cases
"David Walton" and "Goldman Sachs" are indistinguishable
But with the phrase "David Walton of Goldman Sachs" and the Person entity "David Walton" recognised, we can use the pattern "[Person] of [Organization]" to identify "Goldman Sachs“ correctly.
Identification of Contextual Information: Identification of Contextual Information Use KWIC index and concordancer to find windows of context around entities
Search for repeated contextual patterns of either strings, other entities, or both
Manually post-edit list of patterns, and incorporate useful patterns into new rules
Repeat with new entities
Examples of context patterns: Examples of context patterns [PERSON] earns [MONEY]
[PERSON] joined [ORGANIZATION]
[PERSON] left [ORGANIZATION]
[PERSON] joined [ORGANIZATION] as [JOBTITLE]
[ORGANIZATION]'s [JOBTITLE] [PERSON]
[ORGANIZATION] [JOBTITLE] [PERSON]
the [ORGANIZATION] [JOBTITLE]
part of the [ORGANIZATION]
[ORGANIZATION] headquarters in [LOCATION]
price of [ORGANIZATION]
sale of [ORGANIZATION]
investors in [ORGANIZATION]
[ORGANIZATION] is worth [MONEY]
[JOBTITLE] [PERSON]
[PERSON], [JOBTITLE]
Caveats: Caveats Patterns are only indicators based on likelihood
Can set priorities based on frequency thresholds
Need training data for each domain
More semantic information would be useful (e.g. to cluster groups of verbs)
Rule-based Example: FACILE : Rule-based Example: FACILE FACILE - used in MUC-7 [Black et al 98]
Uses Inxight’s LinguistiX tools for tagging and morphological analysis
Database for external information, role similar to a gazetteer
Linguistic info per token, encoded as feature vector:
Text offsets
Orthographic pattern (first/all capitals, mixed, lowercase)
Token and its normalised form
Syntax – category and features
Semantics – from database or morphological analysis
Morphological analyses
Example: (1192 1196 10 T C "Mrs." "mrs." (PROP TITLE) (ˆPER_CIV_F) (("Mrs." "Title" "Abbr")) NIL) PER_CIV_F – female civilian (from database)
FACILE (2): FACILE (2) Context-sensitive rules written in special rule notation, executed by an interpreter
Writing rules in PERL is too error-prone and hard
Rules of the kind: A => B\C/D, where:
A is a set of attribute-value expressions and optional score, the attributes refer to elements of the input token feature vector
B and D are left and right context respectively and can be empty
B, C, D are sequences of attribute-value pairs and Klene regular expression operations; variables are also supported
[syn=NP, sem=ORG] (0.9) => \ [norm="university"], [token="of"], [sem=REGION|COUNTRY|CITY] / ;
FACILE (3): FACILE (3) # Rule for the mark up of person names when the first name is not
# present or known from the gazetteers: e.g 'Mr J. Cass',
[SYN=PROP,SEM=PER, FIRST=_F, INITIALS=_I, MIDDLE=_M, LAST=_S] #_F, _I, _M, _S are variables, transfer info from RHS
=>
[SEM=TITLE_MIL|TITLE_FEMALE|TITLE_MALE]
\[SYN=NAME, ORTH=I|O, TOKEN=_I]?,
[ORTH=C|A, SYN=PROP, TOKEN=_F]?,
[SYN=NAME, ORTH=I|O, TOKEN=_I]?,
[SYN=NAME, TOKEN=_M]?,
[ORTH=C|A|O,SYN=PROP,TOKEN=_S, SOURCE!=RULE]
#proper name, not recognised by a rule
/;
FACILE (4): FACILE (4) Preference mechanism:
The rule with the highest score is preferred
Longer matches are preferred to shorter matches
Results are always one semantic categorisation of the named entity in the text
Evaluation (MUC-7 scores):
Organization: 86% precision, 66% recall
Person: 90% precision, 88% recall
Location: 81% precision, 80% recall
Dates: 93% precision, 86% recall
Example Rule-based System - ANNIE: Example Rule-based System - ANNIE Created as part of GATE
GATE – Sheffield’s open-source infrastructure for language processing
GATE automatically deals with document formats, saving of results, evaluation, and visualisation of results for debugging
GATE has a finite-state pattern-action rule language, used by ANNIE
ANNIE modified for MUC guidelines – 89.5% f-measure on MUC-7 corpus
Slide52: NE Components
The ANNIE system – a reusable and easily extendable set of components
Gazetteer lists for rule-based NE: Gazetteer lists for rule-based NE Needed to store the indicator strings for the internal structure and context rules
Internal location indicators – e.g., {river, mountain, forest} for natural locations; {street, road, crescent, place, square, …}for address locations
Internal organisation indicators – e.g., company designators {GmbH, Ltd, Inc, …}
Produces Lookup results of the given kind
The Named Entity Grammars: The Named Entity Grammars Phases run sequentially and constitute a cascade of FSTs over the pre-processing results
Hand-coded rules applied to annotations to identify NEs
Annotations from format analysis, tokeniser, sentence splitter, POS tagger, and gazetteer modules
Use of contextual information
Finds person names, locations, organisations, dates, addresses.
Slide55: NE Rule in JAPE
JAPE: a Java Annotation Patterns Engine
Light, robust regular-expression-based processing
Cascaded finite state transduction
Low-overhead development of new components
Simplifies multi-phase regex processing
Rule: Company1
Priority: 25
(
( {Token.orthography == upperInitial} )+ //from tokeniser
{Lookup.kind == companyDesignator} //from gazetteer lists
):match
-->
:match.NamedEntity = { kind=company, rule=“Company1” }
Slide56: Named Entities in GATE
Using co-reference to classify ambiguous NEs: Using co-reference to classify ambiguous NEs Orthographic co-reference module that matches proper names in a document
Improves NE results by assigning entity type to previously unclassified names, based on relations with classified NEs
May not reclassify already classified entities
Classification of unknown entities very useful for surnames which match a full name, or abbreviations, e.g. [Bonfield] will match [Sir Peter Bonfield]; [International Business Machines Ltd.] will match [IBM]
Named Entity Coreference: Named Entity Coreference
DEMO: DEMO
Slide60: Structure of the Tutorial
task definition
applications
corpora, annotation
evaluation and testing
how to
preprocessing
approaches to NE
baseline
rule-based approaches
learning-based approaches
multilinguality
future challenges
Machine Learning Approaches: Machine Learning Approaches ML approaches frequently break down the NE task in two parts:
Recognising the entity boundaries
Classifying the entities in the NE categories
Some work is only on one task or the other
Tokens in text are often coded with the IOB scheme
O – outside, B-XXX – first word in NE, I-XXX – all other words in NE
Easy to convert to/from inline MUC-style markup
Argentina B-LOC played O with O Del B-PER Bosque I-PER
IdentiFinder [Bikel et al 99]: IdentiFinder [Bikel et al 99] Based on Hidden Markov Models
Features
Capitalisation
Numeric symbols
Punctuation marks
Position in the sentence
14 features in total, combining above info, e.g., containsDigitAndDash (09-96), containsDigitAndComma (23,000.00)
IdentiFinder (2): IdentiFinder (2) MUC-6 (English) and MET-1(Spanish) corpora used for evaluation
Mixed case English
IdentiFinder - 94.9% f-measure
Best rule-based – 96.4%
Spanish mixed case
IdentiFinder – 90%
Best rule-based - 93%
Lower case names, noisy training data, less training data
Training data: 650,000 words, but similar performance with half of the data. Less than 100,000 words reduce the performance to below 90% on English
MENE [Borthwick et al 98]: MENE [Borthwick et al 98] Combining rule-based and ML NE to achieve better performance
Tokens tagged as: XXX_start, XXX_continue, XXX_end, XXX_unique, other (non-NE), where XXX is an NE category
Uses Maximum Entropy
One only needs to find the best features for the problem
ME estimation routine finds the best relative weights for the features
MENE (2): MENE (2) Features
Binary features – “token begins with capitalised letter”, “token is a four-digit number”
Lexical features – dependencies on the surrounding tokens (window ±2) e.g., “Mr” for people, “to” for locations
Dictionary features – equivalent to gazetteers (first names, company names, dates, abbreviations)
External systems – whether the current token is recognised as an NE by a rule-based system
MENE (3): MENE (3) MUC-7 formal run corpus
MENE – 84.2% f-measure
Rule-based systems it uses – 86% - 91 %
MENE + rule-based systems – 92%
Learning curve
20 docs – 80.97%
40 docs – 84.14%
100 docs – 89.17%
425 docs – 92.94%
NE Recognition without Gazetteers [Mikheev et al 99]: NE Recognition without Gazetteers [Mikheev et al 99] How big should gazetteer lists be?
Experiment with simple list lookup approach on MUC-7 corpus
Learned lists – MUC-7 training corpus
1228 person names
809 organisations
770 locations
Common lists (from the Web)
5000 locations
33,000 organisations
27,000 person names
NE Recognition without Gazetteers (2): NE Recognition without Gazetteers (2)
NE Recognition without Gazetteers (3): NE Recognition without Gazetteers (3) System combines rule-based grammars and statistical (MaxEnt) models
Full gaz – 4900 LOC, 30,000 ORG, 10,000 PER
Some locs – 200 countries + continents + 8 planets
Ltd gaz – Some locs + lists inferred from 30 processed texts in the same domain
NE Recognition without Gazetteers (4): NE Recognition without Gazetteers (4)
Fine-grained Classification of NEs [Fleischman 02]: Fine-grained Classification of NEs [Fleischman 02] Finer-grained categorisation needed for applications like question answering
Person classification into 8 sub-categories – athlete, politician/government, clergy, businessperson, entertainer/artist, lawyer, doctor/scientist, police.
Approach using local context and global semantic information such as WordNet
Used a decision list classifier and Identifinder to construct automatically training set from untagged data
Held-out set of 1300 instances hand annotated
Fine-grained Classification of NEs (2): Fine-grained Classification of NEs (2) Word frequency features – how often the words surrounding the target instance occur with a specific category in training
For each 8 categories 10 distinct word positions = 80 features per instance
3 words before & after the instance
The two-word bigrams immediately before and after the instance
The three-word trigrams before/after the instance
Fine-grained Classification of NEs (3): Fine-grained Classification of NEs (3) Topic signatures and WordNet information
Compute lists of terms that signal relevance to a topic/category [Lin&Hovy 00] & expand with WordNet synonyms to counter unseen examples
Politician – campaign, republican, budget
The topic signature features convey information about the overall context in which each instance exists
Due to differing contexts, instances of the same name in a single text were classified differently
Fine-grained Classification of NEs (4): Fine-grained Classification of NEs (4) MemRun chooses the prevailing sub-category based on their most frequent classification
Othomatching-like algorithm is developed to match George Bush, Bush, and George W. Bush
Expts with k-NN, Naïve Bayes, SVMs, Neural Networks and C4.5 show that C4.5 is best
Expts with different feature configurations – 70.4% with all features discussed here
Future work: treating finer grained classification as a WSD task (categories are different senses of a person)
Slide75: Structure of the Tutorial
task definition
applications
corpora, annotation
evaluation and testing
how to
preprocessing
approaches to NE
baseline
rule-based approaches
learning-based approaches
multilinguality
future challenges
Multilingual Named Entity Recognition: Multilingual Named Entity Recognition Recent experiments are aimed at NE recognition in multiple languages
TIDES surprise language evaluation exercise measures how quickly researchers can develop NLP components in a new language
CONLL’02, CONLL’03 focus on language-independent NE recognition
Analysis of the NE Task in Multiple Languages [Palmer&Day 97]: Analysis of the NE Task in Multiple Languages [Palmer&Day 97]
Analysis of Multilingual NE (2): Analysis of Multilingual NE (2) Numerical and time expressions are very easy to capture using rules
Constitute together about 20-30% of all NEs
All numerical expressions in the 6 languages required only 5 patterns
Time expressions similarly require only a few rules (less than 30 per language)
Many of these rules are reusable across the languages
Analysis of Multilingual NE (3): Analysis of Multilingual NE (3) Suggest a method for calculating the lower bound for system performance given a corpus in the target language
Conclusion: Much of the NE task can be achieved by simple string analysis and common phrasal contexts
Zipf’s law: the prevalence of frequent phenomena allow high scores to be achieved directly from the training data
Chinese, Japanese, and Portuguese corpora had a lower bound above 70%
Substantial further advances require language specificity
What is needed for multilingual NE: What is needed for multilingual NE Extensive support for non-Latin scripts and text encodings, including conversion utilities
Automatic recognition of encoding [Ignat et al03]
Occupied up to 2/3 of the TIDES Hindi effort
Bi-lingual dictionaries
Annotated corpus for evaluation
Internet resources for gazetteer list collection (e.g., phone books, yellow pages, bi-lingual pages)
Multilingual support - Alembic: Multilingual support - Alembic Japanese
example
Editing Multilingual Data:
GATE Unicode Kit (GUK)
Complements Java’s facilities
Support for defining Input Methods (IMs)
currently 30 IMs for 17 languages
Pluggable in other applications (e.g. JEdit)
Editing Multilingual Data
Slide83: Multilingual Data - GATE
All processing, visualisation and editing tools use GUK
Gazetteer-based Approach to Multilingual NE [Ignat et al 03]: Gazetteer-based Approach to Multilingual NE [Ignat et al 03] Deals with locations only
Even more ambiguity than in one language:
Multiple places that share the same name, such as the fourteen cities and villages in the world called ‘Paris’
Place names that are also words in one or more languages, such as ‘And’ (Iran), ‘Split’ (Croatia)
Places have varying names in different languages (Italian ‘Venezia’ vs. English ‘Venice’, German ‘Venedig’, French ‘Venise’)
Gazetteer-based multilingual NE (2): Gazetteer-based multilingual NE (2) Disambiguation module applies heuristics based on location size and country mentions (prefer the locations from the country mentioned most)
Performance evaluation:
853 locations from 80 English texts
96.8% precision
96.5% recall
Machine Learning for Multilingual NE: Machine Learning for Multilingual NE CONLL’2002 and 2003 shared tasks were NE in Spanish, Dutch, English, and German
The most popular ML techniques used:
Maximum Entropy (5 systems)
Hidden Markov Models (4 systems)
Connectionist methods (4 systems)
Combining ML methods has been shown to boost results
ML for NE at CONLL (2): ML for NE at CONLL (2) The choice of features is at least as important as the choice of ML algorithm
Lexical features (words)
Part-of-speech
Orthographic information
Affixes
Gazetteers
External, unmarked data is useful to derive gazetteers and for extracting training instances
ML for NE at CONLL (3): ML for NE at CONLL (3) English (f-measure)
Baseline - 59.5% (list lookup of entities with 1 class in training data)
Systems – between 60.2% and 88.76%
German (f-measure)
Baseline – 30.3%
Systems – between 47.7% and 72.4%
Spanish (f-measure)
Baseline – 35.9%
Systems – between 60.9% and 81.4%
Dutch (f-measure)
Baseline – 53.1%
Systems – between 56.4% and 77%
TIDES surprise language exercise: TIDES surprise language exercise Collaborative effort between a number of sites to develop resources and tools for various LE tasks on a surprise language
Tasks: IE (including NE), machine translation, summarisation, cross-language IR
Dry-run lasted 10 days on the Cebuano language from the Philippines
Surprise language was Hindi, announced at the start of June 2003; duration 1 month
Language categorisation: Language categorisation LDC – survey of 300 largest languages (by population) to establish what resources are available
http://www.ldc.upenn.edu/Projects/TIDES/language-summary-table.html
Classification dimensions:
Dictionaries, news texts, parallel texts, e.g., Bible
Script, orthography, words separated by spaces
The Surprise Languages: The Surprise Languages Cebuano:
Latin script and words are spaced, but
Few resources and little work, so
Medium difficulty
Hindi
Non-latin script, different encodings used, words are spaced, no capitalisation
Many resources available
Medium difficulty
Named Entity Recognition for TIDES: Named Entity Recognition for TIDES Information on other systems and results from TIDES is still unavailable to non-TIDES participants
Will be made available by the end of 2003 in a Special issue of ACM Transactions on Asian Language Information Processing (TALIP). Rapid Development of Language Capabilities: The Surprise Languages
The Sheffield approach is presented below, because it is not subject to these restrictions
Dictionary-based Adaptation of an English POS tagger: Dictionary-based Adaptation of an English POS tagger Substituted Hindi/Cebuano lexicon for English one in a Brill-like tagger
Hindi/Cebuano lexicon derived from a bi-lingual dictionary
Used empty ruleset since no training data available
Used default heuristics (e.g. return NNP for capitalised words)
Very experimental, but reasonable results
Evaluation of the Tagger: Evaluation of the Tagger No formal evaluation was possible
Estimate around 67% accuracy on Hindi – evaluated by a native speaker on 1000 words
Created in 2 person days
Results and a tagging service made available to other researchers in TIDES
Important pre-requisite for NE recognition
NE grammars: NE grammars Most English JAPE rules based on POS tags and gazetteer lookup
Grammars can be reused for languages with similar word order, orthography etc.
No time to make detailed study of Cebuano, but very similar in structure to English
Most of the rules left as for English, but some adjustments to handle especially dates
Used both English and Cebuano grammars and gazetteers, because NEs appear in both languages
Evaluation Results: Evaluation Results
Slide98: Structure of the Tutorial
task definition
applications
corpora, annotation
evaluation and testing
how to
preprocessing
approaches to NE
baseline
rule-based approaches
learning-based approaches
multilinguality
future challenges
Future challenges: Future challenges Towards semantic tagging of entities
New evaluation metrics for semantic entity recognition
Expanding the set of entities recognised – e.g., vehicles, weapons, substances (food, drug)
Finer-grained hierarchies, e.g., types of Organizations (government, commercial, educational, etc.), Locations (regions, countries, cities, water, etc)
Future challenges (2): Future challenges (2) Standardisation of the annotation formats
[Ide & Romary 02] – RDF-based annotation standards
[Collier et al 02] – multi-lingual named entity annotation guidelines
Aimed at defining how to annotate in order to make corpora more reusable and lower the overhead of writing format conversion tools
MUC used inline markup
TIDES and ACE used stand-off markup, but two different kinds (XML vs one-word per line)
Towards Semantic Tagging of Entities: Towards Semantic Tagging of Entities The MUC NE task tagged selected segments of text whenever that text represents the name of an entity.
In ACE (Automated Content Extraction), these names are viewed as mentions of the underlying entities. The main task is to detect (or infer) the mentions in the text of the entities themselves.
ACE focuses on domain- and genre-independent approaches
ACE corpus contains newswire, broadcast news (ASR output and cleaned), and newspaper reports (OCR output and cleaned)
ACE Entities: ACE Entities Dealing with
Proper names – e.g., England, Mr. Smith, IBM
Pronouns – e.g., he, she, it
Nominal mentions – the company, the spokesman
Identify which mentions in the text refer to which entities, e.g.,
Tony Blair, Mr. Blair, he, the prime minister, he
Gordon Brown, he, Mr. Brown, the chancellor
ACE Example: ACE Example
ACE Entities (2): ACE Entities (2) Some entities can have different roles, i.e., behave as Organizations, Locations, or Persons – GPEs (Geo-political entities)
New York [GPE – role: Person], flush with Wall Street money, has a lot of loose change jangling in its pockets.
All three New York [GPE – role: Location] regional commuter train systems were found to be punctual more than 90 percent of the time.
Further information on ACE: Further information on ACE ACE is a closed-evaluation initiative, which does not allow the publication of results
Further information on guidelines and corpora is available at:
http://www.ldc.upenn.edu/Projects/ACE/
ACE also includes other IE tasks, for further details see Doug Appelt’s presentation: http://www.clsp.jhu.edu/ws03/groups/sparse/presentations/doug.ppt
Evaluating Richer NE Tagging: Evaluating Richer NE Tagging Need for new metrics when evaluating hierarchy/ontology-based NE tagging
Need to take into account distance in the hierarchy
Tagging a company as a charity is less wrong than tagging it as a person
Further Reading: Further Reading Aberdeen J., Day D., Hirschman L., Robinson P. and Vilain M. 1995. MITRE: Description of the Alembic System Used for MUC-6. MUC-6 proceedings. Pages141-155. Columbia, Maryland. 1995.
Black W.J., Rinaldi F., Mowatt D. Facile: Description of the NE System Used For MUC-7. Proceedings of 7th Message Understanding Conference, Fairfax, VA, 19 April - 1 May, 1998.
Borthwick. A. A Maximum Entropy Approach to Named Entity Recognition. PhD Dissertation. 1999
Bikel D., Schwarta R., Weischedel. R. An algorithm that learns what’s in a name. Machine Learning 34, pp.211-231, 1999
Carreras X., Màrquez L., Padró. 2002. Named Entity Extraction using AdaBoost. The 6th Conference on Natural Language Learning. 2002
Chang J.S., Chen S. D., Zheng Y., Liu X. Z., and Ke S. J. Large-corpus-based methods for Chinese personal name recognition. Journal of Chinese Information Processing, 6(3):7-15, 1992
Chen H.H., Ding Y.W., Tsai S.C. and Bian G.W. Description of the NTU System Used for MET2. Proceedings of 7th Message Understanding Conference, Fairfax, VA, 19 April - 1 May, 1998.
Chinchor. N. MUC-7 Named Entity Task Definition Version 3.5. Available by from ftp.muc.saic.com/pub/MUC/MUC7-guidelines, 1997
Further reading (2): Further reading (2) Collins M., Singer Y. Unsupervised models for named entity classification In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 1999
Collins M. Ranking Algorithms for Named-Entity Extraction: Boosting and the Voted Perceptron. Proceedings of the 40th Annual Meeting of the ACL, Philadelphia, pp. 489-496, July 2002 Gotoh Y., Renals S. Information extraction from broadcast news, Philosophical Transactions of the Royal Society of London, series A: Mathematical, Physical and Engineering Sciences, 2000.
Grishman R. The NYU System for MUC-6 or Where's the Syntax? Proceedings of the MUC-6 workshop, Washington. November 1995.
[Ign03a] C. Ignat and B. Pouliquen and A. Ribeiro and R. Steinberger. Extending and Information Extraction Tool Set to Eastern-European Languages. Proceedings of Workshop on Information Extraction for Slavonic and other Central and Eastern European Languages (IESL'03). 2003.
Krupka G. R., Hausman K. IsoQuest Inc.: Description of the NetOwlTM Extractor System as Used for MUC-7. Proceedings of 7th Message Understanding Conference, Fairfax, VA, 19 April - 1 May, 1998.
McDonald D. Internal and External Evidence in the Identification and Semantic Categorization of Proper Names. In B.Boguraev and J. Pustejovsky editors: Corpus Processing for Lexical Acquisition. Pages21-39. MIT Press. Cambridge, MA. 1996
Mikheev A., Grover C. and Moens M. Description of the LTG System Used for MUC-7. Proceedings of 7th Message Understanding Conference, Fairfax, VA, 19 April - 1 May, 1998
Miller S., Crystal M., et al. BBN: Description of the SIFT System as Used for MUC-7. Proceedings of 7th Message Understanding Conference, Fairfax, VA, 19 April - 1 May, 1998
Further reading (3): Further reading (3) Palmer D., Day D.S. A Statistical Profile of the Named Entity Task. Proceedings of the Fifth Conference on Applied Natural Language Processing, Washington, D.C., March 31- April 3, 1997.
Sekine S., Grishman R. and Shinou H. A decision tree method for finding and classifying names in Japanese texts. Proceedings of the Sixth Workshop on Very Large Corpora, Montreal, Canada, 1998
Sun J., Gao J.F., Zhang L., Zhou M., Huang C.N. Chinese Named Entity Identification Using Class-based Language Model. In proceeding of the 19th International Conference on Computational Linguistics (COLING2002), pp.967-973, 2002.
Takeuchi K., Collier N. Use of Support Vector Machines in Extended Named Entity Recognition. The 6th Conference on Natural Language Learning. 2002
D.Maynard, K. Bontcheva and H. Cunningham. Towards a semantic extraction of named entities. Recent Advances in Natural Language Processing, Bulgaria, 2003.
M. M. Wood and S. J. Lydon and V. Tablan and D. Maynard and H. Cunningham. Using parallel texts to improve recall in IE. Recent Advances in Natural Language Processing, Bulgaria, 2003.
D.Maynard, V. Tablan and H. Cunningham. NE recognition without training data on a language you don't speak. ACL Workshop on Multilingual and Mixed-language Named Entity Recognition: Combining Statistical and Symbolic Models, Sapporo, Japan, 2003.
Further reading (4): Further reading (4) H. Saggion, H. Cunningham, K. Bontcheva, D. Maynard, O. Hamza, Y. Wilks. Multimedia Indexing through Multisource and Multilingual Information Extraction; the MUMIS project. Data and Knowledge Engineering, 2003.
D. Manov and A. Kiryakov and B. Popov and K. Bontcheva and D. Maynard, H. Cunningham. Experiments with geographic knowledge for information extraction. Workshop on Analysis of Geographic References, HLT/NAACL'03, Canada, 2003.
H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan. GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL'02). Philadelphia, July 2002.
H. Cunningham. GATE, a General Architecture for Text Engineering. Computers and the Humanities, volume 36, pp. 223-254, 2002.
D. Maynard, H. Cunningham, K. Bontcheva, M. Dimitrov. Adapting A Robust Multi-Genre NE System for Automatic Content Extraction. Proc. of the 10th International Conference on Artificial Intelligence: Methodology, Systems, Applications (AIMSA 2002), 2002.
E. Paskaleva and G. Angelova and M.Yankova and K. Bontcheva and H. Cunningham and Y. Wilks. Slavonic Named Entities in GATE. 2003. CS-02-01.
K. Pastra, D. Maynard, H. Cunningham, O. Hamza, Y. Wilks. How feasible is the reuse of grammars for Named Entity Recognition? Language Resources and Evaluation Conference (LREC'2002), 2002.