Presentation Transcript
Slide1 : Adapting an English Information
Extraction System to Swedish
Kristofer Franzén
Information and Language Engineering Group
Human Computer Interaction and Language Engineering Laboratory
Swedish Institute of Computer Science
franzen@sics.se
Overview : Overview
Information Extraction
The Proteus-system at NYU
Changes made to the system
Experiment results
The SICS IE system
Information Extraction? : Information Extraction? Capturing predefined events or relations in texts.
Example scenarios:
Changes in corporate executive management personnel.
Capture all information about people changing jobs at higher positions in companies.
Aircraft accidents.
Capture all information about flight, airline, accident location, aircraft model, flight origin, flight destination, accident casualties etc.
Template filling : Template filling
- Karo Bio. Per-Olof Mårtensson har åter utsetts till VD efter att sedan förra våren ha varit ordförande. Mårtensson efterträds på ordförandeposten av Bertil Hållsten, tidigare chef för SE-Bankens läkemedelsfonder. POSITION VD
COMPANY Karo Bio
IN-PERSON Per-Olof Mårtensson
POSITION ordförande
COMPANY Karo Bio
IN-PERSON Bertil Hållsten
OUT-PERSON Per-Olof Mårtensson
POSITION chef
COMPANY SE-Bankens läkemedelsfonder
OUT-PERSON Bertil Hållsten
General system architecture : Local text analysis General system architecture Discourse analysis
Incremental pattern matching : …and lexical generalization
Totte Boll, tidigare VD i Eckym Ropos Inc., har utsetts till …
[person]name , tidigare [position]np i [company]name , [utse]vg-pass till …
[person]np , [position-in-company]np , [utse]vg-pass till …
[person]np-entity , [appoint]vg-pass till …
which would match the beginning of the following event-pattern
np-entity(person) vg(appoint, voice=pass) 'till' np(position) ('av' np(company))? Incremental pattern matching
Incremental pattern matching : …and lexical generalization
Totte Boll, tidigare VD i Eckym Ropos Inc., har utsetts till …
[person]name , tidigare [position]np i [company]name , [utse]vg till …
string=”Totte Boll” tense=perf
voice=pass
[person] name , [post-in-company]np , [utse]vg till …
… tense=former …
post=”VD”
org=”Eckym Ropos Inc.”
[person] np [appoint]vg-pass till …
outPos=[post-in-company] ...
...
which would match the beginning of the following event-pattern
np(person) vg(appoint, voice=pass) 'till' np(position) ('av' np(company))? Incremental pattern matching
Syntactic generalization : Syntactic generalization Metarules transform patterns to capture all syntactic variations.
Assam Pärks styrelse har utsett Totte Boll till ny styrelseordförande.
Totte Boll utses i morgon till ny styrelseordförande i Assam Pärks.
Totte Boll, tidigare VD i Eckym Ropos Inc., har utsetts till styrelseordförande i Assam Pärks.
Totte Boll, utsedd till ny styrelseordförande i Assam Pärks, är en glad lax.
Totte Boll, som utsågs till styrelseordförande i Assam Pärks igår, har framtiden för sig.
Assam Pärks nye styrelseordförande Totte Boll skyr inga medel.
Changes made to the system : Changes made to the system Lexical analysis
Input format
Rule predicates
Domain and task independent patterns
Knowledge bases
Scenario specific patterns
Swedish management succession : Training corpus 34 news articles (51 events)
F-score 55
Test corpus 50 news articles (87 events)
F-score 25
MUC-6 systems F-score 48-56
Proteus today F-score 65 Swedish management succession F-score = (2 * precision * recall) / (precision + recall)
Possible reasons : Possible reasons Overtraining (overfitting)
Mismatching interpretation of the template filling rules
System design (scenario- and linguistic specifics too integrated in the core system)
Conclusions : Conclusions linguistic differences were not a problem
pragmatic differences were not a problem
complex system not easy to reconfigure
compiling a test corpus is difficult
SICS IE system : SICS IE system Goal
Domain-, language- and platform independent, modular, open and free IE-system. Within one year ;-)
What we have
a general annotation-based (TIPSTER) infrastructure of a document processing system and some low level pattern bases.
Next to do
pattern matching language conforming to the Common Pattern Specification Language definition (CPSL)
a set of scenario independent pattern libraries
Catch the
buzz on authorSTREAM
Copyright © 2002-2008 authorSTREAM. All rights reserved.