darpaMP

Uploaded from authorPOINTLite
Views:
 
Category: Education
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Multilinugual PennTools that capture parses and predicate-argument structures, and their use in Applications: 

Multilinugual PennTools that capture parses and predicate-argument structures, and their use in Applications Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman, Fernando Pereira University of Pennsylvania March 11, 2002 TIDES SITE VISIT

Outline : 

Outline Overview Objectives, resource development, applications Supervised Training of Individual Components parsers semantic taggers Training with labeled and unlabeled data co-training active learning (annotation tools)

Objectives: 

Objectives Resources ($200K) Chinese TreeBank II Parallel Korean/English TreeBanks PropBank Multilingual Annotation Tool – (Tom Morton, Nianwen Xue, Jeremy Lacivita) NYU, MITRE, LDC

Objectives (cont): 

Objectives (cont) PennTools ($300K) Morphological Analyzers (at LDC) Major decrease in parser development time and parser running time (Dan Bikel, Carlos Prolo, Anoop Sarkar) Automatic Predicate Argument Tagging (Dan Gildea) Word Sense Disambiguation, English & Chinese (Hoa Dang)

Chinese TreeBank II Fu-dong Chiou, Nianwen Xue: 

Chinese TreeBank II Fu-dong Chiou, Nianwen Xue Cost of CTB I, 100K words : $270K Additional 40K, (20k, 20K) speedup given automatic parses? doubled compare HK, Sinorama, People’s Daily 2002 - 360K words, $100K Chiang’s parser doubles annotation speed 96K words bracketed as of March 8, 2002 110K Xinhua news, 200K other newswire, 50K DLI corpus release of original 100K + 150K planned for June

English Translation: CTB I TIDES: 

English Translation: CTB I TIDES Beijing E-C Translation LTD 12 week estimate, actual 15 weeks, Nov 100K words, around $10K (.06 per char) 3rd pass for error correction taking longer than expected 40K/100K done

Chinese PropBank - DOD: 

Chinese PropBank - DOD Proposal stage, 2 yrs, 275K a year Year One (Just got funded) Develop lexicon guidelines, 2600 verbs Tag 100K CTB Year Two Extend guidelines, up to 5 or 6000 verbs Tag additional 400K CTB II Spinoff – Chinese lexicon

Richer CTB Annotations TIDES ($25K): 

Richer CTB Annotations TIDES ($25K) Coreference Tagging (Susan Converse) Draft guidelines 100K words tagged Sense tagging (Hoa Dang)

Korean/English Parallel TreeBank Chunghye Han, Narae Han, Allen Lee (CoGenTex/Penn/Systran: ARL MT Project): 

Korean/English Parallel TreeBank Chunghye Han, Narae Han, Allen Lee (CoGenTex/Penn/Systran: ARL MT Project) Defense Language Institute data 50K word corpus of military messages Same corpus available in Chinese Guidelines for postagging, bracketing http://www.cis.upenn.edu/~xtag/koreantag/index.html Companion Transfer Lexicon, 4000 entries READY TO RELEASE

English PropBank Paul Kingsbury, Scott Cotton: 

English PropBank Paul Kingsbury, Scott Cotton 1M words of Treebank New semantic augmentations Predicate-argument relations for verbs, label arguments (arg0, arg1, arg2) First subtask, 300K word financial subcorpus Spin-off: English lexical resource 3500+ verbs

English PropBank – Current Status: 

English PropBank – Current Status Frames files 787 verb lemmas (includes phrasal variants - 932) 363/ VerbNet semi-automatic expansions (subtask/PB) First subtask: 300K financial subcorpus 22,595K unique predicates annotated out of 29K, (80%) 6K+ remaining (7 weeks, 2000@week, first pass) 1040 verb lemmas out of 1700+ (59%) 700 remaining (3.5 months, 200@month) PropBank, (including some of Brown?) 34,437 predicates annotated out of 118K, (29%) 1040 verb lemmas out of 3500, (29%)

Summary of Resources: 

Summary of Resources Completion Project 2002 Funds Status Date

Objectives (cont): 

Objectives (cont) Applications: ($200K) + ($150K) Relation Extraction and MT Initial experiments with MUC 7 Korean/English MT system wrap-up Plans for investigating statistical MT approaches

Information Extraction with TAG (Libin Shen, Anoop Sarkar, and Jinying Chen): 

Information Extraction with TAG (Libin Shen, Anoop Sarkar, and Jinying Chen) Template Relation (TR) task of the 7th Message Understanding Conference F-Measure of 78% on sentence-level relation which is comparable to the best results in MUC-7 Convert IE into a discriminative problem Syntactic Analysis with Supertagger [Joshi 1994] and Lightweight Dependency Analyzer [Srinivas 1997] Machine Learning with Boosting algorithm [Schapire 2000]

Korean/English ARL MT System: New Parser Evaluation Treebank trained – Anoop Sarkar: 

Korean/English ARL MT System: New Parser Evaluation Treebank trained – Anoop Sarkar Dependency Evaluation: 75.7% on test, 97.58% training

Statistical Approaches to MT (Dan Gildea, Yuan Ding, Owen Rambow): 

Statistical Approaches to MT (Dan Gildea, Yuan Ding, Owen Rambow) Tree-based alignment: use one or both sets of trees from parallel treebanks to constrain alignments, compare with unstructured alignments (IBM models). Word-sense disambiguation: apply maximum entropy model of word sense disambiguation to translation selection. Monolingual corpora: translation selection based on dependency statistics from monolingual corpora. Statistical generation: PropBank as underlying representation for statistical generation (JHU summer workshop).