logging in or signing up darpaMP Rina Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 40 Category: Education License: All Rights Reserved Like it (0) Dislike it (0) Added: January 16, 2008 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Multilinugual PennTools that capture parses and predicate-argument structures, and their use in Applications: Multilinugual PennTools that capture parses and predicate-argument structures, and their use in Applications Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman, Fernando Pereira University of Pennsylvania March 11, 2002 TIDES SITE VISIT Outline : Outline Overview Objectives, resource development, applications Supervised Training of Individual Components parsers semantic taggers Training with labeled and unlabeled data co-training active learning (annotation tools) Objectives: Objectives Resources ($200K) Chinese TreeBank II Parallel Korean/English TreeBanks PropBank Multilingual Annotation Tool – (Tom Morton, Nianwen Xue, Jeremy Lacivita) NYU, MITRE, LDCObjectives (cont): Objectives (cont) PennTools ($300K) Morphological Analyzers (at LDC) Major decrease in parser development time and parser running time (Dan Bikel, Carlos Prolo, Anoop Sarkar) Automatic Predicate Argument Tagging (Dan Gildea) Word Sense Disambiguation, English & Chinese (Hoa Dang)Chinese TreeBank IIFu-dong Chiou, Nianwen Xue: Chinese TreeBank II Fu-dong Chiou, Nianwen Xue Cost of CTB I, 100K words : $270K Additional 40K, (20k, 20K) speedup given automatic parses? doubled compare HK, Sinorama, People’s Daily 2002 - 360K words, $100K Chiang’s parser doubles annotation speed 96K words bracketed as of March 8, 2002 110K Xinhua news, 200K other newswire, 50K DLI corpus release of original 100K + 150K planned for June English Translation: CTB ITIDES: English Translation: CTB I TIDES Beijing E-C Translation LTD 12 week estimate, actual 15 weeks, Nov 100K words, around $10K (.06 per char) 3rd pass for error correction taking longer than expected 40K/100K doneChinese PropBank - DOD: Chinese PropBank - DOD Proposal stage, 2 yrs, 275K a year Year One (Just got funded) Develop lexicon guidelines, 2600 verbs Tag 100K CTB Year Two Extend guidelines, up to 5 or 6000 verbs Tag additional 400K CTB II Spinoff – Chinese lexiconRicher CTB AnnotationsTIDES ($25K): Richer CTB Annotations TIDES ($25K) Coreference Tagging (Susan Converse) Draft guidelines 100K words tagged Sense tagging (Hoa Dang) Korean/English Parallel TreeBankChunghye Han, Narae Han, Allen Lee(CoGenTex/Penn/Systran: ARL MT Project): Korean/English Parallel TreeBank Chunghye Han, Narae Han, Allen Lee (CoGenTex/Penn/Systran: ARL MT Project) Defense Language Institute data 50K word corpus of military messages Same corpus available in Chinese Guidelines for postagging, bracketing http://www.cis.upenn.edu/~xtag/koreantag/index.html Companion Transfer Lexicon, 4000 entries READY TO RELEASEEnglish PropBank Paul Kingsbury, Scott Cotton: English PropBank Paul Kingsbury, Scott Cotton 1M words of Treebank New semantic augmentations Predicate-argument relations for verbs, label arguments (arg0, arg1, arg2) First subtask, 300K word financial subcorpus Spin-off: English lexical resource 3500+ verbsEnglish PropBank – Current Status: English PropBank – Current Status Frames files 787 verb lemmas (includes phrasal variants - 932) 363/ VerbNet semi-automatic expansions (subtask/PB) First subtask: 300K financial subcorpus 22,595K unique predicates annotated out of 29K, (80%) 6K+ remaining (7 weeks, 2000@week, first pass) 1040 verb lemmas out of 1700+ (59%) 700 remaining (3.5 months, 200@month) PropBank, (including some of Brown?) 34,437 predicates annotated out of 118K, (29%) 1040 verb lemmas out of 3500, (29%)Summary of Resources: Summary of Resources Completion Project 2002 Funds Status DateObjectives (cont): Objectives (cont) Applications: ($200K) + ($150K) Relation Extraction and MT Initial experiments with MUC 7 Korean/English MT system wrap-up Plans for investigating statistical MT approaches Information Extraction with TAG (Libin Shen, Anoop Sarkar, and Jinying Chen): Information Extraction with TAG (Libin Shen, Anoop Sarkar, and Jinying Chen) Template Relation (TR) task of the 7th Message Understanding Conference F-Measure of 78% on sentence-level relation which is comparable to the best results in MUC-7 Convert IE into a discriminative problem Syntactic Analysis with Supertagger [Joshi 1994] and Lightweight Dependency Analyzer [Srinivas 1997] Machine Learning with Boosting algorithm [Schapire 2000]Korean/English ARL MT System:New Parser EvaluationTreebank trained – Anoop Sarkar: Korean/English ARL MT System: New Parser Evaluation Treebank trained – Anoop Sarkar Dependency Evaluation: 75.7% on test, 97.58% trainingStatistical Approaches to MT(Dan Gildea, Yuan Ding, Owen Rambow): Statistical Approaches to MT (Dan Gildea, Yuan Ding, Owen Rambow) Tree-based alignment: use one or both sets of trees from parallel treebanks to constrain alignments, compare with unstructured alignments (IBM models). Word-sense disambiguation: apply maximum entropy model of word sense disambiguation to translation selection. Monolingual corpora: translation selection based on dependency statistics from monolingual corpora. Statistical generation: PropBank as underlying representation for statistical generation (JHU summer workshop). You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
darpaMP Rina Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 40 Category: Education License: All Rights Reserved Like it (0) Dislike it (0) Added: January 16, 2008 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Multilinugual PennTools that capture parses and predicate-argument structures, and their use in Applications: Multilinugual PennTools that capture parses and predicate-argument structures, and their use in Applications Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman, Fernando Pereira University of Pennsylvania March 11, 2002 TIDES SITE VISIT Outline : Outline Overview Objectives, resource development, applications Supervised Training of Individual Components parsers semantic taggers Training with labeled and unlabeled data co-training active learning (annotation tools) Objectives: Objectives Resources ($200K) Chinese TreeBank II Parallel Korean/English TreeBanks PropBank Multilingual Annotation Tool – (Tom Morton, Nianwen Xue, Jeremy Lacivita) NYU, MITRE, LDCObjectives (cont): Objectives (cont) PennTools ($300K) Morphological Analyzers (at LDC) Major decrease in parser development time and parser running time (Dan Bikel, Carlos Prolo, Anoop Sarkar) Automatic Predicate Argument Tagging (Dan Gildea) Word Sense Disambiguation, English & Chinese (Hoa Dang)Chinese TreeBank IIFu-dong Chiou, Nianwen Xue: Chinese TreeBank II Fu-dong Chiou, Nianwen Xue Cost of CTB I, 100K words : $270K Additional 40K, (20k, 20K) speedup given automatic parses? doubled compare HK, Sinorama, People’s Daily 2002 - 360K words, $100K Chiang’s parser doubles annotation speed 96K words bracketed as of March 8, 2002 110K Xinhua news, 200K other newswire, 50K DLI corpus release of original 100K + 150K planned for June English Translation: CTB ITIDES: English Translation: CTB I TIDES Beijing E-C Translation LTD 12 week estimate, actual 15 weeks, Nov 100K words, around $10K (.06 per char) 3rd pass for error correction taking longer than expected 40K/100K doneChinese PropBank - DOD: Chinese PropBank - DOD Proposal stage, 2 yrs, 275K a year Year One (Just got funded) Develop lexicon guidelines, 2600 verbs Tag 100K CTB Year Two Extend guidelines, up to 5 or 6000 verbs Tag additional 400K CTB II Spinoff – Chinese lexiconRicher CTB AnnotationsTIDES ($25K): Richer CTB Annotations TIDES ($25K) Coreference Tagging (Susan Converse) Draft guidelines 100K words tagged Sense tagging (Hoa Dang) Korean/English Parallel TreeBankChunghye Han, Narae Han, Allen Lee(CoGenTex/Penn/Systran: ARL MT Project): Korean/English Parallel TreeBank Chunghye Han, Narae Han, Allen Lee (CoGenTex/Penn/Systran: ARL MT Project) Defense Language Institute data 50K word corpus of military messages Same corpus available in Chinese Guidelines for postagging, bracketing http://www.cis.upenn.edu/~xtag/koreantag/index.html Companion Transfer Lexicon, 4000 entries READY TO RELEASEEnglish PropBank Paul Kingsbury, Scott Cotton: English PropBank Paul Kingsbury, Scott Cotton 1M words of Treebank New semantic augmentations Predicate-argument relations for verbs, label arguments (arg0, arg1, arg2) First subtask, 300K word financial subcorpus Spin-off: English lexical resource 3500+ verbsEnglish PropBank – Current Status: English PropBank – Current Status Frames files 787 verb lemmas (includes phrasal variants - 932) 363/ VerbNet semi-automatic expansions (subtask/PB) First subtask: 300K financial subcorpus 22,595K unique predicates annotated out of 29K, (80%) 6K+ remaining (7 weeks, 2000@week, first pass) 1040 verb lemmas out of 1700+ (59%) 700 remaining (3.5 months, 200@month) PropBank, (including some of Brown?) 34,437 predicates annotated out of 118K, (29%) 1040 verb lemmas out of 3500, (29%)Summary of Resources: Summary of Resources Completion Project 2002 Funds Status DateObjectives (cont): Objectives (cont) Applications: ($200K) + ($150K) Relation Extraction and MT Initial experiments with MUC 7 Korean/English MT system wrap-up Plans for investigating statistical MT approaches Information Extraction with TAG (Libin Shen, Anoop Sarkar, and Jinying Chen): Information Extraction with TAG (Libin Shen, Anoop Sarkar, and Jinying Chen) Template Relation (TR) task of the 7th Message Understanding Conference F-Measure of 78% on sentence-level relation which is comparable to the best results in MUC-7 Convert IE into a discriminative problem Syntactic Analysis with Supertagger [Joshi 1994] and Lightweight Dependency Analyzer [Srinivas 1997] Machine Learning with Boosting algorithm [Schapire 2000]Korean/English ARL MT System:New Parser EvaluationTreebank trained – Anoop Sarkar: Korean/English ARL MT System: New Parser Evaluation Treebank trained – Anoop Sarkar Dependency Evaluation: 75.7% on test, 97.58% trainingStatistical Approaches to MT(Dan Gildea, Yuan Ding, Owen Rambow): Statistical Approaches to MT (Dan Gildea, Yuan Ding, Owen Rambow) Tree-based alignment: use one or both sets of trees from parallel treebanks to constrain alignments, compare with unstructured alignments (IBM models). Word-sense disambiguation: apply maximum entropy model of word sense disambiguation to translation selection. Monolingual corpora: translation selection based on dependency statistics from monolingual corpora. Statistical generation: PropBank as underlying representation for statistical generation (JHU summer workshop).