logging in or signing up seattle04fst xle Nathaniel Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 67 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: September 28, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Integrating Finite-state Morphologies with Deep LFG Grammars: Integrating Finite-state Morphologies with Deep LFG Grammars Tracy Holloway KingFST and deep grammars: FST and deep grammars Finite state tokenizers and morphologies can be integrated into deep processing systems Integrated tokenizers eliminate the need for preprocessing allow the grammar writer more control over the input Morphologies eliminate the need to list (multiple) surface forms in the lexicon eliminate the need for lexical entries for words with predictable subcategorization framesTalk outline: Talk outline Basic integrated system Integrating morphology FSTs Interaction of tokenization and morphologyBasic Architecture: Basic ArchitectureExample steps through the system: Example steps through the system Input string: Boys appeared. Tokenizing: boys TB appeared TB . TB Morphology: boy + Noun +Pl appear +Verb +PastBoth +123SP . +Punct C-structure/F-structure: next slidesC-structure tree: C-structure treeF-structure AVM: F-structure AVMThe wider system: XLE: The wider system: XLE Handwritten grammars for various languages Substantial for English, German, Japanese, Norwegian Also: Arabic, Chinese, Urdu, Korean, Welsh, Malagasy, Turkish Robustness mechanisms Fragment grammar rules Morphological guessers Skimming when resource limits approached Ambiguity management (packing) Compute all analyses (no “aggressive pruning”) Propagate packed ambiguities across processing modules Stochastic disambiguation MaxEnt models to select from packed (f-)structures Other processing available: generation, semantics, transfer/rewriting Comparisons to other systems/tasks Parsing WSJ (Riezler et al, ACL 2002) Comparison to Collins model 3 (Riezler et al, NAACL 2004)FST Morphologies: FST Morphologies Associate surface form with a lemma (stem/canonical form) a set of tags Process is non-deterministic can have many analyses for one surface form grammar has to be able to deal with multiple analyses (morphological ambiguity) Issue: can the grammar control rampant morphological ambiguity? Arabic vowelless representationsExample Morphology Output: Example Morphology Output turnips <=> turnip +Noun +Pl Mary <=> Mary +Prop +Giv +Fem +Sg falls <=> fall +Noun +Pl fall +Verb +Pres +3sg broken <=> break +Verb +PastPerf +123SP broken +Verb +PastPart } +Adj New York <=> New York +Prop +Place +USAState +Prefer New York +Prop +Place +City +Prefer [ plus analyses of New and York ]Morphologies and lexicons: Morphologies and lexicons Without a morphology, need to list all surface forms in the lexicon bad for English horrible for languages like Finnish and Arabic With a morphology, one entry for the stem form go V XLE @(V-INTRANS go). for: go, goes, going, gone, went With additional integration, words with predictable subcategorization frames need no entryBasic idea: Basic idea Run surface forms of words through the morphology to produce stems and tags MorphConfig file specifies which morphologies the grammar uses Look up stems and tags in the lexicon Sublexical phrase structure rules build syntactic nodes covering the stems and tags Standard grammar rules build larger phrasesLexical entries for tags: Lexical entries for tags boys ==> boy +Noun +Pl boy N XLE @(NOUN boy). +Noun N_SFX XLE @(PERS 3) @(EXISTS NTYPE). +Pl NNUM_SFX XLE @(NUM pl). Sublexical rules for tags: Sublexical rules for tags Build up lexical nodes from stem plus tags Rules are identical to standard phrase structure rules Except display can hide the sublexical information N --> N_BASE N_SFX_BASE NNUM_SFX_BASE.Resulting structures: Resulting structuresLexical entries: Lexical entries Stems with unpredictable subcategorization frames need entries verbs adjectives with obliques (proud of her) nouns with that complements (the idea that he laughed) Most lexical items have predictable frames determined by part of speech common and proper nouns adjectives adverbs numbers-unknown lexical entry: -unknown lexical entry Match any stem to the entry Provide desired functional information %stem will pass in the appropriate surface form (i.e., the lemma/stem) Constrain application via morphological tag possibilities -unknown N XLE @(NOUN %stem); A XLE @(ADJ %stem); ADV XLE @(ADVERB %stem).-unknown example: -unknown example The box boxes. Lexicon entries: box V XLE @(V-INTRANS %stem). -unknown N XLE @(NOUN %stem); ADV…; A... Morphology output: box ==> box +Noun +Sg | +Verb +Non3Sg boxes ==> box +Noun +Pl | +Verb +3Sg Build up four effective lexical entries 1 noun, 1 verb, 1 adverb, 1 adjective adverb and adjective fail sublexically noun and verb relevant for the sentenceInflectional morphology summary: Inflectional morphology summary Integrating FST morphologies significantly decreases lexicon development Verbs and other unpredictable items are listed only under their stem form Predictable items such as nouns are processed via –unknown and never listed in the lexiconGuessers: Guessers Even large industrial FST morphologies are not complete Novel words usually have regular morphology Build and FST guesser based on this Words with capital letters are proper nouns (Saakashvili) Words ending in –ed are past tense verbs or deverbal adjectives Guessed words will go through –unknown no difference from standard morphological output can add +Guessed tag for further controlGuessers: controlling application: Guessers: controlling application Apply guesser in the grammar only if there is no form in the regular morphology don't guess unless you have to Control this with the MorphConfig use multiple fst morphologies stop looking once analysis if foundSample MorphConfig: Sample MorphConfig STANDARD ENGLISH MORPHOLOGY (1.0) TOKENIZE: english.tok.parse.fst ANALYZE USEFIRST: english.infl.fst try regular morphology first english.guesser.fst if fail, guess MULTIWORD: english.standard.mwe.fstMultiple morphology FSTs: Multiple morphology FSTs In addition to the regular morphology and guesser, can have other morphologies morphology for technical terms, part numbers, etc. These can be applied in sequence or in parallel (cascaded or unioned) ANALYZE USEALL: english.infl.fst try regular morphology english.eureka.parts.fst and also part namesMorphology vs. surface form: Morphology vs. surface form System always allows surface form through Lexicon can match this form for multiword expressions override/supplement morphological analysis Example: or as adverb (Or you could leave now.) or ADV * @(ADVERB or); CONJ XLE @(CONJ or).Tokenizers : Tokenizers Tokenizers break strings (sentences) into tokens (words) Need to (for English): break off punctuation Mary laughs. ==> Mary TB laughs TB . TB lower case certain letters The dog ==> the TB dogTokenization and morphology: Tokenization and morphology Linguistic analysis may govern tokenization Are English contracted auxiliaries: affixes: John'll ==> no tokenization John +Noun +Proper +Fut clitics: John'll ==> John TB 'll TB John +Noun +Proper will +Fut Arabic determiners and conjunctions both written with adjacent words determiner as an affix giving +Def (Albint the-girl) conjunction tokenized separately (wakutub and-books)Non-deterministic tokenizers: Punctuation: Non-deterministic tokenizers: Punctuation Cannot just break off punctuation and insert a TB Comma haplology Find the dog, a poodle. ==> find TB the TB dog TB , TB a TB poodle TB , TB . TB Period haplology Go to Palm Dr. ==> go TB to TB Palm TB Dr. TB . TB Resulting tokenizer is non-deterministic System must be able to handle multiple inputsCapitalization: Capitalization Intial capitals are optionally lower cased The boy left. ==> the boy left. Mary left. ==> Mary left. Example for both types of non-determinism Bush saw them. ==> { Bush | bush } TB saw TB them TB [, TB]* . TB Tokenization rules vary from language to language and by choice of linguistic analysisConclusions: Conclusions System architecture integrates FST techniques with deep LFG parsing tokenizers morphologies and guessers Allows generalizations to be factored out properties of words properties of strings Allows use of existing large-scale lexical resources avoids redundant speficication System is actively in use in ParGram grammarsShallow Markup: Shallow Markup Preprocessing with shallow markup can reduce ambiguity and speed processing Tokenizer must be able to process the markup Part of speech tagging: I/PRP_ saw/VBD_ her/PRP_ duck/VB_. Named entities <person>General Mills</person> bought it.POS tagging: POS tagging POS tags are not relevant for tokenizing, but the tokenizer must skip them She walks/VBZ_. should be treated like She walks. The morphology must only insert compatible tags A mapping table states allowable combinations /VBZ_ +Verb +3sg /NN_ +Noun +Sg These are encoded into a filtering FST Only compatible tags are passed to the grammarPOS tagging example: POS tagging example I saw her duck duck +Noun +Sg duck +Verb +Pres +Non3sg both possibilities passed to the grammar I saw her duck/VB_. only +Verb +Pres +Non3sg possibility is compatible with /VB_ POS tag only this possibility is passed to the grammarNamed Entities: Named Entities Named entities appear in text as XML markup <person>General Mills</person> bought it. Tokenizer creates special tag for these puts literal spaces instead of TBs allows version without markup for fallback General Mills TB +NamedEntity TB General TB +Title TB Mills +Proper TB Lexical entry added for +NamedEntity Sublexical N and NAME rules allows the tagSample Named Entity output: Sample Named Entity output You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
seattle04fst xle Nathaniel Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 67 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: September 28, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Integrating Finite-state Morphologies with Deep LFG Grammars: Integrating Finite-state Morphologies with Deep LFG Grammars Tracy Holloway KingFST and deep grammars: FST and deep grammars Finite state tokenizers and morphologies can be integrated into deep processing systems Integrated tokenizers eliminate the need for preprocessing allow the grammar writer more control over the input Morphologies eliminate the need to list (multiple) surface forms in the lexicon eliminate the need for lexical entries for words with predictable subcategorization framesTalk outline: Talk outline Basic integrated system Integrating morphology FSTs Interaction of tokenization and morphologyBasic Architecture: Basic ArchitectureExample steps through the system: Example steps through the system Input string: Boys appeared. Tokenizing: boys TB appeared TB . TB Morphology: boy + Noun +Pl appear +Verb +PastBoth +123SP . +Punct C-structure/F-structure: next slidesC-structure tree: C-structure treeF-structure AVM: F-structure AVMThe wider system: XLE: The wider system: XLE Handwritten grammars for various languages Substantial for English, German, Japanese, Norwegian Also: Arabic, Chinese, Urdu, Korean, Welsh, Malagasy, Turkish Robustness mechanisms Fragment grammar rules Morphological guessers Skimming when resource limits approached Ambiguity management (packing) Compute all analyses (no “aggressive pruning”) Propagate packed ambiguities across processing modules Stochastic disambiguation MaxEnt models to select from packed (f-)structures Other processing available: generation, semantics, transfer/rewriting Comparisons to other systems/tasks Parsing WSJ (Riezler et al, ACL 2002) Comparison to Collins model 3 (Riezler et al, NAACL 2004)FST Morphologies: FST Morphologies Associate surface form with a lemma (stem/canonical form) a set of tags Process is non-deterministic can have many analyses for one surface form grammar has to be able to deal with multiple analyses (morphological ambiguity) Issue: can the grammar control rampant morphological ambiguity? Arabic vowelless representationsExample Morphology Output: Example Morphology Output turnips <=> turnip +Noun +Pl Mary <=> Mary +Prop +Giv +Fem +Sg falls <=> fall +Noun +Pl fall +Verb +Pres +3sg broken <=> break +Verb +PastPerf +123SP broken +Verb +PastPart } +Adj New York <=> New York +Prop +Place +USAState +Prefer New York +Prop +Place +City +Prefer [ plus analyses of New and York ]Morphologies and lexicons: Morphologies and lexicons Without a morphology, need to list all surface forms in the lexicon bad for English horrible for languages like Finnish and Arabic With a morphology, one entry for the stem form go V XLE @(V-INTRANS go). for: go, goes, going, gone, went With additional integration, words with predictable subcategorization frames need no entryBasic idea: Basic idea Run surface forms of words through the morphology to produce stems and tags MorphConfig file specifies which morphologies the grammar uses Look up stems and tags in the lexicon Sublexical phrase structure rules build syntactic nodes covering the stems and tags Standard grammar rules build larger phrasesLexical entries for tags: Lexical entries for tags boys ==> boy +Noun +Pl boy N XLE @(NOUN boy). +Noun N_SFX XLE @(PERS 3) @(EXISTS NTYPE). +Pl NNUM_SFX XLE @(NUM pl). Sublexical rules for tags: Sublexical rules for tags Build up lexical nodes from stem plus tags Rules are identical to standard phrase structure rules Except display can hide the sublexical information N --> N_BASE N_SFX_BASE NNUM_SFX_BASE.Resulting structures: Resulting structuresLexical entries: Lexical entries Stems with unpredictable subcategorization frames need entries verbs adjectives with obliques (proud of her) nouns with that complements (the idea that he laughed) Most lexical items have predictable frames determined by part of speech common and proper nouns adjectives adverbs numbers-unknown lexical entry: -unknown lexical entry Match any stem to the entry Provide desired functional information %stem will pass in the appropriate surface form (i.e., the lemma/stem) Constrain application via morphological tag possibilities -unknown N XLE @(NOUN %stem); A XLE @(ADJ %stem); ADV XLE @(ADVERB %stem).-unknown example: -unknown example The box boxes. Lexicon entries: box V XLE @(V-INTRANS %stem). -unknown N XLE @(NOUN %stem); ADV…; A... Morphology output: box ==> box +Noun +Sg | +Verb +Non3Sg boxes ==> box +Noun +Pl | +Verb +3Sg Build up four effective lexical entries 1 noun, 1 verb, 1 adverb, 1 adjective adverb and adjective fail sublexically noun and verb relevant for the sentenceInflectional morphology summary: Inflectional morphology summary Integrating FST morphologies significantly decreases lexicon development Verbs and other unpredictable items are listed only under their stem form Predictable items such as nouns are processed via –unknown and never listed in the lexiconGuessers: Guessers Even large industrial FST morphologies are not complete Novel words usually have regular morphology Build and FST guesser based on this Words with capital letters are proper nouns (Saakashvili) Words ending in –ed are past tense verbs or deverbal adjectives Guessed words will go through –unknown no difference from standard morphological output can add +Guessed tag for further controlGuessers: controlling application: Guessers: controlling application Apply guesser in the grammar only if there is no form in the regular morphology don't guess unless you have to Control this with the MorphConfig use multiple fst morphologies stop looking once analysis if foundSample MorphConfig: Sample MorphConfig STANDARD ENGLISH MORPHOLOGY (1.0) TOKENIZE: english.tok.parse.fst ANALYZE USEFIRST: english.infl.fst try regular morphology first english.guesser.fst if fail, guess MULTIWORD: english.standard.mwe.fstMultiple morphology FSTs: Multiple morphology FSTs In addition to the regular morphology and guesser, can have other morphologies morphology for technical terms, part numbers, etc. These can be applied in sequence or in parallel (cascaded or unioned) ANALYZE USEALL: english.infl.fst try regular morphology english.eureka.parts.fst and also part namesMorphology vs. surface form: Morphology vs. surface form System always allows surface form through Lexicon can match this form for multiword expressions override/supplement morphological analysis Example: or as adverb (Or you could leave now.) or ADV * @(ADVERB or); CONJ XLE @(CONJ or).Tokenizers : Tokenizers Tokenizers break strings (sentences) into tokens (words) Need to (for English): break off punctuation Mary laughs. ==> Mary TB laughs TB . TB lower case certain letters The dog ==> the TB dogTokenization and morphology: Tokenization and morphology Linguistic analysis may govern tokenization Are English contracted auxiliaries: affixes: John'll ==> no tokenization John +Noun +Proper +Fut clitics: John'll ==> John TB 'll TB John +Noun +Proper will +Fut Arabic determiners and conjunctions both written with adjacent words determiner as an affix giving +Def (Albint the-girl) conjunction tokenized separately (wakutub and-books)Non-deterministic tokenizers: Punctuation: Non-deterministic tokenizers: Punctuation Cannot just break off punctuation and insert a TB Comma haplology Find the dog, a poodle. ==> find TB the TB dog TB , TB a TB poodle TB , TB . TB Period haplology Go to Palm Dr. ==> go TB to TB Palm TB Dr. TB . TB Resulting tokenizer is non-deterministic System must be able to handle multiple inputsCapitalization: Capitalization Intial capitals are optionally lower cased The boy left. ==> the boy left. Mary left. ==> Mary left. Example for both types of non-determinism Bush saw them. ==> { Bush | bush } TB saw TB them TB [, TB]* . TB Tokenization rules vary from language to language and by choice of linguistic analysisConclusions: Conclusions System architecture integrates FST techniques with deep LFG parsing tokenizers morphologies and guessers Allows generalizations to be factored out properties of words properties of strings Allows use of existing large-scale lexical resources avoids redundant speficication System is actively in use in ParGram grammarsShallow Markup: Shallow Markup Preprocessing with shallow markup can reduce ambiguity and speed processing Tokenizer must be able to process the markup Part of speech tagging: I/PRP_ saw/VBD_ her/PRP_ duck/VB_. Named entities <person>General Mills</person> bought it.POS tagging: POS tagging POS tags are not relevant for tokenizing, but the tokenizer must skip them She walks/VBZ_. should be treated like She walks. The morphology must only insert compatible tags A mapping table states allowable combinations /VBZ_ +Verb +3sg /NN_ +Noun +Sg These are encoded into a filtering FST Only compatible tags are passed to the grammarPOS tagging example: POS tagging example I saw her duck duck +Noun +Sg duck +Verb +Pres +Non3sg both possibilities passed to the grammar I saw her duck/VB_. only +Verb +Pres +Non3sg possibility is compatible with /VB_ POS tag only this possibility is passed to the grammarNamed Entities: Named Entities Named entities appear in text as XML markup <person>General Mills</person> bought it. Tokenizer creates special tag for these puts literal spaces instead of TBs allows version without markup for fallback General Mills TB +NamedEntity TB General TB +Title TB Mills +Proper TB Lexical entry added for +NamedEntity Sublexical N and NAME rules allows the tagSample Named Entity output: Sample Named Entity output