logging in or signing up Vorl 03 Annotation cont Gulkund Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 62 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: November 15, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Corpus Annotation II: Corpus Annotation II Martin Volk Universität Zürich Eurospider Information Technology AGOverview: Overview Clean-Up and Text Structure Recognition Sentence Boundary Recognition Proper Name Recognition and Classification Part-of-Speech Tagging Tagging Correction and Sentence Boundary Corr. Lemmatisation and Lemma Filtering NP/PP Chunk Recognition Recognition of Local and Temporal PPs Clause Boundary RecognitionPart-of-Speech Tagging: Part-of-Speech Tagging Was done with the Tree-Tagger (Helmut Schmid, IMS-Stuttgart). The Tree-Tagger Is a Statistical Tagger. uses the STTS tag set (50 PoS tags and 3 tags for punctuation). assigns 1 tag to each word. preserves pre-set tags.Tagging Correction: Tagging Correction Correction of observed tagger problems: Sentence-initial adjectives are often tagged as noun (NN) '...liche[nr]' or '...ische[nr]' ADJA Verb group patterns the verb in front of 'worden' must be perfect participle VVXXX + 'worden' VVPP if verb + modal verb then the verb must be infinitive VVXXX + VMYYY VVINF Unknown prepositions (a, via, innert, ennet) Correction of sentence boundaries: Correction of sentence boundaries E.g.: suspected ordinal number followed by a capitalized determiner or pronoun or preposition or adverb insert sentence boundary. Open question: Could all sentence boundary detection be done after PoS tagging?Lemmatisation: Lemmatisation Was done with Gertwol (von Lingsoft Oy, Helsinki) for adjectives, nouns, prepositions, and verbs. Gertwol is a two-level morphology analyzer for German is lexicon-based returns all possible interpretations for each word form segments compound words dynamically analyzes hyphenated compounds only if all parts are known (e.g. Software-Aktien but not Informix-Aktien) feed last element to GertwolLemma Filtering (a project by Julian Käser): Lemma Filtering (a project by Julian Käser) After lemmatisation: Merging of Gertwol and tagger information Case 1: The lemma was prespecified during proper name recognition (IBMs IBM) Case 2: Gertwol does not find a lemma insert the word form as lemma (with '?')Lemma Filtering: Lemma Filtering Case 3: Gertwol finds exactly one lemma for the given PoS insert the lemma Case 4: Gertwol finds multiple lemmas for the given PoS disambiguate and insert the best lemma Disambiguation weights the segmentation symbols: Strong compound segment boundary: 4 points Weak compound segment boundary: 2 points Derivational segment boundary: 1 point the lemma with the lowest score wins! Example: Abteilungen Abt~ei#lunge (5 points) vs. Ab|teil~ung (3 points)Lemma Filtering: Lemma Filtering Case 5: Gertwol finds a lemma but not for the given PoS this indicates a tagger error (Gertwol is more reliable than the tagger.) Case 5.1: Gertwol finds a lemma for exactly one PoS insert the lemma and exchange the PoS tag Case 5.2: Gertwol finds lemmas for more than one PoS find closest PoS tag, or guess Lemma Filtering: Lemma Filtering 0.74% of all PoS tags were exchanged (2% of Adj, N, V tags). In other words: ~ 14'000 tags / annual volume of the ComputerZeitung were exchanged. 85% are cases with exactly one Gertwol tag, 15% are guesses.Limitations of Gertwol: Limitations of Gertwol Compounds are lemmatized only if all parts are known. Idea: Use corpus for lemmatizing remaining compounds: Example: kaputtreden, Waferfabriken Solution: If first part occurs standing alone AND second part occurs standing alone with lemma, then segment and lemmatize! and store first part as lemma (of itself)! !!NP/PP Chunk Recognition (a project by Dominik A. Merz): NP/PP Chunk Recognition (a project by Dominik A. Merz) Pattern matcher with patterns over PoS-tags Example patterns: ADV ADJA --> AP APPR ART ADJA NN --> PP APPR ART AP NN --> PP Note: The morphological information provided by Gertwol (e.g. grammatical case, number, gender) was not used!!Representation Format: Representation Format The NEGRA export format is a line based format works with pointers for tree structure comprises node labels (constituents) and edge labels (grammatical functions) has no provision for semantic information. Therefore: We use the comment field.Recognition of temporal PPs (a project by Stefan Höfler): Recognition of temporal PPs (a project by Stefan Höfler) A second step towards semantic annotation. Starting point: Prepositions (3) that always introduce a temporal PP: binnen, während, zeit Prepositions (30) that may introduce a temporal PP: ab, an, auf, bis, ... + additional evidence Additional evidence: Temporal adverb in PP: heute, niemals, wann, ... Temporal noun in PP: Minute, Stunde, Jahr, Anfang, ...Recognition of temporal PPs: Recognition of temporal PPs Evaluation corpus: 990 sentences with manually checked 263 temporal PPs Result: Precision: 81% Recall: 76%Recognition of local PPs: Recognition of local PPs Starting point: Prepositions (3) that always introduce a local PP: fern, oberhalb, südlich von Prepositions (30) that may introduce a local PP: ab, auf, bei, ... + additional evidence Additional evidence: Local adverb in PP: dort, hier, oben, rechts, ... Local noun in PP: Strasse, Quartier, Land, Norden, <GEO>, ...Recognition of temporal and local PPs: Recognition of temporal and local PPsA Word on Recall and Precision: A Word on Recall and Precision The focus varies with the application! For my project: Precision is more important than Recall! Idea: If I annotate something, then I want to be 'sure' that it is correct.Clause Boundary Recognition (a project by Gaudenz Lügstenmann): Clause Boundary Recognition (a project by Gaudenz Lügstenmann) Definition: A clause is a unit consisting of a full verb together with its (non-clausal) complements and adjuncts. A sentence consists of one or more clauses, and a clause consists of one or more phrases. Clauses are important for determining the cooccurrence of verbs and PPs (among other things).Clauses Boundary Recognition: Clauses Boundary Recognition Exceptions from the definition: Clauses with more than one verb: Coordinated verbs (e.g. Daten können überführt und verarbeitet werden) Perception verb + infinitive verb (=AcI) (e.g. die den Markt wachsen sehen.) 'lassen' + infinitive verb (e.g. lässt die Handbücher übertragen) Clauses Boundary Recognition: Clauses Boundary Recognition Exceptions from the definition: Clauses without a verb: Elliptical clauses (e.g. in coordinated structures) Example: Er beobachtet den Markt und seine Mitarbeiter die Konkurrenz.Clauses Boundary Recognition: Clauses Boundary Recognition The CB recognizer is realized as a pattern matcher over PoS tags. (34 patterns) Example: Comma + Relative Pronoun Finite verb ... + Conjunction + ... Finite Verb Most difficult: CB without overt punctuation symbol or trigger word Example: Simple Budgetreduzierungen in der IT in den Vordergrund zu stellen <CB> ist der falsche Ansatz.Clauses Boundary Recognition: Clauses Boundary Recognition Evaluation corpus: 1150 sentences with 754 intra-sentential CBs. Results (counting all CBs) Precision: 95.8% Recall: 84.9% Results (counting only intra-sentential CBs) Precision: 90.5% Recall: 61.1%Using a PoS Tagger for Clause Boundary Recognition: Using a PoS Tagger for Clause Boundary Recognition A CB recognizer can be seen as a disambiguator over commas and CB trigger tokens (if we disregard the CBs without trigger). A tagger may serve the same purpose. Example: ... schrieb der Präsident,<Co> Michael Eisner,<Co> im Jahresbericht. ... schrieb der Präsident,<CB> der Michael Eisner kannte,<CB> im Jahresbericht. Using a PoS Tagger for Clause Boundary Recognition: Using a PoS Tagger for Clause Boundary Recognition Evaluation corpus: 1150 sentences with 754 intra-sentential CBs. Training the Brill-Tagger on 75% and applying it on the remaining 25% Results: 93% Precision 91% Recall Caution: very small evaluation corpus!!Clause Boundary Recognition vs. Clause Recognition: Clause Boundary Recognition vs. Clause Recognition CB recognition marks only the boundaries. It does not identify discontinuous parts of clauses. Example: Nur ein Projekt der Volkswagen AG, <CB> die ihre europäischen Vertragswerkstätten per Satellit vernetzen will, <CB> stößt in ähnliche Dimensionen vor. <C> Nur ein Projekt der Volkswagen AG, <C> die ihre europäischen Vertragswerkstätten per Satellit vernetzen will, </C> stößt in ähnliche Dimensionen vor. </C> Clause Recognition should be done with a recursive parsing approach because of clause nesting. You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
Vorl 03 Annotation cont Gulkund Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 62 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: November 15, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Corpus Annotation II: Corpus Annotation II Martin Volk Universität Zürich Eurospider Information Technology AGOverview: Overview Clean-Up and Text Structure Recognition Sentence Boundary Recognition Proper Name Recognition and Classification Part-of-Speech Tagging Tagging Correction and Sentence Boundary Corr. Lemmatisation and Lemma Filtering NP/PP Chunk Recognition Recognition of Local and Temporal PPs Clause Boundary RecognitionPart-of-Speech Tagging: Part-of-Speech Tagging Was done with the Tree-Tagger (Helmut Schmid, IMS-Stuttgart). The Tree-Tagger Is a Statistical Tagger. uses the STTS tag set (50 PoS tags and 3 tags for punctuation). assigns 1 tag to each word. preserves pre-set tags.Tagging Correction: Tagging Correction Correction of observed tagger problems: Sentence-initial adjectives are often tagged as noun (NN) '...liche[nr]' or '...ische[nr]' ADJA Verb group patterns the verb in front of 'worden' must be perfect participle VVXXX + 'worden' VVPP if verb + modal verb then the verb must be infinitive VVXXX + VMYYY VVINF Unknown prepositions (a, via, innert, ennet) Correction of sentence boundaries: Correction of sentence boundaries E.g.: suspected ordinal number followed by a capitalized determiner or pronoun or preposition or adverb insert sentence boundary. Open question: Could all sentence boundary detection be done after PoS tagging?Lemmatisation: Lemmatisation Was done with Gertwol (von Lingsoft Oy, Helsinki) for adjectives, nouns, prepositions, and verbs. Gertwol is a two-level morphology analyzer for German is lexicon-based returns all possible interpretations for each word form segments compound words dynamically analyzes hyphenated compounds only if all parts are known (e.g. Software-Aktien but not Informix-Aktien) feed last element to GertwolLemma Filtering (a project by Julian Käser): Lemma Filtering (a project by Julian Käser) After lemmatisation: Merging of Gertwol and tagger information Case 1: The lemma was prespecified during proper name recognition (IBMs IBM) Case 2: Gertwol does not find a lemma insert the word form as lemma (with '?')Lemma Filtering: Lemma Filtering Case 3: Gertwol finds exactly one lemma for the given PoS insert the lemma Case 4: Gertwol finds multiple lemmas for the given PoS disambiguate and insert the best lemma Disambiguation weights the segmentation symbols: Strong compound segment boundary: 4 points Weak compound segment boundary: 2 points Derivational segment boundary: 1 point the lemma with the lowest score wins! Example: Abteilungen Abt~ei#lunge (5 points) vs. Ab|teil~ung (3 points)Lemma Filtering: Lemma Filtering Case 5: Gertwol finds a lemma but not for the given PoS this indicates a tagger error (Gertwol is more reliable than the tagger.) Case 5.1: Gertwol finds a lemma for exactly one PoS insert the lemma and exchange the PoS tag Case 5.2: Gertwol finds lemmas for more than one PoS find closest PoS tag, or guess Lemma Filtering: Lemma Filtering 0.74% of all PoS tags were exchanged (2% of Adj, N, V tags). In other words: ~ 14'000 tags / annual volume of the ComputerZeitung were exchanged. 85% are cases with exactly one Gertwol tag, 15% are guesses.Limitations of Gertwol: Limitations of Gertwol Compounds are lemmatized only if all parts are known. Idea: Use corpus for lemmatizing remaining compounds: Example: kaputtreden, Waferfabriken Solution: If first part occurs standing alone AND second part occurs standing alone with lemma, then segment and lemmatize! and store first part as lemma (of itself)! !!NP/PP Chunk Recognition (a project by Dominik A. Merz): NP/PP Chunk Recognition (a project by Dominik A. Merz) Pattern matcher with patterns over PoS-tags Example patterns: ADV ADJA --> AP APPR ART ADJA NN --> PP APPR ART AP NN --> PP Note: The morphological information provided by Gertwol (e.g. grammatical case, number, gender) was not used!!Representation Format: Representation Format The NEGRA export format is a line based format works with pointers for tree structure comprises node labels (constituents) and edge labels (grammatical functions) has no provision for semantic information. Therefore: We use the comment field.Recognition of temporal PPs (a project by Stefan Höfler): Recognition of temporal PPs (a project by Stefan Höfler) A second step towards semantic annotation. Starting point: Prepositions (3) that always introduce a temporal PP: binnen, während, zeit Prepositions (30) that may introduce a temporal PP: ab, an, auf, bis, ... + additional evidence Additional evidence: Temporal adverb in PP: heute, niemals, wann, ... Temporal noun in PP: Minute, Stunde, Jahr, Anfang, ...Recognition of temporal PPs: Recognition of temporal PPs Evaluation corpus: 990 sentences with manually checked 263 temporal PPs Result: Precision: 81% Recall: 76%Recognition of local PPs: Recognition of local PPs Starting point: Prepositions (3) that always introduce a local PP: fern, oberhalb, südlich von Prepositions (30) that may introduce a local PP: ab, auf, bei, ... + additional evidence Additional evidence: Local adverb in PP: dort, hier, oben, rechts, ... Local noun in PP: Strasse, Quartier, Land, Norden, <GEO>, ...Recognition of temporal and local PPs: Recognition of temporal and local PPsA Word on Recall and Precision: A Word on Recall and Precision The focus varies with the application! For my project: Precision is more important than Recall! Idea: If I annotate something, then I want to be 'sure' that it is correct.Clause Boundary Recognition (a project by Gaudenz Lügstenmann): Clause Boundary Recognition (a project by Gaudenz Lügstenmann) Definition: A clause is a unit consisting of a full verb together with its (non-clausal) complements and adjuncts. A sentence consists of one or more clauses, and a clause consists of one or more phrases. Clauses are important for determining the cooccurrence of verbs and PPs (among other things).Clauses Boundary Recognition: Clauses Boundary Recognition Exceptions from the definition: Clauses with more than one verb: Coordinated verbs (e.g. Daten können überführt und verarbeitet werden) Perception verb + infinitive verb (=AcI) (e.g. die den Markt wachsen sehen.) 'lassen' + infinitive verb (e.g. lässt die Handbücher übertragen) Clauses Boundary Recognition: Clauses Boundary Recognition Exceptions from the definition: Clauses without a verb: Elliptical clauses (e.g. in coordinated structures) Example: Er beobachtet den Markt und seine Mitarbeiter die Konkurrenz.Clauses Boundary Recognition: Clauses Boundary Recognition The CB recognizer is realized as a pattern matcher over PoS tags. (34 patterns) Example: Comma + Relative Pronoun Finite verb ... + Conjunction + ... Finite Verb Most difficult: CB without overt punctuation symbol or trigger word Example: Simple Budgetreduzierungen in der IT in den Vordergrund zu stellen <CB> ist der falsche Ansatz.Clauses Boundary Recognition: Clauses Boundary Recognition Evaluation corpus: 1150 sentences with 754 intra-sentential CBs. Results (counting all CBs) Precision: 95.8% Recall: 84.9% Results (counting only intra-sentential CBs) Precision: 90.5% Recall: 61.1%Using a PoS Tagger for Clause Boundary Recognition: Using a PoS Tagger for Clause Boundary Recognition A CB recognizer can be seen as a disambiguator over commas and CB trigger tokens (if we disregard the CBs without trigger). A tagger may serve the same purpose. Example: ... schrieb der Präsident,<Co> Michael Eisner,<Co> im Jahresbericht. ... schrieb der Präsident,<CB> der Michael Eisner kannte,<CB> im Jahresbericht. Using a PoS Tagger for Clause Boundary Recognition: Using a PoS Tagger for Clause Boundary Recognition Evaluation corpus: 1150 sentences with 754 intra-sentential CBs. Training the Brill-Tagger on 75% and applying it on the remaining 25% Results: 93% Precision 91% Recall Caution: very small evaluation corpus!!Clause Boundary Recognition vs. Clause Recognition: Clause Boundary Recognition vs. Clause Recognition CB recognition marks only the boundaries. It does not identify discontinuous parts of clauses. Example: Nur ein Projekt der Volkswagen AG, <CB> die ihre europäischen Vertragswerkstätten per Satellit vernetzen will, <CB> stößt in ähnliche Dimensionen vor. <C> Nur ein Projekt der Volkswagen AG, <C> die ihre europäischen Vertragswerkstätten per Satellit vernetzen will, </C> stößt in ähnliche Dimensionen vor. </C> Clause Recognition should be done with a recursive parsing approach because of clause nesting.