logging in or signing up Lect 05 Annotation cont Carmina Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 106 Category: Education License: All Rights Reserved Like it (0) Dislike it (0) Added: February 05, 2008 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Corpus Annotation II: Corpus Annotation II Martin Volk Stockholm University Overview: Overview Clean-Up and Text Structure Recognition Sentence Boundary Recognition Proper Name Recognition and Classification Part-of-Speech Tagging Tagging Correction and Sentence Boundary Corr. Lemmatisation and Lemma Filtering NP/PP Chunk Recognition Recognition of Local and Temporal PPs Clause Boundary RecognitionSlide3: Tokenizer and Sentence Boundary Recognizer Input Docs Abbreviations Proper Name Recognizer Persons, Locations, … First Name list Location list … Part-of-Speech Tagger and Lemmatiser Training Corpus SUC Swetwol Morph. Analyser for Lemmas, Tags, Compounds Morph. Rules LexiconPart-of-Speech Tagging for German: Part-of-Speech Tagging for German Was done with the Tree-Tagger (from Helmut Schmid, IMS Stuttgart). The Tree-Tagger is a statistical tagger. uses the STTS tag set (50 PoS tags and 3 tags for punctuation). assigns 1 tag to each word form. preserves pre-set tags.A statistical Part-of-Speech tagger: A statistical Part-of-Speech tagger learns tagging ”rules” from a manually Part-of-Speech annotated corpus (= training corpus). Vid/PR kliniken/NN i/PR Huddinge/PM övervakas/VB nu/AB Mijailovic/PM ständigt/AB av/PR två/RG vårdare/NN. applies the learned ”rules” to new sentences. Problems: words that were not in the training corpus. words with many possible tags.Two Swedish example word forms with multiple PoS tags in SUC: Two Swedish example word forms with multiple PoS tags in SUC av adverb (AB) 48 times particle (PL) 407 times proper name (PM) 4 times preposition (PR) 14´580 times foreign word (UO) 2 times lagar (EN: laws or to make/repair) noun (NN) 43 times verb (VB) 5 timesPart-of-Speech Tagging for Swedish: Part-of-Speech Tagging for Swedish is done with the TreeTagger which is trained on SUC (Stockholm-Umeå-Corpus; 1 million words) with the SUC tag set (slightly enlarged) originally 22 tags plus VBFIN, VBINF, VBSUP, VBIMP has an estimated error rate of 4% (ie. every 25th word is incorrectly tagged!)Part-of-Speech Tagging with Lemmatisation: Part-of-Speech Tagging with Lemmatisation The TreeTagger also assigns lemmas that it has learned from the training corpus. Rule: If word form W in the corpus has lemma L1 with tag T1 and lemma L2 with tag T2, then the TreeTagger will assign the lemma corresponding to the chosen tag. Example: Swedish: låg ligger (EN: to lay) and VVFIN (finite full verb) låg (EN: low) and JJ (adjective) nice example of PoS Tagging as word sense disambiguationPoS Tagging with Lemmatisation: PoS Tagging with Lemmatisation But, it is possible that word form W has more than one lemma with tag T1 in the training corpus. Example: Swedish: kön kö (EN: queue) noun kön (EN: gender, sex) noun The TreeTagger will simply assign all lemmas to W that go with T1 (no lemma disambiguation).Tagging Correction in German: Tagging Correction in German Correction of observed tagger problems: Sentence-initial adjectives are often tagged as noun (NN) '...liche[nr]' or '...ische[nr]' ADJA Verb group patterns the verb in front of 'worden' must be perfect participle VVXXX + 'worden' VVPP if verb + modal verb then the verb must be infinitive VVXXX + VMYYY VVINF Unknown prepositions (a, via, innert, ennet) Correction of sentence boundaries: Correction of sentence boundaries E.g.: suspected ordinal number followed by a capitalized determiner or pronoun or preposition or adverb insert sentence boundary. Open question: Could all sentence boundary detection be done after PoS tagging?Lemmatisation for Swedish: Lemmatisation for Swedish is (partly) done by the TreeTagger by re-using the lemmas from SUC (Stockholm-Umeå-Corpus) Limits: word forms that are not in SUC. In particular names proper name recognition compounds Swetwol neologisms, foreign expressions ?? SUC lemmas have no compound boundaries (byskolan byskola), (konstindustriskolan konstindustriskola) elliptical compounds (e.g. kostnads- och tidseffektivt) ?? TreeTagger ignores the hyphen. upper case / lower case (e.g. Bo vs. bo) ?? TreeTagger treats them separately. Morphological information: Morphological information such as case, number, gender etc. is important for correct linguistic analysis. could be taken from SUC based on the triple word form – PoS tag – lemma Examples: kön – NN – kön NEUtrum SINgular INDefinite NOMinative kön – NN – kö UTRum SINgular DEFinite NOMinative Limits: word forms that are not in SUC, and triples that have more than 1 set of morphological features.Lemmatisation for Swedish: Lemmatisation for Swedish can be done with Swetwol (Lingsoft Oy, Helsinki) for adjectives (inflection lyckligt - lyckliga, gradation söt - sötare - sötaste), nouns (inflection hus – husen – huset ), verbs (inflection arbeta – arbetar - arbetat …). Swetwol is a two-level morphology analyzer for Swedish is lexicon-based returns all possible interpretations for each word form kön kön N NEU INDEF SG/PL NOM kön kö N UTR DEF SG NOM segments compound words dynamically if all parts are known cirkusskolan cirkus#skola analyzes hyphenated compounds only if all parts are known FN-uppdraget FN-uppdrag tPA-plantan ?? although plantan planta feed last element to SwetwolLemmatisation for German: Lemmatisation for German can be done with Gertwol (Lingsoft Oy, Helsinki) for adjectives (inflection schöne - schönes, gradation schöner - schönste), nouns (inflection Haus – Hauses – Häuser – Häusern), prepositions (contraction zum – zur – zu), and verbs (inflection zeige – zeigst – zeigt – zeigte – zeigten …). Gertwol is a two-level morphology analyzer for German is lexicon-based returns all possible interpretations for each word form segments compound words dynamically analyzes hyphenated compounds only if all parts are known e.g. Software-Aktien but not Informix-Aktien feed last element to GertwolLemma Filtering (a project by Julian Käser): Lemma Filtering (a project by Julian Käser) After lemmatisation: Merging of Ger/Swetwol and tagger information Case 1: The lemma was prespecified during proper name recognition (IBMs IBM) Case 2: Ger/Swetwol does not find a lemma insert the word form as lemma (mark it with '?')Lemma Filtering: Lemma Filtering Case 3: Ger/Swetwol finds exactly one lemma for the given PoS insert the lemma Case 4: Ger/Swetwol finds multiple lemmas for the given PoS disambiguate and insert the best lemma Disambiguation weights the segmentation symbols: Strong compound segment boundary: 4 points Weak compound segment boundary: 2 points Derivational segment boundary: 1 point the lemma with the lowest score wins! Examples: Abteilungen Abt~ei#lunge (5 points) vs. Ab|teil~ung (3 points) rådhusklockan råd|hus#klocka (6 p.) vs. råd#hus#klocka (8 p.)Lemma Filtering: Lemma Filtering Case 5: Ger/Swetwol finds a lemma but not for the given PoS this indicates a tagger error (Ger/Swetwol is more reliable than the tagger.) Case 5.1: Ger/Swetwol finds a lemma for exactly one PoS insert the lemma and exchange the PoS tag Case 5.2: Ger/Swetwol finds lemmas for more than one PoS find „closest“ PoS tag, or guess Option: Check if the PoS tag in the corpus was licensed by SUC. If yes, ask the user for a decision.Lemma Filtering for German: Lemma Filtering for German 0.74% of all PoS tags were exchanged (2% of Adjective tags, Noun tags, Verb tags). In other words: ~ 14'000 tags / annual volume of the ComputerZeitung were exchanged. 85% are cases with exactly one Gertwol tag, 15% are guesses.Limitations of Gertwol: Limitations of Gertwol Compounds are lemmatized only if all parts are known. Idea: Use a corpus for lemmatizing remaining compounds: Examples: kaputtreden, Waferfabriken Solution: If first part occurs standing alone AND second part occurs standing alone with lemma, then segment and lemmatize! and store first part as lemma (of itself)! !!NP/PP Chunk Recognition (a project by Dominik A. Merz): NP/PP Chunk Recognition (a project by Dominik A. Merz) adapted to Swedish by Jerker Hagman, 2004 Pattern matcher with patterns over PoS-tags Example patterns: ADV ADJ --> AP ART AP NN --> NP PR NP --> PP Example treeJerker Hagman’s results: Jerker Hagman’s results 135 chunking rules Categories AdjP, AdvP, MPN, Coordinated_MPN, MPN_genitive NP, Coordinated_NP, NP_genitive PP VerbCluster (hade gått), InfinitiveGroup (att minska) Evaluation against a small treebank 75% precision 68% recallRecognition of temporal PPs in German (a project by Stefan Höfler): Recognition of temporal PPs in German (a project by Stefan Höfler) A second step towards semantic annotation. Starting point: Prepositions (3) that always introduce a temporal PP: binnen, während, zeit Prepositions (30) that may introduce a temporal PP: ab, an, auf, bis, ... + additional evidence Additional evidence: Temporal adverb in PP: heute, niemals, wann, ... Temporal noun in PP: Minute, Stunde, Jahr, Anfang, ...Recognition of temporal PPs: Recognition of temporal PPs Evaluation corpus: 990 sentences with manually checked 263 temporal PPs Result: Precision: 81% Recall: 76%Recognition of local PPs: Recognition of local PPs Starting point: Prepositions that always introduce a local PP: fern, oberhalb, südlich von Prepositions that may introduce a local PP: ab, auf, bei, ... + additional evidence Additional evidence: Local adverb in PP: dort, hier, oben, rechts, ... Local noun in PP: Strasse, Quartier, Land, Norden, <LOC>, ...Recognition of temporal and local PPs: Recognition of temporal and local PPsA Word on Recall and Precision: A Word on Recall and Precision The focus varies with the application! Often: Precision is more important than Recall! Idea: If I annotate something, then I want to be 'sure' that it is correct.Clause Boundary Recognition (a project by Gaudenz Lügstenmann): Clause Boundary Recognition (a project by Gaudenz Lügstenmann) Definition: A clause is a unit consisting of a full verb together with its (non-clausal) complements and adjuncts. A sentence consists of one or more clauses, and a clause consists of one or more phrases. Clauses are important for determining the cooccurrence of verbs and PPs (among other things).Slide29: <S> Mijailovic vårdas på sjukhus <S> Anna Lindhs mördare Mijailo Mijailovic är så sjuk <CB> att han förts till sjukhus. <S> Sedan i lördags vårdas han vid rättspsykiatriska kliniken på Karolinska universitetssjukhuset i Huddinge. <S> Dit fördes han <CB> sedan en läkare vid Kronobergshäktet i Stockholm konstaterat <CB> att han det fanns risk <CB> att han skulle försöka <CB> ta livet av sig i häktet. <S> Det skriver Aftonbladet och Expressen. <S> Mijailovic, <CB> som väntar på rättegången i Högsta domstolen <CB> efter att ha dömts till sluten psykiatrisk vård och inte till fängelse, <CB> ska enligt tidningarna ha slutat ta sina tabletter <CB> och blivit starkt förvirrad. <S> Enligt Kriminalvårdsstyrelsens bestämmelser ska i sådana fall en fånge föras till sjukhus. Dagens Nyheter, 20. Sept. 2004Clause Boundary Recognition: Clause Boundary Recognition Exceptions from the definition: Clauses with more than one verb: Coordinated verbs Daten können überführt und verarbeitet werden. Perception verb + infinitive verb (=AcI) die den Markt wachsen sehen. 'lassen' + infinitive verb lässt die Handbücher übertragen Clause Boundary Recognition: Clause Boundary Recognition Exceptions from the definition: Clauses without a verb: Elliptical clauses (e.g. in coordinated structures) Examples: Er beobachtet den Markt und seine Mitarbeiter die Konkurrenz. Heute kann die Welt nur mehr knapp 30 dieser früher äusserst populären Riesenbilder bewundern, drei davon in der Schweiz.Clause Boundary Recognition: Clause Boundary Recognition The German CB recognizer is realized as a pattern matcher over PoS tags. (34 patterns) Example: Comma + Relative Pronoun Finite verb ... + Conjunction + ... Finite Verb Most difficult: CB without overt punctuation symbol or trigger word Example: Simple Budgetreduzierungen in der IT in den Vordergrund zu stellen <CB> ist der falsche Ansatz. This happens often in Swedish.?Clause Boundary Recognition for German: Clause Boundary Recognition for German Evaluation corpus: 1150 sentences with 754 intra-sentential CBs. Results (counting all CBs) Precision: 95.8% Recall: 84.9% Results (counting only intra-sentential CBs) Precision: 90.5% Recall: 61.1%Using a PoS Tagger for Clause Boundary Recognition in German: Using a PoS Tagger for Clause Boundary Recognition in German A CB recognizer can be seen as a disambiguator over commas and CB-trigger-tokens (if we disregard the CBs without trigger). A tagger may serve the same purpose. Example: ... schrieb der Präsident,<Co> Michael Eisner,<Co> im Jahresbericht. ... schrieb der Präsident,<CB> der Michael Eisner kannte,<CB> im Jahresbericht. Using a PoS Tagger for Clause Boundary Recognition in German: Using a PoS Tagger for Clause Boundary Recognition in German Evaluation corpus: 1150 sentences with 754 intra-sentential CBs. Training the Brill-Tagger on 75% and applying it on the remaining 25% Results: 93% Precision 91% Recall Caution: very small evaluation corpus!!Clause Boundary Recognition vs. Clause Recognition: Clause Boundary Recognition vs. Clause Recognition CB recognition marks only the boundaries. It does not identify discontinuous parts of clauses. It does not identify nesting. Example: <S> Mijailovic, <CB> som väntar på rättegången i Högsta domstolen <CB> efter att ha dömts till sluten psykiatrisk vård och inte till fängelse, <CB> ska enligt tidningarna ha slutat ta sina tabletter <CB> och blivit starkt förvirrad. <C> Mijailovic, <C> som väntar på rättegången i Högsta domstolen <C> efter att ha dömts till sluten psykiatrisk vård och inte till fängelse, </C></C> ska enligt tidningarna ha slutat ta sina tabletter </C><C> och blivit starkt förvirrad.</C> Clause Recognition should be done with a recursive parsing approach because of clause nesting. Summary: Summary Part-of-Speech tagging based on statistical methods is robust and reliable. The TreeTagger assigns PoS tags and lemmas. Swetwol is a morphological analyser that given a word form outputs the PoS tag, the lemma and the morphological features for all its readings. Multiple knowledge sources (e.g. PoS-tagger and Swetwol) may lead to conflicting tags. Chunking (partial parsing) builds partial trees. Clause boundary detection can be realized as pattern matching over PoS tags. You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
Lect 05 Annotation cont Carmina Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 106 Category: Education License: All Rights Reserved Like it (0) Dislike it (0) Added: February 05, 2008 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Corpus Annotation II: Corpus Annotation II Martin Volk Stockholm University Overview: Overview Clean-Up and Text Structure Recognition Sentence Boundary Recognition Proper Name Recognition and Classification Part-of-Speech Tagging Tagging Correction and Sentence Boundary Corr. Lemmatisation and Lemma Filtering NP/PP Chunk Recognition Recognition of Local and Temporal PPs Clause Boundary RecognitionSlide3: Tokenizer and Sentence Boundary Recognizer Input Docs Abbreviations Proper Name Recognizer Persons, Locations, … First Name list Location list … Part-of-Speech Tagger and Lemmatiser Training Corpus SUC Swetwol Morph. Analyser for Lemmas, Tags, Compounds Morph. Rules LexiconPart-of-Speech Tagging for German: Part-of-Speech Tagging for German Was done with the Tree-Tagger (from Helmut Schmid, IMS Stuttgart). The Tree-Tagger is a statistical tagger. uses the STTS tag set (50 PoS tags and 3 tags for punctuation). assigns 1 tag to each word form. preserves pre-set tags.A statistical Part-of-Speech tagger: A statistical Part-of-Speech tagger learns tagging ”rules” from a manually Part-of-Speech annotated corpus (= training corpus). Vid/PR kliniken/NN i/PR Huddinge/PM övervakas/VB nu/AB Mijailovic/PM ständigt/AB av/PR två/RG vårdare/NN. applies the learned ”rules” to new sentences. Problems: words that were not in the training corpus. words with many possible tags.Two Swedish example word forms with multiple PoS tags in SUC: Two Swedish example word forms with multiple PoS tags in SUC av adverb (AB) 48 times particle (PL) 407 times proper name (PM) 4 times preposition (PR) 14´580 times foreign word (UO) 2 times lagar (EN: laws or to make/repair) noun (NN) 43 times verb (VB) 5 timesPart-of-Speech Tagging for Swedish: Part-of-Speech Tagging for Swedish is done with the TreeTagger which is trained on SUC (Stockholm-Umeå-Corpus; 1 million words) with the SUC tag set (slightly enlarged) originally 22 tags plus VBFIN, VBINF, VBSUP, VBIMP has an estimated error rate of 4% (ie. every 25th word is incorrectly tagged!)Part-of-Speech Tagging with Lemmatisation: Part-of-Speech Tagging with Lemmatisation The TreeTagger also assigns lemmas that it has learned from the training corpus. Rule: If word form W in the corpus has lemma L1 with tag T1 and lemma L2 with tag T2, then the TreeTagger will assign the lemma corresponding to the chosen tag. Example: Swedish: låg ligger (EN: to lay) and VVFIN (finite full verb) låg (EN: low) and JJ (adjective) nice example of PoS Tagging as word sense disambiguationPoS Tagging with Lemmatisation: PoS Tagging with Lemmatisation But, it is possible that word form W has more than one lemma with tag T1 in the training corpus. Example: Swedish: kön kö (EN: queue) noun kön (EN: gender, sex) noun The TreeTagger will simply assign all lemmas to W that go with T1 (no lemma disambiguation).Tagging Correction in German: Tagging Correction in German Correction of observed tagger problems: Sentence-initial adjectives are often tagged as noun (NN) '...liche[nr]' or '...ische[nr]' ADJA Verb group patterns the verb in front of 'worden' must be perfect participle VVXXX + 'worden' VVPP if verb + modal verb then the verb must be infinitive VVXXX + VMYYY VVINF Unknown prepositions (a, via, innert, ennet) Correction of sentence boundaries: Correction of sentence boundaries E.g.: suspected ordinal number followed by a capitalized determiner or pronoun or preposition or adverb insert sentence boundary. Open question: Could all sentence boundary detection be done after PoS tagging?Lemmatisation for Swedish: Lemmatisation for Swedish is (partly) done by the TreeTagger by re-using the lemmas from SUC (Stockholm-Umeå-Corpus) Limits: word forms that are not in SUC. In particular names proper name recognition compounds Swetwol neologisms, foreign expressions ?? SUC lemmas have no compound boundaries (byskolan byskola), (konstindustriskolan konstindustriskola) elliptical compounds (e.g. kostnads- och tidseffektivt) ?? TreeTagger ignores the hyphen. upper case / lower case (e.g. Bo vs. bo) ?? TreeTagger treats them separately. Morphological information: Morphological information such as case, number, gender etc. is important for correct linguistic analysis. could be taken from SUC based on the triple word form – PoS tag – lemma Examples: kön – NN – kön NEUtrum SINgular INDefinite NOMinative kön – NN – kö UTRum SINgular DEFinite NOMinative Limits: word forms that are not in SUC, and triples that have more than 1 set of morphological features.Lemmatisation for Swedish: Lemmatisation for Swedish can be done with Swetwol (Lingsoft Oy, Helsinki) for adjectives (inflection lyckligt - lyckliga, gradation söt - sötare - sötaste), nouns (inflection hus – husen – huset ), verbs (inflection arbeta – arbetar - arbetat …). Swetwol is a two-level morphology analyzer for Swedish is lexicon-based returns all possible interpretations for each word form kön kön N NEU INDEF SG/PL NOM kön kö N UTR DEF SG NOM segments compound words dynamically if all parts are known cirkusskolan cirkus#skola analyzes hyphenated compounds only if all parts are known FN-uppdraget FN-uppdrag tPA-plantan ?? although plantan planta feed last element to SwetwolLemmatisation for German: Lemmatisation for German can be done with Gertwol (Lingsoft Oy, Helsinki) for adjectives (inflection schöne - schönes, gradation schöner - schönste), nouns (inflection Haus – Hauses – Häuser – Häusern), prepositions (contraction zum – zur – zu), and verbs (inflection zeige – zeigst – zeigt – zeigte – zeigten …). Gertwol is a two-level morphology analyzer for German is lexicon-based returns all possible interpretations for each word form segments compound words dynamically analyzes hyphenated compounds only if all parts are known e.g. Software-Aktien but not Informix-Aktien feed last element to GertwolLemma Filtering (a project by Julian Käser): Lemma Filtering (a project by Julian Käser) After lemmatisation: Merging of Ger/Swetwol and tagger information Case 1: The lemma was prespecified during proper name recognition (IBMs IBM) Case 2: Ger/Swetwol does not find a lemma insert the word form as lemma (mark it with '?')Lemma Filtering: Lemma Filtering Case 3: Ger/Swetwol finds exactly one lemma for the given PoS insert the lemma Case 4: Ger/Swetwol finds multiple lemmas for the given PoS disambiguate and insert the best lemma Disambiguation weights the segmentation symbols: Strong compound segment boundary: 4 points Weak compound segment boundary: 2 points Derivational segment boundary: 1 point the lemma with the lowest score wins! Examples: Abteilungen Abt~ei#lunge (5 points) vs. Ab|teil~ung (3 points) rådhusklockan råd|hus#klocka (6 p.) vs. råd#hus#klocka (8 p.)Lemma Filtering: Lemma Filtering Case 5: Ger/Swetwol finds a lemma but not for the given PoS this indicates a tagger error (Ger/Swetwol is more reliable than the tagger.) Case 5.1: Ger/Swetwol finds a lemma for exactly one PoS insert the lemma and exchange the PoS tag Case 5.2: Ger/Swetwol finds lemmas for more than one PoS find „closest“ PoS tag, or guess Option: Check if the PoS tag in the corpus was licensed by SUC. If yes, ask the user for a decision.Lemma Filtering for German: Lemma Filtering for German 0.74% of all PoS tags were exchanged (2% of Adjective tags, Noun tags, Verb tags). In other words: ~ 14'000 tags / annual volume of the ComputerZeitung were exchanged. 85% are cases with exactly one Gertwol tag, 15% are guesses.Limitations of Gertwol: Limitations of Gertwol Compounds are lemmatized only if all parts are known. Idea: Use a corpus for lemmatizing remaining compounds: Examples: kaputtreden, Waferfabriken Solution: If first part occurs standing alone AND second part occurs standing alone with lemma, then segment and lemmatize! and store first part as lemma (of itself)! !!NP/PP Chunk Recognition (a project by Dominik A. Merz): NP/PP Chunk Recognition (a project by Dominik A. Merz) adapted to Swedish by Jerker Hagman, 2004 Pattern matcher with patterns over PoS-tags Example patterns: ADV ADJ --> AP ART AP NN --> NP PR NP --> PP Example treeJerker Hagman’s results: Jerker Hagman’s results 135 chunking rules Categories AdjP, AdvP, MPN, Coordinated_MPN, MPN_genitive NP, Coordinated_NP, NP_genitive PP VerbCluster (hade gått), InfinitiveGroup (att minska) Evaluation against a small treebank 75% precision 68% recallRecognition of temporal PPs in German (a project by Stefan Höfler): Recognition of temporal PPs in German (a project by Stefan Höfler) A second step towards semantic annotation. Starting point: Prepositions (3) that always introduce a temporal PP: binnen, während, zeit Prepositions (30) that may introduce a temporal PP: ab, an, auf, bis, ... + additional evidence Additional evidence: Temporal adverb in PP: heute, niemals, wann, ... Temporal noun in PP: Minute, Stunde, Jahr, Anfang, ...Recognition of temporal PPs: Recognition of temporal PPs Evaluation corpus: 990 sentences with manually checked 263 temporal PPs Result: Precision: 81% Recall: 76%Recognition of local PPs: Recognition of local PPs Starting point: Prepositions that always introduce a local PP: fern, oberhalb, südlich von Prepositions that may introduce a local PP: ab, auf, bei, ... + additional evidence Additional evidence: Local adverb in PP: dort, hier, oben, rechts, ... Local noun in PP: Strasse, Quartier, Land, Norden, <LOC>, ...Recognition of temporal and local PPs: Recognition of temporal and local PPsA Word on Recall and Precision: A Word on Recall and Precision The focus varies with the application! Often: Precision is more important than Recall! Idea: If I annotate something, then I want to be 'sure' that it is correct.Clause Boundary Recognition (a project by Gaudenz Lügstenmann): Clause Boundary Recognition (a project by Gaudenz Lügstenmann) Definition: A clause is a unit consisting of a full verb together with its (non-clausal) complements and adjuncts. A sentence consists of one or more clauses, and a clause consists of one or more phrases. Clauses are important for determining the cooccurrence of verbs and PPs (among other things).Slide29: <S> Mijailovic vårdas på sjukhus <S> Anna Lindhs mördare Mijailo Mijailovic är så sjuk <CB> att han förts till sjukhus. <S> Sedan i lördags vårdas han vid rättspsykiatriska kliniken på Karolinska universitetssjukhuset i Huddinge. <S> Dit fördes han <CB> sedan en läkare vid Kronobergshäktet i Stockholm konstaterat <CB> att han det fanns risk <CB> att han skulle försöka <CB> ta livet av sig i häktet. <S> Det skriver Aftonbladet och Expressen. <S> Mijailovic, <CB> som väntar på rättegången i Högsta domstolen <CB> efter att ha dömts till sluten psykiatrisk vård och inte till fängelse, <CB> ska enligt tidningarna ha slutat ta sina tabletter <CB> och blivit starkt förvirrad. <S> Enligt Kriminalvårdsstyrelsens bestämmelser ska i sådana fall en fånge föras till sjukhus. Dagens Nyheter, 20. Sept. 2004Clause Boundary Recognition: Clause Boundary Recognition Exceptions from the definition: Clauses with more than one verb: Coordinated verbs Daten können überführt und verarbeitet werden. Perception verb + infinitive verb (=AcI) die den Markt wachsen sehen. 'lassen' + infinitive verb lässt die Handbücher übertragen Clause Boundary Recognition: Clause Boundary Recognition Exceptions from the definition: Clauses without a verb: Elliptical clauses (e.g. in coordinated structures) Examples: Er beobachtet den Markt und seine Mitarbeiter die Konkurrenz. Heute kann die Welt nur mehr knapp 30 dieser früher äusserst populären Riesenbilder bewundern, drei davon in der Schweiz.Clause Boundary Recognition: Clause Boundary Recognition The German CB recognizer is realized as a pattern matcher over PoS tags. (34 patterns) Example: Comma + Relative Pronoun Finite verb ... + Conjunction + ... Finite Verb Most difficult: CB without overt punctuation symbol or trigger word Example: Simple Budgetreduzierungen in der IT in den Vordergrund zu stellen <CB> ist der falsche Ansatz. This happens often in Swedish.?Clause Boundary Recognition for German: Clause Boundary Recognition for German Evaluation corpus: 1150 sentences with 754 intra-sentential CBs. Results (counting all CBs) Precision: 95.8% Recall: 84.9% Results (counting only intra-sentential CBs) Precision: 90.5% Recall: 61.1%Using a PoS Tagger for Clause Boundary Recognition in German: Using a PoS Tagger for Clause Boundary Recognition in German A CB recognizer can be seen as a disambiguator over commas and CB-trigger-tokens (if we disregard the CBs without trigger). A tagger may serve the same purpose. Example: ... schrieb der Präsident,<Co> Michael Eisner,<Co> im Jahresbericht. ... schrieb der Präsident,<CB> der Michael Eisner kannte,<CB> im Jahresbericht. Using a PoS Tagger for Clause Boundary Recognition in German: Using a PoS Tagger for Clause Boundary Recognition in German Evaluation corpus: 1150 sentences with 754 intra-sentential CBs. Training the Brill-Tagger on 75% and applying it on the remaining 25% Results: 93% Precision 91% Recall Caution: very small evaluation corpus!!Clause Boundary Recognition vs. Clause Recognition: Clause Boundary Recognition vs. Clause Recognition CB recognition marks only the boundaries. It does not identify discontinuous parts of clauses. It does not identify nesting. Example: <S> Mijailovic, <CB> som väntar på rättegången i Högsta domstolen <CB> efter att ha dömts till sluten psykiatrisk vård och inte till fängelse, <CB> ska enligt tidningarna ha slutat ta sina tabletter <CB> och blivit starkt förvirrad. <C> Mijailovic, <C> som väntar på rättegången i Högsta domstolen <C> efter att ha dömts till sluten psykiatrisk vård och inte till fängelse, </C></C> ska enligt tidningarna ha slutat ta sina tabletter </C><C> och blivit starkt förvirrad.</C> Clause Recognition should be done with a recursive parsing approach because of clause nesting. Summary: Summary Part-of-Speech tagging based on statistical methods is robust and reliable. The TreeTagger assigns PoS tags and lemmas. Swetwol is a morphological analyser that given a word form outputs the PoS tag, the lemma and the morphological features for all its readings. Multiple knowledge sources (e.g. PoS-tagger and Swetwol) may lead to conflicting tags. Chunking (partial parsing) builds partial trees. Clause boundary detection can be realized as pattern matching over PoS tags.