logging in or signing up romand2002 020717 Francisco Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 24 Category: Entertainment License: All Rights Reserved Like it (2) Dislike it (0) Added: November 16, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Robust Error Detection: A Hybrid Approach Combining Unsupervised Error Detection and Linguistic Knowledge: Robust Error Detection: A Hybrid Approach Combining Unsupervised Error Detection and Linguistic Knowledge Johnny Bigert and Ola Knutsson Royal Institute of Technology Stockholm, Sweden johnny@nada.kth.se knutsson@nada.kth.seDetection of context-sensitive spelling errors: Detection of context-sensitive spelling errors Identification of less-frequent grammatical constructions in the face of sparse data Hybrid method Unsupervised error detection Linguistic knowledge used for phrase transformations Properties: Properties Find difficult error types in unrestricted text (spelling errors resulting in an existing word etc.) No prior knowledge required, i.e. no classification of errors or confusion setsA first approach: A first approach Algorithm: for each position i in the stream if the frequency of (ti-1 ti ti+1) is low report error to the user report no errorSparse data: Sparse data Problems: Data sparseness for trigram statistics Phrase and clause boundaries may produce almost any trigramSparse data: Sparse data Example: ”It is every manager's task to…” ”It is every” is tagged (pn.neu.sin.def.sub/obj, vb.prs.akt, dt.utr/neu.sin.ind) and has a frequency of zero Probable cause: out of a million words in the corpus, only 709 have been assigned the tag (dt.utr/neu.sin.ind) Sparse data: Sparse data We try to replace ”It is every manager's task to…” with ”It is a manager's task to…” Sparse data: Sparse data ”It is every” is tagged (pn.neu.sin.def.sub/obj, vb.prs.akt, dt.utr/neu.sin.ind) and had a frequency of 0 ”It is a” is tagged (pn.neu.sin.def.sub/obj, vb.prs.akt, dt.utr.sin.ind) and have a frequency of 231 (dt.utr/neu.sin.ind) had a frequency of 709 (dt.utr.sin.ind) has a frequency 19112 Tag replacements: Tag replacements When replacing a tag: All tags are not suitable as replacements All replacements are not equally appropriate… …and thus, we require a penalty or probability for the replacementTag replacements: Tag replacements To be considered: Manual work to create the probabilities for each tag set and language The probabilities are difficult to estimate manually Automatic estimation of the probabilities (other paper)Tag replacements: Tag replacements Examples of replacement probabilities: Mannen var glad. (The man was happy.) Mannen är glad. (The man is happy.) 100% vb.prt.akt.kop vb.prt.akt.kop 74% vb.prt.akt.kop vb.prs.akt.kop 50% vb.prt.akt.kop vb.prt.akt____ 48% vb.prt.akt.kop vb.prt.sfoTag replacements: Tag replacements Examples of replacement probabilities: Mannen talar med de anställda. (The man talks to the employees.) Mannen talar med våra anställda. (The man talks to our employees.) 100% dt.utr/neu.plu.def dt.utr/neu.plu.def 44% dt.utr/neu.plu.def dt.utr/neu.plu.ind/def 42% dt.utr/neu.plu.def ps.utr/neu.plu.def 41% dt.utr/neu.plu.def jj.pos.utr/neu.plu.ind.nomWeighted trigrams: Weighted trigrams Replacing (t1 t2 t3) with (r1 r2 r3): f = freq(r1 r2 r3) · penalty penalty = Pr[replace t1 with r1] · Pr[replace t2 with r2] · Pr[replace t3 with r3]Weighted trigrams: Weighted trigrams Replacement of tags: Calculate f for all representatives for t1 , t2 and t3 (typically 3 · 3 · 3 of them) The weighted frequency is the sum of the penalized frequencies Algorithm: Algorithm Algorithm: for each position i in the stream if weighted freq for (ti-1 ti ti+1) is low report error to the user report no error An improved algorithm: An improved algorithm Problems with sparse data Phrase and clause boundaries may produce almost any trigram Use clauses as the unit for error detection to avoid clause boundariesPhrase transformations: Phrase transformations We identify phrases to transform rare constructions to those more frequent Replacing the phrase with its head Removing phrases (e.g. AdvP, PP)Phrase transformations: Phrase transformations Example: Alla hundar som är bruna är lyckliga (All dogs that are brown are happy) Hundarna är lyckliga (The dogs are happy)Phrase transformations: Phrase transformations Den bruna (jj.sin) hunden (the brown dog) De bruna (jj.plu) hundarna (the brown dogs)Phrase transformations: Phrase transformations The same example with a tagging error: Alla hundar som är bruna (jj.sin) är lyckliga (All dogs that are brown are happy) Robust NP detection yield Hundarna är lyckliga (The dogs are happy) Results: Results Error types found: context-sensitive spelling errors split compounds spelling errors verb chain errorsComparison between probabilistic methods: Comparison between probabilistic methods The unsupervised method has a good error capacity but also a high rate of false alarms The introduction of linguistic knowledge dramtically reduces the number of false alarms Future work: Future work The error detection method is not only restricted to part-of-speech tags - we consider adopting the method to phrase n-grams Error classification Generation of correction suggestionsSumming up: Summing up Detection of context-sensitive spelling errors Combining an unsupervised error detection method with robust shallow parsing Internal Evaluation: Internal Evaluation POS-tagger: 96.4% NP-recognition: P=83.1% and R=79.5% Clause boundary recognition: P=81.4% and 86.6% You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
romand2002 020717 Francisco Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 24 Category: Entertainment License: All Rights Reserved Like it (2) Dislike it (0) Added: November 16, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Robust Error Detection: A Hybrid Approach Combining Unsupervised Error Detection and Linguistic Knowledge: Robust Error Detection: A Hybrid Approach Combining Unsupervised Error Detection and Linguistic Knowledge Johnny Bigert and Ola Knutsson Royal Institute of Technology Stockholm, Sweden johnny@nada.kth.se knutsson@nada.kth.seDetection of context-sensitive spelling errors: Detection of context-sensitive spelling errors Identification of less-frequent grammatical constructions in the face of sparse data Hybrid method Unsupervised error detection Linguistic knowledge used for phrase transformations Properties: Properties Find difficult error types in unrestricted text (spelling errors resulting in an existing word etc.) No prior knowledge required, i.e. no classification of errors or confusion setsA first approach: A first approach Algorithm: for each position i in the stream if the frequency of (ti-1 ti ti+1) is low report error to the user report no errorSparse data: Sparse data Problems: Data sparseness for trigram statistics Phrase and clause boundaries may produce almost any trigramSparse data: Sparse data Example: ”It is every manager's task to…” ”It is every” is tagged (pn.neu.sin.def.sub/obj, vb.prs.akt, dt.utr/neu.sin.ind) and has a frequency of zero Probable cause: out of a million words in the corpus, only 709 have been assigned the tag (dt.utr/neu.sin.ind) Sparse data: Sparse data We try to replace ”It is every manager's task to…” with ”It is a manager's task to…” Sparse data: Sparse data ”It is every” is tagged (pn.neu.sin.def.sub/obj, vb.prs.akt, dt.utr/neu.sin.ind) and had a frequency of 0 ”It is a” is tagged (pn.neu.sin.def.sub/obj, vb.prs.akt, dt.utr.sin.ind) and have a frequency of 231 (dt.utr/neu.sin.ind) had a frequency of 709 (dt.utr.sin.ind) has a frequency 19112 Tag replacements: Tag replacements When replacing a tag: All tags are not suitable as replacements All replacements are not equally appropriate… …and thus, we require a penalty or probability for the replacementTag replacements: Tag replacements To be considered: Manual work to create the probabilities for each tag set and language The probabilities are difficult to estimate manually Automatic estimation of the probabilities (other paper)Tag replacements: Tag replacements Examples of replacement probabilities: Mannen var glad. (The man was happy.) Mannen är glad. (The man is happy.) 100% vb.prt.akt.kop vb.prt.akt.kop 74% vb.prt.akt.kop vb.prs.akt.kop 50% vb.prt.akt.kop vb.prt.akt____ 48% vb.prt.akt.kop vb.prt.sfoTag replacements: Tag replacements Examples of replacement probabilities: Mannen talar med de anställda. (The man talks to the employees.) Mannen talar med våra anställda. (The man talks to our employees.) 100% dt.utr/neu.plu.def dt.utr/neu.plu.def 44% dt.utr/neu.plu.def dt.utr/neu.plu.ind/def 42% dt.utr/neu.plu.def ps.utr/neu.plu.def 41% dt.utr/neu.plu.def jj.pos.utr/neu.plu.ind.nomWeighted trigrams: Weighted trigrams Replacing (t1 t2 t3) with (r1 r2 r3): f = freq(r1 r2 r3) · penalty penalty = Pr[replace t1 with r1] · Pr[replace t2 with r2] · Pr[replace t3 with r3]Weighted trigrams: Weighted trigrams Replacement of tags: Calculate f for all representatives for t1 , t2 and t3 (typically 3 · 3 · 3 of them) The weighted frequency is the sum of the penalized frequencies Algorithm: Algorithm Algorithm: for each position i in the stream if weighted freq for (ti-1 ti ti+1) is low report error to the user report no error An improved algorithm: An improved algorithm Problems with sparse data Phrase and clause boundaries may produce almost any trigram Use clauses as the unit for error detection to avoid clause boundariesPhrase transformations: Phrase transformations We identify phrases to transform rare constructions to those more frequent Replacing the phrase with its head Removing phrases (e.g. AdvP, PP)Phrase transformations: Phrase transformations Example: Alla hundar som är bruna är lyckliga (All dogs that are brown are happy) Hundarna är lyckliga (The dogs are happy)Phrase transformations: Phrase transformations Den bruna (jj.sin) hunden (the brown dog) De bruna (jj.plu) hundarna (the brown dogs)Phrase transformations: Phrase transformations The same example with a tagging error: Alla hundar som är bruna (jj.sin) är lyckliga (All dogs that are brown are happy) Robust NP detection yield Hundarna är lyckliga (The dogs are happy) Results: Results Error types found: context-sensitive spelling errors split compounds spelling errors verb chain errorsComparison between probabilistic methods: Comparison between probabilistic methods The unsupervised method has a good error capacity but also a high rate of false alarms The introduction of linguistic knowledge dramtically reduces the number of false alarms Future work: Future work The error detection method is not only restricted to part-of-speech tags - we consider adopting the method to phrase n-grams Error classification Generation of correction suggestionsSumming up: Summing up Detection of context-sensitive spelling errors Combining an unsupervised error detection method with robust shallow parsing Internal Evaluation: Internal Evaluation POS-tagger: 96.4% NP-recognition: P=83.1% and R=79.5% Clause boundary recognition: P=81.4% and 86.6%