logging in or signing up ofer yaniv Camilla Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 52 Category: Education License: All Rights Reserved Like it (0) Dislike it (0) Added: January 17, 2008 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Unsupervised Learning of Natural Language Morphology using MDL: Unsupervised Learning of Natural Language Morphology using MDL John Goldsmith November 9, 2001Unsupervised learning: Unsupervised learning Input: untagged text in orthographic or phonetic form with spaces (or punctuation) separating words. But no tagging or text preparation.Output: Output List of stems, suffixes, and prefixes List of signatures. A signature: a list of all suffixes (prefixes) appearing in a given corpus with a given stem. Hence, a stem in a corpus has a unique signature. A signature has a unique set of stems associated with it …(example of signature in English): (example of signature in English) NULL.ed.ing.s ask call point = ask asked asking asks call called calling calls point pointed pointing points Minimum Description Length (MDL): Minimum Description Length (MDL) Jorma Rissanen: Stochastic Complexity in Statistical Inquiry (1989) Work by Michael Brent and Carl de Marcken on word-discovery using MDLEssence of MDL: Essence of MDL We are given a corpus, and a probabilistic morphology, which technically means that we are given a distribution over certain strings of stems and affixes. (“Given”? Given by who? We’ll get back to that.) (Remember: a distribution is a set of non-negative numbers summing to 1.0.) Slide7: The higher the probability is that the morphology assigns to the (observed) corpus, the better that morphology is as a model of that data. Better said: -1 * log probability (corpus) is a measure of how well the morphology models the data: the smaller that number is, the better the morphology models the data. This is known as the optimal compressed length of the data, given the model. Using base 2 logs, this number is a measure in information theoretic bits.Essence of MDL…: Essence of MDL… The goodness of the morphology is also measured by how compact the morphology is. We can measure the compactness of a morphology in information theoretic bits.How can we measure the compactness of a morphology?: How can we measure the compactness of a morphology? Let’s consider a naïve version of description length: count the number of letters. This naïve version is nonetheless helpful in seeing the intuition involved.Slide10: Naive Minimum Description Length Corpus: jump, jumps, jumping laugh, laughed, laughing sing, sang, singing the, dog, dogs total: 62 letters Analysis: Stems: jump laugh sing sang dog (20 letters) Suffixes: s ing ed (6 letters) Unanalyzed: the (3 letters) total: 29 letters. Notice that the description length goes UP if we analyze sing into s+ingEssence of MDL…: Essence of MDL… The best overall theory of a corpus is the one for which the sum of log prob (corpus) + length of the morphology (that’s the description length) is the smallest. Essence of MDL…: Essence of MDL…Overall logic: Overall logic Search through morphology space for the morphology which provides the smallest description length.Slide14: Corpus Pick a large corpus from a language -- 5,000 to 1,000,000 words.Slide15: Corpus Bootstrap heuristic Feed it into the “bootstrapping” heuristic...Slide16: Corpus Out of which comes a preliminary morphology, which need not be superb. Morphology Bootstrap heuristic Slide17: Corpus Morphology Bootstrap heuristic incremental heuristics Feed it to the incremental heuristics...Slide18: Corpus Morphology Bootstrap heuristic incremental heuristics modified morphology Out comes a modified morphology.Slide19: Corpus Morphology Bootstrap heuristic incremental heuristics modified morphology Is the modification an improvement? Ask MDL! Slide20: Corpus Morphology Bootstrap heuristic modified morphology If it is an improvement, replace the morphology... Garbage Slide21: Corpus Bootstrap heuristic incremental heuristics modified morphology Send it back to the incremental heuristics again... Slide22: Morphology incremental heuristics modified morphology Continue until there are no improvements to try. 1. Bootstrap heuristic: 1. Bootstrap heuristic A function that takes words as inputs and gives an initial hypothesis regarding what are stems and what are affixes. In theory, the search space is enormous: each word w of length |w| has at least |w| analyses, so search space has at least members.Better bootstrap heuristics: Better bootstrap heuristics Heuristic, not perfection! Several good heuristics. Best is a modification of a good idea of Zellig Harris (1955): Current variant: Cut words at certain peaks of successor frequency. Problems: can over-cut; can under-cut;Successor frequency: Successor frequency g o v e r n Empirically, only one letter follows “gover”: “n”Slide26: Successor frequency g o v e r n m Empirically, 6 letters follows “govern”: “m” i o s e #Slide27: Successor frequency g o v e r n m Empirically, 1 letter follows “governm”: “e” e g o v e r 1 n 6 m 1 e peak of successor frequencyLots of errors…: Lots of errors… c o n s e r v a t i v e s 9 18 11 6 4 1 2 1 1 2 1 1 wrong right wrongEven so…: Even so… We set conditions: Accept cuts with stems at least 5 letters in length; Demand that successor frequency be a clear peak: 1… N … 1 (e.g. govern-ment) Then for each stem, collect all of its suffixes into a signature; and accept only signatures with at least 5 stems to it.2. Incremental heuristics: 2. Incremental heuristics Course-grained to fine-grained 1. Stems and suffixes to split: Accept any analysis of a word if it consists of a known stem and a known suffix. 2. Loose fit: suffixes and signatures to split: Collect any string that precedes a known suffix. Find all of its apparent suffixes, and use MDL to decide if it’s worth it to do the analysis. We’ll return to this in a moment.Incremental heuristic: Incremental heuristic 3.Slide stem-suffix boundary to the left: Again, use MDL to decide. How do we use MDL to decide?Using MDL to judge a potential stem: Using MDL to judge a potential stem act, acted, action, acts, acting. We have the suffixes NULL, ed, ion, ing, and s, but no signature NULL.ed.ion.ing.s Let’s compute cost versus savings of signature NULL.ed.ion.ing.s Savings: Stem savings: 4 copies of the stem act: that’s 3 x 4 = 12 letters = almost 60 bits.Cost of NULL.ed.ing.s: Cost of NULL.ed.ing.s A pointer to each suffix: To give a feel for this: Total cost of suffix list: about 30 bits. Cost of pointer to signature: total cost is Slide34: Cost of signature: about 45 bits Savings: about 60 bits so MDL says: Do it! Analyze the words as stem + suffix. Notice that the cost of the analysis would have been higher if one or more of the suffixes had not already “existed”.Repair heuristics: using MDL: Original morphology + Compressed data Repair heuristics: using MDL We could compute the entire MDL in one state of the morphology; make a change; compute the whole MDL in the proposed (modified) state; and compared the two lengths. Revised morphology+ compressed data < > You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
ofer yaniv Camilla Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 52 Category: Education License: All Rights Reserved Like it (0) Dislike it (0) Added: January 17, 2008 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Unsupervised Learning of Natural Language Morphology using MDL: Unsupervised Learning of Natural Language Morphology using MDL John Goldsmith November 9, 2001Unsupervised learning: Unsupervised learning Input: untagged text in orthographic or phonetic form with spaces (or punctuation) separating words. But no tagging or text preparation.Output: Output List of stems, suffixes, and prefixes List of signatures. A signature: a list of all suffixes (prefixes) appearing in a given corpus with a given stem. Hence, a stem in a corpus has a unique signature. A signature has a unique set of stems associated with it …(example of signature in English): (example of signature in English) NULL.ed.ing.s ask call point = ask asked asking asks call called calling calls point pointed pointing points Minimum Description Length (MDL): Minimum Description Length (MDL) Jorma Rissanen: Stochastic Complexity in Statistical Inquiry (1989) Work by Michael Brent and Carl de Marcken on word-discovery using MDLEssence of MDL: Essence of MDL We are given a corpus, and a probabilistic morphology, which technically means that we are given a distribution over certain strings of stems and affixes. (“Given”? Given by who? We’ll get back to that.) (Remember: a distribution is a set of non-negative numbers summing to 1.0.) Slide7: The higher the probability is that the morphology assigns to the (observed) corpus, the better that morphology is as a model of that data. Better said: -1 * log probability (corpus) is a measure of how well the morphology models the data: the smaller that number is, the better the morphology models the data. This is known as the optimal compressed length of the data, given the model. Using base 2 logs, this number is a measure in information theoretic bits.Essence of MDL…: Essence of MDL… The goodness of the morphology is also measured by how compact the morphology is. We can measure the compactness of a morphology in information theoretic bits.How can we measure the compactness of a morphology?: How can we measure the compactness of a morphology? Let’s consider a naïve version of description length: count the number of letters. This naïve version is nonetheless helpful in seeing the intuition involved.Slide10: Naive Minimum Description Length Corpus: jump, jumps, jumping laugh, laughed, laughing sing, sang, singing the, dog, dogs total: 62 letters Analysis: Stems: jump laugh sing sang dog (20 letters) Suffixes: s ing ed (6 letters) Unanalyzed: the (3 letters) total: 29 letters. Notice that the description length goes UP if we analyze sing into s+ingEssence of MDL…: Essence of MDL… The best overall theory of a corpus is the one for which the sum of log prob (corpus) + length of the morphology (that’s the description length) is the smallest. Essence of MDL…: Essence of MDL…Overall logic: Overall logic Search through morphology space for the morphology which provides the smallest description length.Slide14: Corpus Pick a large corpus from a language -- 5,000 to 1,000,000 words.Slide15: Corpus Bootstrap heuristic Feed it into the “bootstrapping” heuristic...Slide16: Corpus Out of which comes a preliminary morphology, which need not be superb. Morphology Bootstrap heuristic Slide17: Corpus Morphology Bootstrap heuristic incremental heuristics Feed it to the incremental heuristics...Slide18: Corpus Morphology Bootstrap heuristic incremental heuristics modified morphology Out comes a modified morphology.Slide19: Corpus Morphology Bootstrap heuristic incremental heuristics modified morphology Is the modification an improvement? Ask MDL! Slide20: Corpus Morphology Bootstrap heuristic modified morphology If it is an improvement, replace the morphology... Garbage Slide21: Corpus Bootstrap heuristic incremental heuristics modified morphology Send it back to the incremental heuristics again... Slide22: Morphology incremental heuristics modified morphology Continue until there are no improvements to try. 1. Bootstrap heuristic: 1. Bootstrap heuristic A function that takes words as inputs and gives an initial hypothesis regarding what are stems and what are affixes. In theory, the search space is enormous: each word w of length |w| has at least |w| analyses, so search space has at least members.Better bootstrap heuristics: Better bootstrap heuristics Heuristic, not perfection! Several good heuristics. Best is a modification of a good idea of Zellig Harris (1955): Current variant: Cut words at certain peaks of successor frequency. Problems: can over-cut; can under-cut;Successor frequency: Successor frequency g o v e r n Empirically, only one letter follows “gover”: “n”Slide26: Successor frequency g o v e r n m Empirically, 6 letters follows “govern”: “m” i o s e #Slide27: Successor frequency g o v e r n m Empirically, 1 letter follows “governm”: “e” e g o v e r 1 n 6 m 1 e peak of successor frequencyLots of errors…: Lots of errors… c o n s e r v a t i v e s 9 18 11 6 4 1 2 1 1 2 1 1 wrong right wrongEven so…: Even so… We set conditions: Accept cuts with stems at least 5 letters in length; Demand that successor frequency be a clear peak: 1… N … 1 (e.g. govern-ment) Then for each stem, collect all of its suffixes into a signature; and accept only signatures with at least 5 stems to it.2. Incremental heuristics: 2. Incremental heuristics Course-grained to fine-grained 1. Stems and suffixes to split: Accept any analysis of a word if it consists of a known stem and a known suffix. 2. Loose fit: suffixes and signatures to split: Collect any string that precedes a known suffix. Find all of its apparent suffixes, and use MDL to decide if it’s worth it to do the analysis. We’ll return to this in a moment.Incremental heuristic: Incremental heuristic 3.Slide stem-suffix boundary to the left: Again, use MDL to decide. How do we use MDL to decide?Using MDL to judge a potential stem: Using MDL to judge a potential stem act, acted, action, acts, acting. We have the suffixes NULL, ed, ion, ing, and s, but no signature NULL.ed.ion.ing.s Let’s compute cost versus savings of signature NULL.ed.ion.ing.s Savings: Stem savings: 4 copies of the stem act: that’s 3 x 4 = 12 letters = almost 60 bits.Cost of NULL.ed.ing.s: Cost of NULL.ed.ing.s A pointer to each suffix: To give a feel for this: Total cost of suffix list: about 30 bits. Cost of pointer to signature: total cost is Slide34: Cost of signature: about 45 bits Savings: about 60 bits so MDL says: Do it! Analyze the words as stem + suffix. Notice that the cost of the analysis would have been higher if one or more of the suffixes had not already “existed”.Repair heuristics: using MDL: Original morphology + Compressed data Repair heuristics: using MDL We could compute the entire MDL in one state of the morphology; make a change; compute the whole MDL in the proposed (modified) state; and compared the two lengths. Revised morphology+ compressed data < >