ofer yaniv

Uploaded from authorPOINTLite
Views:
 
Category: Education
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Unsupervised Learning of Natural Language Morphology using MDL: 

Unsupervised Learning of Natural Language Morphology using MDL John Goldsmith November 9, 2001

Unsupervised learning: 

Unsupervised learning Input: untagged text in orthographic or phonetic form with spaces (or punctuation) separating words. But no tagging or text preparation.

Output: 

Output List of stems, suffixes, and prefixes List of signatures. A signature: a list of all suffixes (prefixes) appearing in a given corpus with a given stem. Hence, a stem in a corpus has a unique signature. A signature has a unique set of stems associated with it …

(example of signature in English): 

(example of signature in English) NULL.ed.ing.s ask call point = ask asked asking asks call called calling calls point pointed pointing points

Minimum Description Length (MDL): 

Minimum Description Length (MDL) Jorma Rissanen: Stochastic Complexity in Statistical Inquiry (1989) Work by Michael Brent and Carl de Marcken on word-discovery using MDL

Essence of MDL: 

Essence of MDL We are given a corpus, and a probabilistic morphology, which technically means that we are given a distribution over certain strings of stems and affixes. (“Given”? Given by who? We’ll get back to that.) (Remember: a distribution is a set of non-negative numbers summing to 1.0.)

Slide7: 

The higher the probability is that the morphology assigns to the (observed) corpus, the better that morphology is as a model of that data. Better said: -1 * log probability (corpus) is a measure of how well the morphology models the data: the smaller that number is, the better the morphology models the data. This is known as the optimal compressed length of the data, given the model. Using base 2 logs, this number is a measure in information theoretic bits.

Essence of MDL…: 

Essence of MDL… The goodness of the morphology is also measured by how compact the morphology is. We can measure the compactness of a morphology in information theoretic bits.

How can we measure the compactness of a morphology?: 

How can we measure the compactness of a morphology? Let’s consider a naïve version of description length: count the number of letters. This naïve version is nonetheless helpful in seeing the intuition involved.

Slide10: 

Naive Minimum Description Length Corpus: jump, jumps, jumping laugh, laughed, laughing sing, sang, singing the, dog, dogs total: 62 letters Analysis: Stems: jump laugh sing sang dog (20 letters) Suffixes: s ing ed (6 letters) Unanalyzed: the (3 letters) total: 29 letters. Notice that the description length goes UP if we analyze sing into s+ing

Essence of MDL…: 

Essence of MDL… The best overall theory of a corpus is the one for which the sum of log prob (corpus) + length of the morphology (that’s the description length) is the smallest.

Essence of MDL…: 

Essence of MDL…

Overall logic: 

Overall logic Search through morphology space for the morphology which provides the smallest description length.

Slide14: 

Corpus Pick a large corpus from a language -- 5,000 to 1,000,000 words.

Slide15: 

Corpus Bootstrap heuristic Feed it into the “bootstrapping” heuristic...

Slide16: 

Corpus Out of which comes a preliminary morphology, which need not be superb. Morphology Bootstrap heuristic

Slide17: 

Corpus Morphology Bootstrap heuristic incremental heuristics Feed it to the incremental heuristics...

Slide18: 

Corpus Morphology Bootstrap heuristic incremental heuristics modified morphology Out comes a modified morphology.

Slide19: 

Corpus Morphology Bootstrap heuristic incremental heuristics modified morphology Is the modification an improvement? Ask MDL!

Slide20: 

Corpus Morphology Bootstrap heuristic modified morphology If it is an improvement, replace the morphology... Garbage

Slide21: 

Corpus Bootstrap heuristic incremental heuristics modified morphology Send it back to the incremental heuristics again...

Slide22: 

Morphology incremental heuristics modified morphology Continue until there are no improvements to try.

1. Bootstrap heuristic: 

1. Bootstrap heuristic A function that takes words as inputs and gives an initial hypothesis regarding what are stems and what are affixes. In theory, the search space is enormous: each word w of length |w| has at least |w| analyses, so search space has at least members.

Better bootstrap heuristics: 

Better bootstrap heuristics Heuristic, not perfection! Several good heuristics. Best is a modification of a good idea of Zellig Harris (1955): Current variant: Cut words at certain peaks of successor frequency. Problems: can over-cut; can under-cut;

Successor frequency: 

Successor frequency g o v e r n Empirically, only one letter follows “gover”: “n”

Slide26: 

Successor frequency g o v e r n m Empirically, 6 letters follows “govern”: “m” i o s e #

Slide27: 

Successor frequency g o v e r n m Empirically, 1 letter follows “governm”: “e” e g o v e r 1 n 6 m 1 e peak of successor frequency

Lots of errors…: 

Lots of errors… c o n s e r v a t i v e s 9 18 11 6 4 1 2 1 1 2 1 1 wrong right wrong

Even so…: 

Even so… We set conditions: Accept cuts with stems at least 5 letters in length; Demand that successor frequency be a clear peak: 1… N … 1 (e.g. govern-ment) Then for each stem, collect all of its suffixes into a signature; and accept only signatures with at least 5 stems to it.

2. Incremental heuristics: 

2. Incremental heuristics Course-grained to fine-grained 1. Stems and suffixes to split: Accept any analysis of a word if it consists of a known stem and a known suffix. 2. Loose fit: suffixes and signatures to split: Collect any string that precedes a known suffix. Find all of its apparent suffixes, and use MDL to decide if it’s worth it to do the analysis. We’ll return to this in a moment.

Incremental heuristic: 

Incremental heuristic 3.Slide stem-suffix boundary to the left: Again, use MDL to decide. How do we use MDL to decide?

Using MDL to judge a potential stem: 

Using MDL to judge a potential stem act, acted, action, acts, acting. We have the suffixes NULL, ed, ion, ing, and s, but no signature NULL.ed.ion.ing.s Let’s compute cost versus savings of signature NULL.ed.ion.ing.s Savings: Stem savings: 4 copies of the stem act: that’s 3 x 4 = 12 letters = almost 60 bits.

Cost of NULL.ed.ing.s: 

Cost of NULL.ed.ing.s A pointer to each suffix: To give a feel for this: Total cost of suffix list: about 30 bits. Cost of pointer to signature: total cost is

Slide34: 

Cost of signature: about 45 bits Savings: about 60 bits so MDL says: Do it! Analyze the words as stem + suffix. Notice that the cost of the analysis would have been higher if one or more of the suffixes had not already “existed”.

Repair heuristics: using MDL: 

Original morphology + Compressed data Repair heuristics: using MDL We could compute the entire MDL in one state of the morphology; make a change; compute the whole MDL in the proposed (modified) state; and compared the two lengths. Revised morphology+ compressed data < >