Unsupervised Knowledge-Free Morpheme Boundary Detection : Unsupervised Knowledge-Free Morpheme Boundary Detection
Stefan Bordag
University of Leipzig
Example
Related work
Part One: Generating training data
Part Two: Training and Applying a Classificator
Preliminary results
Further research
Example: clearly early : Example: clearly early The examples used throughout this presentation are clearly and early
In one case, the stem is clear and in the other early
Other word forms of same lemmas:
clearly: clearest, clear, clearer, clearing
early: earlier, erliest
Semantically related words:
clearly: logically, really, totally, weakly, …
early: morning, noon, day, month, time, …
Correct morpheme boundaries analysis:
clearly → clear-ly but not *clearl-y or *clea-rly
early → early or earl-y but not *ear-ly
1. Three approaches to morpheme boundary detection : 1. Three approaches to morpheme boundary detection Three kinds of approaches:
Genetic Algorithms and the Minimum Description Length model
(Kazakov 97 & 01), (Goldsmith 01), (Creutz 03 & 05)
This approach utilizes only word list, not the context information for each word from corpus.
This possibly results in an upper limit on achievable performance (especially with regards to irregularities).
One advantage is that smaller corpora sufficient
Semantics based
(Schone & Jurafsky 01), (Baroni 03)
General problem of this approach with examples like deeply and deepness where semantic similarity is unlikely
Letter Successor Variety (LSV) based
(Harris 55), (Hafer & Weiss 74) first application, but low performance
Also applied only to a word list
Further hampered by noise in the data
2. New solution in two parts : 2. New solution in two parts sentences cooccurrences The talk was very
informative The talk 1
Talk was 1
…
similar words Talk speech 20
Was is 15
… clear-ly
lately
early
… train classifier clear-ly
late-ly
early
… apply classifier compute LSV s = LSV * freq * multiletter *
bigram
2.1. First part: Generating training data with LSV and distributed Semantics : 2.1. First part: Generating training data with LSV and distributed Semantics Overview:
Use context information to gather common direct neighbors of the input word → they are most probably marked by the same grammatical information
Frequency of word A and B is nA and nB
Frequency of cooccurrence of A with B is nAB
Corpus size is n
Significance computation is Poisson approximation of log-likelihood (Dunning 93) (Quasthoff & Wolff 02)
Neighbors of “clearly“ : Neighbors of “clearly“ Most significant left neighbors
very
quite
so
It‘s
most
it‘s
shows
results
that‘s
stated
Quite
Most significant right neighbors
defined
written
labeled
marked
visible
demonstrated
superior
stated
shows
demonstrates
understood
clearly It’s clearly labeled very clearly shows
2.2. New solution as combination of two existing approaches : 2.2. New solution as combination of two existing approaches Overview:
Use context information to gather common direct neighbors of the input word → they are most probably marked by the same grammatical information
Use these neighbor cooccurrences to find words that have similar cooccurrence profiles → those that are surrounded by the same cooccurrences bear mostly the same grammatical marker
Similar words to “clearly“ : Similar words to “clearly“ Most significant right neighbors
defined
written
labeled
marked
visible
demonstrated
superior
stated
shows
demonstrates
understood …
weakly
legally
closely
clearly
greatly
linearly
really
… Most significant left neighbors
very
quite
so
It‘s
most
it‘s
shows
results
that‘s
stated
Quite
2.3. New solution as combination of two existing approaches : 2.3. New solution as combination of two existing approaches Overview:
Use context information to gather common direct neighbors of the input word → they are most probably marked by the same grammatical information
Use these neighbor cooccurrences to find words that have similar cooccurrence profiles → those that are surrounded by the same cooccurrences bear mostly the same grammatical marker
Sort those words by edit distance and keep 150 most similar → since further words only add random noise
Similar words to “clearly“ sorted by edit distance : Similar words to “clearly“ sorted by edit distance Most significant
left neighbors
very
quite
so
It‘s
most
it‘s
shows
results
that‘s
stated
Quite
Most significant
right neighbors
defined
written
labeled
marked
visible
demonstrated
superior
stated
shows
demonstrates
understood
Sorted List
clearly
closely
greatly
legally
linearly
really
weakly
…
2.4. New solution as combination of two existing approaches : 2.4. New solution as combination of two existing approaches Overview:
Use context information to gather common direct neighbors of the input word → they are most probably marked by the same grammatical information
Use these neighbor cooccurrences to find words that have similar cooccurrence profiles → those that are surrounded by the same cooccurrences bear mostly the same grammatical marker
Sort those words by edit distance and keep 150 most similar → since further words only add random noise
Compute letter successor variety for each transition between two characters of the input word
Report boundaries where the LSV is above threshold
2.5. Letter successor variety : 2.5. Letter successor variety Letter successor variety: Harris (55)
where word-splitting occurs if the number of distinct letters that follows a given sequence of characters surpasses the threshold.
Input are the 150 most similar words
Observing how many different letters occur after a part of the string:
#c- In the given list after #c- 5 letters
#cl- only 3 letters
#cle- only 1 letter
…
-ly# but reversed before –ly# 16 different letters (16 different stems preceding the suffix –ly#)
# c l e a r l y #
28 5 3 1 1 1 1 1 f. left (thus after #cl 5 various letters)
1 1 2 1 3 16 10 14 f. right (thus before -y# 10 var. letters)
2.5.1. Balancing factors : 2.5.1. Balancing factors LSV score for each possible boundary is not normalized and needs to be weighted against several factors that otherwise add noise:
freq: Frequency differences between beginning and middle of word
multiletter: Representation of single phonemes with several letters
bigram: Certain fixed combinations of letters
Final score s for each possible boundary is then:
s = LSV * freq * multiletter * bigram
2.5.2. Balancing factors: Frequency : 2.5.2. Balancing factors: Frequency LSV is not normalized against frequency
28 different first letters within 150 words
5 different second letters within 11 words, beginning with c
3 different third letters within 4 words, beginning with cl
Computing frequency weight freq:
4 out of 11 begin with #cl- then weight is 4/11
# c l e a r l y #
150 11 4 1 1 1 1 1 of 11 4 begin with cl
0.1 0.4 0.3 1 1 1 1 1 from left
2.5.3. Balancing factors: Multiletter Phonemes : 2.5.3. Balancing factors: Multiletter Phonemes Problem: Two or more letters which together represent one phoneme “carry away” the nominator for the overlap factor quotient:
Letter split variety:
# s c h l i m m e
7 1 7 2 1 1 2
2 1 1 1 2 4 15
Computing overlap factor:
150 27 18 18 6 5 5 5
2 2 2 2 3 7 105 150
^ thus at this point the LSV 7 is weighted 1 (18/18), but since sch is one phoneme, it should have been 18/150 !
Solution: Ranking of bi- and trigrams, highest receives weight of 1.0
Overlap factor is recomputed as weighted average:
In this case that means 1.0 * 27/150, since ‘sch’ is the highest trigram and has a weight of 1.0.
2.5.4. Balancing factors: Bigrams : 2.5.4. Balancing factors: Bigrams It is obvious that –th– in English is almost never to be divided
Computation of bigram ranking over all words in word list and give 0.1 weight to highest ranked and 1.0 to lowest ranked.
LSV score then multiplied with resulting weight.
Thus, the German –ch- which is the highest ranked bigram receives a penalty of 0.1 and thus it is nearly impossible that it becomes a morpheme boundary
2.5.5. Sample computation : 2.5.5. Sample computation Compute letter successor variety:
# c l e a r - l y # # e a r l y #
28 5 3 1 1 1 1 1 40 5 1 1 2 1
1 1 2 1 3 16 10 10 1 2 1 4 6 19
Balancing: Frequencies:
150 11 4 1 1 1 1 1 150 9 2 2 2 1
1 1 2 2 5 76 90 150 1 2 2 6 19 150
Balancing: Multiletter weights:
Bi l 0.4 0.1 0.5 0.2 0.5 0.0 0.2 0.2 0.5 0.0
Tri r 0.1 0.1 0.1 0.1 0.0 0.0 0.1 0.0
Bi l 0.5 0.2 0.5 0.0 0.1 0.3 0.5 0.0 0.1 0.3
Tri r 0.1 0.1 0.0 0.0 0.2 0.0 0.0 0.2
Balancing: Bigram weight:
0.1 0.5 0.2 0.5 0.0 0.1 0.2 0.5 0.0 0.1
Left and Right LSV scores:
0.1 0.3 0.0 0.4 1.0 0.9 0.0 0.0 0.5 1.7
0.3 0.9 0.1 0.0 12.4 3.7 1.0 0.0 0.7 0.2
Computing right score for clear-ly:
16*(76/90+0.1*76/150)/(1.0+0.1)*(1-0.0)=12.4
Sum scores for left and right:
0.4 1.2 0.1 0.4 13.4 4.6 1.0 0.1 1.2 2.0
threshold: 5
clear-ly early
3. Second Part: Training and Applying classifier : 3. Second Part: Training and Applying classifier Any word list can be stored in a trie (Fredkin:60) or in a more efficient version of a trie, a PATRICIA compact tree (PCT) (Morrison:68)
Example:
clearly
early
lately
clear
late
r y e a l t e r a l a l c e ¤ root l c l a e ¤ ¤ ¤ ¤ a ¤ = End or
beginning of word
3.1. PCT as a Classificator : 3.1. PCT as a Classificator clear ly late ear ¤ root cl late ¤ ¤ ¤ ¤ clear-ly, late-ly, early,
Clear, late clear ly late ear ¤ root cl late ¤ ¤ ¤ ¤ ly=2 ly=1 ly=1 ly=1 Amazing?ly amazing-ly ly=1 ly=1 add known
information retrieve
known
information Apply deepest
found node dear?ly dearly ¤=1 ¤=1 ¤=1 ¤=1 ¤=1
4. Evaluation : 4. Evaluation Boundary measuring: each boundary detected can be correct or wrong (precision) or boundaries can be not detected (recall)
First evaluation is global LSV with the proposed improvements
Evaluating LSV Precision vs. Recall : Evaluating LSV Precision vs. Recall
Evaluating LSV F-measure : Evaluating LSV F-measure
Evaluating combination Precision vs. Recall : Evaluating combination Precision vs. Recall
Evaluating combination F-measure : Evaluating combination F-measure
Comparing combination with global LSV : Comparing combination with global LSV
4.1. Results : 4.1. Results German newspaper corpus with 35 million sentences
English newspaper corpus with 13 million sentences
4.2. Statistics : 4.2. Statistics
4.3. Assessing true error rate : 4.3. Assessing true error rate Typical sample list of words considered as wrong due to CELEX:
Tau-sende Tausend-e
senegales-isch-e senegalesisch-e
sensibelst-en sens-ibel-sten
separat-ist-isch-e separ-at-istisch-e
tris-t trist
triump-hal triumph-al
trock-en trocken
unueber-troff-en un-uebertroffen
trop-f-en tropf-en
trotz-t-en trotz-ten
ver-traeum-t-e vertraeumt-e
Reasons:
Gender –e (in (Creutz & Lagus 05) for example counted as correct)
compounds (sometimes separated, sometimes not)
-t-en Error
With proper names –isch often not analyzed
Connecting elements
4.4. Real example : 4.4. Real example Orien-tal
Orien-tal-ische
Orien-tal-ist
Orien-tal-ist-en
Orien-tal-ist-ik
Orien-tal-ist-in
Orient-ier-ung
Orient-ier-ungen
Orient-ier-ungs-hilf-e
Orient-ier-ungs-hilf-en
Orient-ier-ungs-los-igkeit
Orient-ier-ungs-punkt
Orient-ier-ungs-punkt-e
Orient-ier-ungs-stuf-e Ver-trau-enskrise
Ver-trau-ensleute
Ver-trau-ens-mann
Ver-trau-ens-sache
Ver-trau-ensvorschuß
Ver-trau-ensvo-tum
Ver-trau-ens-würd-igkeit
Ver-traut-es
Ver-trieb-en
Ver-trieb-spartn-er
Ver-triebene
Ver-triebenenverbände
Ver-triebs-beleg-e
5. Further research : 5. Further research Examine quality on various language types
Improve trie-based classificator
Possibly combine with other existing algorithms
Find out how to acquire morphology of non-concatenative languages
Deeper analysis:
find deletions
alternations
insertions
morpheme classes etc.
6. References : 6. References (Argamon et al. 04) Shlomo Argamon, Navot Akiva, Amihood Amir, and Oren Kapah. Effcient unsupervized recursive word segmentation using minimun desctiption length. In Proceedings of Coling 2004, Geneva, Switzerland, 2004. GLDV-Tagung, pages 93-99, Leipzig, March 1998. Deutscher Universitätsverlag.
(Baroni 03) Marco Baroni. Distribution-driven morpheme discovery: A computational/experimental study. Yearbook of Morphology, pages 213-248, 2003. France, http://www.sle.sharp.co.uk/senseval2/, 5-6 July 2001.
(Creutz & Lagus 05) Mathias Creutz and Krista Lagus. Unsupervised morpheme segmentation and morphology induction from text corpora using morfessor 1.0. In Publications in Computer and Information Science, Report A81. Helsinki University of Technology, March 2005.
(Déjean 98) Hervé Déjean. Morphemes as necessary concept for structures discovery from untagged corpora. In D.M.W. Powers, editor, NeMLaP3/CoNLL98 Workshop on Paradigms and Grounding in Natural Language Learning, ACL, pages 295-299, Adelaide, January 1998.
(Dunning 93) T. E. Dunning. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1):61-74, 1993.
6. References II : 6. References II (Goldsmith 01) John Goldsmith. Unsupervised learning of the morphology of a natural language. Computational Linguistics, 27(2):153-198, 2001.
(Hafer & Weiss 74) Margaret A. Hafer and Stephen F. Weiss. Word segmentation by letter successor varieties. Information Storage and Retrieval, 10:371-385, 1974.
(Harris 55) Zellig S. Harris. From phonemes to morphemes. Language, 31(2):190-222, 1955.
(Kazakov 97) Dimitar Kazakov. Unsupervised learning of na¨ive morphology with genetic algorithms. In A. van den Bosch, W. Daelemans, and A. Weijters, editors, Workshop Notes of the ECML/MLnet Workshop on Empirical Learning of Natural Language Processing Tasks, pages 105-112, Prague, Czech Republic, April 1997.
(Quasthoff & Wolff 02) Uwe Quasthoff and Christian Wolff. The poisson collocation measure and its applications. In Second International Workshop on Computational Approaches to Collocations. 2002.
(Schone & Jurafsky 01) Patrick Schone and Daniel Jurafsky. Language-independent induction of part of speech class labels using only language universals. In Workshop at IJCAI-2001, Seattle, WA., August 2001. Machine Learning: Beyond Supervision.
E. Gender-e vs. Frequency-e : E. Gender-e vs. Frequency-e vs. Gender-e
Schule 8.4
Devise 7.8
Sonne 4.5
Abendsonne 5.3
Abende 5.5
Liste 6.5
vs. other-e
andere 8.4
keine 6.8
rote 11.6
stolze 8.0
drehte 10.8
winzige 9.7
lustige 13.2
rufe 4.4
Dumme 12.6 Frequency-e
Affe 2.7
Junge 5.3
Knabe 4.6
Bursche 2.4
Backstage 3.0