Cognate or False Friend? Ask the Web! : Cognate or False Friend? Ask the Web! Svetlin Nakov, Sofia University "St. Kliment Ohridski"
Preslav Nakov, University of California, Berkeley
Elena Paskaleva, Bulgarian Academy of Sciences A Workshop on Acquisition and
Management of Multilingual Lexicons
Introduction : Introduction Cognates and false friends
Cognates are pair of words in different languages that sound similar and are translations of each other
False friends are pairs of words in two languages that sound similar but differ in their meanings
The problem
Design an algorithm that can distinguish between cognates and false friends
Cognates and False Friends : Cognates and False Friends Examples of cognates
ден in Bulgarian = день in Russian (day)
idea in English = идея in Bulgarian (idea)
Examples of false friends
майка in Bulgarian (mother) ≠ майка in Russian (vest)
prost in German (cheers) ≠ прост in Bulgarian (stupid)
gift in German (poison) ≠ gift in English (present)
The Paper in One Slide : The Paper in One Slide Measuring semantic similarity
Analyze the words local contexts
Use the Web as a corpus
Similarities contexts similar words
Context translation cross-lingual similarity
Evaluation
200 pairs of words
100 cognates and 100 false friends
11pt average precision: 95.84%
Contextual Web Similarity : Contextual Web Similarity What is local context?
Few words before and after the target word
The words in the local context of given word are semantically related to it
Need to exclude the stop words: prepositions, pronouns, conjunctions, etc.
Stop words appear in all contexts
Need of sufficiently big corpus Same day delivery of fresh flowers, roses, and unique gift baskets from our online boutique. Flower delivery online by local florists for birthday flowers.
Contextual Web Similarity : Contextual Web Similarity Web as a corpus
The Web can be used as a corpus to extract the local context for given word
The Web is the largest possible corpus
Contains big corpora in any language
Searching some word in Google can return up to 1 000 excerpts of texts
The target word is given along with its local context: few words before and after it
Target language can be specified
Contextual Web Similarity : Contextual Web Similarity Web as a corpus
Example: Google query for "flower"
Contextual Web Similarity : Contextual Web Similarity Measuring semantic similarity
For given two words their local contexts are extracted from the Web
A set of words and their frequencies
Semantic similarity is measured as similarity between these local contexts
Local contexts are represented as frequency vectors for given set of words
Cosine between the frequency vectors in the Euclidean space is calculated
Contextual Web Similarity : Contextual Web Similarity Example of context words frequencies word: flower word: computer
Contextual Web Similarity : Contextual Web Similarity Example of frequency vectors
Similarity = cosine(v1, v2) v1: flower v2: computer
Cross-Lingual Similarity : Cross-Lingual Similarity We are given two words in different languages L1 and L2
We have a bilingual glossary G of translation pairs {p ∈ L1, q ∈ L2}
Measuring cross-lingual similarity:
We extract the local contexts of the target words from the Web: C1 ∈ L1 and C2 ∈ L2
We translate the context
We measure distance between C1* and C2
Reverse Context Lookup : Reverse Context Lookup Local context extracted from the Web can contain arbitrary parasite words like "online", "home", "search", "click", etc.
Internet terms appear in any Web page
Such words are not likely to be associated with the target word
Example (for the word flowers)
"send flowers online", "flowers here", "order flowers here"
Will the word "flowers" appear in the local context of "send", "online" and "here"?
Reverse Context Lookup : Reverse Context Lookup If two words are semantically related both should appear in the local contexts of each other
Let #{x,y} = number of occurrences of x in the local context of y
For any word w and a word from its local context wc, we define their strength of semantic association p(w,wc) as follows:
p(w, wc) = min{ #(w, wc), #(wc,w) }
We use p(w,wc) as vector coordinates when measuring semantic similarity
Web Similarity Using Seed Words : Web Similarity Using Seed Words Adaptation of the Fung&Yee'98 algorithm*
We have a bilingual glossary G: L1 L2 of translation pairs and target words w1, w2
We search in Google the co-occurrences of the target words with the glossary entries
Compare the co-occurrence vectors
for each {p,q} ∈ G compare
max (google#("w1 p") and google#("p w1"))
with
max (google#"w2 q") and google#("q w2")) * P. Fung and L. Y. Yee. An IR approach for translating from nonparallel, comparable texts. In Proceedings of ACL, volume 1, pages 414–420, 1998
Evaluation Data Set : Evaluation Data Set We use 200 Bulgarian/Russian pairs of words:
100 cognates and 100 false friends
Manually assembled by a linguist
Manually checked in several large monolingual and bilingual dictionaries
Limited to nouns only
Experiments : Experiments We tested few modifications of our contextual Web similarity algorithm
Use of TF.IDF weighting
Preserve the stop words
Use of lemmatization of the context words
Use different context size (2, 3, 4 and 5)
Use small and large bilingual glossary
Compared it with the seed words algorithm
Compared with traditional orthographic similarity measures: LCSR and MEDR
Experiments : Experiments BASELINE: random
MEDR: minimum edit distance ratio
LCSR: longest common subsequence ration
SEED: the "seed words" algorithm
WEB3: the Web-based similarity algorithm with the default parameters: context size = 3, small glossary, stop words filtering, no lemmatization, no reverse context lookup, no TF.IDF-weighting
NO-STOP: WEB3 without stop words removal
WEB1, WEB2, WEB4 and WEB5: WEB3 with context size of 1, 2, 4 and 5
LEMMA: WEB3 with lemmatization
HUGEDICT: WEB3 with the huge glossary
REVERSE: the "reverse context lookup" algorithm
COMBINED: WEB3 + lemmatization + huge glossary + reverse context lookup
Resources : Resources We used the following resources:
Bilingual Bulgarian / Russian glossary: 3 794 pairs of translation words
Huge bilingual glossary: 59 583 word pairs
A list of 599 Bulgarian stop words
A list of 508 Russian stop words
Bulgarian lemma dictionary: 1 000 000 wordforms and 70 000 lemmata
Russian lemma dictionary: 1 500 000 wordforms and 100 000 lemmata
Evaluation : Evaluation We order the pairs of words from the testing dataset by the calculated similarity
False friends are expected to appear on the top and the cognates on the bottom
We evaluate the 11pt average precision of the obtained ordering
Results (11pt Average Precision) : Results (11pt Average Precision) Comparing BASELINE, LCSR, MEDR, SEED and WEB3 algorithms
Results (11pt Average Precision) : Results (11pt Average Precision) Comparing different context sizes; keeping the stop words
Results (11pt Average Precision) : Results (11pt Average Precision) Comparing different improvements of the WEB3 algorithm
Results (Precision-Recall Graph) : Results (Precision-Recall Graph) Comparing the recall-precision graphs of evaluated algorithms
Results: The Ordering for WEB3 : Results: The Ordering for WEB3
Discussion : Discussion Our approach is original because:
Introduces semantic similarity measure
Not orthographic or phonetic
Uses the Web as a corpus
Does not rely on any preexisting corpora
Uses reverse-context lookup
Significant improvement in quality
Is applied to original problem
Classification of almost identically spelled true/false friends
Discussion : Discussion Very good accuracy: over 95%
It is not 100% accurate
Typical mistakes are synonyms, hyponyms, words influenced by cultural, historical and geographical differences
The Web as a corpus introduces noise
Google returns the first 1 000 results only
Google ranks higher news portals, travel agencies and retail sites than books, articles and forums posts
Local context could contains noise
Conclusion and Future Work : Conclusion and Future Work Conclusion
Algorithm that can distinguish between cognates and false friends
Analyzes words local contexts, using the Web as a corpus
Future Work
Better glossaries
Automatic augmenting the glossary
Different language pairs
Questions? : Questions? Cognate or False Friend? Ask the Web!