224u 07 lec3

Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Word Sense Disambiguation: 

Word Sense Disambiguation CS 224U – 2007 Much borrowed material from slides by Ted Pedersen, Massimo Poesio, Dan Jurafsky, Andras Csomai, and Jim Martin

Word senses: 

Word senses pike

An example LEXICAL ENTRY from a machine-readable dictionary: STOCK,from the LDOCE : 

An example LEXICAL ENTRY from a machine-readable dictionary: STOCK,from the LDOCE 0100 a supply (of something) for use: a good stock of food 0200 goods for sale: Some of the stock is being taken without being paid for 0300 the thick part of a tree trunk 0400 (a) a piece of wood used as a support or handle, as for a gun or tool (b) the piece which goes across the top of an ANCHOR^1 (1) from side to side 0500 (a) a plant from which CUTTINGs are grown (b) a stem onto which another plant is GRAFTed 0600 a group of animals used for breeding 0700 farm animals usu. cattle; LIVESTOCK 0800 a family line, esp. of the stated character 0900 money lent to a government at a fixed rate of interest 1000 the money (CAPITAL) owned by a company, divided into SHAREs 1100 a type of garden flower with a sweet smell 1200 a liquid made from the juices of meat, bones, etc., used in cooking …..

WORD SENSE DISAMBIGUATION: 

WORD SENSE DISAMBIGUATION

Identifying the sense of a word in its context: 

Identifying the sense of a word in its context The task of Word Sense Disambiguation is to determine which of various senses of a word are invoked in context: the seed companies cut off the tassels of each plant, making it male sterile Nissan's Tennessee manufacturing plant beat back a United Auto Workers organizing effort with aggressive tactics This is generally viewed as a categorization/tagging task So, similar task to that of POS tagging But this is a simplification! Less agreement on what the senses are, so the UPPER BOUND is lower Word sense discrimination is the problem of dividing the usages of a word into different meanings, without regard to any particular existing sense inventory. Involves unsupervised techniques. Clear potential uses include Machine Translation, Information Retrieval, Question Answering, Knowledge Acquisition, even Parsing. Though in practice the implementation path hasn’t always been clear

Early Days of WSD: 

Early Days of WSD Noted as problem for Machine Translation (Weaver, 1949) A word can often only be translated if you know the specific sense intended (A bill in English could be a pico or a cuenta in Spanish) Bar-Hillel (1960) posed the following problem: Little John was looking for his toy box. Finally, he found it. The box was in the pen. John was very happy. Is “pen” a writing instrument or an enclosure where children play? …declared it unsolvable, and left the field of MT (!): “Assume, for simplicity’s sake, that pen in English has only the following two meanings: (1) a certain writing utensil, (2) an enclosure where small children can play. I now claim that no existing or imaginable program will enable an electronic computer to determine that the word pen in the given sentence within the given context has the second of the above meanings, whereas every reader with a sufficient knowledge of English will do this ‘automatically’.” (1960, p. 159)

Bar-Hillel: 

Bar-Hillel "Let me state rather dogmatically that there exists at this moment no method of reducing the polysemy of the, say, twenty words of an average Russian sentence in a scientific article below a remainder of, I would estimate, at least five or six words with multiple English renderings, which would not seriously endanger the quality of the machine output. Many tend to believe that by reducing the number of initially possible renderings of a twenty word Russian sentence from a few tens of thousands (which is the approximate number resulting from the assumption that each of the twenty Russian words has two renderings on the average, while seven or eight of them have only one rendering) to some eighty (which would be the number of renderings on the assumption that sixteen words are uniquely rendered and four have three renderings apiece, forgetting now about all the other aspects such as change of word order, etc.) the main bulk of this kind of work has been achieved, the remainder requiring only some slight additional effort" (Bar-Hillel, 1960, p. 163).

Identifying the sense of a word in its context: 

Identifying the sense of a word in its context Most early work used semantic networks, frames, logical reasoning, or ``expert system'' methods for disambiguation based on contexts (e.g., Small 1980, Hirst 1988). The problem got quite out of hand: The word expert for `throw' is ``currently six pages long, but should be ten times that size'' (Small and Rieger 1982) Supervised machine learning sense disambiguation through use of context is frequently extremely successful -- and is a straightforward classification problem However, it requires extensive annotated training data Much recent work focuses on minimizing need for annotation.

Philosophy: 

Philosophy ``You shall know a word by the company it keeps'’ -- Firth “You say: the point isn't the word, but its meaning, and you think of the meaning as a thing of the same kind as the word, though also different from the word. Here the word, there the meaning. The money, and the cow that you can buy with it. (But contrast: money, and its use.)” Wittgenstein, Philosophical Investigations For a large class of cases---though not for all---in which we employ the word `meaning' it can be defined thus: the meaning of a word is its use in the language.'' Wittgenstein, Philosophical Investigations

Corpora used for word sense disambiguation work: 

Corpora used for word sense disambiguation work Sense Annotated (Difficult and expensive to build) Semcor (200,000 words from Brown) DSO (192,000 semantically annotated occurrences of 121 nouns and 70 verbs), Training data for Senseval competitions (lexical samples and running text) Non Annotated (Available in large quantity) newswire, Web, …

modest: 

modest In evident apprehension that such a prospect might frighten off the young or composers of more modest_1 forms -- Tort reform statutes in thirty-nine states have effected modest_9 changes of substantive and remedial law The modest_9 premises are announced with a modest and simple name - In the year before the Nobel Foundation belatedly honoured this modest_0 and unassuming individual, LinkWay is IBM's response to HyperCard, and in Glasgow (its UK launch) it impressed many by providing colour, by its modest_9 memory requirements, In a modest_1 mews opposite TV-AM there is a rumpled hyperactive figure He is also modest_0: the ``help to'' is a nice touch.

SEMCOR: 

SEMCOR <contextfile concordance="brown"> <context filename="br-h15" paras="yes"> ….. <wf cmd="ignore" pos="IN">in</wf> <wf cmd="done" pos="NN" lemma="fig" wnsn="1" lexsn="1:10:00::">fig.</wf>   <wf cmd="done" pos="NN" lemma="6" wnsn="1“ lexsn="1:23:00::">6</wf>   <punc>)</punc>   <wf cmd="done" pos="VBP" ot="notag">are</wf>   <wf cmd="done" pos="VB" lemma="slip" wnsn="3" lexsn="2:38:00::">slipped</wf>   <wf cmd="ignore" pos="IN">into</wf>   <wf cmd="done" pos="NN" lemma="place" wnsn="9" lexsn="1:15:05::">place</wf>   <wf cmd="ignore" pos="IN">across</wf>   <wf cmd="ignore" pos="DT">the</wf>   <wf cmd="done" pos="NN" lemma="roof" wnsn="1" lexsn="1:06:00::">roof</wf>   <wf cmd="done" pos="NN" lemma="beam" wnsn="2" lexsn="1:06:00::">beams</wf>   <punc>,</punc>

Dictionary-based approaches: 

Dictionary-based approaches Lesk (1986): Retrieve from MRD all sense definitions of the word to be disambiguated Compare with sense definitions of words in context Choose sense with most overlap Example: PINE 1 kinds of evergreen tree with needle-shaped leaves 2 waste away through sorrow or illness CONE 1 solid body which narrows to a point 2 something of this shape whether solid or hollow 3 fruit of certain evergreen trees Disambiguate: PINE CONE

Frequency-based word-sense disambiguation: 

Frequency-based word-sense disambiguation If you have a corpus in which each word is annotated with its sense, you can collect unigram statistics (count the number of times each sense occurs in the corpus) P(SENSE) P(SENSE|WORD) E.g., if you have 5845 uses of the word bridge, 5641 cases in which it is tagged with the sense STRUCTURE 194 instances with the sense DENTAL-DEVICE Frequency-based WSD can get about 60-70% correct! The WordNet first sense heuristic is good! To improve upon these results, need context

Traditional selectional restrictions: 

Traditional selectional restrictions One type of contextual information is the information about the type of arguments that a verb takes – its SELECTIONAL RESTRICTIONS: AGENT EAT FOOD-STUFF AGENT DRIVE VEHICLE Example: Which airlines serve DENVER? Which airlines serve BREAKFAST? Limitations: In his two championship trials, Mr. Kulkarni ATE GLASS on an empty stomach, accompanied only by water and tea. But if fell apart in 1931, perhaps because people realized that you can’t EAT GOLD for lunch if you’re hungry Resnik (1998): 44% with these methods

Context in general: 

Context in general But it’s not just classic selectional restrictions that are useful context Often simply knowing the topic is really useful!

Supervised approaches to WSD: the rebirth of Naïve Bayes in CompLing: 

Supervised approaches to WSD: the rebirth of Naïve Bayes in CompLing A Naïve Bayes Classifier chooses the most probable sense for a word given the context: As usual, this can be expressed as: The “NAÏVE” ASSUMPTION: all the features are independent

An example of use of Naïve Bayes classifiers: Gale, Church, and Y. (1992): 

An example of use of Naïve Bayes classifiers: Gale, Church, and Y. (1992) Used this method to disambiguated word senses using an ALIGNED CORPUS (Hansard) to get the word senses

Gale et al: words as contextual clues: 

Gale et al: words as contextual clues Gale et al view a ‘context’ as a set of words Good clues for the different senses of DRUG: Medication: prices, prescription, patent, increase, consumer, pharmaceutical Illegal substance: abuse, paraphernalia, illicit, alcohol, cocaine, traffickers To determine which interpretation is more likely, extract words (e.g. ABUSE) from context, and use P(abuse|medicament), P(abuse|drogue) To estimate these probabilities, use SMOOTHED relative freq: P(abuse|medicament) ≈ C(abuse, medicament) / C(medicament)) P(medicament) ≈ C(medicament) / C(drug)

Gale, Church, and Yarowsky (1992): EDA: 

Gale, Church, and Yarowsky (1992): EDA

Gale, Church, and Yarowsky (1992): EDA: 

Gale, Church, and Yarowsky (1992): EDA

Gale, Church, and Yarowsky (1992): EDA: 

Gale, Church, and Yarowsky (1992): EDA

Results: 

Results Gale et al (1992): disambiguation system using this algorithm correct for about 90% of occurrences of six ambiguous nouns in the Hansard corpus: duty, drug, land, language, position, sentence Good clues for drug: medication sense: prices, prescription, patent, increase illegal substance sense: abuse, paraphernalia, illicit, alcohol, cocaine, traffickers BUT THIS WAS FOR TWO CLEARLY DIFFERENT SENSES Of course, that may be the most important case to get right…

Broad context vs. Collocations: 

Broad context vs. Collocations

Other methods for WSD: 

Other methods for WSD Supervised: Brown et al, 1991: using mutual information to combine senses into groups Yarowsky (1992): using a thesaurus and a topic-classified corpus More recently, any machine learning method whose name you know Unsupervised: sense DISCRIMINATION Schuetze 1996: using EM algorithm based clustering, LSA Mixed Yarowsky’s 1995 bootstrapping algorithm Quite cool A pioneering example of doing context and content constraining each other. More on this later Principles One sense per collocation One sense per discourse

Evaluation: 

Evaluation Baseline: is the system good or an improvement? Unsupervised: Random, Simple-Lesk Supervised: Most Frequent, Lesk-plus-corpus. Upper bound: agreement between humans?

SENSEVAL: 

SENSEVAL Goals: Provide a common framework to compare WSD systems Standardise the task (especially evaluation procedures) Build and distribute new lexical resources (dictionaries and sense tagged corpora) Web site: http://www.senseval.org/ “There are now many computer programs for automatically determining the sense of a word in context (Word Sense Disambiguation or WSD).  The purpose of Senseval is to evaluate the strengths and weaknesses of such programs with respect to different words, different varieties of language, and different languages.” from: http://www.sle.sharp.co.uk/senseval2

SENSEVAL History: 

SENSEVAL History ACL-SIGLEX workshop (1997) Yarowsky and Resnik paper SENSEVAL-I (1998) Lexical Sample for English, French, and Italian SENSEVAL-II (Toulouse, 2001) Lexical Sample and All Words Organization: Kilkgarriff (Brighton) SENSEVAL-III (2004) SENSEVAL-IV -> SEMEVAL (2007)

WSD at SENSEVAL-II: 

WSD at SENSEVAL-II Choosing the right sense for a word among those of WordNet

English All Words: All N, V, Adj, Adv: 

English All Words: All N, V, Adj, Adv Data: 3 texts for a total of 1770 words Average polysemy: 6.5 Example: (part of) Text 1 The art of change-ringing is peculiar to the English and, like most English peculiarities , unintelligible to the rest of the world . -- Dorothy L. Sayers , " The Nine Tailors " ASLACTON , England -- Of all scenes that evoke rural England , this is one of the loveliest : An ancient stone church stands amid the fields , the sound of bells cascading from its tower , calling the faithful to evensong . The parishioners of St. Michael and All Angels stop to chat at the church door , as members here always have . […]

English All Words Systems: 

English All Words Systems Unsupervised (6): UMED (relevance matrix over Gutemberg project corpus) Illinois (Lexical Proximity) Malaysia (MTD, Machine Tractable Dictionary) Litkowsky (New Oxford Dictionary and Contextual Clues) Sheffield (Anaphora and WN hierarchy) IRST (WordNet Domains) Supervised (5): S. Sebastian (decision lists in Semcor) UCLA (Semcor, Semantic Distance and Density, AltaVista for frequency) Sinequa (Semcor and Semantic Classes) Antwerp (Semcor, Memory Based Learning) Moldovan (Semcor plus an additional sense tagged corpus, heuristics)

English Lexical Sample: 

English Lexical Sample Data: 8699 texts for 73 words Average WN polysemy: 9.22 Training Data: 8166 (average 118/word) Baseline (commonest): 0.47 precision Baseline (Lesk): 0.51 precision

Lexical Sample: 

Lexical Sample Example: to leave <instance id="leave.130"> <context> I 'd been seeing Johnnie almost a year now, but I still didn't want to <head>leave</head> him for five whole days. </context> </instance> <instance id="leave.157"> <context> And he saw them all as he walked up and down. At two that morning, he was still walking -- up and down Peony, up and down the veranda, up and down the silent, moonlit beach. Finally, in desperation, he opened the refrigerator, filched her hand lotion, and <head>left</head> a note. </context> </instance>

Slide35: 

English Lexical Sample Systems Unsupervised (5): Sunderlard, UNED, Illinois, Litkowsky, ITRI Supervised (12): S. Sebastian, Sinequa, CS 224N, Pedersen, Korea, Yarowsky, Resnik, Pennsylvania, Barcelona, Moldovan, Alicante, IRST

Finding Predominant Word Senses in Untagged Text: 

Finding Predominant Word Senses in Untagged Text Diana McCarthy & Rob Koeling & Julie Weeds & John Carroll

Predominant senses: 

Predominant senses

First sense Heuristic: 

First sense Heuristic

The power of the first sense heuristic: 

The power of the first sense heuristic

Finding predominant senses: 

Finding predominant senses Why do you need automated methods?

Domain Dependence: 

Domain Dependence E.g. star

Thesaurus: 

Thesaurus How it will be used…

Automatically obtaining a thesaurus: 

Automatically obtaining a thesaurus

Obtaining the thesaurus: 

Obtaining the thesaurus Mutual information of two words given a relation: The original Lin formulation:

Obtaining the thesaurus (continued): 

Obtaining the thesaurus (continued) Distributional similarity Ds(w,n)=

WordNet similarities: 

WordNet similarities Lesk: JCN: corpus based IC(s)=-log(p(s)) D(s1,s2)=IC(s1)+IC(s2)-2 x IC(s3), where s3 is the lowest common subsumer of s1 and s2

Obtaining predominant sense: 

Obtaining predominant sense For each sense si of word w calculate where

Evaluation on SemCor: 

Evaluation on SemCor PS – accuracy of finding predominant sense according to SemCor WSD – WSD accuracy using automatically determined MFS

Senseval 2 evaluation: 

Senseval 2 evaluation The best system at Senseval 2 obtained 69% prec. and rec. (it also used semcor and MFS information)

Domain specific corpora: 

Domain specific corpora

Domain specific results: 

Domain specific results

authorStream Live Help