lexical symantics

Views:
 
Category: Education
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

PowerPoint Presentation:

Lexical Semantics

Preliminaries:

Preliminaries The meanings of individual words is Lexical Semantics. An entry in a lexicon consisting of a pairing of a form with a single meaning representation is a Lexeme . Lexicon : A collection of lexemes. Lemma is the grammatical form that is used to represent a lexeme. The lemma or citation form for sing, sang, sung is sing; are called wordform . The process of mapping from a wordform to a lemma is called lemmatization.

Word Sense:

Word Sense The Lemma “ bank” means “financial institution” and “sloping mound”. A word sense is a discrete representation of one aspect of the meaning of a word. Loosely following lexicographic tradition, each sense use superscript on the orthographic form of the lemma as in and . The two senses are HOMONYMS , and the relation between the senses is one HOMONYMY . When two senses are related semantically, we call the relationship between them POLYSEMY.

Relation between Senses:

Relation between Senses When the meaning of two senses of two different words (lemmas) are identical or nearly identical we say the two senses are synonyms . couch/sofa vomit/throw up filbert/hazelnut car/automobile Antonyms by contrast, are words with opposite meaning. Long/short big/little fast/slow

Relation between Senses:

Relation between Senses One sense is a hyponym of another sense if the first sense is more specific, denoting a subclass of the other. Car is a hyponym of vehicle; dog is a hyponym of animal, and mango is a hyponym of fruit . A semantic field is an attempt capture a more integrated, or holistic, relationship among entire sets of words from a single domain. Consider the following set of words extracted from the ATIS corpus: reservation , flight, travel, buy, price, cost, fare, rates, meal, plane

WordNet:

WordNet WordNet : A database of lexical relations. WordNet consists of three separate databases. Noun, Verb, Adjective and Adverb. Closed class words are not included in WordNet . Each database consists of a set of lemmas, each one annotated with a set of senses. The WordNet 3.0 release has: 117,097 nouns 11,488 verbs 22,141 adjectives 4,601 adverbs .

Event Participants:

Event Participants Thematic Roles Sasha broke the window. Pat opened the door . The roles of the subjects of the verbs break and open are Breaker and Opener respectively. Breaking events have Breakers, Opening events have Openers. Breakers and Openers have something in common. They are both actors , often animate, and they have direct causal responsibility for their events. Thematic roles are one attempt to capture this semantic commonality between AGENTS, Breakers and Eaters.

PowerPoint Presentation:

The Proposition Bank The proposition bank is a resource of sentences annotated with semantic roles. The English propBank labels all the sentences in the penn TreeBank . Each sense of each verb thus has a specific set of roles, which are given only numbers rather than names: Arg0, Arg1 Arg2, and so on . Arg0 is used to represent the PROTO-AGENT Arg1 the PROTO-PATIENT; the semantics of the other roles are specific to each verb sense. Arg2 of one verb is likely to have nothing in common with the Arg2 of another verb.

PowerPoint Presentation:

Agree Arg0: Agreer Arg1: Proposition Arg2: Other entity agreeing Ex1: [Arg0 The group] agreed [Arg1 it wouldn’t make an offer unless it had Georgia Gulf’s consent]. Ex2: [ ArgM-Tmp Usually] [Arg0 John] agrees [Arg2 with Mary] [Arg1 on everything ]. Fall Arg1 : Logical subject, patient, thing falling Arg2: Extent, amount fallen Arg3: start point Arg4: end point, end state of arg1 ArgM -LOC: medium Ex1: [Arg1 Sales] fell [Arg4 to $251.2 million] [Arg3 from $278.7 million]. Ex2: [Arg1 The average junk bond] fell [Arg2 by 4.2%] [ ArgM -TMP in October ]. **Note that there is no Arg0 role for fall, because the normal subject of fall is a PROTO-PATIENT

PowerPoint Presentation:

FrameNet The FrameNet project is another semantic role labeling project. Where roles in the PropBank project are specific to an individual verb , roles in the FrameNet project are specific to a frame . A frame is a script-like structure , which instantiates a set of frame-specific semantic roles called frame elements.

PowerPoint Presentation:

Primitive Decomposition Primitive decomposition is another way to represent the meaning of word, in terms of finite sets of sub-lexical primitives. Hen, rooster and chick means Chicken with different age and sex. This can be represented by using semantic features , symbols which represent some sort of primitive meaning :

Metaphor:

Metaphor Metaphor is used when it refer to and reason about a concept or domain using words and phrases whose meanings come from a completely different domain . Example: “ He was drowning in paperwork ”. “He is the apple of my eye ”. “Time is a thief”.

PowerPoint Presentation:

Computational Lexical Semantics

WORD SENSE DISAMBIGUATION:

WORD SENSE DISAMBIGUATION The task of examining word tokens in context and determining which sense of each word is being used . WSD algorithms take as input a word in context along with a fixed inventory of potential word senses , and return the correct word sense for that use. Two examples of the distinct senses that exist for the word "bass": a type of fish tones of low frequency and the sentences : I went fishing for some sea bass. The bass line of the song is too weak .

Supervised WORD SENSE DISAMBIGUATION:

Supervised WORD SENSE DISAMBIGUATION Supervised machine learning approaches are often used to handle lexical sample tasks. For each word, a number of corpus instances can be selected and hand-labeled with the correct sense of the target word in each . Classifier systems can then be trained using these labeled examples . Unlabeled target words in context can then be labeled using such a trained classifier.

Extracting Feature Vectors for Supervised Learning:

Extracting Feature Vectors for Supervised Learning The first step in supervised training is to extract a useful set of features that are predictive of word senses . A feature vector consisting of numeric or nominal values is used to encode this linguistic information as an input to most machine learning algorithms .

Extracting Feature Vectors for Supervised Learning:

A feature-vector , extracted from a window of two words to the right and left of the target word, made up of the words themselves and their respective parts-of speech , i.e., And the vector is: Extracting Feature Vectors for Supervised Learning

Extracting Feature Vectors for Supervised Learning:

The second type of feature consists of bag-of-words information about neighboring words. A bag-of-words means an unordered set of words, ignoring their exact position. The simplest bag-of-words approach represents the context of a target word by a vector of features, each binary feature indicating whether a vocabulary word w does or doesn’t occur in the context. Extracting Feature Vectors for Supervised Learning

Extracting Feature Vectors for Supervised Learning:

For example a bag-of-words vector consisting of the 12 most frequent content words from a collection of bass sentences drawn from the WSJ corpus would have the following ordered word feature set : Using these word features with a window size of 10, example would be represented by the following binary vector: [0,0,0,1,0,0,0,0,0,0,1,0 ] *More in chapter 23 Extracting Feature Vectors for Supervised Learning

Naive bayes and Decision List Classifiers:

The naive Bayes classifier approach to WSD is based on the premise that choosing the best sense Ŝ out of the set of possible senses S for a feature vector amounts to choosing the most probable sense given that vector. In other words : I t is difficult to collect reasonable statistic for this equation directly. Bag-of-words vector defined over a vocabulary of 20 words would have possible feature vectors. Naive bayes and Decision List Classifiers

Naive bayes and Decision List Classifiers:

we first reformulate our problem in the usual Bayesian manner as follows : Even this equation isn’t helpful enough, since the data available that associates specific vectors with each sense s is also too sparse . But we have information about individual feature-value pairs in the context of specific senses . Therefore we can make the independent assumption. Naive bayes and Decision List Classifiers

Naive bayes and Decision List Classifiers:

Making this assumption that the features are conditionally independent given the word sense yields the following approximation for : We can estimate the probability of an entire vector given a sense by the product of the probabilities of its individual features given that sense. Final naive Bayes classifier for WSD : Naive bayes and Decision List Classifiers

WSD: DICTIONARY AND THESAURUS METHODS:

WSD: DICTIONARY AND THESAURUS METHODS A methods for using a dictionary or thesaurus as an indirect kind of supervision. The Lesk Algorithm

WSD: DICTIONARY AND THESAURUS METHODS:

WSD: DICTIONARY AND THESAURUS METHODS Sense has two words overlapping with the context in deposits and mortgage, while sense has zero, so sense is chosen .

Selectional Restrictions and Selectional Preferences:

Selectional Restrictions and Selectional Preferences (a piece of dishware normally used as a container for holding or serving food), with hypernyms like artifact. (a particular item of prepared food) with hypernyms like food. The restriction imposed by wash and stir-fry on their THEME semantic roles. The restriction imposed by wash conflict with and the restriction stir-fry conflict with .

Selectional Restrictions and Selectional Preferences:

Selectional Restrictions and Selectional Preferences The selectional preference strength can be defined by the difference in information between two distributions: 1. T he distribution of expected semantic classes P(c). 2. T he distribution of expected semantic classes for the particular verb P( c|v ). The greater the difference between these distributions, the more information the verb is giving us about possible objects. This difference can be quantified by the relative entropy between these two distributions, or Kullback-Leibler divergence ( Kullback and Leibler, 1951).

Kullback-Leibler divergence:

Kullback-Leibler divergence The Kullback-Leibler or KL divergence D(P||Q) can be used to express the difference between two probability distributions P and Q , The selectional preference uses the KL divergence to express how much information, in bits, the verb v expresses about the possible semantic class of its argument.

MINIMALLY SUPERVISED WSD: BOOTSTRAPPING:

MINIMALLY SUPERVISED WSD: BOOTSTRAPPING Supervised approach and the dictionary-based approach to WSD require large hand-built resources; supervised training sets in one case, large dictionaries in the other. Bootstrapping algorithms (semi-supervised learning or minimally supervised learning) which need only a very small hand-labeled training set. The most widely emulated bootstrapping algorithm for WSD is the Yarowsky algorithm .

Yarowsky algorithm:

Yarowsky algorithm The goal of the Yarowsky algorithm is to learn a classifier for a target word. Small seed-set of labeled instances of each sense, A much larger unlabeled corpus . The algorithm first trains an initial decision-list classifier on the seed-set . It then uses this classifier to label the unlabeled corpus . The algorithm then selects the examples in that it is most confident about, removes them, and adds them to the training set (call it now ). The algorithm then trains a new decision list classifier on , and iterates by applying the classifier to the smaller unlabeled set , extracting a new training set and so on. With each iteration of this process, the training corpus grows and the untagged corpus shrinks. The process is repeated until some sufficiently low error-rate on the training set is reached , or until no further examples from the untagged corpus are above threshold .

Yarowsky algorithm:

Yarowsky algorithm The Yarowsky algorithm at the initial stage. With only seed sentences labeled by collocates, at an intermediate state, More collocates have been discovered and more instances in V have been labeled and moved to and at a final stage.

WORD SIMILARITY: THESAURUS METHODS:

WORD SIMILARITY: THESAURUS METHODS The concept dime is most similar to nickel and coin , less similar to money , and even less similar to Richter scale . Formally, we specify path length as follows: = the number of edges in the shortest path in the thesaurus graph between the sense nodes c1 and c2. Path based similarity can be defined just as the path length, often with a log transform, resulting in the following common definition of path length based similarity: simpath (c1,c2 ) = −log pathlen (c1,c2) The oldest and simplest thesaurus-based algorithm are based on intuition that the shorter the path between two words or senses in the graph defined by the thesaurus hierarchy the more similar they are.

RESNIK SIMILARITY:

RESNIK SIMILARITY Resnik proposes to estimate the common amount of information by the information content of the lowest common subsumer of the two nodes. More formally, the Resnik similarity measure is : Lin (1998b) extended the Resnik intuition by pointing out that a similarity metric between objects A and B needs to do more than measure the amount of information in common between A and B. The final LIN Similarity function is:

SIMILARITY MEASURES:

SIMILARITY MEASURES Five thesaurus based similarity measures

SEMANTIC ROLE LABELING:

SEMANTIC ROLE LABELING Automatically finding the semantic roles for each predicate in a sentence. More specifically, that means determining which constituents in a sentence are semantic arguments for a given predicate, and then determining the appropriate role for each of those arguments . Semantic role labeling is also called thematic role labeling , case role assignment , shallow semantic parsing . Current approaches to semantic role labeling are based on supervised machine learning and hence require access to adequate amounts of training and testing materials. FrameNet and PropBank is being used. They have been used to specify what counts as a predicate, to define the set of roles used in the task and to provide training and test data.

SEMANTIC ROLE LABELING:

SEMANTIC ROLE LABELING FrameNet employs a large number of frame-specific frame elements as roles. PropBank makes use of a smaller number of numbered argument labels which can be interpreted as verb-specific labels .

SEMANTIC ROLE LABELING:

SEMANTIC ROLE LABELING The CLASSIFYNODE component can be a simple 1-of-N classifier which assigns a semantic role. CLASSIFYNODE can be trained on labeled data such as FrameNet or PropBank .

SEMANTIC ROLE LABELING:

SEMANTIC ROLE LABELING The resulting parse is then traversed to find all predicate-bearing words. For each of these predicates the tree is again traversed to determine which role each constituent in the parse plays with respect to that predicate. A classifier trained on an appropriate training set is then passed this feature set and makes the appropriate assignment .

UNSUPERVISED SENSE DISAMBIGUATION:

UNSUPERVISED SENSE DISAMBIGUATION We don’t use human-defined word senses. Instead, the set of ‘senses’ of each word are created automatically from the instances of each word in the training set. Sch¨utze’s introduced a methods on unsupervised sense disambiguation . In this method , he first represent each instance of a word in the training set by distributional context feature-vectors that are a slight generalization of the feature vectors. He defines the context vector of a word w not as this first-order vector, but instead by its second order co-occurrence. That is, the context vector for a word w is built by taking each word x in the context of w, for each x computing its word vector , and then taking the centroid (average) of the vectors .

UNSUPERVISED SENSE DISAMBIGUATION:

UNSUPERVISED SENSE DISAMBIGUATION How to use these context vectors (whether first-order or second-order) in unsupervised sense disambiguation of a word w. It need only 3 steps: For each token of word w in a corpus, compute a context vector . Use a clustering algorithm to cluster these word token context vectors into a predefined number of groups or clusters. Each cluster defines a sense of w. Compute the vector centroid of each cluster. Each vector centroid is a sense vector representing that sense of w .

UNSUPERVISED SENSE DISAMBIGUATION:

UNSUPERVISED SENSE DISAMBIGUATION Agglomerative clustering is a technique where each of the N training instances is initially assigned to its own cluster. New clusters are then formed in a bottom-up fashion by successively merging the two clusters that are most similar. This process continues until either a specified number of clusters is reached, or some global goodness measure among the clusters is achieved .