03d Thesaurus Semantic Network 05 pub

Uploaded from authorPOINTLite
Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

By: cetindogu (13 month(s) ago)

Thanks A Lot, good sharing:)

Presentation Transcript

Thesauri and Semantic Networks: 

Thesauri and Semantic Networks

Thesauri: 

Thesauri

Thesauri: 

Thesauri It is intuitive to use thesauri to expand a query to enhance the accuracy. A query about “dogs” might well be expanded to include “canine” if a thesauri was consulted. Only problem is that you can easily add a “bad” word. A synonym for “dog” might well be “pet” and then the query would be too generic.

Manual vs. Automatic: 

Manual vs. Automatic Manual use a readily available machine-readable thesaurus (e.g. Roget’s) Automatic build a thesaurus automatically in a language independent fashion Notion is that an algorithm that could build a thesaurus automatically could be used on many different languages.

Automatic Thesauri Generation: 

Automatic Thesauri Generation Two approaches (that we will describe -- others are in the book) Term Co-occurrence (Salton 1971) Term Context (Gauch, 1996)

Thesaurus Generation with Term Co-occurrence: 

Thesaurus Generation with Term Co-occurrence Thesaurus is generated by finding similar terms Terms that co-occur with each other over a threshold are considered similar. Term-Term similarity matrix is created, having SC between every term ti with tj

Term Co-occurrence (example): 

Term Co-occurrence (example) Term Vectors (term-doc mapping): t1 < 1 1> t2 <0 1> SC (t1, t2)= < 1 1>. < 0 1> = 1 dot product SC (t1, t2)= SC (t2, t1) symmetric coefficient

Expanding Query using Term Co-occurrence : 

Expanding Query using Term Co-occurrence For a given term ti, the top t most similar terms, based on SC, are picked. These words can now be used for query expansion.

Problems with Term Co-occurrence: 

Problems with Term Co-occurrence A very frequent term will co-occur with everything Very general terms will co-occur with other general terms (hairy will co-occur with furry)

Thesaurus Generation with Term Context: 

Thesaurus Generation with Term Context Notion here is that term co-occurrence is nice, but many unrelated terms will co-occur. Proposed improvement is that words that are used with similar context words are similar.

Context Words: 

Context Words Consider The dog ran up the hill The canine ran down the hill. We hope to find that “dog” and “canine” are synonyms because of the context words around them.

Context Vectors: 

Context Vectors Step 1 Identify context terms that will be used Identify target terms (terms for which we want synonyms) Select window of how many context words we care about. For a given target term, we are going to choose how many context words to the left and to the right we will watch. A window of size 3 says that we will watch context words at -3, -2, -1, +1, +2, +3 Step 2 Build the context vectors around each target term Step 3 Compute the similarity between two target term vectors Step 4 Identifying expansion terms.

Step 1: Choose Key Parameters: 

Step 1: Choose Key Parameters Identify context words that will be used Pick the top 200 most common terms Identify target terms (terms that we want synonyms for) This is the hard part, we don’t want too frequent as they will be vague, general terms; don’t want too infrequent because they won’t co-occur with anything. Select window of context words Let’s choose -3 to +3, six word window. Determine the weights for the components of the context vector

Step 2: Build Context Vectors: 

Step 2: Build Context Vectors Each vector consists of an element for each context word for each position in the term window. So if we have 200 context words and six positions (-3,-2,-1,+1,+2,+3) each vector will have 1200 components.

Component Weights: 

Component Weights Goal is to give higher weight to context term with larger co-occurrence frequency with target term than overall frequencies. For a given context term j and target term i w = log ((N dfij / tfi tfj ) + 1) tfi = total occurrences of term i in the collection for a given window size tfj = total occurrences of term j in the collection for a given window size dfij = total documents that contain the co-occurrence of term i and term j in a given window size.

Step 3: Compute Similarity: 

Step 3: Compute Similarity For each target term, identify its similarity to all other target terms using their context vectors. Can use dot product

Step 4: Identifying Expansion Terms: 

Step 4: Identifying Expansion Terms Expand target terms in the query using the top t most similar terms. Various thresholds for t can be used.

Semantic Networks: 

Semantic Networks

Semantic Networks: 

Semantic Networks Attempt to resolve the mismatch problem Instead of matching query terms and document terms, measures the semantic distance Premise: Terms that share the same meaning are closer (smaller distance) to each other in semantic network. See publicly available tool, WordNet (www.cogsci.princeton.edu/~wn)

Semantic Networks: 

Semantic Networks Builds a network that for each word shows its relationships to other words. (recent efforts, 2004, to incorporate phrases). For dog and canine a synonym arc would exist. To expand a query, find the word in the semantic network and follow the various arcs to other related words. Different distance measures can be used to compute the distance from one word in the network to another.

Types of Links in Wordnet: 

Types of Links in Wordnet Synonyms dog, canine Antonyms (opposite) night, day Hyponyms (is-a) dog, mammal Meronyms (part-of) roof, house Entailment (one entails the other) buy, pay Troponyms (two words related by entailment must occur at the same time) limp, walk

Summary: 

Summary Pros Thesauri and Semantic Networks (WordNet) can be used to find good words for users “more like this” Cons Little improvement has been found with automatic techniques to expand query without user intervention Manual thesauri and WordNet are language dependent