Presentation Transcript
Thesauri and Semantic Networks: Thesauri and Semantic Networks
Thesauri: Thesauri
Thesauri: Thesauri It is intuitive to use thesauri to expand a query to enhance the accuracy.
A query about “dogs” might well be expanded to include “canine” if a thesauri was consulted.
Only problem is that you can easily add a “bad” word. A synonym for “dog” might well be “pet” and then the query would be too generic.
Manual vs. Automatic: Manual vs. Automatic Manual
use a readily available machine-readable thesaurus (e.g. Roget’s)
Automatic
build a thesaurus automatically in a language independent fashion
Notion is that an algorithm that could build a thesaurus automatically could be used on many different languages.
Automatic Thesauri Generation: Automatic Thesauri Generation Two approaches (that we will describe -- others are in the book)
Term Co-occurrence (Salton 1971)
Term Context (Gauch, 1996)
Thesaurus Generation with Term Co-occurrence: Thesaurus Generation with Term Co-occurrence Thesaurus is generated by finding similar terms
Terms that co-occur with each other over a threshold are considered similar.
Term-Term similarity matrix is created, having SC between every term ti with tj
Term Co-occurrence (example): Term Co-occurrence (example) Term Vectors (term-doc mapping):
t1
t2
SC (t1, t2)= . = 1 dot product
SC (t1, t2)= SC (t2, t1) symmetric coefficient
Expanding Query using Term Co-occurrence : Expanding Query using Term Co-occurrence For a given term ti, the top t most similar terms, based on SC, are picked.
These words can now be used for query expansion.
Problems with Term Co-occurrence: Problems with Term Co-occurrence A very frequent term will co-occur with everything
Very general terms will co-occur with other general terms (hairy will co-occur with furry)
Thesaurus Generation with Term Context: Thesaurus Generation with Term Context Notion here is that term co-occurrence is nice, but many unrelated terms will co-occur.
Proposed improvement is that words that are used with similar context words are similar.
Context Words: Context Words Consider
The dog ran up the hill
The canine ran down the hill.
We hope to find that “dog” and “canine” are synonyms because of the context words around them.
Context Vectors: Context Vectors Step 1
Identify context terms that will be used
Identify target terms (terms for which we want synonyms)
Select window of how many context words we care about. For a given target term, we are going to choose how many context words to the left and to the right we will watch. A window of size 3 says that we will watch context words at
-3, -2, -1, +1, +2, +3
Step 2
Build the context vectors around each target term
Step 3
Compute the similarity between two target term vectors
Step 4
Identifying expansion terms.
Step 1: Choose Key Parameters: Step 1: Choose Key Parameters Identify context words that will be used
Pick the top 200 most common terms
Identify target terms (terms that we want synonyms for)
This is the hard part, we don’t want too frequent as they will be vague, general terms; don’t want too infrequent because they won’t co-occur with anything.
Select window of context words
Let’s choose -3 to +3, six word window.
Determine the weights for the components of the context vector
Step 2: Build Context Vectors: Step 2: Build Context Vectors Each vector consists of an element for each context word for each position in the term window.
So if we have 200 context words and six positions (-3,-2,-1,+1,+2,+3) each vector will have 1200 components.
Component Weights: Component Weights Goal is to give higher weight to context term with larger co-occurrence frequency with target term than overall frequencies.
For a given context term j and target term i
w = log ((N dfij / tfi tfj ) + 1)
tfi = total occurrences of term i in the collection for a given window size
tfj = total occurrences of term j in the collection for a given window size
dfij = total documents that contain the co-occurrence of term i and term j in a given window size.
Step 3: Compute Similarity: Step 3: Compute Similarity For each target term, identify its similarity to all other target terms using their context vectors.
Can use dot product
Step 4: Identifying Expansion Terms: Step 4: Identifying Expansion Terms Expand target terms in the query using the top t most similar terms. Various thresholds for t can be used.
Semantic Networks: Semantic Networks
Semantic Networks: Semantic Networks Attempt to resolve the mismatch problem
Instead of matching query terms and document terms, measures the semantic distance
Premise: Terms that share the same meaning are closer (smaller distance) to each other in semantic network.
See publicly available tool, WordNet (www.cogsci.princeton.edu/~wn)
Semantic Networks: Semantic Networks Builds a network that for each word shows its relationships to other words. (recent efforts, 2004, to incorporate phrases).
For dog and canine a synonym arc would exist.
To expand a query, find the word in the semantic network and follow the various arcs to other related words.
Different distance measures can be used to compute the distance from one word in the network to another.
Types of Links in Wordnet: Types of Links in Wordnet Synonyms
dog, canine
Antonyms (opposite)
night, day
Hyponyms (is-a)
dog, mammal
Meronyms (part-of)
roof, house
Entailment (one entails the other)
buy, pay
Troponyms (two words related by entailment must occur at the same time)
limp, walk
Summary: Summary Pros
Thesauri and Semantic Networks (WordNet) can be used to find good words for users “more like this”
Cons
Little improvement has been found with automatic techniques to expand query without user intervention
Manual thesauri and WordNet are language dependent