Semantic Indexing with Typed Terms usingRapid Annotation : Semantic Indexing with Typed Terms using Rapid Annotation 16th of August 2005
TKE-05 Workshop on Semantic Indexing, Copenhagen Chris Biemann
University of Leipzig
Outline : Outline The benefits of typed terms and relations
Alleviating the ontology bottleneck
Rapid annotation
Sources for annotation candidates
Annotation tools
Case study: Annotation of „Deutscher Wortschatz“
Conclusion
Typed terms and relations : Typed terms and relations The bag of words model treats all terms equally
Document similarity based on all terms
No views on data possible
Typed terms and relations:
Multiple views on documents w.r.t. types
Document similarity restricted to types and augmented by relations
Enables some tasks of Question Answering
Motivating example: untyped : Motivating example: untyped Documents:
The government official A. Smith signed a contract over the purchase of 100 tanks from weapon manufacturer B. Miller.
„Weapon sales increased“, a government official stated, „especially tanks sell well“
A holiday cruise on a yacht invites to take photos of seagulls.
The photos show A. Smith on a cruise with B. Miller‘s yacht.
Similarity of terms: Clustering:
1 4 3 2
Slide5 : Motivating example: type PERSON Documents:
The government official A. Smith signed a contract over the purchase of 100 tanks from weapon manufacturer B. Miller.
„Weapon sales increased“, a government official stated, „especially tanks sell well“
A holiday cruise on a yacht invites to take photos of seagulls.
The photos show A. Smith on a cruise with B. Miller‘s yacht.
Similarity of terms: Clustering:
1 4 3 2
The ontology bottleneck : The ontology bottleneck Semantic Web people believe that annotation with ontology relations will enable semantic search, ...
Annotation: Chose an ontology, label all instances in the document
Problems:
New documents have to be annotated all over again
Merging of ontologies
Despite tools, users are reluctant to annotate their documents
Doc 1 Anno 1 Doc 2 Anno 2 Doc 3 Anno 3 Doc n Anno n .... Merged ontology interface
Centralized annotation : Centralized annotation Types and relations for terms are assigned globally and once-for-all.
No (logically grounded, consistent) ontology, but a free collection of types and relations suited to the problem
Annotation is done for document collections Doc 1 Annotation Doc 2 Doc 3 Doc n .... interface document
collection
Generating Candidates for Annotation : Generating Candidates for Annotation Given N terms from the collection, it is not feasible to present N² pairs to an annotator. Most of the pairs will not be related
Needed: Method that produces terms with similar types and related pairs at high rate
Method here:
Co-occurrence statistics: Pairs of terms that occur significantly often together in sentences/documents.
Co-occurrences of higher orders: pairs of terms that have similar co-occurrence statistics
Co-occurrences reflect syntagmatic and paradigmatic relations, the former are ruled out in higher orders
The cats and dogs example : The cats and dogs example cat co-occurrences: dog, her, food, pet, litter, she, burglar, animal, my, mouse, feline, Garfield, like, Cat, bag
cat order 2: cats, pet, dog, animals, animal, dogs, pets, neutered, her, she, Synindex, like, tabbie, pigs, shelter
cat order 4: pet, pets, cats, dog, pigs, animals, dogs, animal, owners, zoo, wild, birds, rabbits, puppies, tiger
Graphical annotation tool: colourizing co-occurrences : Graphical annotation tool: colourizing co-occurrences
Specifying types and relations : Specifying types and relations Click on node / edge opens context menu restricted to POS
Web-based annotation tool for arbitrary candidate sources : Web-based annotation tool for arbitrary candidate sources
Rule-based candidate generation : Rule-based candidate generation If some annotation is already present, then rules can be specified to obtain candidates at even higher rate.
It is possible to guess the type of candidates
Example:
Rule 1: If IS-A(A,B) and PROPERTY(B), then PROPERTY(A) yields LIVING(dog) as candidate
Rule 2: If IS-A(A,B) and COHYPONYM(A,C) then IS-A(C,B) yields IS-A(cat, animal) as candidate dog cat LIVING animal LIVING IS-A CO-HYPONYM
Tool to accept or reject rule-based candidates : Tool to accept or reject rule-based candidates
Case study: Annotating Deutscher Wortschatzwww.wortschatz.uni-leipzig.de : Case study: Annotating Deutscher Wortschatz www.wortschatz.uni-leipzig.de In terms of numbers:
In 1‘000 hours, annotators could chose between
46 semantic types and
57 relations, and produced
150‘000 type instances and
150‘000 relation instances for over
80‘000 distinct terms, that is text coverage of
90%, with a speed of
5 units per minute
Different relations from different sources : Different relations from different sources
Example: Query resolution with types and relations : Example: Query resolution with types and relations Query: „Find documents mentioning at least two heads of computer companies!“
1. Translate into formal query:
Qset = {B | IS-A(A, computer company), HEAD-OF(B,A)}
b1 Qset, b2Qset, b1 b2
2. Access search engine with possible b1, b2
What Google found:Find documents mentioning at least two heads of computer companies! : What Google found: Find documents mentioning at least two heads of computer companies! #1 hit 14.08.2005 www.google.com
Conclusion : Conclusion Typed terms and relation can facilitate processing of electronic documents for a wide range of applications
Rapid annotation alleviates the acquisition bottleneck by - globally annotating - local dependencies
Intuitive tools for annotation are highly important to achieve large amounts in short time
QUESTIONS?!? : QUESTIONS?!? THANK YOU
Bonus material : Bonus material
Co-occurrences
Co-occurrences of higher orders
Statistical Co-occurrences : Statistical Co-occurrences occurrence of two or more words within a well-defined unit of information (sentence, nearest neighbors)
Significant Co-occurrences reflect relations between words
Significance Measure (log-likelihood): - k is the number of sentences containing a and b together - ab is (number of sentences with a)*(number of sentences with b) - n is total number of sentences in corpus
Iterating Co-occurrences : Iterating Co-occurrences (sentence-based) co-ocurrences of first order: words that co-occur significantly often together in sentences
co-occurrences of second order:
words that co-occur significantly often in collocation sets of first order
co-occurrences of n-th order: words that co-occur significantly often in collocation sets of (n-1)th order
When calculating a higher order, the significance values of the preceding order are not relevant. A co-occurrence set consists of the N highest ranked co-occurrences of a word.
Constructed Example I : Constructed Example I
Constructed Example II : Constructed Example II