logging in or signing up BiemannTKE WS SemanticIndexing05 FunSchool Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 17 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: November 17, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Semantic Indexing with Typed Terms usingRapid Annotation: Semantic Indexing with Typed Terms using Rapid Annotation 16th of August 2005 TKE-05 Workshop on Semantic Indexing, Copenhagen Chris Biemann University of LeipzigOutline: Outline The benefits of typed terms and relations Alleviating the ontology bottleneck Rapid annotation Sources for annotation candidates Annotation tools Case study: Annotation of „Deutscher Wortschatz“ ConclusionTyped terms and relations: Typed terms and relations The bag of words model treats all terms equally Document similarity based on all terms No views on data possible Typed terms and relations: Multiple views on documents w.r.t. types Document similarity restricted to types and augmented by relations Enables some tasks of Question Answering Motivating example: untyped: Motivating example: untyped Documents: The government official A. Smith signed a contract over the purchase of 100 tanks from weapon manufacturer B. Miller. „Weapon sales increased“, a government official stated, „especially tanks sell well“ A holiday cruise on a yacht invites to take photos of seagulls. The photos show A. Smith on a cruise with B. Miller‘s yacht. Similarity of terms: Clustering: 1 4 3 2 Slide5: Motivating example: type PERSON Documents: The government official A. Smith signed a contract over the purchase of 100 tanks from weapon manufacturer B. Miller. „Weapon sales increased“, a government official stated, „especially tanks sell well“ A holiday cruise on a yacht invites to take photos of seagulls. The photos show A. Smith on a cruise with B. Miller‘s yacht. Similarity of terms: Clustering: 1 4 3 2 The ontology bottleneck : The ontology bottleneck Semantic Web people believe that annotation with ontology relations will enable semantic search, ... Annotation: Chose an ontology, label all instances in the document Problems: New documents have to be annotated all over again Merging of ontologies Despite tools, users are reluctant to annotate their documents Doc 1 Anno 1 Doc 2 Anno 2 Doc 3 Anno 3 Doc n Anno n .... Merged ontology interfaceCentralized annotation: Centralized annotation Types and relations for terms are assigned globally and once-for-all. No (logically grounded, consistent) ontology, but a free collection of types and relations suited to the problem Annotation is done for document collections Doc 1 Annotation Doc 2 Doc 3 Doc n .... interface document collectionGenerating Candidates for Annotation: Generating Candidates for Annotation Given N terms from the collection, it is not feasible to present N² pairs to an annotator. Most of the pairs will not be related Needed: Method that produces terms with similar types and related pairs at high rate Method here: Co-occurrence statistics: Pairs of terms that occur significantly often together in sentences/documents. Co-occurrences of higher orders: pairs of terms that have similar co-occurrence statistics Co-occurrences reflect syntagmatic and paradigmatic relations, the former are ruled out in higher ordersThe cats and dogs example: The cats and dogs example cat co-occurrences: dog, her, food, pet, litter, she, burglar, animal, my, mouse, feline, Garfield, like, Cat, bag cat order 2: cats, pet, dog, animals, animal, dogs, pets, neutered, her, she, Synindex, like, tabbie, pigs, shelter cat order 4: pet, pets, cats, dog, pigs, animals, dogs, animal, owners, zoo, wild, birds, rabbits, puppies, tiger Graphical annotation tool: colourizing co-occurrences: Graphical annotation tool: colourizing co-occurrences Specifying types and relations: Specifying types and relations Click on node / edge opens context menu restricted to POSWeb-based annotation tool for arbitrary candidate sources: Web-based annotation tool for arbitrary candidate sources Rule-based candidate generation: Rule-based candidate generation If some annotation is already present, then rules can be specified to obtain candidates at even higher rate. It is possible to guess the type of candidates Example: Rule 1: If IS-A(A,B) and PROPERTY(B), then PROPERTY(A) yields LIVING(dog) as candidate Rule 2: If IS-A(A,B) and COHYPONYM(A,C) then IS-A(C,B) yields IS-A(cat, animal) as candidate dog cat LIVING animal LIVING IS-A CO-HYPONYMTool to accept or reject rule-based candidates: Tool to accept or reject rule-based candidates Case study: Annotating Deutscher Wortschatzwww.wortschatz.uni-leipzig.de: Case study: Annotating Deutscher Wortschatz www.wortschatz.uni-leipzig.de In terms of numbers: In 1‘000 hours, annotators could chose between 46 semantic types and 57 relations, and produced 150‘000 type instances and 150‘000 relation instances for over 80‘000 distinct terms, that is text coverage of 90%, with a speed of 5 units per minuteDifferent relations from different sources: Different relations from different sources Example: Query resolution with types and relations: Example: Query resolution with types and relations Query: „Find documents mentioning at least two heads of computer companies!“ 1. Translate into formal query: Qset = {B | IS-A(A, computer company), HEAD-OF(B,A)} b1 Qset, b2Qset, b1 b2 2. Access search engine with possible b1, b2What Google found:Find documents mentioning at least two heads of computer companies! : What Google found: Find documents mentioning at least two heads of computer companies! #1 hit 14.08.2005 www.google.comConclusion: Conclusion Typed terms and relation can facilitate processing of electronic documents for a wide range of applications Rapid annotation alleviates the acquisition bottleneck by - globally annotating - local dependencies Intuitive tools for annotation are highly important to achieve large amounts in short timeQUESTIONS?!?: QUESTIONS?!? THANK YOUBonus material: Bonus material Co-occurrences Co-occurrences of higher ordersStatistical Co-occurrences: Statistical Co-occurrences occurrence of two or more words within a well-defined unit of information (sentence, nearest neighbors) Significant Co-occurrences reflect relations between words Significance Measure (log-likelihood): - k is the number of sentences containing a and b together - ab is (number of sentences with a)*(number of sentences with b) - n is total number of sentences in corpus Iterating Co-occurrences: Iterating Co-occurrences (sentence-based) co-ocurrences of first order: words that co-occur significantly often together in sentences co-occurrences of second order: words that co-occur significantly often in collocation sets of first order co-occurrences of n-th order: words that co-occur significantly often in collocation sets of (n-1)th order When calculating a higher order, the significance values of the preceding order are not relevant. A co-occurrence set consists of the N highest ranked co-occurrences of a word.Constructed Example I: Constructed Example IConstructed Example II: Constructed Example II You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
BiemannTKE WS SemanticIndexing05 FunSchool Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 17 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: November 17, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Semantic Indexing with Typed Terms usingRapid Annotation: Semantic Indexing with Typed Terms using Rapid Annotation 16th of August 2005 TKE-05 Workshop on Semantic Indexing, Copenhagen Chris Biemann University of LeipzigOutline: Outline The benefits of typed terms and relations Alleviating the ontology bottleneck Rapid annotation Sources for annotation candidates Annotation tools Case study: Annotation of „Deutscher Wortschatz“ ConclusionTyped terms and relations: Typed terms and relations The bag of words model treats all terms equally Document similarity based on all terms No views on data possible Typed terms and relations: Multiple views on documents w.r.t. types Document similarity restricted to types and augmented by relations Enables some tasks of Question Answering Motivating example: untyped: Motivating example: untyped Documents: The government official A. Smith signed a contract over the purchase of 100 tanks from weapon manufacturer B. Miller. „Weapon sales increased“, a government official stated, „especially tanks sell well“ A holiday cruise on a yacht invites to take photos of seagulls. The photos show A. Smith on a cruise with B. Miller‘s yacht. Similarity of terms: Clustering: 1 4 3 2 Slide5: Motivating example: type PERSON Documents: The government official A. Smith signed a contract over the purchase of 100 tanks from weapon manufacturer B. Miller. „Weapon sales increased“, a government official stated, „especially tanks sell well“ A holiday cruise on a yacht invites to take photos of seagulls. The photos show A. Smith on a cruise with B. Miller‘s yacht. Similarity of terms: Clustering: 1 4 3 2 The ontology bottleneck : The ontology bottleneck Semantic Web people believe that annotation with ontology relations will enable semantic search, ... Annotation: Chose an ontology, label all instances in the document Problems: New documents have to be annotated all over again Merging of ontologies Despite tools, users are reluctant to annotate their documents Doc 1 Anno 1 Doc 2 Anno 2 Doc 3 Anno 3 Doc n Anno n .... Merged ontology interfaceCentralized annotation: Centralized annotation Types and relations for terms are assigned globally and once-for-all. No (logically grounded, consistent) ontology, but a free collection of types and relations suited to the problem Annotation is done for document collections Doc 1 Annotation Doc 2 Doc 3 Doc n .... interface document collectionGenerating Candidates for Annotation: Generating Candidates for Annotation Given N terms from the collection, it is not feasible to present N² pairs to an annotator. Most of the pairs will not be related Needed: Method that produces terms with similar types and related pairs at high rate Method here: Co-occurrence statistics: Pairs of terms that occur significantly often together in sentences/documents. Co-occurrences of higher orders: pairs of terms that have similar co-occurrence statistics Co-occurrences reflect syntagmatic and paradigmatic relations, the former are ruled out in higher ordersThe cats and dogs example: The cats and dogs example cat co-occurrences: dog, her, food, pet, litter, she, burglar, animal, my, mouse, feline, Garfield, like, Cat, bag cat order 2: cats, pet, dog, animals, animal, dogs, pets, neutered, her, she, Synindex, like, tabbie, pigs, shelter cat order 4: pet, pets, cats, dog, pigs, animals, dogs, animal, owners, zoo, wild, birds, rabbits, puppies, tiger Graphical annotation tool: colourizing co-occurrences: Graphical annotation tool: colourizing co-occurrences Specifying types and relations: Specifying types and relations Click on node / edge opens context menu restricted to POSWeb-based annotation tool for arbitrary candidate sources: Web-based annotation tool for arbitrary candidate sources Rule-based candidate generation: Rule-based candidate generation If some annotation is already present, then rules can be specified to obtain candidates at even higher rate. It is possible to guess the type of candidates Example: Rule 1: If IS-A(A,B) and PROPERTY(B), then PROPERTY(A) yields LIVING(dog) as candidate Rule 2: If IS-A(A,B) and COHYPONYM(A,C) then IS-A(C,B) yields IS-A(cat, animal) as candidate dog cat LIVING animal LIVING IS-A CO-HYPONYMTool to accept or reject rule-based candidates: Tool to accept or reject rule-based candidates Case study: Annotating Deutscher Wortschatzwww.wortschatz.uni-leipzig.de: Case study: Annotating Deutscher Wortschatz www.wortschatz.uni-leipzig.de In terms of numbers: In 1‘000 hours, annotators could chose between 46 semantic types and 57 relations, and produced 150‘000 type instances and 150‘000 relation instances for over 80‘000 distinct terms, that is text coverage of 90%, with a speed of 5 units per minuteDifferent relations from different sources: Different relations from different sources Example: Query resolution with types and relations: Example: Query resolution with types and relations Query: „Find documents mentioning at least two heads of computer companies!“ 1. Translate into formal query: Qset = {B | IS-A(A, computer company), HEAD-OF(B,A)} b1 Qset, b2Qset, b1 b2 2. Access search engine with possible b1, b2What Google found:Find documents mentioning at least two heads of computer companies! : What Google found: Find documents mentioning at least two heads of computer companies! #1 hit 14.08.2005 www.google.comConclusion: Conclusion Typed terms and relation can facilitate processing of electronic documents for a wide range of applications Rapid annotation alleviates the acquisition bottleneck by - globally annotating - local dependencies Intuitive tools for annotation are highly important to achieve large amounts in short timeQUESTIONS?!?: QUESTIONS?!? THANK YOUBonus material: Bonus material Co-occurrences Co-occurrences of higher ordersStatistical Co-occurrences: Statistical Co-occurrences occurrence of two or more words within a well-defined unit of information (sentence, nearest neighbors) Significant Co-occurrences reflect relations between words Significance Measure (log-likelihood): - k is the number of sentences containing a and b together - ab is (number of sentences with a)*(number of sentences with b) - n is total number of sentences in corpus Iterating Co-occurrences: Iterating Co-occurrences (sentence-based) co-ocurrences of first order: words that co-occur significantly often together in sentences co-occurrences of second order: words that co-occur significantly often in collocation sets of first order co-occurrences of n-th order: words that co-occur significantly often in collocation sets of (n-1)th order When calculating a higher order, the significance values of the preceding order are not relevant. A co-occurrence set consists of the N highest ranked co-occurrences of a word.Constructed Example I: Constructed Example IConstructed Example II: Constructed Example II