logging in or signing up sepln 2003 Clarice Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 29 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: October 03, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Linguistic Processing of Classification Hierarchies: Linguistic Processing of Classification Hierarchies Bernardo Magnini ITC-irst, Istituto per la Ricerca Scientifica e Tecnologica Trento - ItalyCurrent Research Topics on Text Processing at ITC-irst: Current Research Topics on Text Processing at ITC-irst Question/Answering TREC style Information Extraction ML approach, DOT.KOM project Lexical Acquisition and Linguistic Resources MultiWordnet, Wordnet Domains, corpora for Italian Word Sense Disambiguation Based on domains, MEANING project NLP for Knowledge Management Edamok project Evaluation of NLP Technologies Qa at CLEF-2003, Senseval-3Current Research Topics on Text Processing at ITC-irst: Current Research Topics on Text Processing at ITC-irst Question/Answering TREC style Information Extraction ML approach, DOT.Kom project Lexical Acquisition and Linguistic Resources MultiWordnet, Wordnet Domains, corpora for Italian Word Sense Disambiguation Based on domains, Meaning project NLP for Knowledge Management Edamok project Evaluation of NLP Technologies Qa at CLEF-2003, Senseval-3Outline: Outline Classification Hierarchies (CH) Concept hierarchies Approaches toward interoperability of CHs Semantic interpretation of CHs Making the information explicit: the role of linguistic and world knowledge Experimental setting Preliminary results with CTXMATCH algorithmOrganizing papers: A senior researcher: Organizing papers: A senior researcher Work WSD QA Papers Projects Experiments Senseval-2 ACL-02 Submission Camera ready Submission Knowledge about the domain is used Classification schema are repeated Labels are interpreted in their contextOrganizing papers:A young researcher: Organizing papers: A young researcher Home Articles Code 2002 2001 2000 Senseval-2 ACL-02 workshops Int. conferences A different view for the same documents Redundant information Different labels for the same concept journalsOrganizing papers:A student: Organizing papers: A student Disambiguation Less structure corresponds to more complex labels Any kind of document is allowed (text, images, code, …) Results-all-word-Eng. Senseval-Call-for-paper Senseval-article Meaning-project Algorithm-description Acl-article-final-version Lexical-sample-training-dataQuestions: Questions Can a system automatically discover similarities among different views of the same documents? Example: retrieving documents in classification B using the schema of classification A How much reasoning is involved? Labels are expressed in a natural language. Is there a role for NLP technologies?Classification Hierarchies – CH (1): Classification Hierarchies – CH (1) Taxonomic organization of documents Easy to build: no formal language is required Widespread used: Web directories (Google, Yahoo!, Looksmart, portals) Market place catalogues for product classifications File systems Local Ontologies Documents are classified at all levels of the hierarchy CHs structure reflect both the documents and world knowledgeClassification Hierarchies (2): Classification Hierarchies (2) Vacation 2001 2000 Sea Lake Sea Mountains Tuscany Spain USA Semi-structured: relations among nodes are not formally defined. Document dependent: CHs are organized according to the documents that have to be classified. Specificity criterion: a document is classified in the more specific node of the hierarchy. Interoperability among CHs: Interoperability among CHs Commercial interest: Distributed Knowledge Management in corporations Scientific interest. Various terms have been recently used, including: Meaning negotiation Semantic coordination Mapping between domain models Semantic mediation Ontology merging, integration or alignment Integration of hierarchical categorization Fits well in the Semantic Web perspective Common goal: find mappings between nodes of two classification hierarchiesInteroperability among CHs: Source CH Target CH Vacation 2001 2000 Sea Lake Sea Mountains Tuscany Spain USA Sea holidays Italy in Europe Interoperability among CHsInteroperability among CHs: Source CH Target CH Vacation 2001 2000 Sea Lake Sea Mountains Tuscany Spain USA Sea holidays Italy in Europe Interoperability among CHsInteroperability among CHs: Source CH Target CH Vacation 2001 2000 Sea Lake Sea Mountains Tuscany Spain USA Sea holidays Italy in Europe ? Interoperability among CHsQualitative Mapping: Source CH Target CH Vacation 2001 2000 Sea Lake Sea Mountains Tuscany Spain USA Sea holidays Italy in Europe More general Qualitative MappingQualitative mapping: Source CH Target CH Vacation 2001 2000 Sea Lake Sea Mountains Tuscany Spain USA Sea holidays Italy in Europe More specific Qualitative mapping 2001 TuscanyQualitative mapping: Source CH Target CH Vacation 2001 2000 Sea Lake Sea Mountains Tuscany Spain USA Sea holidays Italy in Europe Qualitative mapping 2001 Tuscany EquivalentQualitative mapping: Source CH Target CH Vacation 2001 2000 Sea Lake Sea Mountains Tuscany Spain USA Sea holidays Italy in Europe Qualitative mapping 2001 Tuscany Not compatibleQualitative mapping: Source CH Target CH Vacation 2001 2000 Sea Lake Sea Mountains Tuscany Spain USA Sea holidays Italy in Europe Compatible Qualitative mapping 2001 TuscanyQualitative mapping: Source CH Target CH Vacation 2001 2000 Sea Lake Sea Mountains Tuscany Spain USA Sea holidays Italy in Europe Qualitative mapping 2001 Tuscany Approaches to CH mapping: Approaches to CH mapping Approaches to CH mapping can be grouped in four classes, according with the kind of information used: Based on document content Based on document classifications Based on structural information Based on semantic interpretation of labels (CTXMATCH)1. Mapping based on Documents: 1. Mapping based on Documents Consider the content of the document Procedure [Madhavan et al. AAAI-2002]: Train a classifier on documents of source CH Apply the classifier to documents of target CH Drawbacks: Needs the documents Only textual documents can be considered Do not consider structural information Do not produce qualitative mappingsSlide23: 2. Mapping based on Classifications Consider the number of documents in common with nodes of different CHs Procedure [Ichise et al. IJCAI-2003]: Compute a a statistical model of classification criteria of source and target CHs Determine similarity between pairs of nodes in source and target Drawbacks: Needs documents in common Does not produce qualitative mappings3. Mapping Based on Structural Information (1): 3. Mapping Based on Structural Information (1) Consider node definitions and their lexical expansions Procedure [Calvanese et al. ISWC 2001]: Automatically propose candidate mappings based on lexicographic criteria Correct mappings are validated by a domain expert Drawbacks: Require human intervention Feasible for ontology integration, not for CHs3. Mapping Based on Structural Information (2): 3. Mapping Based on Structural Information (2) Consider structural constraints among nodes Procedure [Daude et al. ACL-2000, this conference]: Select candidates pairs with lexicographic criteria Select structural constraints Use relaxation labelling to chose the best candidate Drawbacks: Good for WordNet, but CHs have a lot of implicit knowledge Do not produce qualitative mapping4. Mapping Based on Semantic Interpretation: 4. Mapping Based on Semantic Interpretation Consider linguistic processing of nodes and world knowledge Procedure [Bouquet et al. ISWC-2003, to appear]: Build a logical interpretation for the source and the target nodes Compute the relation between the two logical forms Drawbacks: Require world knowledge Require tuning of linguistic tools for CHsSemantic Interpretation (1): Semantic Interpretation (1) Images Beach Mountain Italy More specific More specific World Knowledge is necessary Semantic Interpretation (2): Semantic Interpretation (2) Images Beach Mountain Italy More specific More specific More specific EquivalentLinguistic Processing of CHs: Linguistic Processing of CHs How linguistic techniques work on CHs? Tokenization and Part of Speech Tagging Multiwords recognition Named entities recognition Word sense disambiguation Which peculiar problems are posed by CHs as far as their semantic interpretation is concerned? How much implicit information is it possible to extract from CHs? Part of Speech Tags (1): Part of Speech Tags (1) Vacation 2001 2000 Sea Lake Beach Mountains Tuscany Spain USA Nouns are prevalent Limited context available for solving ambiguities Part of Speech Tags (2): Part of Speech Tags (2) POS tagger: TNT [Brants, ANLP-2000] CH: 5k tokens extracted by a balanced set of CHs (web directories, file systems, product catalogues, ontologies) both for English and Italian Text: English: training over 1M words (BNC) Italian: training over 50k words (Elsnet) Tokenization: Tokenization Parenthesis and Acronyms Business credit agencies Business credit gathering or reporting services Value added network (VAN) services From UNSPSC Credit agenciesAbbreviations: Abbreviations Abbreviations From EClass Potato, pot. product Semi-instant product (veg.)Multiwords: Multiwords Multiword on two contiguous levels Multiword on one level Billiards Players From Google Sport United StatesCoordination: Coordination Conjunction Disjunction Alternative and Holistic medicine Witch doctors or voodoo services From UNSPSC Healthcare ServicesMultilinguality: Multilinguality Spanish English MixedLexical Ambiguity: Lexical Ambiguity Structural information provide context for word sense disambiguation The connections between WSD and web directories have been investigated by [Gonzalo et al. 2003] Trees Apple tree From Google PlantsArc Interpretation: Arc Interpretation Relations among nodes are not formally defined Instance-of In CHs documents classified under a certain node A are a subset of the documents classified under a parent node of A. According to our world knowledge the relation among two nodes can be interpreted in various ways.Arc Interpretation: Arc Interpretation Relations among nodes are not formally defined Part-of From Google Images Pisa Florence TuscanyArc Interpretation: Arc Interpretation Relations among nodes are not formally defined Generic Associations Television Cable_TV Public_Access From Google Satellite GuidesArc Interpretation: Arc Interpretation Relations among nodes are not formally defined Meta-level criteria World Languages A Afrikaans From Google B BaliImplicit Negation: Implicit Negation Trentino is part of North Italy Implicit Negation: Implicit Negation Trentino is part of North Italy From ITC-irst personnel office Origin of ITC-irst employees Italy North except Trentino Center South TrentinoCTXMATCH Algorithm: CTXMATCH Algorithm Semantic explicitation Linguistic analysis of labels Shallow parsing, access to wordnet, multiwords Contextualization Sense filtering (use Wordnet as knowledge repository) Sense composition (use Wordnet as knowledge repository) Semantic comparison Build a logical form (description logics) Computing the logical relation between two formula (SAT solver)An Experimental Setting:Matching Web Directories: An Experimental Setting: Matching Web Directories Task: automatically discover qualitative mappings among corresponding directories of Google and Yahoo CTXMATCH: Input: a pair <N1, N2> belonging to CH1 and CH2 Output: a relation holding between N1 and N2: more general, more specific, equivalent, no relation Evaluation: define a metric considering the documents (Urls) classified both by Google and Yahoo. Define a mapping between this metric and the CTXMATCH relations. Baseline: string match of the paths of the two nodes.Matching Google and Yahoo! :Linguistic Analysis: Matching Google and Yahoo! : Linguistic AnalysisMatching Google and Yahoo! :Preliminary Results: Matching Google and Yahoo! : Preliminary Results Google: Architecture/History/Periods_and_Styles/Gothic Yahoo: Architecture/History/Medieval Is More specific thanOngoing and Future Experiments: Ongoing and Future Experiments Web directories: build a reference benchmark for evaluating matching algorithms. Include Looksmart Google English vs Google Italian File systems Collaboration Edamok, SWAP, MEANING Domain specific applications Medical classification: integration of UML in the algorithm Public Administration: matching document classification hierarchies for automatic routing Edamok project: www.edamok.itc.it Papers, algorithm specifications, case studiesConclusions: Conclusions Interoperability of Classification Hierarchies Scientific interest: Semantic Web community Application oriented interest NLP can play a crucial role A proper experimental setting is necessary for comparing different approaches CTXMATCH: Qualitative mappings Semantic interpretation based on linguistic analysis Preliminary results You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
sepln 2003 Clarice Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 29 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: October 03, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Linguistic Processing of Classification Hierarchies: Linguistic Processing of Classification Hierarchies Bernardo Magnini ITC-irst, Istituto per la Ricerca Scientifica e Tecnologica Trento - ItalyCurrent Research Topics on Text Processing at ITC-irst: Current Research Topics on Text Processing at ITC-irst Question/Answering TREC style Information Extraction ML approach, DOT.KOM project Lexical Acquisition and Linguistic Resources MultiWordnet, Wordnet Domains, corpora for Italian Word Sense Disambiguation Based on domains, MEANING project NLP for Knowledge Management Edamok project Evaluation of NLP Technologies Qa at CLEF-2003, Senseval-3Current Research Topics on Text Processing at ITC-irst: Current Research Topics on Text Processing at ITC-irst Question/Answering TREC style Information Extraction ML approach, DOT.Kom project Lexical Acquisition and Linguistic Resources MultiWordnet, Wordnet Domains, corpora for Italian Word Sense Disambiguation Based on domains, Meaning project NLP for Knowledge Management Edamok project Evaluation of NLP Technologies Qa at CLEF-2003, Senseval-3Outline: Outline Classification Hierarchies (CH) Concept hierarchies Approaches toward interoperability of CHs Semantic interpretation of CHs Making the information explicit: the role of linguistic and world knowledge Experimental setting Preliminary results with CTXMATCH algorithmOrganizing papers: A senior researcher: Organizing papers: A senior researcher Work WSD QA Papers Projects Experiments Senseval-2 ACL-02 Submission Camera ready Submission Knowledge about the domain is used Classification schema are repeated Labels are interpreted in their contextOrganizing papers:A young researcher: Organizing papers: A young researcher Home Articles Code 2002 2001 2000 Senseval-2 ACL-02 workshops Int. conferences A different view for the same documents Redundant information Different labels for the same concept journalsOrganizing papers:A student: Organizing papers: A student Disambiguation Less structure corresponds to more complex labels Any kind of document is allowed (text, images, code, …) Results-all-word-Eng. Senseval-Call-for-paper Senseval-article Meaning-project Algorithm-description Acl-article-final-version Lexical-sample-training-dataQuestions: Questions Can a system automatically discover similarities among different views of the same documents? Example: retrieving documents in classification B using the schema of classification A How much reasoning is involved? Labels are expressed in a natural language. Is there a role for NLP technologies?Classification Hierarchies – CH (1): Classification Hierarchies – CH (1) Taxonomic organization of documents Easy to build: no formal language is required Widespread used: Web directories (Google, Yahoo!, Looksmart, portals) Market place catalogues for product classifications File systems Local Ontologies Documents are classified at all levels of the hierarchy CHs structure reflect both the documents and world knowledgeClassification Hierarchies (2): Classification Hierarchies (2) Vacation 2001 2000 Sea Lake Sea Mountains Tuscany Spain USA Semi-structured: relations among nodes are not formally defined. Document dependent: CHs are organized according to the documents that have to be classified. Specificity criterion: a document is classified in the more specific node of the hierarchy. Interoperability among CHs: Interoperability among CHs Commercial interest: Distributed Knowledge Management in corporations Scientific interest. Various terms have been recently used, including: Meaning negotiation Semantic coordination Mapping between domain models Semantic mediation Ontology merging, integration or alignment Integration of hierarchical categorization Fits well in the Semantic Web perspective Common goal: find mappings between nodes of two classification hierarchiesInteroperability among CHs: Source CH Target CH Vacation 2001 2000 Sea Lake Sea Mountains Tuscany Spain USA Sea holidays Italy in Europe Interoperability among CHsInteroperability among CHs: Source CH Target CH Vacation 2001 2000 Sea Lake Sea Mountains Tuscany Spain USA Sea holidays Italy in Europe Interoperability among CHsInteroperability among CHs: Source CH Target CH Vacation 2001 2000 Sea Lake Sea Mountains Tuscany Spain USA Sea holidays Italy in Europe ? Interoperability among CHsQualitative Mapping: Source CH Target CH Vacation 2001 2000 Sea Lake Sea Mountains Tuscany Spain USA Sea holidays Italy in Europe More general Qualitative MappingQualitative mapping: Source CH Target CH Vacation 2001 2000 Sea Lake Sea Mountains Tuscany Spain USA Sea holidays Italy in Europe More specific Qualitative mapping 2001 TuscanyQualitative mapping: Source CH Target CH Vacation 2001 2000 Sea Lake Sea Mountains Tuscany Spain USA Sea holidays Italy in Europe Qualitative mapping 2001 Tuscany EquivalentQualitative mapping: Source CH Target CH Vacation 2001 2000 Sea Lake Sea Mountains Tuscany Spain USA Sea holidays Italy in Europe Qualitative mapping 2001 Tuscany Not compatibleQualitative mapping: Source CH Target CH Vacation 2001 2000 Sea Lake Sea Mountains Tuscany Spain USA Sea holidays Italy in Europe Compatible Qualitative mapping 2001 TuscanyQualitative mapping: Source CH Target CH Vacation 2001 2000 Sea Lake Sea Mountains Tuscany Spain USA Sea holidays Italy in Europe Qualitative mapping 2001 Tuscany Approaches to CH mapping: Approaches to CH mapping Approaches to CH mapping can be grouped in four classes, according with the kind of information used: Based on document content Based on document classifications Based on structural information Based on semantic interpretation of labels (CTXMATCH)1. Mapping based on Documents: 1. Mapping based on Documents Consider the content of the document Procedure [Madhavan et al. AAAI-2002]: Train a classifier on documents of source CH Apply the classifier to documents of target CH Drawbacks: Needs the documents Only textual documents can be considered Do not consider structural information Do not produce qualitative mappingsSlide23: 2. Mapping based on Classifications Consider the number of documents in common with nodes of different CHs Procedure [Ichise et al. IJCAI-2003]: Compute a a statistical model of classification criteria of source and target CHs Determine similarity between pairs of nodes in source and target Drawbacks: Needs documents in common Does not produce qualitative mappings3. Mapping Based on Structural Information (1): 3. Mapping Based on Structural Information (1) Consider node definitions and their lexical expansions Procedure [Calvanese et al. ISWC 2001]: Automatically propose candidate mappings based on lexicographic criteria Correct mappings are validated by a domain expert Drawbacks: Require human intervention Feasible for ontology integration, not for CHs3. Mapping Based on Structural Information (2): 3. Mapping Based on Structural Information (2) Consider structural constraints among nodes Procedure [Daude et al. ACL-2000, this conference]: Select candidates pairs with lexicographic criteria Select structural constraints Use relaxation labelling to chose the best candidate Drawbacks: Good for WordNet, but CHs have a lot of implicit knowledge Do not produce qualitative mapping4. Mapping Based on Semantic Interpretation: 4. Mapping Based on Semantic Interpretation Consider linguistic processing of nodes and world knowledge Procedure [Bouquet et al. ISWC-2003, to appear]: Build a logical interpretation for the source and the target nodes Compute the relation between the two logical forms Drawbacks: Require world knowledge Require tuning of linguistic tools for CHsSemantic Interpretation (1): Semantic Interpretation (1) Images Beach Mountain Italy More specific More specific World Knowledge is necessary Semantic Interpretation (2): Semantic Interpretation (2) Images Beach Mountain Italy More specific More specific More specific EquivalentLinguistic Processing of CHs: Linguistic Processing of CHs How linguistic techniques work on CHs? Tokenization and Part of Speech Tagging Multiwords recognition Named entities recognition Word sense disambiguation Which peculiar problems are posed by CHs as far as their semantic interpretation is concerned? How much implicit information is it possible to extract from CHs? Part of Speech Tags (1): Part of Speech Tags (1) Vacation 2001 2000 Sea Lake Beach Mountains Tuscany Spain USA Nouns are prevalent Limited context available for solving ambiguities Part of Speech Tags (2): Part of Speech Tags (2) POS tagger: TNT [Brants, ANLP-2000] CH: 5k tokens extracted by a balanced set of CHs (web directories, file systems, product catalogues, ontologies) both for English and Italian Text: English: training over 1M words (BNC) Italian: training over 50k words (Elsnet) Tokenization: Tokenization Parenthesis and Acronyms Business credit agencies Business credit gathering or reporting services Value added network (VAN) services From UNSPSC Credit agenciesAbbreviations: Abbreviations Abbreviations From EClass Potato, pot. product Semi-instant product (veg.)Multiwords: Multiwords Multiword on two contiguous levels Multiword on one level Billiards Players From Google Sport United StatesCoordination: Coordination Conjunction Disjunction Alternative and Holistic medicine Witch doctors or voodoo services From UNSPSC Healthcare ServicesMultilinguality: Multilinguality Spanish English MixedLexical Ambiguity: Lexical Ambiguity Structural information provide context for word sense disambiguation The connections between WSD and web directories have been investigated by [Gonzalo et al. 2003] Trees Apple tree From Google PlantsArc Interpretation: Arc Interpretation Relations among nodes are not formally defined Instance-of In CHs documents classified under a certain node A are a subset of the documents classified under a parent node of A. According to our world knowledge the relation among two nodes can be interpreted in various ways.Arc Interpretation: Arc Interpretation Relations among nodes are not formally defined Part-of From Google Images Pisa Florence TuscanyArc Interpretation: Arc Interpretation Relations among nodes are not formally defined Generic Associations Television Cable_TV Public_Access From Google Satellite GuidesArc Interpretation: Arc Interpretation Relations among nodes are not formally defined Meta-level criteria World Languages A Afrikaans From Google B BaliImplicit Negation: Implicit Negation Trentino is part of North Italy Implicit Negation: Implicit Negation Trentino is part of North Italy From ITC-irst personnel office Origin of ITC-irst employees Italy North except Trentino Center South TrentinoCTXMATCH Algorithm: CTXMATCH Algorithm Semantic explicitation Linguistic analysis of labels Shallow parsing, access to wordnet, multiwords Contextualization Sense filtering (use Wordnet as knowledge repository) Sense composition (use Wordnet as knowledge repository) Semantic comparison Build a logical form (description logics) Computing the logical relation between two formula (SAT solver)An Experimental Setting:Matching Web Directories: An Experimental Setting: Matching Web Directories Task: automatically discover qualitative mappings among corresponding directories of Google and Yahoo CTXMATCH: Input: a pair <N1, N2> belonging to CH1 and CH2 Output: a relation holding between N1 and N2: more general, more specific, equivalent, no relation Evaluation: define a metric considering the documents (Urls) classified both by Google and Yahoo. Define a mapping between this metric and the CTXMATCH relations. Baseline: string match of the paths of the two nodes.Matching Google and Yahoo! :Linguistic Analysis: Matching Google and Yahoo! : Linguistic AnalysisMatching Google and Yahoo! :Preliminary Results: Matching Google and Yahoo! : Preliminary Results Google: Architecture/History/Periods_and_Styles/Gothic Yahoo: Architecture/History/Medieval Is More specific thanOngoing and Future Experiments: Ongoing and Future Experiments Web directories: build a reference benchmark for evaluating matching algorithms. Include Looksmart Google English vs Google Italian File systems Collaboration Edamok, SWAP, MEANING Domain specific applications Medical classification: integration of UML in the algorithm Public Administration: matching document classification hierarchies for automatic routing Edamok project: www.edamok.itc.it Papers, algorithm specifications, case studiesConclusions: Conclusions Interoperability of Classification Hierarchies Scientific interest: Semantic Web community Application oriented interest NLP can play a crucial role A proper experimental setting is necessary for comparing different approaches CTXMATCH: Qualitative mappings Semantic interpretation based on linguistic analysis Preliminary results