sepln 2003

Uploaded from authorPOINTLite
Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Linguistic Processing of Classification Hierarchies: 

Linguistic Processing of Classification Hierarchies Bernardo Magnini ITC-irst, Istituto per la Ricerca Scientifica e Tecnologica Trento - Italy

Current Research Topics on Text Processing at ITC-irst: 

Current Research Topics on Text Processing at ITC-irst Question/Answering TREC style Information Extraction ML approach, DOT.KOM project Lexical Acquisition and Linguistic Resources MultiWordnet, Wordnet Domains, corpora for Italian Word Sense Disambiguation Based on domains, MEANING project NLP for Knowledge Management Edamok project Evaluation of NLP Technologies Qa at CLEF-2003, Senseval-3

Current Research Topics on Text Processing at ITC-irst: 

Current Research Topics on Text Processing at ITC-irst Question/Answering TREC style Information Extraction ML approach, DOT.Kom project Lexical Acquisition and Linguistic Resources MultiWordnet, Wordnet Domains, corpora for Italian Word Sense Disambiguation Based on domains, Meaning project NLP for Knowledge Management Edamok project Evaluation of NLP Technologies Qa at CLEF-2003, Senseval-3

Outline: 

Outline Classification Hierarchies (CH) Concept hierarchies Approaches toward interoperability of CHs Semantic interpretation of CHs Making the information explicit: the role of linguistic and world knowledge Experimental setting Preliminary results with CTXMATCH algorithm

Organizing papers: A senior researcher: 

Organizing papers: A senior researcher Work WSD QA Papers Projects Experiments Senseval-2 ACL-02 Submission Camera ready Submission Knowledge about the domain is used Classification schema are repeated Labels are interpreted in their context

Organizing papers: A young researcher: 

Organizing papers: A young researcher Home Articles Code 2002 2001 2000 Senseval-2 ACL-02 workshops Int. conferences A different view for the same documents Redundant information Different labels for the same concept journals

Organizing papers: A student: 

Organizing papers: A student Disambiguation Less structure corresponds to more complex labels Any kind of document is allowed (text, images, code, …) Results-all-word-Eng. Senseval-Call-for-paper Senseval-article Meaning-project Algorithm-description Acl-article-final-version Lexical-sample-training-data

Questions: 

Questions Can a system automatically discover similarities among different views of the same documents? Example: retrieving documents in classification B using the schema of classification A How much reasoning is involved? Labels are expressed in a natural language. Is there a role for NLP technologies?

Classification Hierarchies – CH (1): 

Classification Hierarchies – CH (1) Taxonomic organization of documents Easy to build: no formal language is required Widespread used: Web directories (Google, Yahoo!, Looksmart, portals) Market place catalogues for product classifications File systems Local Ontologies Documents are classified at all levels of the hierarchy CHs structure reflect both the documents and world knowledge

Classification Hierarchies (2): 

Classification Hierarchies (2) Vacation 2001 2000 Sea Lake Sea Mountains Tuscany Spain USA Semi-structured: relations among nodes are not formally defined. Document dependent: CHs are organized according to the documents that have to be classified. Specificity criterion: a document is classified in the more specific node of the hierarchy.

Interoperability among CHs: 

Interoperability among CHs Commercial interest: Distributed Knowledge Management in corporations Scientific interest. Various terms have been recently used, including: Meaning negotiation Semantic coordination Mapping between domain models Semantic mediation Ontology merging, integration or alignment Integration of hierarchical categorization Fits well in the Semantic Web perspective Common goal: find mappings between nodes of two classification hierarchies

Interoperability among CHs: 

Source CH Target CH Vacation 2001 2000 Sea Lake Sea Mountains Tuscany Spain USA Sea holidays Italy in Europe Interoperability among CHs

Interoperability among CHs: 

Source CH Target CH Vacation 2001 2000 Sea Lake Sea Mountains Tuscany Spain USA Sea holidays Italy in Europe Interoperability among CHs

Interoperability among CHs: 

Source CH Target CH Vacation 2001 2000 Sea Lake Sea Mountains Tuscany Spain USA Sea holidays Italy in Europe ? Interoperability among CHs

Qualitative Mapping: 

Source CH Target CH Vacation 2001 2000 Sea Lake Sea Mountains Tuscany Spain USA Sea holidays Italy in Europe More general Qualitative Mapping

Qualitative mapping: 

Source CH Target CH Vacation 2001 2000 Sea Lake Sea Mountains Tuscany Spain USA Sea holidays Italy in Europe More specific Qualitative mapping 2001 Tuscany

Qualitative mapping: 

Source CH Target CH Vacation 2001 2000 Sea Lake Sea Mountains Tuscany Spain USA Sea holidays Italy in Europe Qualitative mapping 2001 Tuscany Equivalent

Qualitative mapping: 

Source CH Target CH Vacation 2001 2000 Sea Lake Sea Mountains Tuscany Spain USA Sea holidays Italy in Europe Qualitative mapping 2001 Tuscany Not compatible

Qualitative mapping: 

Source CH Target CH Vacation 2001 2000 Sea Lake Sea Mountains Tuscany Spain USA Sea holidays Italy in Europe Compatible Qualitative mapping 2001 Tuscany

Qualitative mapping: 

Source CH Target CH Vacation 2001 2000 Sea Lake Sea Mountains Tuscany Spain USA Sea holidays Italy in Europe Qualitative mapping 2001 Tuscany

Approaches to CH mapping: 

Approaches to CH mapping Approaches to CH mapping can be grouped in four classes, according with the kind of information used: Based on document content Based on document classifications Based on structural information Based on semantic interpretation of labels (CTXMATCH)

1. Mapping based on Documents: 

1. Mapping based on Documents Consider the content of the document Procedure [Madhavan et al. AAAI-2002]: Train a classifier on documents of source CH Apply the classifier to documents of target CH Drawbacks: Needs the documents Only textual documents can be considered Do not consider structural information Do not produce qualitative mappings

Slide23: 

2. Mapping based on Classifications Consider the number of documents in common with nodes of different CHs Procedure [Ichise et al. IJCAI-2003]: Compute a a statistical model of classification criteria of source and target CHs Determine similarity between pairs of nodes in source and target Drawbacks: Needs documents in common Does not produce qualitative mappings

3. Mapping Based on Structural Information (1): 

3. Mapping Based on Structural Information (1) Consider node definitions and their lexical expansions Procedure [Calvanese et al. ISWC 2001]: Automatically propose candidate mappings based on lexicographic criteria Correct mappings are validated by a domain expert Drawbacks: Require human intervention Feasible for ontology integration, not for CHs

3. Mapping Based on Structural Information (2): 

3. Mapping Based on Structural Information (2) Consider structural constraints among nodes Procedure [Daude et al. ACL-2000, this conference]: Select candidates pairs with lexicographic criteria Select structural constraints Use relaxation labelling to chose the best candidate Drawbacks: Good for WordNet, but CHs have a lot of implicit knowledge Do not produce qualitative mapping

4. Mapping Based on Semantic Interpretation: 

4. Mapping Based on Semantic Interpretation Consider linguistic processing of nodes and world knowledge Procedure [Bouquet et al. ISWC-2003, to appear]: Build a logical interpretation for the source and the target nodes Compute the relation between the two logical forms Drawbacks: Require world knowledge Require tuning of linguistic tools for CHs

Semantic Interpretation (1): 

Semantic Interpretation (1) Images Beach Mountain Italy More specific More specific World Knowledge is necessary

Semantic Interpretation (2): 

Semantic Interpretation (2) Images Beach Mountain Italy More specific More specific More specific Equivalent

Linguistic Processing of CHs: 

Linguistic Processing of CHs How linguistic techniques work on CHs? Tokenization and Part of Speech Tagging Multiwords recognition Named entities recognition Word sense disambiguation Which peculiar problems are posed by CHs as far as their semantic interpretation is concerned? How much implicit information is it possible to extract from CHs?

Part of Speech Tags (1): 

Part of Speech Tags (1) Vacation 2001 2000 Sea Lake Beach Mountains Tuscany Spain USA Nouns are prevalent Limited context available for solving ambiguities

Part of Speech Tags (2): 

Part of Speech Tags (2) POS tagger: TNT [Brants, ANLP-2000] CH: 5k tokens extracted by a balanced set of CHs (web directories, file systems, product catalogues, ontologies) both for English and Italian Text: English: training over 1M words (BNC) Italian: training over 50k words (Elsnet)

Tokenization: 

Tokenization Parenthesis and Acronyms Business credit agencies Business credit gathering or reporting services Value added network (VAN) services From UNSPSC Credit agencies

Abbreviations: 

Abbreviations Abbreviations From EClass Potato, pot. product Semi-instant product (veg.)

Multiwords: 

Multiwords Multiword on two contiguous levels Multiword on one level Billiards Players From Google Sport United States

Coordination: 

Coordination Conjunction Disjunction Alternative and Holistic medicine Witch doctors or voodoo services From UNSPSC Healthcare Services

Multilinguality: 

Multilinguality Spanish English Mixed

Lexical Ambiguity: 

Lexical Ambiguity Structural information provide context for word sense disambiguation The connections between WSD and web directories have been investigated by [Gonzalo et al. 2003] Trees Apple tree From Google Plants

Arc Interpretation: 

Arc Interpretation Relations among nodes are not formally defined Instance-of In CHs documents classified under a certain node A are a subset of the documents classified under a parent node of A. According to our world knowledge the relation among two nodes can be interpreted in various ways.

Arc Interpretation: 

Arc Interpretation Relations among nodes are not formally defined Part-of From Google Images Pisa Florence Tuscany

Arc Interpretation: 

Arc Interpretation Relations among nodes are not formally defined Generic Associations Television Cable_TV Public_Access From Google Satellite Guides

Arc Interpretation: 

Arc Interpretation Relations among nodes are not formally defined Meta-level criteria World Languages A Afrikaans From Google B Bali

Implicit Negation: 

Implicit Negation Trentino is part of North Italy

Implicit Negation: 

Implicit Negation Trentino is part of North Italy From ITC-irst personnel office Origin of ITC-irst employees Italy North except Trentino Center South Trentino

CTXMATCH Algorithm: 

CTXMATCH Algorithm Semantic explicitation Linguistic analysis of labels Shallow parsing, access to wordnet, multiwords Contextualization Sense filtering (use Wordnet as knowledge repository) Sense composition (use Wordnet as knowledge repository) Semantic comparison Build a logical form (description logics) Computing the logical relation between two formula (SAT solver)

An Experimental Setting: Matching Web Directories: 

An Experimental Setting: Matching Web Directories Task: automatically discover qualitative mappings among corresponding directories of Google and Yahoo CTXMATCH: Input: a pair <N1, N2> belonging to CH1 and CH2 Output: a relation holding between N1 and N2: more general, more specific, equivalent, no relation Evaluation: define a metric considering the documents (Urls) classified both by Google and Yahoo. Define a mapping between this metric and the CTXMATCH relations. Baseline: string match of the paths of the two nodes.

Matching Google and Yahoo! : Linguistic Analysis: 

Matching Google and Yahoo! : Linguistic Analysis

Matching Google and Yahoo! : Preliminary Results: 

Matching Google and Yahoo! : Preliminary Results Google: Architecture/History/Periods_and_Styles/Gothic Yahoo: Architecture/History/Medieval Is More specific than

Ongoing and Future Experiments: 

Ongoing and Future Experiments Web directories: build a reference benchmark for evaluating matching algorithms. Include Looksmart Google English vs Google Italian File systems Collaboration Edamok, SWAP, MEANING Domain specific applications Medical classification: integration of UML in the algorithm Public Administration: matching document classification hierarchies for automatic routing Edamok project: www.edamok.itc.it Papers, algorithm specifications, case studies

Conclusions: 

Conclusions Interoperability of Classification Hierarchies Scientific interest: Semantic Web community Application oriented interest NLP can play a crucial role A proper experimental setting is necessary for comparing different approaches CTXMATCH: Qualitative mappings Semantic interpretation based on linguistic analysis Preliminary results