Share PowerPoint. Anywhere!

IDAR26

Uploaded from authorPOINT
Download as Download Not Available PPT Click to download this presentation as video. Video Click to view this presentation in iTunes.You must have iTunes installed on your computer. iPod
Presentation Description

No description available

Views: 2
Like it  ( Likes) Dislike it  ( Dislikes)
Added: August 26, 2007 This presentation is Public
Presentation Category :Entertainment |
Tags Add Tags
Presentation StatisticsNew!
Views on authorSTREAM: 2
Presentation Transcript

Topic Oriented Semi-supervised Document Clustering : Topic Oriented Semi-supervised Document Clustering Jiangtao Qiu, Changjie Tang Computer School, Sichuan University


OUTLINE : OUTLINE 1.Introduction 2. Motivation 3. Topic Semantic Annotation 4. Optimizing Hierarchical Clustering 5. Experiments 6. Conclusion


1. INTRODUCTION : 1. INTRODUCTION Developing a Text Mining Prototype System. Aim to mine associative event, generate hypotheses etc. At present, we have complete Content Extracting from web page, Document Classification, Document Cluster.


1. INTRODUCTION : 1. INTRODUCTION Web pages Text Collecting data Preprocess Classification Cluster Needed Vectors Remove noise Get feature vector Deriving needed texts Mining Presenting Mining associative Events etc. Prototype System


OUTLINE : OUTLINE 1. Introduction 2.Motivation 3. Topic Semantic Annotation 4. Optimizing Hierarchical Clustering 5. Experiments 6. Conclusion


2. MOTIVATION : 2. MOTIVATION Traditional documents clustering are usually considered an unsupervised learning. General Method: Extracting Feature Vector Computing Similarity among vectors Building dissimilarity matrix Implementing Clustering


2. Motivation : 2. Motivation Can we group documents by users need? New Challenge


OUTLINE : OUTLINE 1. Introduction 2. Motivation 3.Topic Semantic Annotation 4. Optimizing Hierarchical Clustering 5. Experiments 6. Conclusion


3. Topic Semantic Annotation : 3. Topic Semantic Annotation we propose a new semi-supervised documents clustering approach It can group documents according to user’s need Topic oriented documents clustering


3. Topic Semantic Annotation : 3. Topic Semantic Annotation Several issues need be addressed (1) How to represent user’s need? (2) How to represent relationship between the need and documents? (3) How to evaluate similarity of documents by the need?


3. Topic Semantic Annotation : 3. Topic Semantic Annotation 3.1 How to represent user’s need? (1) we propose a multiple-attributes topic structure to represent the user’s need Topic is a user’s focus that is represented by a word.


3. Topic Semantic Annotation : 3. Topic Semantic Annotation 3.1 How to represent user’s need? (1) we propose a multiple-attributes topic structure to represent the user’s need Topic is a user’s focus that is represented by a word. We use concept set C in ontology as attributes set. Attributes of topic consist of a collection of concepts {p1,..,pn} C; attributes can well describe the topic.


3. Topic Semantic Annotation : 3. Topic Semantic Annotation 3.1 How to represent user’s need? For Example: Collecting documents about Yao Ming. There are several peoples named Yao Ming in corpus. We want to group documents by different Yao Ming. We set ‘Yao Ming’ as topic. We choose background, place , named entity as attributes.


3. Topic Semantic Annotation : 3. Topic Semantic Annotation 3.1 How to represent user’s need? Reason for choosing the three attributes. 1.Many words has background. Cancer medicine background For instance, when words coach, stadium emerge in a document, it can be inferred that the peoples involved in this document is related to ‘sport’.


3. Topic Semantic Annotation : 3. Topic Semantic Annotation 3.1 How to represent user’s need? Reason for choosing the three attributes. 1.Many words has background. Cancer medicine background We have modified ontology, which added background for words in ontology


3. Topic Semantic Annotation : 3. Topic Semantic Annotation 3.1 How to represent user’s need? Reason for choosing the three attributes. 2.Place can well distinguish different peoples. The places where peoples have grown up and lived may well distinguish different peoples.


3. Topic Semantic Annotation : 3. Topic Semantic Annotation 3.1 How to represent user’s need? Reason for choosing the three attributes. 3.Named entities may be used to describe semantic of topic. Some people names, institution and organization names that do not occur in dictionary are called named entity. Named entities may be used to describe semantic of topic.


3. Topic Semantic Annotation : 3. Topic Semantic Annotation Several issues need be addressed (1) How to represent user’s need? (2) How to represent relationship between the need and documents? (3) How to evaluate similarity of documents by the need?


3. Topic Semantic Annotation : 3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? We represent relationship between topic and documents by annotating topic-semantic for documents


3. Topic Semantic Annotation : 3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Topic T Attributes:p1,.., pn Document S Words {t1,…, tn} If ti may be mapped to one attribute pj Ontology ti pj


3. Topic Semantic Annotation : 3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Topic T Attributes:p1,.., pn Document S Words {t1,…, tn} And ti is semantical correlation with T If distance of ti and T is not lager than threshold, We call ti and T is semantical correlation


3. Topic Semantic Annotation : 3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Topic T Attributes:p1,.., pn Document S Words {t1,…, tn} Insert ti into vector Pj Vector Pj ={…, ti}


3. Topic Semantic Annotation : 3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Topic T Attributes:p1,.., pn Document S Words {t1,…, tn} When all words are explored, we can derived Attributes Vectors: P1 ={…, ti} … Pn ={…, tm}


3. Topic Semantic Annotation : 3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Topic T Attributes:p1,.., pn Document S Words {t1,…, tn} We call the above process topic-semantic annotation


3. Topic Semantic Annotation : 3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Example: Houston Rockets center Yao Ming grabs a rebound in front of Detroit Pistons forward Rasheed Wallace and Rockets forward Shane Battier during the first half of their NBA game in Auburn Hills, Michigan.


3. Topic Semantic Annotation : 3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Example: Houston Rockets center Yao Ming grabs a rebound in front of Detroit Pistons forward Rasheed Wallace and Rockets forward Shane Battier during the first half of their NBA game in Auburn Hills, Michigan. Topic: Yao Ming


3. Topic Semantic Annotation : 3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Example: Houston Rockets center Yao Ming grabs a rebound in front of Detroit Pistons forward Rasheed Wallace and Rockets forward Shane Battier during the first half of their NBA game in Auburn Hills, Michigan. Topic: Yao Ming Attributes: p1=background, p2=place, p3=named entity


3. Topic Semantic Annotation : 3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Example: Houston Rockets center Yao Ming grabs a rebound in front of Detroit Pistons forward Rasheed Wallace and Rockets forward Shane Battier during the first half of their NBA game in Auburn Hills, Michigan. Topic: Yao Ming Attributes: p1=background, p2=place, p3=named entity Feature vectors: P1={andlt;sport, 4andgt;}


3. Topic Semantic Annotation : 3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Example: Houston Rockets center Yao Ming grabs a rebound in front of Detroit Pistons forward Rasheed Wallace and Rockets forward Shane Battier during the first half of their NBA game in Auburn Hills, Michigan. Topic: Yao Ming Attributes: p1=background, p2=place, p3=named entity Feature vectors: P1={andlt;sport, 4andgt;} P2={andlt;Huston, 1andgt;, andlt;Michigan, 1andgt;,andlt; Detroit,1 andgt;}


3. Topic Semantic Annotation : 3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Example: Houston Rockets center Yao Ming grabs a rebound in front of Detroit Pistons forward Rasheed Wallace and Rockets forward Shane Battier during the first half of their NBA game in Auburn Hills, Michigan. Topic: Yao Ming Attributes: p1=background, p2=place, p3=named entity Feature vectors: P1={andlt;sport, 4andgt;} P2={andlt;Huston, 1andgt;, andlt;Michigan, 1andgt;,andlt; Detroit,1 andgt;} P3={andlt; Rasheed Wallace, 1andgt;, andlt; Shane Battier, 1andgt;, andlt; Auburn Hills, 1andgt;}


3. Topic Semantic Annotation : 3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Example: Houston Rockets center Yao Ming grabs a rebound in front of Detroit Pistons forward Rasheed Wallace and Rockets forward Shane Battier during the first half of their NBA game in Auburn Hills, Michigan. Topic: Yao Ming Attributes: p1=background, p2=place, p3=named entity Feature vectors: P1={andlt;sport, 4andgt;} P2={andlt;Huston, 1andgt;, andlt;Michigan, 1andgt;,andlt; Detroit,1 andgt;} P3={andlt; Rasheed Wallace, 1andgt;, andlt; Shane Battier, 1andgt;, andlt; Auburn Hills, 1andgt;}


3. Topic Semantic Annotation : 3. Topic Semantic Annotation Several issues need be addressed (1) How to represent user’s need? (2) How to represent relationship between the need and documents? (3) How to evaluate similarity of documents by the need?


3. Topic Semantic Annotation : 3. Topic Semantic Annotation 3.3 How to evaluate similarity of documents by the need? d1 d2 V1={…} … Vn={…} V1={…} … Vn={…}


OUTLINE : OUTLINE 1. Introduction 2. Motivation 3. Topic Semantic Annotation 4.Optimizing Hierarchical Clustering 5. Experiments 6. Conclusion


4. Optimizing Hierarchical Clustering : 4. Optimizing Hierarchical Clustering Motivation: Current clustering algorithms often need user to set some parameters such as the number of clusters, radius or density threshold. If users lack experience to choice parameters, it is difficult to produce good clustering solution.


4. Optimizing Hierarchical Clustering : 4. Optimizing Hierarchical Clustering Solution: 1.build clustering tree by using hierarchical clustering algorithm. 2.recommend best clustering solution on clustering tree to users by using a criterion function.


4. Optimizing Hierarchical Clustering : 4. Optimizing Hierarchical Clustering Solution: All samples in one cluster Each samples is one cluster Worst Solution One cluster five clusters


4. Optimizing Hierarchical Clustering : 4. Optimizing Hierarchical Clustering Solution: Combining inner-cluster distance with intra-cluster distance, We propose a criterion function. the best clustering solution may be provided to user by using a criterion function without parameter setting.


4. Optimizing Hierarchical Clustering : 4. Optimizing Hierarchical Clustering the best clustering solution may be provided to user by using a criterion function without parameter setting. A B C D E Bottom up


4. Optimizing Hierarchical Clustering : 4. Optimizing Hierarchical Clustering the best clustering solution may be provided to user by using a criterion function without parameter setting. A B C D E Level 5 Level 4 Level 3 Level 2 Level 1 The smallest DistanceSum


OUTLINE : OUTLINE 1. Introduction 2. Motivation 3. Topic Semantic Annotation 4. Optimizing Hierarchical Clustering 5.Experiments 6. Conclusion


5. Experiments : 5. Experiments To the best our knowledge, topic oriented document clustering has not been well addressed in the existing works. Experiments, in this study, will compare our approach to the unsupervised clustering approach


5. Experiments : 5. Experiments Dataset: Collect web pages involved three peoples named ‘Li Ming’. purpose: clustering documents by people.


5. Experiments : 5. Experiments Experiment 1: TFIDF Comparing on Time performance


5. Experiments : 5. Experiments Experiment 1: TFIDF Comparing Dimensionality


5. Experiments : 5. Experiments Experiment 2: 1. Using new approach and traditional approach to build dissimilarity matrix 2. Implement documents clustering on matrix 3. compare clustering solution by using F-Measure


5. Experiments : 5. Experiments Experiment 2:


OUTLINE : OUTLINE 1. Introduction 2. Motivation 3. Topic Semantic Annotation 4. Optimizing Hierarchical Clustering 5. Experiments 6.Conclusion


6. Conclusion : 6. Conclusion Experiments show that new approach is feasible and effective. To further improve performance, However, some works need be done such as improving accuracy on named entity recognizing


Thanks! : Any Question? Thanks!