Topic Oriented Semi-supervised Document Clustering : Topic Oriented Semi-supervised Document Clustering
Jiangtao Qiu, Changjie Tang
Computer School, Sichuan University
OUTLINE : OUTLINE 1.Introduction
2. Motivation
3. Topic Semantic Annotation
4. Optimizing Hierarchical Clustering
5. Experiments
6. Conclusion
1. INTRODUCTION : 1. INTRODUCTION Developing a Text Mining Prototype System.
Aim to mine associative event, generate hypotheses etc.
At present, we have complete Content Extracting from web page, Document Classification, Document Cluster.
1. INTRODUCTION : 1. INTRODUCTION Web pages Text Collecting data Preprocess Classification Cluster Needed Vectors Remove noise
Get feature vector Deriving needed texts Mining Presenting Mining associative Events etc. Prototype
System
OUTLINE : OUTLINE 1. Introduction
2.Motivation
3. Topic Semantic Annotation
4. Optimizing Hierarchical Clustering
5. Experiments
6. Conclusion
2. MOTIVATION : 2. MOTIVATION Traditional documents clustering are usually considered an unsupervised learning. General Method: Extracting
Feature Vector Computing
Similarity among
vectors Building
dissimilarity matrix Implementing
Clustering
2. Motivation : 2. Motivation Can we group documents by users need? New Challenge
OUTLINE : OUTLINE 1. Introduction
2. Motivation
3.Topic Semantic Annotation
4. Optimizing Hierarchical Clustering
5. Experiments
6. Conclusion
3. Topic Semantic Annotation : 3. Topic Semantic Annotation we propose a new semi-supervised documents clustering approach It can group documents according to user’s need
Topic oriented documents clustering
3. Topic Semantic Annotation : 3. Topic Semantic Annotation Several issues need be addressed (1) How to represent user’s need?
(2) How to represent relationship between the need and documents?
(3) How to evaluate similarity of documents by the need?
3. Topic Semantic Annotation : 3. Topic Semantic Annotation 3.1 How to represent user’s need? (1) we propose a multiple-attributes topic structure
to represent the user’s need
Topic is a user’s focus that is represented by a word.
3. Topic Semantic Annotation : 3. Topic Semantic Annotation 3.1 How to represent user’s need? (1) we propose a multiple-attributes topic structure
to represent the user’s need
Topic is a user’s focus that is represented by a word. We use concept set C in ontology as attributes set. Attributes of topic consist of a collection of concepts
{p1,..,pn} C; attributes can well describe the topic.
3. Topic Semantic Annotation : 3. Topic Semantic Annotation 3.1 How to represent user’s need? For Example:
Collecting documents about Yao Ming.
There are several peoples named Yao Ming in corpus.
We want to group documents by different Yao Ming.
We set ‘Yao Ming’ as topic.
We choose background, place , named entity as attributes.
3. Topic Semantic Annotation : 3. Topic Semantic Annotation 3.1 How to represent user’s need? Reason for choosing the three attributes. 1.Many words has background. Cancer medicine background For instance, when words coach, stadium emerge in a document,
it can be inferred that the peoples involved in this document is
related to ‘sport’.
3. Topic Semantic Annotation : 3. Topic Semantic Annotation 3.1 How to represent user’s need? Reason for choosing the three attributes. 1.Many words has background. Cancer medicine background We have modified ontology,
which added background for words in ontology
3. Topic Semantic Annotation : 3. Topic Semantic Annotation 3.1 How to represent user’s need? Reason for choosing the three attributes. 2.Place can well distinguish different peoples. The places where peoples have grown up and lived may well
distinguish different peoples.
3. Topic Semantic Annotation : 3. Topic Semantic Annotation 3.1 How to represent user’s need? Reason for choosing the three attributes. 3.Named entities may be used to describe semantic of topic. Some people names, institution and organization names that
do not occur in dictionary are called named entity. Named entities
may be used to describe semantic of topic.
3. Topic Semantic Annotation : 3. Topic Semantic Annotation Several issues need be addressed (1) How to represent user’s need?
(2) How to represent relationship between the need and documents?
(3) How to evaluate similarity of documents by the need?
3. Topic Semantic Annotation : 3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? We represent relationship between topic and documents by
annotating topic-semantic for documents
3. Topic Semantic Annotation : 3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Topic T
Attributes:p1,.., pn Document S Words {t1,…, tn} If ti may be mapped to one attribute pj Ontology ti pj
3. Topic Semantic Annotation : 3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Topic T
Attributes:p1,.., pn Document S Words {t1,…, tn} And ti is semantical correlation with T If distance of ti and T is not lager than threshold,
We call ti and T is semantical correlation
3. Topic Semantic Annotation : 3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents?
Topic T
Attributes:p1,.., pn Document S Words {t1,…, tn} Insert ti into vector Pj Vector Pj ={…, ti}
3. Topic Semantic Annotation : 3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents?
Topic T
Attributes:p1,.., pn Document S Words {t1,…, tn} When all words are explored, we can derived Attributes Vectors:
P1 ={…, ti}
…
Pn ={…, tm}
3. Topic Semantic Annotation : 3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents?
Topic T
Attributes:p1,.., pn Document S Words {t1,…, tn} We call the above process
topic-semantic annotation
3. Topic Semantic Annotation : 3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents?
Example: Houston Rockets center Yao Ming grabs a rebound in front of
Detroit Pistons forward Rasheed Wallace and Rockets forward
Shane Battier during the first half of their NBA game in
Auburn Hills, Michigan.
3. Topic Semantic Annotation : 3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents?
Example: Houston Rockets center Yao Ming grabs a rebound in front of
Detroit Pistons forward Rasheed Wallace and Rockets forward
Shane Battier during the first half of their NBA game in
Auburn Hills, Michigan. Topic: Yao Ming
3. Topic Semantic Annotation : 3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents?
Example: Houston Rockets center Yao Ming grabs a rebound in front of
Detroit Pistons forward Rasheed Wallace and Rockets forward
Shane Battier during the first half of their NBA game in
Auburn Hills, Michigan. Topic: Yao Ming
Attributes: p1=background, p2=place, p3=named entity
3. Topic Semantic Annotation : 3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents?
Example: Houston Rockets center Yao Ming grabs a rebound in front of
Detroit Pistons forward Rasheed Wallace and Rockets forward
Shane Battier during the first half of their NBA game in
Auburn Hills, Michigan. Topic: Yao Ming
Attributes: p1=background, p2=place, p3=named entity
Feature vectors:
P1={andlt;sport, 4andgt;}
3. Topic Semantic Annotation : 3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents?
Example: Houston Rockets center Yao Ming grabs a rebound in front of
Detroit Pistons forward Rasheed Wallace and Rockets forward
Shane Battier during the first half of their NBA game in
Auburn Hills, Michigan. Topic: Yao Ming
Attributes: p1=background, p2=place, p3=named entity
Feature vectors:
P1={andlt;sport, 4andgt;}
P2={andlt;Huston, 1andgt;, andlt;Michigan, 1andgt;,andlt; Detroit,1 andgt;}
3. Topic Semantic Annotation : 3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents?
Example: Houston Rockets center Yao Ming grabs a rebound in front of
Detroit Pistons forward Rasheed Wallace and Rockets forward
Shane Battier during the first half of their NBA game in
Auburn Hills, Michigan. Topic: Yao Ming
Attributes: p1=background, p2=place, p3=named entity
Feature vectors:
P1={andlt;sport, 4andgt;}
P2={andlt;Huston, 1andgt;, andlt;Michigan, 1andgt;,andlt; Detroit,1 andgt;}
P3={andlt; Rasheed Wallace, 1andgt;, andlt; Shane Battier, 1andgt;,
andlt; Auburn Hills, 1andgt;}
3. Topic Semantic Annotation : 3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Example: Houston Rockets center Yao Ming grabs a rebound in front of
Detroit Pistons forward Rasheed Wallace and Rockets forward
Shane Battier during the first half of their NBA game in
Auburn Hills, Michigan. Topic: Yao Ming
Attributes: p1=background, p2=place, p3=named entity
Feature vectors:
P1={andlt;sport, 4andgt;}
P2={andlt;Huston, 1andgt;, andlt;Michigan, 1andgt;,andlt; Detroit,1 andgt;}
P3={andlt; Rasheed Wallace, 1andgt;, andlt; Shane Battier, 1andgt;,
andlt; Auburn Hills, 1andgt;}
3. Topic Semantic Annotation : 3. Topic Semantic Annotation Several issues need be addressed (1) How to represent user’s need?
(2) How to represent relationship between the need and documents?
(3) How to evaluate similarity of documents by the need?
3. Topic Semantic Annotation : 3. Topic Semantic Annotation 3.3 How to evaluate similarity of documents by the need? d1 d2 V1={…}
…
Vn={…} V1={…}
…
Vn={…}
OUTLINE : OUTLINE 1. Introduction
2. Motivation
3. Topic Semantic Annotation
4.Optimizing Hierarchical Clustering
5. Experiments
6. Conclusion
4. Optimizing Hierarchical Clustering : 4. Optimizing Hierarchical Clustering Motivation:
Current clustering algorithms often need user to set some parameters such as the number of clusters, radius or density threshold. If users lack experience to choice parameters, it is difficult to produce good clustering solution.
4. Optimizing Hierarchical Clustering : 4. Optimizing Hierarchical Clustering Solution:
1.build clustering tree by using hierarchical
clustering algorithm. 2.recommend best clustering solution on
clustering tree to users by using a criterion
function.
4. Optimizing Hierarchical Clustering : 4. Optimizing Hierarchical Clustering Solution:
All samples in one cluster Each samples is one cluster Worst Solution One cluster five clusters
4. Optimizing Hierarchical Clustering : 4. Optimizing Hierarchical Clustering Solution:
Combining inner-cluster distance with intra-cluster distance,
We propose a criterion function. the best clustering solution may be provided to user
by using a criterion function without parameter setting.
4. Optimizing Hierarchical Clustering : 4. Optimizing Hierarchical Clustering
the best clustering solution may be provided to user
by using a criterion function without parameter setting. A B C D E Bottom up
4. Optimizing Hierarchical Clustering : 4. Optimizing Hierarchical Clustering
the best clustering solution may be provided to user
by using a criterion function without parameter setting. A B C D E Level 5 Level 4 Level 3 Level 2 Level 1 The smallest
DistanceSum
OUTLINE : OUTLINE 1. Introduction
2. Motivation
3. Topic Semantic Annotation
4. Optimizing Hierarchical Clustering
5.Experiments
6. Conclusion
5. Experiments : 5. Experiments To the best our knowledge, topic oriented document clustering has not been well addressed in the existing works.
Experiments, in this study, will compare our approach to the unsupervised clustering approach
5. Experiments : 5. Experiments Dataset:
Collect web pages involved three peoples named ‘Li Ming’.
purpose:
clustering documents by people.
5. Experiments : 5. Experiments Experiment 1: TFIDF Comparing on Time performance
5. Experiments : 5. Experiments Experiment 1: TFIDF Comparing Dimensionality
5. Experiments : 5. Experiments Experiment 2: 1. Using new approach and traditional approach to build
dissimilarity matrix 2. Implement documents clustering on matrix 3. compare clustering solution by using F-Measure
5. Experiments : 5. Experiments Experiment 2:
OUTLINE : OUTLINE 1. Introduction
2. Motivation
3. Topic Semantic Annotation
4. Optimizing Hierarchical Clustering
5. Experiments
6.Conclusion
6. Conclusion : 6. Conclusion Experiments show that new approach is feasible and effective.
To further improve performance, However, some works need be done such as improving accuracy on named entity recognizing
Thanks! : Any Question? Thanks!