Web Document Clustering : Web Document Clustering 2006. 8. 24
한국해양대학교
공과대학 IT공학부
김재훈
Contents : Contents IR and Web Search
Problems on Web Search Results
Document Clustering?
Why Cluster Documents on Web Search Results?
Applications of Document Clustering
Clustering Approaches
More Applications
Discussions
Typical IR Task : Typical IR Task Given:
A corpus of textual natural-language documents.
A user query in the form of a textual string.
Find:
A ranked set of documents that are relevant to the query.
IR System : IR System
Relevance : Relevance a subjective judgment
may include:
Being on the proper subject.
Being timely (recent information).
Being authoritative (from a trusted source).
Satisfying the goals of the user and his/her intended use of the information (information need).
Keyword Search : Keyword Search Simplest notion of relevance is that
the query string appears verbatim in the document.
Slightly less strict notion is that
the words in the query appear frequently in the document, in any order (bag of words).
Intelligent IR : Intelligent IR Taking into account
the meaning of the words used.
Taking into account
the order of words in the query.
Taking into account
the authority of the source.
Adapting to the user based on
the direct or indirect feedback.
Web Search : Web Search Application of IR to
HTML documents on the World Wide Web.
Differences with typical IR:
Must assemble document corpus by spidering the web.
Can use the structural layout information in HTML (XML).
Can use the link structure of the web.
Documents change uncontrollably.
Web Search System : Web Search System
Web users’ information needs : Web users’ information needs classification based on the type of answers expected by the Web user
1. Question answering
type of answer : a very short answer (single sentence, part of the document)
query example “What is the population of Korea?”
2. Named page finding - Homepage finding
type of answer : a single document containing specific information
query example “Authors instructions for the KISS journal”
3. Topic relevance
type of answer: a range of documents relevant to the topic of their information need
query example “KOREA’s immigration policy”
4. Online services
type of answer: a range of documents that allow the user to access a particular service
query example “download mp3”
5. Topic distillation
type of answer: a range of documents that are relevant key resources on the topic (Relevance + Quality)
query example “highway safety”
Web search results : Web search results Big problem
A long list of search results, ranked by their relevancies to the given query
Engine : search engine, car part, Engine Corp.
A time consuming task when multiple sub-topics of the given query are mixed together.
Possible solution
To (online) cluster web search results
Evidence
Relationships between the results
Cluster Hypothesis (van Rijsbergen 1979):
“Closely related documents tend to be relevant to the same requests.”
Potential results
Aids user-engine interaction
Browsing
Help users express his need
An example - Vivisimo : An example - Vivisimo
What is clustering? : What is clustering? Partition unlabeled examples into disjoint subsets of clusters, such that:
Examples within a cluster are very similar
Examples in different clusters are very different the act of grouping similar objects into sets
unsupervised classification
no predefined classes
Typical applications
to get insight into data as a preprocessing step
What is clustering? (cont.) : What is clustering? (cont.)
What is document clustering? : What is document clustering? Document clustering :
To group similar documents into sets
To group similar or related documents into a common cluster.
Eg) Scatter/Gather
Why not just document clustering? : Why not just document clustering? Web search results clustering is a version of document clustering, but…
Billions of pages
Constantly changing
Data mainly unstructured and heterogeneous
Additional information to consider (i.e. links, click-through data, etc.)
Why cluster web documents? : Why cluster web documents? 1. For whole corpus analysis/navigation
Better user interface
2. For improving recall in search applications
Better search results
3. For better navigation of search results
Effective “user recall” will be higher
4. For speeding up vector space retrieval
Faster search
1. Corpus analysis/navigation : 1. Corpus analysis/navigation Standard IR Index Aardvark, 15 Blueberry, 200 Capricorn, 1, 45-55 Dog, 79-99 Egypt, 65 Falafel, 78-90 Giraffes, 45-59
… Table of Contents
1. Science of Cognition 1.a. Motivations
1.a.i. Intellectual Curiosity 1.a.ii. Practical Applications 1.b. History of Cognitive Psychology 2. The Neural Basis of Cognition
2.a. The Nervous System
2.b. Organization of the Brain
2.c. The Visual System
3. Perception and Attention
3.a. Sensory Memory
3.b. Attention and Sensory Information Processing Document clusters
Some time, TOC is very useful
1. Corpus analysis/navigation (cont.) : 1. Corpus analysis/navigation (cont.) Document clustering
Can induce a tree of topics
To allow user to browse through corpus to find information
Crucial need:
meaningful labels for topic nodes.
Yahoo!: manual hierarchy
Often not available for new document collection
For visualizing a document collection and its themes : For visualizing a document collection and its themes Wise et al, “Visualizing the non-visual” PNNL
ThemeScapes, Cartia
[Mountain height = cluster size]
2. Improvement of search recall : 2. Improvement of search recall Cluster hypothesis
Documents with similar text are related
Therefore, to improve search recall:
To cluster docs in corpus a priori (pre-clustering)
To return other docs in the cluster containing D when a query matches a doc D.
Hope if we do this: The query “car” will also return docs containing automobile
Because clustering grouped together docs containing car with those containing automobile.
3. Better navigation of search results : 3. Better navigation of search results For grouping search results thematically
clusty.com / Vivisimo
3. Better navigation of search results - Kartoo.com - : 3. Better navigation of search results - Kartoo.com -
3. Better navigation of search results - iboogie.com - : 3. Better navigation of search results - iboogie.com -
3. Better navigation of search results - mooter.com - : 3. Better navigation of search results - mooter.com -
3. Better navigation of search results : 3. Better navigation of search results One can also view grouping documents with the same sense of a word as clustering
Given the results of a search (say Jaguar, or NLP), partition into groups of related docs
Can be viewed as a form of word sense disambiguation
E.g., jaguar may have senses:
The car company
The animal
The football team
The video game
4. Speeding up vector space retrieval : 4. Speeding up vector space retrieval In vector space retrieval,
To find nearest doc vectors to query vector
To entail finding the similarity of the query to every doc – slow (for some applications)
By clustering docs in corpus a priori
To find nearest docs in cluster close to query
To be inexact but avoid exhaustive similarity computation
What Is A Good Clustering? : What Is A Good Clustering? Internal criterion:
A good clustering will produce high quality clusters in which:
the intra-class (that is, intra-cluster) similarity is high
the inter-class similarity is low
The measured quality of a clustering depends on both the document representation and the similarity measure used
External criterion:
The quality of a clustering is also measured by its ability to discover some or all of the hidden patterns or latent classes
Assessable with gold standard data
Main issues : Main issues Online or offline clustering?
What to use as input
Entire documents
Snippets
Structure information (links)
Other data (i.e. click-through)
Use stop word lists, stemming, etc.
How to define similarity?
Content (i.e. vector-space model)
Link analysis
Usage statistics
How to group similar documents?
How to label the groups?
Components Of Clustering System : Components Of Clustering System 1. Feature selection/extraction
the most effective subset from patterns
2. Feature representation
one or more transformations of the input features (attributes) to produce new salient features.
3. Pattern proximity (measure similarity)
The distance of pairs of patterns (features).
4. Clustering (Grouping)
Hard (a partition of the data into groups)
Fuzzy (each pattern has a variable degree)
5. Data abstraction
Extracting a compact representation of a data set
Automatic analysis, Human-oriented.
6. Evaluation
to measure the performance of competing algorithms
1. Feature Selection/Extraction : 1. Feature Selection/Extraction What is the representation of documents?
document = a set of features
What is the features?
Linguistic structure in documents
Co-occurrences of Terms : n-gram
Semantic structure : case structure
Meta-data of documents
Authors
Citation
Co-citation document : documents cites the examined documents
bibliographic coupling : documents are cited by the examined documents
1. Feature Selection/Extraction : 1. Feature Selection/Extraction Which terms to use as axes for vector space?
Better is to use highest weight mid-frequency words – the most discriminating terms
Pseudo-linguistic heuristics, e.g.,
drop stop-words
stemming/lemmatization
use only nouns/noun phrases
Good clustering should “figure out” some of these
2. Feature representation : 2. Feature representation Documents are transformed into vectors of weighted terms
Data matrix
(two modes)
How to compute xij
tf*idf
3. Pattern proximity (measure similarity) : 3. Pattern proximity (measure similarity) Dissimilarity matrix
(one mode)
How to compute dij
Dissimilarity - Distance
Similarity - inverse of distance
Proximity Measure : Proximity Measure Dissimilarity
The Euclidean distance:
The Manhattan distance:
Weighted distance:
Similarity
Inner Product
Cos Coefficient
4. Clustering Approach and/or Algorithm : 4. Clustering Approach and/or Algorithm (1) Using contents of documents
(2) Using user’s usage logs
(3) Using current search engines
(4) Using hyperlinks
(5) Other classical methods
(1) Using Contents of Documents : (1) Using Contents of Documents Based on snippets returned by web search engines.
Be as good as clusters created using the full text of Web documents.
Suffix Tree Clustering (STC) :
incremental, O(n) time algorithm
three logical steps:
(1) document “cleaning”
(2) identifying base clusters using a suffix tree
(3) combining these base clusters into clusters
(2) Using user’s usage logs : (2) Using user’s usage logs Relevancy information is objectively reflected by the usage logs
An experimental result on www.nasa.gov/
(3) Using current web search engines - Metacrawler - : (3) Using current web search engines - Metacrawler -
(4) Using hyperlinks : (4) Using hyperlinks Cluster web documents based on both the textual and hyperlink
The hyperlink structure is used as the dominant factor in the similarity metric
Kleinberg’s HITS (Hypertext Induced Topic Selection) algorithm
based purely on hyperlink information.
authority and hub documents for a user query.
only cover the most popular topics and leave out the less popular ones.
Other classical approaches : Other classical approaches
Hierarchical Clustering : Hierarchical Clustering Clusters are created in levels actually creating sets of clusters at each level.
Agglomerative
Initially each item in its own cluster
Iteratively clusters are merged together
Bottom Up
Divisive
Initially all items in one cluster
Large clusters are successively divided
Top Down
Hierarchical Clustering (Cent) : Hierarchical Clustering (Cent) Use distance matrix as clustering criteria.
This method does not require the number of clusters k as an input, but needs a termination condition. agglomerative divisive
Agglomerative Example : Agglomerative Example B A E C D 4 Threshold of 2 3 5 1 A B C D E
Distance Between Two Clusters : Distance Between Two Clusters Single-Link Method : Nearest Neighbor
Complete-Link : Furthest Neighbor
Average-link : all cross-cluster pairs.
Single-Link Method : Single-Link Method b a Distance Matrix Euclidean Distance (1) (2) (3) a,b,c c c d a,b d d a,b,c,d
Complete-Link Method : Complete-Link Method b a Distance Matrix Euclidean Distance (1) (2) (3) a,b c c d a,b d c,d a,b,c,d
Example of the Chaining Effect : Example of the Chaining Effect Single-link (10 clusters) Complete-link (2 clusters)
Effect of Bias towards Spherical Clusters : Effect of Bias towards Spherical Clusters Single-link (2 clusters) Complete-link (2 clusters)
Partitioning clustering : Partitioning clustering Partitioning method:
Construct a partition of a database D of n objects into a set of k clusters
Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion
Global optimal: exhaustively enumerate all partitions
Heuristic methods: k-means and k-medoids algorithms
k-means (MacQueen’67): Each cluster is represented by the center of the cluster
k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster
K-means algorithm : K-means algorithm Initial center of cluster are randomly selected
Assign objects to cluster using distances between center and object
Re-compute the center of each cluster
Return step2 until stopping criteria is satisfied
K-means algorithm : K-means algorithm Example
What is the problem of k-Means? : What is the problem of k-Means? The k-means algorithm is sensitive to outliers !
Since an object with an extremely large value may substantially distort the distribution of the data.
K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster.
5. Data Abstraction - Cluster labelling - : 5. Data Abstraction - Cluster labelling - How do we generate meaningful descriptions, or labels, for clusters?
it is an open issue
Possible solution
To list 5-10 most frequent features from the cluster
features: index terms, noun phrases, proper names, …
To show features that occur frequently in this cluster and not frequently in others
Differential labeling
To summarize multiple documents in the cluster
6. Cluster validation : 6. Cluster validation How can we tell if a clustering is good or not?
ask users whether they agree with the clusters
agreement with human clustering is problematic (Macskassy et al., 1998)
use statistical techniques to measure qualities of the cluster,
e.g. the purity, or the divergence from random clustering
For information retrieval
do we manage to cluster relevant documents together? This is the “holy grail” for cluster-based IR
can be evaluated through cluster-based retrieval
Approaches to evaluating : Approaches to evaluating Anecdotal
“I wrote this clustering algorithm and look what it found!”
No benchmarks, no comparison possible
User inspection
Experts evaluate the results and score them
Expensive / time-consuming
Ground “truth” comparison
To compare clustering results to a known taxonomy like Yahoo
The static prior taxonomy may be incomplete/wrong in places
Microeconomic / utility
Net economic gain produced by an approach (vs. another approach)
How do various clustering methods affect the quality of what’s retrieved?
Compare two IR algorithms
1. send query, present ranked results
2. send query, cluster results, present clusters
Purely quantitative measures
Probability of generating clusters found
Average distance between cluster members
Cluster Validity Assessment : Cluster Validity Assessment (Ci, Cj) : inter-cluster distance
(Ck) : intra-cluster distance D ( C K ) d ( C i , C j )
Inter-cluster distance : (Ci, Cj) : Inter-cluster distance : (Ci, Cj) Average linkage
Centroid linkage
Complete linkage
Single linkage
Average to centroids linkage
Intra-cluster distance: (Ck) : Intra-cluster distance: (Ck) Complete diameter
Average diameter
Centroid diameter
DB(Davies-Bouldin) Index : DB(Davies-Bouldin) Index K : the number of clusters
Sij : Distance between the center zi of cluster i Ci & the center zj of cluster j Cj
Si : Scatter between Ci and Cj
Davies-Bouldin (DB) Index (2) : Davies-Bouldin (DB) Index (2) The objective:
To minimize the index the number of clusters
optimal number of clusters, K
The clusters with minimized DB index
Small values of DB
correspond to good clusters
Dunn’s Index : Dunn’s Index The objective
To identify sets of clusters that are compact and well separated
to maximize the inter-cluster distances and minimize the intra-cluster distances
Large values of VD
correspond to good clusters
Other Cluster Validity Measures : Other Cluster Validity Measures https://www.cs.tcd.ie/Nadia.Bolshakova/validation_algorithms.html
Calinski Harabasz (CH) Index
Index I
Xie and Beni
F-Measure
-Measure
Silhouette
Symmetry
Graph-Based Boundary Analysis
Partition Coefficient
Separation Index
Classification Entropy (CE)
Etc.
Other clustering applications : Other clustering applications Relevance feedback (Iwayama, 2000)
Search engine query logs
cluster queries on the basis of the documents that users selected from these queries (Beeferman & Berger, 2000)
Recommender systems
cluster users based on preferences, purchases, etc. (Sarwar et al., 2002)
Topic detection and tracking (TDT) & Novelty Detection
cluster news stories together and try to find emerging new topics (Allan, 2001)
cluster previously seen information, and measure novelty of incoming information by its distance to that already seen (TREC Novelty track, 2002, 2003)
Discussion : Discussion In general, clustering presents problems and challenges:
Selection of attributes for clustering
Selection of clustering method
Generation of cluster representations
Validity/quality of the generated clustering
Updating clustering structures
e.g. inserting or deleting a document from a hierarchy
Effectiveness gains are not always evident
Computational expenses
Looking back : Looking back Document clustering has been applied to IR for over 30 years
the research focus has shifted over the years
efficiency (70s), effectiveness (80s), other applications (e.g. browsing, visualisation (90s)), dynamic clustering (00s), …?
But, there are still many unresolved issues:
Cluster representation
Cluster-based search strategies
Dynamic clustering (per-query basis)
Algorithmic aspects
Document representation for clustering (e.g. reduce noise by using only the most important part of a document)
New applications for clustering in IR
As a conclusion : As a conclusion Clustering is a useful tool for IR
It is theoretically solid and intuitively plausible
Although it has been used for over 30 years there are still open issues
It is still an interesting area to research
Some references : Some references If you only read one article/reference:
Tombros, A., PhD Thesis, 2002. Chapter 3 (optionally 4 & 5) available at: http://www.dcs.qmul.ac.uk/~tassos/publications.html
Willett, P., Recent trends in hierarchic document clustering: A critical review. Information Processing & Management, 24(5):577-597, 1988.
More than worth to have a look at:
van Rijsbergen, C.J., Information Retrieval. London: Butterworths, 2nd Edition, 1979; available at http://www.dcs.gla.ac.uk/~keith/Preface.html
Also recommended :
Hearst, M.A. and Pedersen, J.O. Re-examining the Cluster Hypothesis: Scatter/Gather on Retrieval Results. In Proceedings of ACM SIGIR ‘96, pages 76-84, 1996.
Some references : Some references Zamir, O. and Etzioni, O. Web document clustering: A feasibility demonstration. In Proceedings of ACM SIGIR ‘09, pages 46-54, 1998.
Iwayama, M. Relevance feedback with a small number of relevance judgements: incremental relevance feedback vs. document clustering. In Proceedings of ACM SIGIR ’00, pages 10-16, 2000.
Beeferman, D. and Berger, A. Agglomerative clustering of a search engine log. In Proceedings of the 6th International Conference on Knowledge Discovery in Data, pages 407-416, 2000.
B.M. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Recommender Systems for Large-Scale E-Commerce: Scalable Neighborhood Formation Using Clustering. Proceedings of the 5th International Conference on Computer and Information Technology, 2002.
J. Allan, J. Carbonell, G. Doddington, J. Yamron and Y. Yang. Topic detection and tracking pilot study. In Topic Detection and Tracking Workshop Report, 2001.
Some references : Some references Macskassy, S.A., Banerjee, A., Davidson, B.D., Hirsh, H. Human performance on clustering web pages: a preliminary study. In Proceedings of The 4th Knowledge Discovery and Data Mining Conference (KDD-98), pp. 264-268, 1998.
Goldszmidt, M. and Sahami, M. A Probabilistic Approach to Full-Text Document Clustering. Technical Report ITAD-433-MS-98-044, SRI International, 1998. Available at http://dbpubs.stanford.edu:8090/pub/1998-60
Bradley, P., Fayyad, U., Reina, C. Scaling EM (expectation maximization) algorithm to large databases, Microsoft Research Technical Report, MSR-TR-98, 1998.
Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W. Scatter/Gather: A cluster based approach to browsing large document collections. In Proceedings of ACM SIGIR ‘92, pages 126-135, 1992.
Acknowledgement : Acknowledgement C. Manning and P. Raghavan Why cluster documents http://www.stanford.edu/class/cs276a/handouts/lecture13.ppt
I. Corna, Cluster Validation, http://www.cosc.brocku.ca/Offerings/4V80/similarity.ppt
T. Tombros, Clustering for IR, http://qmir.dcs.qmul.ac.uk/teaching/2004/ir2/week02/lecture/ClusteringForIR.ppt
External Evaluation of Cluster Quality : External Evaluation of Cluster Quality Assesses clustering with respect to ground truth
Assume that there are C gold standard classes, while our clustering algorithms produce k clusters, π1, π2, …, πk with ni members.
Simple measure: purity, the ratio between the dominant class in the cluster πi and the size of cluster πi
Others are entropy of classes in clusters (or mutual information between classes and clusters)
Purity :
Cluster I Cluster II Cluster III Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6 Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6 Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5 Purity