Automatic Indexing &Text Categorization: Automatic Indexing andamp; Text Categorization Dr. Miguel E. Ruiz
School of Informatics
Department of Library and Information Studies
Special presentation for
LIS514 Indexing and Surrogation
Dr. June Abbas
Text Processing: Text Processing Document Text +
ect. stopwords Noun
groups stemming Automatic
Recognition Text Full text Structure Index term Vector Space Model (Logical representation): Vector Space Model (Logical representation) Documents and queries are expressed using a vector whose components are all the possible index terms(t). Each index term has an associated weight that indicates the importance of the index term in the document (or query). Vector Space Model (Logical representation): Vector Space Model (Logical representation) In other words, the document dj and the query q are represented as t-dimensional vectors. dj q Documents in Vector Space: Documents in Vector Space t1 t2 t3 D1 D2 D10 D3 D9 D4 D7 D8 D5 D11 D6 For example, in a three dimensional space documents and queries would look like this. Q Document Vectors : Document Vectors nova galaxy heat Hollywood film role diet fur
1.0 0.5 0.3
1.0 0.8 0.7
0.9 1.0 0.5
0.5 0.7 0.9
0.6 1.0 0.3 0.2 0.8
0.7 0.5 0.1 0.3
D9 Document ids Terms Vector Space Model: Vector Space Model The vector space model proposes to evaluate the degree of similarity of document dj with regard to the query q as the correlation between the two vectors dj and q. Vector Space Model: Vector Space Model This correlation can be quantified in different ways, for example by the cosine of the angle between these two vectors.
Vector Space Model: Computing Similarity Scores: Vector Space Model: Computing Similarity Scores 1.0 0.8 0.6 0.8 0.4 0.6 0.4 1.0 0.2 0.2 Vector Space Model: Vector Space Model Since wi,j 0 and Wi,q 0 , sim(dj,q) varies between 0 to +1. The vector space model assumes that the similarity value is an indication of the relevance of the document to the given query.
Thus vector space model ranks the retrieved documents by their similarity with the query. Vector Space Model: Vector Space Model How can we compute the values of the weights wi,j ?
One of the most popular methods is based on combining two factors:
The importance of each index term in the document
The importance of the index term in the collection Vector Space Model: Vector Space Model Importance of the index term in the document:
This can be measured by the number of times that the term appears in the document. This is called the term frequency which is denoted by the symbol tf. Vector Space Model: Vector Space Model The importance of the index term in the collection:
An index term that appears in every document in the collection is not very useful, but a term that occurs in only a few documents may indicate that these few documents could be relevant to a query that uses this term.
This factor is usually called the inverse document frequency or the idf factor. Vector Space Model: Vector Space Model Mathematically The inverse document frequency (idf) can be expressed as:
N= number of documents in the collection
ni = number of document that contain the term i
Vector Space Model: Vector Space Model Combining these two factors we can obtain the weight of an index term i as:
Also called the tf-idf weighting scheme Text Processing: Text Processing Index term selection ca be done manually or automatically
A method for automatically selecting index terms was proposed by Luhn (in 1958) based on the use of Zipf’s Law Text Processing: Text Processing Zipf’s Law:
Let f be the frequency of occurrence of various words in a given position of the text and r be their rank order, that is, the order of their frequency of occurrence. The product of f r is approximately a constant. Text Processing: Text Processing Luhn proposed that if we graph f versus r we can define an upper and lower cut-off. Words that fall within these two cut-off points are likely to be good predictors of the contents of a document. Zipf’s Law applied to indexing: Zipf’s Law applied to indexing Words by rank order Frequency of words Text Processing: Text Processing Thesaurus index terms
In its simple form a thesaurus consist of a precompiled list of words and for each word a set of synonyms.
We can assign terms from a thesaurus (either manually or automatically) to the set of index terms of the document. Automatic Text Categorization: Automatic Text Categorization Categorization in Linguistic: Categorization in Linguistic Rosch’s Hypothesis:
Each category has its own structure. Some members are better examples of the category.
Any category has a member called prototype. Overview of Automatic Text Categorization: Overview of Automatic Text Categorization Text categorization is the process of algorithmically analyzing an electronic document to assign a set of categories (or index terms) from a predefined vocabulary to succinctly describe the content of a document. Machine Learning and Text Categorization: Machine Learning and Text Categorization Machine learning is an area of artificial intelligence that concentrates on the study of algorithms that 'learn' how to perform specific tasks.
Supervised learning: Learning from a set of previously categorized examples. (Pattern Matching)
Unsupervised learning: finding useful relations between members of the target set. (Clustering) Machine Learning and Text Categorization: Machine Learning and Text Categorization Machine learning methods used for Text categorization:
Support vector machines Slide26: Training Process Feature
epoch Errorandlt; e Subset
Selection no yes Threshold
optimization Trained classifier Slide27: Text Categorization Process Document Build
Vector Thresholding Assigned
Categories Applications of Text Categorization: Applications of Text Categorization Document indexing
Automatic generation of metadata
Classification of patents
Word sense disambiguation
Essay grading. Feature Selection: Feature Selection Feature selection consists in choosing the words or phrases are are better predictors of a given category.
Mutual Information Correlation Coefficient: Correlation Coefficient Hierarchical Mixture of Experts: Hierarchical Mixture of Experts Based on 'divide-and-conquer' principle.
Jordan and Jacobs (1993) Expert 1 Expert 2 Expert 3 Expert 4 Gate 2.1 Gate 2.2 Gate 1 x x x x x x x y1.1 y1.2 y2.1 y2.2 y2 y y1 Training set selection:Category Zone: Training set selection:Category Zone - + + + + + - - - - - - - - - - - - - Slide33: The category zone is defined as the set of positive examples plus significant negative examples. It is inspired by the query zone concept proposed by Singhal, Mitra, andamp; Buckley (1997). Knn-based Category Zones: Knn-based Category Zones For each positive example SMART System
(Retrieval) Top k retrieved
documents Add example
and top K
retrieved docs Category zone Slide35: Experimental Collection MeSH Categories (~23,00 concepts)
we selected the Heart Diseases sub tree (119 concepts)
OHSUMED collection (233,445 MEDLINE records) from 1987 - 1991
Training set 183,229 records. (1987-90)
Test set 50,216 records (1991)
Sets of Categories: Sets of Categories Set of categories in the 'Heart Diseases' subtree (HD-119): a total of 103 categories
High frequency categories (HD-49) with at least 75 examples in the collection
Medium frequency categories (HD-28) between 15 and 74 examples
Low frequency categories (HD-26) between 1 and 14 examples.
Slide37: Performance Measure Recall= a/(a+c)
F1=(2*P*R)/(P+R) Categories assigned by human indexers are
Taken as gold standard Research Questions: Research Questions Does a hierarchical classifier built on the HME model improve performance when compared to a flat classifier?
How does our hierarchical method compare with other text categorization approaches? Baselines: Baselines Rocchio Classifier: is the vector of the difference between the centroid of positive examples, and the centroid of negative examples.
Baselines: Baselines Flat neural network classifier:
A classifier obtained by combining the output
of the expert networks. Expert 1 Expert 2 Expert 49 . . . Final Hierarchical Classifier: Final Hierarchical Classifier Comparison between Classifiers: Comparison between Classifiers Challenges of text categorization using MeSH: Challenges of text categorization using MeSH Hierarchy of the indexing vocabulary andamp; indexing rules
Indexing rules (i.e. assign the most specific category)
Use of Qualifiers (or sub-headings)
Multi-hierarchy (a category can appear in multiple places in the hierarchy)
i.e. Acquired immunodeficiency Syndrome appears as a virus infection and as a sexually transmitted disease
No specific indication of which sense is being assigned
Limited number of categories are assigned manually (only the 'most important' are assigned) Examples of Operational Automatic Indexing Systems: Examples of Operational Automatic Indexing Systems NLM Indexing Project: NLM Indexing Project Medical Text Indexer (MTI)
Project that explores the automatic indexing tools for current indexing practices in the National Library of Medicine.
Fully Automatic or semi-automatic indexing tool.
Overview of MTI: Overview of MTI NLM Indexing Project: NLM Indexing Project The initial version of the system generated too many spurious indexing terms.
Strict (Terms recommended by MetaMap and Pubmed Related Citations)
Medium (discard terns that are too general)
Base filtering (rules used to add, boost substitute and remove terms)
Examples: http://ii.nlm.nih.gov/Demo/II_demo.html Systems that use Automatic Text Categorization: Systems that use Automatic Text Categorization A good example of commercial software that uses text categorization is:
Inxight smart discovery Taxonomy management and categorization tool http://www.inxight.com/products/smartdiscovery/tc/