Automatic Indexing

Views:

Category: Entertainment

Presentation Description

No description available.

Presentation Transcript

Automatic Indexing &Text Categorization:

Automatic Indexing andamp; Text Categorization Dr. Miguel E. Ruiz School of Informatics Department of Library and Information Studies Special presentation for LIS514 Indexing and Surrogation Dr. June Abbas

Text Processing:

Text Processing Document Text + Structure Accents, Spacing, ect. stopwords Noun groups stemming Automatic or manual indexing. Structure Recognition Text Full text Structure Index term

Vector Space Model (Logical representation):

Vector Space Model (Logical representation) Documents and queries are expressed using a vector whose components are all the possible index terms(t). Each index term has an associated weight that indicates the importance of the index term in the document (or query).

Vector Space Model (Logical representation):

Vector Space Model (Logical representation) In other words, the document dj and the query q are represented as t-dimensional vectors.  dj q

Documents in Vector Space:

Documents in Vector Space t1 t2 t3 D1 D2 D10 D3 D9 D4 D7 D8 D5 D11 D6 For example, in a three dimensional space documents and queries would look like this. Q

Document Vectors :

Document Vectors nova galaxy heat Hollywood film role diet fur 1.0 0.5 0.3 0.5 1.0 1.0 0.8 0.7 0.9 1.0 0.5 1.0 1.0 0.9 1.0 0.5 0.7 0.9 0.6 1.0 0.3 0.2 0.8 0.7 0.5 0.1 0.3 D1 D2 D3 D4 D5 D6 D7 D8 D9 Document ids Terms

Vector Space Model:

Vector Space Model The vector space model proposes to evaluate the degree of similarity of document dj with regard to the query q as the correlation between the two vectors dj and q.

Vector Space Model:

Vector Space Model This correlation can be quantified in different ways, for example by the cosine of the angle between these two vectors.

Vector Space Model: Computing Similarity Scores:

Vector Space Model: Computing Similarity Scores 1.0 0.8 0.6 0.8 0.4 0.6 0.4 1.0 0.2 0.2

Vector Space Model:

Vector Space Model Since wi,j  0 and Wi,q  0 , sim(dj,q) varies between 0 to +1. The vector space model assumes that the similarity value is an indication of the relevance of the document to the given query. Thus vector space model ranks the retrieved documents by their similarity with the query.

Vector Space Model:

Vector Space Model How can we compute the values of the weights wi,j ? One of the most popular methods is based on combining two factors: The importance of each index term in the document The importance of the index term in the collection

Vector Space Model:

Vector Space Model Importance of the index term in the document: This can be measured by the number of times that the term appears in the document. This is called the term frequency which is denoted by the symbol tf.

Vector Space Model:

Vector Space Model The importance of the index term in the collection: An index term that appears in every document in the collection is not very useful, but a term that occurs in only a few documents may indicate that these few documents could be relevant to a query that uses this term. This factor is usually called the inverse document frequency or the idf factor.

Vector Space Model:

Vector Space Model Mathematically The inverse document frequency (idf) can be expressed as: Where: N= number of documents in the collection ni = number of document that contain the term i

Vector Space Model:

Vector Space Model Combining these two factors we can obtain the weight of an index term i as: Also called the tf-idf weighting scheme

Text Processing:

Text Processing Index term selection ca be done manually or automatically A method for automatically selecting index terms was proposed by Luhn (in 1958) based on the use of Zipf’s Law

Text Processing:

Text Processing Zipf’s Law: Let f be the frequency of occurrence of various words in a given position of the text and r be their rank order, that is, the order of their frequency of occurrence. The product of f  r is approximately a constant.

Text Processing:

Text Processing Luhn proposed that if we graph f versus r we can define an upper and lower cut-off. Words that fall within these two cut-off points are likely to be good predictors of the contents of a document.

Zipf’s Law applied to indexing:

Zipf’s Law applied to indexing Words by rank order Frequency of words

Text Processing:

Text Processing Thesaurus index terms In its simple form a thesaurus consist of a precompiled list of words and for each word a set of synonyms. We can assign terms from a thesaurus (either manually or automatically) to the set of index terms of the document.

Automatic Text Categorization:

Automatic Text Categorization

Categorization in Linguistic:

Categorization in Linguistic Rosch’s Hypothesis: Each category has its own structure. Some members are better examples of the category. Any category has a member called prototype.

Overview of Automatic Text Categorization:

Overview of Automatic Text Categorization Text categorization is the process of algorithmically analyzing an electronic document to assign a set of categories (or index terms) from a predefined vocabulary to succinctly describe the content of a document.

Machine Learning and Text Categorization:

Machine Learning and Text Categorization Machine learning is an area of artificial intelligence that concentrates on the study of algorithms that 'learn' how to perform specific tasks. Supervised learning: Learning from a set of previously categorized examples. (Pattern Matching) Unsupervised learning: finding useful relations between members of the target set. (Clustering)

Machine Learning and Text Categorization:

Machine Learning and Text Categorization Machine learning methods used for Text categorization: Linear classifiers Neural networks Decision trees Inductive learning Support vector machines

Slide26:

Training Process Feature Selection Training examples Training epoch Errorandlt; e Subset Selection no yes Threshold optimization Trained classifier

Slide27:

Text Categorization Process Document Build Document Vector Thresholding Assigned Categories

Applications of Text Categorization:

Applications of Text Categorization Document indexing Automatic generation of metadata Classification of patents Document filtering Adaptive filtering Spam filtering Word sense disambiguation Essay grading.

Feature Selection:

Feature Selection Feature selection consists in choosing the words or phrases are are better predictors of a given category. Examples Correlation Coefficient Mutual Information

Correlation Coefficient:

Correlation Coefficient

Hierarchical Mixture of Experts:

Hierarchical Mixture of Experts Based on 'divide-and-conquer' principle. Jordan and Jacobs (1993)    Expert 1 Expert 2 Expert 3 Expert 4 Gate 2.1 Gate 2.2 Gate 1 x x x x x x x y1.1 y1.2 y2.1 y2.2 y2 y y1

Training set selection:Category Zone:

Training set selection:Category Zone - + + + + + - - - - - - - - - - - - -

Slide33:

The category zone is defined as the set of positive examples plus significant negative examples. It is inspired by the query zone concept proposed by Singhal, Mitra, andamp; Buckley (1997).

Knn-based Category Zones:

Knn-based Category Zones For each positive example SMART System (Retrieval) Top k retrieved documents Add example and top K retrieved docs Category zone

Slide35:

Experimental Collection MeSH Categories (~23,00 concepts) we selected the Heart Diseases sub tree (119 concepts) OHSUMED collection (233,445 MEDLINE records) from 1987 - 1991 Training set 183,229 records. (1987-90) Test set 50,216 records (1991)

Sets of Categories:

Sets of Categories Set of categories in the 'Heart Diseases' subtree (HD-119): a total of 103 categories High frequency categories (HD-49) with at least 75 examples in the collection Medium frequency categories (HD-28) between 15 and 74 examples Low frequency categories (HD-26) between 1 and 14 examples.

Slide37:

Performance Measure Recall= a/(a+c) Precision= a/(a+b) F1=(2*P*R)/(P+R) Categories assigned by human indexers are Taken as gold standard

Research Questions:

Research Questions Does a hierarchical classifier built on the HME model improve performance when compared to a flat classifier? How does our hierarchical method compare with other text categorization approaches?

Baselines:

Baselines Rocchio Classifier: is the vector of the difference between the centroid of positive examples, and the centroid of negative examples.

Baselines:

Baselines Flat neural network classifier: A classifier obtained by combining the output of the expert networks. Expert 1 Expert 2 Expert 49 . . .

Final Hierarchical Classifier:

Final Hierarchical Classifier

Comparison between Classifiers:

Comparison between Classifiers

Challenges of text categorization using MeSH:

Challenges of text categorization using MeSH Hierarchy of the indexing vocabulary andamp; indexing rules Indexing rules (i.e. assign the most specific category) Use of Qualifiers (or sub-headings) Multi-hierarchy (a category can appear in multiple places in the hierarchy) i.e. Acquired immunodeficiency Syndrome appears as a virus infection and as a sexually transmitted disease No specific indication of which sense is being assigned Limited number of categories are assigned manually (only the 'most important' are assigned)

Examples of Operational Automatic Indexing Systems:

Examples of Operational Automatic Indexing Systems

NLM Indexing Project:

NLM Indexing Project Medical Text Indexer (MTI) Project that explores the automatic indexing tools for current indexing practices in the National Library of Medicine. Fully Automatic or semi-automatic indexing tool.

Overview of MTI

NLM Indexing Project:

NLM Indexing Project The initial version of the system generated too many spurious indexing terms. MTI Filtering Strict (Terms recommended by MetaMap and Pubmed Related Citations) Medium (discard terns that are too general) Base filtering (rules used to add, boost substitute and remove terms) Examples: http://ii.nlm.nih.gov/Demo/II_demo.html

Systems that use Automatic Text Categorization:

Systems that use Automatic Text Categorization A good example of commercial software that uses text categorization is: Inxight smart discovery Taxonomy management and categorization tool http://www.inxight.com/products/smartdiscovery/tc/