09

Uploaded from authorPOINTLite
Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations : 

Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh www.Gelbukh.com

Previous chapter: Conclusions: 

Previous chapter: Conclusions Modeling of text helps predict behavior of systems Zipf law, Heaps’ law Describing formally the structure of documents allows to treat a part of their meaning automatically, e.g., search Languages to describe document syntax SGML, too expensive HTML, too simple XML, good combination

Text operations: 

Text operations Linguistic operations Document clustering Compression Encription (not discussed here)

Linguistic operations: 

Linguistic operations Purpose: Convert words to “meanings” Synonyms or related words Different words, same meaning. Morphology Foot / feet, woman / female Homonyms Same words, different meanings. Word senses River bank / financial bank Stopwords Word, no meaning. Functional words The

For good or for bad?: 

For good or for bad? More exact matching Less noise, better recall Unexpected behavior Difficult for users to grasp Harms if introduces errors More expensive Adds a whole new technology Maintenance; language dependents Slows down Good if done well, harmful if done badly

Document preprocessing: 

Document preprocessing Lexical analysis (punctuation, case) Simple but must be careful Stopwords. Reduces index size and pocessing time Stemming: connected, connection, connections, ... Multiword expressions: hot dog, B-52 Here, all the power of linguistic analysis can be used Selection of index terms Often nouns; noun groups: computer science Construction of thesaurus synonymy: network of related concepts (words or phrases)

Stemming: 

Stemming Methods Linguistic analysis: complex, expensive maintenance Table lookup: simple, but needs data Statistical (Avetisyan): no data, but imprecise Suffix removal Suffix removal Porter algorithm. Martin Porter. Ready code on his website Substitution rules: sses  s, s   stresses  stress.

Better stemming: 

Better stemming The whole problematics of computational linguistics POS disambiguation well  adverb or noun? Oil well. Statistical methods. Brill tagger Syntactic analysis. Syntactic disambiguation Word sense disambiguatiuon bank1 and bank2 should be different stems Statistical methods Dictionary-based methods. Lesk algorithm Semantic analysis

Thesaurus: 

Thesaurus Terms (controlled vocabulary) and relationships Terms used for indexing represent a concept. One word or a phrase. Usually nouns sense. Definition or notes to distinguish senses: key (door). Relationships Paradigmatic: Synonymy, hierarchical (is-a, part), non-hierarchical Syntagmatic: collocations, co-occurrences WordNet. EuroWordNet synsets

Use of thesurus: 

Use of thesurus To help the user to formulate the query Navigation in the hierarchy of words Yahoo! For the program, to collate related terms woman  female fuzzy comparison: woman  0.8 * female. Path length

Yahoo! vs. thesaurus: 

Yahoo! vs. thesaurus The book says Yahoo! is based on a thesaurus. I disagree Tesaurus: words of language organized in hierarchy Document hierarchy: documents attached to hierarchy This is word sense disambiguation I claim that Yahoo! is based on (manual) WSD Also uses thesaurus for navigation

Text operations: 

Text operations Linguistic operations Document clustering Compression Encription (not discussed here)

Document clustering: 

Document clustering Operation on the whole collection Global vs. local Global: whole collection At compile time, one-time operation Local Cluster the results of a specific query At runtime, with each query Is more a query transformation operation Already discussed in Chapter 5

Text operations: 

Text operations Linguistic operations Document clustering Compression Encription (not discussed here)

Compression: 

Compression Gain: storage, transmission, search Lost: time on compressing/decompressing In IR: need for random access. Blocks do not work Also: pattern matching on compressed text

Compression methods: 

Compression methods Statistical Huffman: fixed size per symbol. More frequent symbols shorter Allows starting decompression from any symbol Arithmetic: dynamic coding Need to decompress from the beginning Not for IR Dictionary Pointers to previous occurrences. Lampel-Ziv Again not for IR

Compression ratio: 

Compression ratio Size compressed / size decompressed Huffman, units = words: up to 2 bits per char Close to the limit = entropy. Only for large texts! Other methods: similar ratio, but no random access Shannon: optimal length for symbol with probability p is - log2 p Entropy: Limit of compression Average length with optimal coding Property of model

Modeling: 

Modeling Find probability for the next symbol Adaptive, static, semi-static Adaptive: good compression, but need to start from beginning Static (for language): poor compression, random access Semi-static (for specific text; two-pass): both OK Word-based vs. character-based Word-based: better compression and search

Huffman coding: 

Huffman coding Each symbol is encoded, sequentially More frequent symbols have shorter codes No code is a prefix of another one How to build the tree: book Byte codes are better Allow for sequential search

Dictionary-based methods: 

Dictionary-based methods Static (simple, poor compression), dynamic, semi-static. Lempel-Ziv: references to previous occurrence Adaptive Disadvantages for IR Need to decode from the very beginning New statistical methods perform better

Comparison of methods: 

Comparison of methods

Compression of inverted files: 

Compression of inverted files Inverted file: words + lists of docs where they occur Lists of docs are ordered. Can be compressed Seen as lists of gaps. Short gaps occur more frequently Statistical compression Our work: order the docs for better compression We code runs of docs Minimize the number of runs Distance: # of different words TSP.

Research topics: 

Research topics All computational linguistics Improved POS tagging Improved WSD Uses of thesaurus for user navigation for collating similar terms Better compression methods Searchable compression Random access

Conclusions: 

Conclusions Text transformation: meaning instead of strings Lexical analysis Stopwords Stemming POS, WSD, syntax, semantics Ontologies to collate similar stems Text compression Searchable Random access Word-based statistical methods (Huffman) Index compression

Slide25: 

Thank you! Till compensation lecture