logging in or signing up File_organization1 aehla Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT lite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 31 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: February 01, 2009 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript File organization : File organization File Organizations (Indexes) : File Organizations (Indexes) Choices for accessing data during query evaluation Scan the entire collection Typical in early (batch) retrieval systems Computational and I/O costs are O(characters in collection) Practical for only “small” text collections Large memory systems make scanning feasible Use indexes for direct access Evaluation time O(query term occurrences in collection) Practical for “large” collections Many opportunities for optimization Hybrids: Use small index, then scan a subset of the collection Indexes : Indexes What should the index contain? Database systems index primary and secondarykeys This is the hybrid approach Index provides fast access to a subset of database records Scan subset to find solution set IR Problem: Cannot predict keys that people will use in queries Every word in a document is a potential search term IR Solution: Index by all keys (words) ?full text indexes Indexes : Indexes Index is accessed by the atoms of a query language, The atoms are called “features” or “keys” or “terms” Most common feature types: – Words in text, punctuation – Manually assigned terms (controlled and uncontrolled vocabulary) – Document structure (sentence and paragraph boundaries) – Inter- or intra-document links (e.g., citations) Composed features – Feature sequences (phrases, names, dates, monetary amounts) – Feature sets (e.g., synonym classes) Indexing and retrieval models drive choices – Must be able to construct all components of those models Indexing choices (there is no “right” answer) Tokenization,Case folding (e.g., New vs new, Apple vs apple), Stopwords (e.g., the, a, its), Morphology (e.g., computer, computers, computing, computed) Index granularity has a large impact on speed and effectiveness Index Contents : Index Contents The contents depend upon the retrieval model Feature presence/absence Boolean Statistical (tf, df, ctf, doclen, maxtf) Often about 10% the size of the raw data, compressed Positional Feature location within document Granularities include word, sentence, paragraph, etc Coarse granularities are less precise, but take less space Word-level granularity about 20-30% the size of the raw data,compressed Indexes: Implementation : Indexes: Implementation Common implementations of indexes Bitmaps Signature files Inverted files Common index components Dictionary (lexicon) Postings document ids word positions No positional data indexed Indexes: Bitmaps : Indexes: Bitmaps Bag-of-words index only For each term, allocate vector with one bit per document If feature present in document n, set nth bit to 1, otherwise 0 Boolean operations very fast Space efficient for common terms (why?) Space inefficient for rare terms (why?) Good compression with run-length encoding (why?) Not widely used Indexes: Signature Files : Indexes: Signature Files Bag-of-words only For each term, allocate fixed size s-bit vector (signature) Define hash function: Multiple functions: word ? 1..s [selects which bits to set] Each term has an s-bit signature may not be unique! OR the term signatures to form document signature Long documents are a problem (why?) Usually segment them into smaller pieces / blocks Signature File Example : Signature File Example Signature File Example : Signature File Example Indexes: Signature Files : Indexes: Signature Files At query time: Lookup signature for query (how?) If all corresponding 1-bits are “on” in document signature, document probably contains that term Vary s to control P (false positive) Note space tradeoff Optimal s changes as collection grows (why?) Widely studied but Not widely used Indexes: Inverted Lists : Indexes: Inverted Lists Inverted lists are currently the most common indexing technique Source file: collection, organized by document Inverted file: collection organized by term one record per term, listing locations where term occurs During evaluation, traverse lists for each query term OR: the union of component lists AND: an intersection of component lists Proximity: an intersection of component lists SUM: the union of component lists; each entry has a score Inverted Files : Inverted Files Inverted Files : Inverted Files Word-Level Inverted File : Word-Level Inverted File How big is the index? : How big is the index? For an n word collection: Lexicon Heaps’ Law: V = O(nß), 0.4 < ß< 0.6 TREC-2: 1 GB text, 5 MB lexicon Postings at most, one per occurrence of the word in the text: O(n) Inverted Search Algorithm : Inverted Search Algorithm Find query elements (terms) in the lexicon Retrieve postings for each lexicon entry Manipulate postings according to the retrieval model Word-Level Inverted File : Word-Level Inverted File Query: 1.porridge & pot (BOOL) 2.“porridge pot” (BOOL) 3. porridge pot (VSM) lexicon posting Answer Lexicon Data Structures : Lexicon Data Structures According to Heaps’ Law, the size of lexicon maybe very large Hash table O(1) lookup, with constant h() and collision handling Supports exact-match lookup May be complex to expand B-Tree On-disk storage with fast retrieval and good caching behavior Supports exact-match and range-based lookup O(log n) lookups to find a list Usually easy to expand In-memory Inversion Algorithm : In-memory Inversion Algorithm Create an empty lexicon For each document d in the collection, Read document, parse into terms For each indexing term t, fd,t = frequency of t in d If t is not in lexicon, insert it Append <d, fd,t> to postings list for t Output each postings list into inverted file For each term, start new file entry Append each <d,fd,t> to the entry Compress entry Write entry out to file. Complexity of In-memory Inv. : Complexity of In-memory Inv. Time: O(n) for n-byte text Space Lexicon: space for unique words + offsets Postings, 10 bytes per entry document number: 4 bytes frequency count: 2 bytes (allows 65536 max occ) "next" pointer: 4 bytes Is this affordable? For 5GB collection, at 10 bytes/entry, for 400M entries, need 4GB of main memory Idea 1: Partition the text : Idea 1: Partition the text Invert a chunk of the text at a time Then, merge each sub-indexes into one complete index Idea 2: Sort-based Inversion : Idea 2: Sort-based Inversion Invert in two passes: Output records <t, d, ft> to a temp. file Sort the records using external merge sort read a chunk of the temp file sort it using Quicksort write it back into the same place then merge-sort the chunks in place Read sorted file, and write inverted file Access Optimizations : Access Optimizations Skip lists: A table of contents to the inverted list Embedded pointers that jump ahead n documents (why is this useful?) Separating presence information from location information Many operators only need presence information Location information takes substantial space (I/O) If split, reduced I/O for presence operators increased I/O for location operators (or larger index) Common in CD-ROM implementations Inverted file compression : Inverted file compression Inverted lists are usually compressed Inverted files with word locations are about the size of the raw data Distribution of numbers is skewed Most numbers are small (e.g., word locations, term frequency) Distribution can be made more skewed easily Delta encoding: 5, 8, 10, 17 ? 5, 3, 2, 7 Simple compression techniques are often the best choice Simple algorithms nearly as effective as complex algorithms Simple algorithms much faster than complex algorithms Goal: Time saved by reduced I/O > Time required to uncompress Compress ? Huffman encoding Inverted file compression : Inverted file compression The longst lists, which take up the most space, have the most frequent (probable) words. Compressing the longest lists would save the most space. The longest lists should compress easily because they contain the least information (why?) Algorithms: Delta encoding Variable-length encoding Unary codes Gamma codes Delta codes Variable-Byte Code Golomb code Inverted List Indexes: Compression : Inverted List Indexes: Compression Delta Encoding ("Storing Gaps") Reduces range of numbers. keep d in ascending order 3, 5, 20, 21, 23, 76, 77, 78 becomes: 3, 2, 15, 1, 2, 53, 1, 1 Produces a more skewed distribution. Increases probability of smaller numbers. Stemming also increases the probability of smaller numbers. (Why?) Variable-Byte Code : Variable-Byte Code Binary, but use minimum number of bytes 7 bits to store value, 1 bit to flag if there is another byte 0 < x < 128: 1 byte 128 < x < 16384: 2 bytes 16384 < x < 2097152: 3 bytes Integral byte sizes for easy coding Very effective for medium-sized numbers A little wasteful for very small numbers Best trade-off between smallest index size and fastest query time Index/IR toolkits : Index/IR toolkits The Lemur Toolkit for Language Modeling and Information Retrieval Java-based indexing and search technology,provide web search application software http://lucene.apache.org/nutch/ http://www.lemurproject.org/ Summary : Summary Common implementations of indexes Bitmap, signature file, inverted file Common index components Lexicon & postings Inverted Search Algorithm Inversion algorithm Inverted file access optimizations compression The End : The End Vector Space Model : Vector Space Model ??d???q???????????m???,???????TF·IDF,?????????????,?: (?????tf,idf??) Vocabulary Growth (Heap’s Law) : Vocabulary Growth (Heap’s Law) How does the size of the overall vocabulary (number of unique words) grow with the size of the corpus? Vocabulary has no upper bound due to proper names, typos, etc. New words occur less frequently as vocabulary grows If V is the size of the vocabulary and the n is the length of the corpus in words: V = Knß (0< ß <1) Typical constants: K ˜ 10-100 ß ˜ 0.4-0.6 (approx. square-root of n) Query Answer : Query Answer 1.porridge & pot (BOOL) d2 2.“porridge pot” (BOOL) null 3. porridge pot (VSM) d2 > d1>d5 Next page? Slide 35: 3. porridge pot (VSM) N=6,f(d1)??d1??? back You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
File_organization1 aehla Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT lite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 31 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: February 01, 2009 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript File organization : File organization File Organizations (Indexes) : File Organizations (Indexes) Choices for accessing data during query evaluation Scan the entire collection Typical in early (batch) retrieval systems Computational and I/O costs are O(characters in collection) Practical for only “small” text collections Large memory systems make scanning feasible Use indexes for direct access Evaluation time O(query term occurrences in collection) Practical for “large” collections Many opportunities for optimization Hybrids: Use small index, then scan a subset of the collection Indexes : Indexes What should the index contain? Database systems index primary and secondarykeys This is the hybrid approach Index provides fast access to a subset of database records Scan subset to find solution set IR Problem: Cannot predict keys that people will use in queries Every word in a document is a potential search term IR Solution: Index by all keys (words) ?full text indexes Indexes : Indexes Index is accessed by the atoms of a query language, The atoms are called “features” or “keys” or “terms” Most common feature types: – Words in text, punctuation – Manually assigned terms (controlled and uncontrolled vocabulary) – Document structure (sentence and paragraph boundaries) – Inter- or intra-document links (e.g., citations) Composed features – Feature sequences (phrases, names, dates, monetary amounts) – Feature sets (e.g., synonym classes) Indexing and retrieval models drive choices – Must be able to construct all components of those models Indexing choices (there is no “right” answer) Tokenization,Case folding (e.g., New vs new, Apple vs apple), Stopwords (e.g., the, a, its), Morphology (e.g., computer, computers, computing, computed) Index granularity has a large impact on speed and effectiveness Index Contents : Index Contents The contents depend upon the retrieval model Feature presence/absence Boolean Statistical (tf, df, ctf, doclen, maxtf) Often about 10% the size of the raw data, compressed Positional Feature location within document Granularities include word, sentence, paragraph, etc Coarse granularities are less precise, but take less space Word-level granularity about 20-30% the size of the raw data,compressed Indexes: Implementation : Indexes: Implementation Common implementations of indexes Bitmaps Signature files Inverted files Common index components Dictionary (lexicon) Postings document ids word positions No positional data indexed Indexes: Bitmaps : Indexes: Bitmaps Bag-of-words index only For each term, allocate vector with one bit per document If feature present in document n, set nth bit to 1, otherwise 0 Boolean operations very fast Space efficient for common terms (why?) Space inefficient for rare terms (why?) Good compression with run-length encoding (why?) Not widely used Indexes: Signature Files : Indexes: Signature Files Bag-of-words only For each term, allocate fixed size s-bit vector (signature) Define hash function: Multiple functions: word ? 1..s [selects which bits to set] Each term has an s-bit signature may not be unique! OR the term signatures to form document signature Long documents are a problem (why?) Usually segment them into smaller pieces / blocks Signature File Example : Signature File Example Signature File Example : Signature File Example Indexes: Signature Files : Indexes: Signature Files At query time: Lookup signature for query (how?) If all corresponding 1-bits are “on” in document signature, document probably contains that term Vary s to control P (false positive) Note space tradeoff Optimal s changes as collection grows (why?) Widely studied but Not widely used Indexes: Inverted Lists : Indexes: Inverted Lists Inverted lists are currently the most common indexing technique Source file: collection, organized by document Inverted file: collection organized by term one record per term, listing locations where term occurs During evaluation, traverse lists for each query term OR: the union of component lists AND: an intersection of component lists Proximity: an intersection of component lists SUM: the union of component lists; each entry has a score Inverted Files : Inverted Files Inverted Files : Inverted Files Word-Level Inverted File : Word-Level Inverted File How big is the index? : How big is the index? For an n word collection: Lexicon Heaps’ Law: V = O(nß), 0.4 < ß< 0.6 TREC-2: 1 GB text, 5 MB lexicon Postings at most, one per occurrence of the word in the text: O(n) Inverted Search Algorithm : Inverted Search Algorithm Find query elements (terms) in the lexicon Retrieve postings for each lexicon entry Manipulate postings according to the retrieval model Word-Level Inverted File : Word-Level Inverted File Query: 1.porridge & pot (BOOL) 2.“porridge pot” (BOOL) 3. porridge pot (VSM) lexicon posting Answer Lexicon Data Structures : Lexicon Data Structures According to Heaps’ Law, the size of lexicon maybe very large Hash table O(1) lookup, with constant h() and collision handling Supports exact-match lookup May be complex to expand B-Tree On-disk storage with fast retrieval and good caching behavior Supports exact-match and range-based lookup O(log n) lookups to find a list Usually easy to expand In-memory Inversion Algorithm : In-memory Inversion Algorithm Create an empty lexicon For each document d in the collection, Read document, parse into terms For each indexing term t, fd,t = frequency of t in d If t is not in lexicon, insert it Append <d, fd,t> to postings list for t Output each postings list into inverted file For each term, start new file entry Append each <d,fd,t> to the entry Compress entry Write entry out to file. Complexity of In-memory Inv. : Complexity of In-memory Inv. Time: O(n) for n-byte text Space Lexicon: space for unique words + offsets Postings, 10 bytes per entry document number: 4 bytes frequency count: 2 bytes (allows 65536 max occ) "next" pointer: 4 bytes Is this affordable? For 5GB collection, at 10 bytes/entry, for 400M entries, need 4GB of main memory Idea 1: Partition the text : Idea 1: Partition the text Invert a chunk of the text at a time Then, merge each sub-indexes into one complete index Idea 2: Sort-based Inversion : Idea 2: Sort-based Inversion Invert in two passes: Output records <t, d, ft> to a temp. file Sort the records using external merge sort read a chunk of the temp file sort it using Quicksort write it back into the same place then merge-sort the chunks in place Read sorted file, and write inverted file Access Optimizations : Access Optimizations Skip lists: A table of contents to the inverted list Embedded pointers that jump ahead n documents (why is this useful?) Separating presence information from location information Many operators only need presence information Location information takes substantial space (I/O) If split, reduced I/O for presence operators increased I/O for location operators (or larger index) Common in CD-ROM implementations Inverted file compression : Inverted file compression Inverted lists are usually compressed Inverted files with word locations are about the size of the raw data Distribution of numbers is skewed Most numbers are small (e.g., word locations, term frequency) Distribution can be made more skewed easily Delta encoding: 5, 8, 10, 17 ? 5, 3, 2, 7 Simple compression techniques are often the best choice Simple algorithms nearly as effective as complex algorithms Simple algorithms much faster than complex algorithms Goal: Time saved by reduced I/O > Time required to uncompress Compress ? Huffman encoding Inverted file compression : Inverted file compression The longst lists, which take up the most space, have the most frequent (probable) words. Compressing the longest lists would save the most space. The longest lists should compress easily because they contain the least information (why?) Algorithms: Delta encoding Variable-length encoding Unary codes Gamma codes Delta codes Variable-Byte Code Golomb code Inverted List Indexes: Compression : Inverted List Indexes: Compression Delta Encoding ("Storing Gaps") Reduces range of numbers. keep d in ascending order 3, 5, 20, 21, 23, 76, 77, 78 becomes: 3, 2, 15, 1, 2, 53, 1, 1 Produces a more skewed distribution. Increases probability of smaller numbers. Stemming also increases the probability of smaller numbers. (Why?) Variable-Byte Code : Variable-Byte Code Binary, but use minimum number of bytes 7 bits to store value, 1 bit to flag if there is another byte 0 < x < 128: 1 byte 128 < x < 16384: 2 bytes 16384 < x < 2097152: 3 bytes Integral byte sizes for easy coding Very effective for medium-sized numbers A little wasteful for very small numbers Best trade-off between smallest index size and fastest query time Index/IR toolkits : Index/IR toolkits The Lemur Toolkit for Language Modeling and Information Retrieval Java-based indexing and search technology,provide web search application software http://lucene.apache.org/nutch/ http://www.lemurproject.org/ Summary : Summary Common implementations of indexes Bitmap, signature file, inverted file Common index components Lexicon & postings Inverted Search Algorithm Inversion algorithm Inverted file access optimizations compression The End : The End Vector Space Model : Vector Space Model ??d???q???????????m???,???????TF·IDF,?????????????,?: (?????tf,idf??) Vocabulary Growth (Heap’s Law) : Vocabulary Growth (Heap’s Law) How does the size of the overall vocabulary (number of unique words) grow with the size of the corpus? Vocabulary has no upper bound due to proper names, typos, etc. New words occur less frequently as vocabulary grows If V is the size of the vocabulary and the n is the length of the corpus in words: V = Knß (0< ß <1) Typical constants: K ˜ 10-100 ß ˜ 0.4-0.6 (approx. square-root of n) Query Answer : Query Answer 1.porridge & pot (BOOL) d2 2.“porridge pot” (BOOL) null 3. porridge pot (VSM) d2 > d1>d5 Next page? Slide 35: 3. porridge pot (VSM) N=6,f(d1)??d1??? back