File organization :File organization
File Organizations (Indexes) :File Organizations (Indexes) Choices for accessing data during query evaluation
Scan the entire collection
Typical in early (batch) retrieval systems
Computational and I/O costs are O(characters in collection)
Practical for only “small” text collections
Large memory systems make scanning feasible
Use indexes for direct access
Evaluation time O(query term occurrences in collection)
Practical for “large” collections
Many opportunities for optimization
Hybrids: Use small index, then scan a subset of the collection
Indexes :Indexes What should the index contain?
Database systems index primary and secondarykeys
This is the hybrid approach
Index provides fast access to a subset of database records
Scan subset to find solution set
IR Problem:
Cannot predict keys that people will use in queries
Every word in a document is a potential search term
IR Solution: Index by all keys (words) ?full text indexes
Indexes :Indexes Index is accessed by the atoms of a query language, The atoms are called “features” or “keys” or “terms”
Most common feature types:
– Words in text, punctuation
– Manually assigned terms (controlled and uncontrolled vocabulary)
– Document structure (sentence and paragraph boundaries)
– Inter- or intra-document links (e.g., citations)
Composed features
– Feature sequences (phrases, names, dates, monetary amounts)
– Feature sets (e.g., synonym classes)
Indexing and retrieval models drive choices
– Must be able to construct all components of those models
Indexing choices (there is no “right” answer)
Tokenization,Case folding (e.g., New vs new, Apple vs apple), Stopwords (e.g., the, a, its), Morphology (e.g., computer, computers, computing, computed)
Index granularity has a large impact on speed and effectiveness
Index Contents :Index Contents The contents depend upon the retrieval model
Feature presence/absence
Boolean
Statistical (tf, df, ctf, doclen, maxtf)
Often about 10% the size of the raw data, compressed
Positional
Feature location within document
Granularities include word, sentence, paragraph, etc
Coarse granularities are less precise, but take less space
Word-level granularity about 20-30% the size of the raw data,compressed
Indexes: Implementation :Indexes: Implementation Common implementations of indexes
Bitmaps
Signature files
Inverted files
Common index components
Dictionary (lexicon)
Postings
document ids
word positions No positional data indexed
Indexes: Bitmaps :Indexes: Bitmaps Bag-of-words index only
For each term, allocate vector with one bit per document
If feature present in document n, set nth bit to 1, otherwise 0
Boolean operations very fast
Space efficient for common terms (why?)
Space inefficient for rare terms (why?)
Good compression with run-length encoding (why?)
Not widely used
Indexes: Signature Files :Indexes: Signature Files Bag-of-words only
For each term, allocate fixed size s-bit vector (signature)
Define hash function:
Multiple functions: word ? 1..s [selects which bits to set]
Each term has an s-bit signature
may not be unique!
OR the term signatures to form document signature
Long documents are a problem (why?)
Usually segment them into smaller pieces / blocks
Signature File Example :Signature File Example
Signature File Example :Signature File Example
Indexes: Signature Files :Indexes: Signature Files At query time:
Lookup signature for query (how?)
If all corresponding 1-bits are “on” in document signature, document probably contains that term
Vary s to control P (false positive)
Note space tradeoff
Optimal s changes as collection grows (why?)
Widely studied but Not widely used
Indexes: Inverted Lists :Indexes: Inverted Lists Inverted lists are currently the most common indexing technique
Source file: collection, organized by document
Inverted file: collection organized by term
one record per term, listing locations where term occurs
During evaluation, traverse lists for each query term
OR: the union of component lists
AND: an intersection of component lists
Proximity: an intersection of component lists
SUM: the union of component lists; each entry has a score
Inverted Files :Inverted Files
Inverted Files :Inverted Files
Word-Level Inverted File :Word-Level Inverted File
How big is the index? :How big is the index? For an n word collection:
Lexicon
Heaps’ Law: V = O(nß), 0.4 < ß< 0.6
TREC-2: 1 GB text, 5 MB lexicon
Postings
at most, one per occurrence of the word in the text: O(n)
Inverted Search Algorithm :Inverted Search Algorithm Find query elements (terms) in the lexicon
Retrieve postings for each lexicon entry
Manipulate postings according to the retrieval model
Word-Level Inverted File :Word-Level Inverted File Query:
1.porridge & pot (BOOL) 2.“porridge pot” (BOOL)
3. porridge pot (VSM) lexicon posting Answer
Lexicon Data Structures :Lexicon Data Structures According to Heaps’ Law, the size of lexicon maybe very large
Hash table
O(1) lookup, with constant h() and collision handling
Supports exact-match lookup
May be complex to expand
B-Tree
On-disk storage with fast retrieval and good caching behavior
Supports exact-match and range-based lookup
O(log n) lookups to find a list
Usually easy to expand
In-memory Inversion Algorithm :In-memory Inversion Algorithm Create an empty lexicon
For each document d in the collection,
Read document, parse into terms
For each indexing term t,
fd,t = frequency of t in d
If t is not in lexicon, insert it
Append to postings list for t
Output each postings list into inverted file
For each term, start new file entry
Append each to the entry
Compress entry
Write entry out to file.
Complexity of In-memory Inv. :Complexity of In-memory Inv. Time: O(n) for n-byte text
Space
Lexicon: space for unique words + offsets
Postings, 10 bytes per entry
document number: 4 bytes
frequency count: 2 bytes (allows 65536 max occ)
"next" pointer: 4 bytes
Is this affordable?
For 5GB collection, at 10 bytes/entry, for 400M entries, need 4GB of main memory
Idea 1: Partition the text :Idea 1: Partition the text Invert a chunk of the text at a time
Then, merge each sub-indexes into one complete index
Idea 2: Sort-based Inversion :Idea 2: Sort-based Inversion Invert in two passes:
Output records to a temp. file
Sort the records using external merge sort
read a chunk of the temp file
sort it using Quicksort
write it back into the same place
then merge-sort the chunks in place
Read sorted file, and write inverted file
Access Optimizations :Access Optimizations Skip lists:
A table of contents to the inverted list
Embedded pointers that jump ahead n documents (why is this useful?)
Separating presence information from location information
Many operators only need presence information
Location information takes substantial space (I/O)
If split,
reduced I/O for presence operators
increased I/O for location operators (or larger index)
Common in CD-ROM implementations
Inverted file compression :Inverted file compression Inverted lists are usually compressed
Inverted files with word locations are about the size of the raw data
Distribution of numbers is skewed
Most numbers are small (e.g., word locations, term frequency)
Distribution can be made more skewed easily
Delta encoding: 5, 8, 10, 17 ? 5, 3, 2, 7
Simple compression techniques are often the best choice
Simple algorithms nearly as effective as complex algorithms
Simple algorithms much faster than complex algorithms
Goal: Time saved by reduced I/O > Time required to uncompress Compress ? Huffman encoding
Inverted file compression :Inverted file compression The longst lists, which take up the most space, have the most frequent (probable) words.
Compressing the longest lists would save the most space.
The longest lists should compress easily because they contain the least information (why?)
Algorithms:
Delta encoding
Variable-length encoding
Unary codes
Gamma codes
Delta codes
Variable-Byte Code
Golomb code
Inverted List Indexes: Compression :Inverted List Indexes: Compression Delta Encoding ("Storing Gaps")
Reduces range of numbers.
keep d in ascending order
3, 5, 20, 21, 23, 76, 77, 78
becomes: 3, 2, 15, 1, 2, 53, 1, 1
Produces a more skewed distribution.
Increases probability of smaller numbers.
Stemming also increases the probability of smaller numbers. (Why?)
Variable-Byte Code :Variable-Byte Code Binary, but use minimum number of bytes
7 bits to store value, 1 bit to flag if there is another byte
0 < x < 128: 1 byte
128 < x < 16384: 2 bytes
16384 < x < 2097152: 3 bytes
Integral byte sizes for easy coding
Very effective for medium-sized numbers
A little wasteful for very small numbers
Best trade-off between smallest index size and fastest query time
Index/IR toolkits :Index/IR toolkits The Lemur Toolkit for Language Modeling and Information Retrieval
Java-based indexing and search technology,provide web search application software http://lucene.apache.org/nutch/ http://www.lemurproject.org/
Summary :Summary Common implementations of indexes
Bitmap, signature file, inverted file
Common index components
Lexicon & postings
Inverted Search Algorithm
Inversion algorithm
Inverted file access optimizations
compression
The End :The End
Vector Space Model :Vector Space Model ??d???q???????????m???,???????TF·IDF,?????????????,?: (?????tf,idf??)
Vocabulary Growth (Heap’s Law) :Vocabulary Growth (Heap’s Law) How does the size of the overall vocabulary (number of unique words) grow with the size of the corpus?
Vocabulary has no upper bound due to proper names, typos, etc.
New words occur less frequently as vocabulary grows
If V is the size of the vocabulary and the n is the length of the corpus in words:
V = Knß (0< ß <1)
Typical constants:
K ˜ 10-100
ß ˜ 0.4-0.6 (approx. square-root of n)
Query Answer :Query Answer 1.porridge & pot (BOOL)
d2
2.“porridge pot” (BOOL)
null
3. porridge pot (VSM)
d2 > d1>d5
Next page?
Slide 35:3. porridge pot (VSM)
N=6,f(d1)??d1??? back