Space-Efficient Algorithms for Document Retrieval:

Space-Efficient Algorithms for Document Retrieval Veli Mäkinen
University of Helsinki Joint work with Niko Välimäki

Introduction:

Introduction Information
Retrieval Document
Retrieval Inverted Index Combinatorial
Pattern Matching Text Indexing Suffix tree Field Problem Solution [PST06] [Mut02] practice: space limits
theory: time limits [Sad07 & this paper]

Text Indexing:

Text Indexing Let T = t1t2 ... tn be a text string from an ordered alphabet Σ.
Text Indexing problem is to build an index structure for T that supports the following operations on a given pattern P=p1p2 ... pm:
Count(P): How many times P occurs in T?
List(P): list the occurrence positions of P in T.

Document Retrieval:

Document Retrieval Let D={T1,T2,...Tk} be a set of text documents of total length n.
Document Retrieval problem is to build an index for D that supports the following operation on a given pattern P=p1p2 ... pm: - Find(P): List the documents that contain P (in the order of relevance,...)

Inverted Index & Document Retrieval:

Inverted Index & Document Retrieval ...
be: (d1,4) (d1,18) ... (d2,74) (d2,139)...
...
to: (d1,1) (d1,15) ...(d2,136)...
... Find("to be")= Remove duplicates((Find("to")+3)∩Find("be"))
= d1 (Hamlet), d2 (Merchant of Venice),... To be, or not to be: that is the question: Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune, Or to take arms against a sea of troubles, And by opposing end them? To die: to sleep; PORTIA: If to do were as easy as to know what were good to do, chapels had been churches and poor men's cottages princes' palaces. It is a good divine that follows his own instructions: I can easier teach twenty what were good to be done, than be one of the twenty to follow mine own teaching. Creating inverted file over Shakespeare's plays...............................

Suffix Array & Document Retrieval (1/2):

Suffix Array & Document Retrieval (1/2) Build generalized suffix array of D:
1 2 .... 6853491 6853492 6853493 6853494 ...
To be, or not to be: that is the question: Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune, Or to take arms against a sea of troubles, And by opposing end them? To die: to sleep; PORTIA: If to do were as easy as to know what were good to do, chapels had been churches and poor men's cottages princes' palaces. It is a good divine that follows his own instructions: I can easier teach twenty what were good to be done, than be one of the twenty to follow mine own teaching.

Suffix Array & Document Retrieval:

Suffix Array & Document Retrieval Build generalized suffix array of D:
Locate the interval containing all occurrences of pattern P:
Remove duplicates:
1 2 .... 6853491 6853492 6853493 6853494 ...
1 2 .... 6853491 6853492 6853493 6853494 ... "to be" d1 (Hamlet), d2 (Merchant of Venice),...

Time-Optimal Document Retrieval Theorem [Mut02]: Document retrieval problem can be solved in the optimal O(m+ndoc) time using an index structure of size O(n log n) bits, where ndoc is the number of documents matching the query.
Observation: The solution is not space-optimal, as the document collection can be represented in n log |Σ| bits.

Space-Optimal Document Retrieval:

Space-Optimal Document Retrieval Theorem [Sad02]: Document retrieval problem can be solved in O(f(m,n)+ndoc·g(n)) time using an index structure of size |CSA|+4n+o(n)+O(k log (n/k)) bits, where
|CSA| ≤ n log |Σ| (1+o(1)) is the size of the compressed suffix array used;
f(m,n)=O(m log n) is the pattern search time; and
Ω(logεn)=g(n) is the time to decode a suffix array value.

Our Result: Space- and Time-Efficient Document Retrieval:

Our Result: Space- and Time-Efficient Document Retrieval Theorem: Document retrieval problem can be solved in the optimal O(m+ndoc) time using an index structure of size |CSA|+2n+o(n)+n log k(1+o(1)) bits, when |Σ|,k polylog(n);
for unbounded |Σ|,k the time bound components become O(m log |Σ|) and O(ndoc log k), respectively.

Details of Our Result (1/3):

Details of Our Result (1/3) We use the alphabet-friendly FM-index [FMMN07] to find the suffix array interval containing the pattern occurrences.
We use the generalized wavelet tree [GGV03,FMMN07] to store document numbers according to the suffix array order.

Details of Our Result (2/3):

Details of Our Result (2/3) Observation: prev[i]=selectdoc[i](doc,rankdoc[i](doc,i)-1), where
rankk'(A,i) gives the number of times value k' appears in A[1,i]; and
selectk'(A,j) gives the position of the j-th occurrence of value k' in A.

Details of Our Result (3/3):

Details of Our Result (3/3) The generalized wavelet tree representation of doc-array provides constant time rank and select when kpolylog (n).
Constant time Range Minimum Queries (RMQ) on implicit prev-array can be supported using 2n+o(n) bits [FH07].

A simpler way to obtain the O(ndoc log k) result...:

Extensions The approach can easily be extended to
report the documents in relevance order under standard scoring schemes like TF*IDF; and
show context around the first/several/all occurrences in selected documents.

Small experiment:

Small experiment 50MB English text
k=200 size query time
m=3 query time
m=4

You do not have the permission to view this presentation. In order to view it, please
contact the author of the presentation.

Send to Blogs and Networks

Processing ....

Premium member

Use HTTPs

HTTPS (Hypertext Transfer Protocol Secure) is a protocol used by Web servers to transfer and display Web content securely. Most web browsers block content or generate a “mixed content” warning when users access web pages via HTTPS that contain embedded content loaded via HTTP. To prevent users from facing this, Use HTTPS option.