space efficient algorithms veli

Views:
 
Category: Education
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Space-Efficient Algorithms for Document Retrieval: 

Space-Efficient Algorithms for Document Retrieval Veli Mäkinen University of Helsinki Joint work with Niko Välimäki

Introduction: 

Introduction Information Retrieval Document Retrieval Inverted Index Combinatorial Pattern Matching Text Indexing Suffix tree Field Problem Solution [PST06] [Mut02] practice: space limits theory: time limits [Sad07 & this paper]

Text Indexing: 

Text Indexing Let T = t1t2 ... tn be a text string from an ordered alphabet Σ. Text Indexing problem is to build an index structure for T that supports the following operations on a given pattern P=p1p2 ... pm: Count(P): How many times P occurs in T? List(P): list the occurrence positions of P in T.

Document Retrieval: 

Document Retrieval Let D={T1,T2,...Tk} be a set of text documents of total length n. Document Retrieval problem is to build an index for D that supports the following operation on a given pattern P=p1p2 ... pm: - Find(P): List the documents that contain P (in the order of relevance,...)

Inverted Index & Document Retrieval: 

Inverted Index & Document Retrieval ... be: (d1,4) (d1,18) ... (d2,74) (d2,139)... ... to: (d1,1) (d1,15) ...(d2,136)... ... Find("to be")= Remove duplicates((Find("to")+3)∩Find("be")) = d1 (Hamlet), d2 (Merchant of Venice),... To be, or not to be: that is the question: Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune, Or to take arms against a sea of troubles, And by opposing end them? To die: to sleep; PORTIA: If to do were as easy as to know what were good to do, chapels had been churches and poor men's cottages princes' palaces. It is a good divine that follows his own instructions: I can easier teach twenty what were good to be done, than be one of the twenty to follow mine own teaching. Creating inverted file over Shakespeare's plays...............................

Suffix Array & Document Retrieval (1/2): 

Suffix Array & Document Retrieval (1/2) Build generalized suffix array of D: 1 2 .... 6853491 6853492 6853493 6853494 ... To be, or not to be: that is the question: Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune, Or to take arms against a sea of troubles, And by opposing end them? To die: to sleep; PORTIA: If to do were as easy as to know what were good to do, chapels had been churches and poor men's cottages princes' palaces. It is a good divine that follows his own instructions: I can easier teach twenty what were good to be done, than be one of the twenty to follow mine own teaching.

Suffix Array & Document Retrieval: 

Suffix Array & Document Retrieval Build generalized suffix array of D: Locate the interval containing all occurrences of pattern P: Remove duplicates: 1 2 .... 6853491 6853492 6853493 6853494 ... 1 2 .... 6853491 6853492 6853493 6853494 ... "to be" d1 (Hamlet), d2 (Merchant of Venice),...

Muthukrishnan's improvement: 

Muthukrishnan's improvement 1 2 .... 6853491 6853492 6853493 6853494 ... 6 4 .... 2 1 1 3 doc "to be"

Time-Optimal Document Retrieval: 

Time-Optimal Document Retrieval Theorem [Mut02]: Document retrieval problem can be solved in the optimal O(m+ndoc) time using an index structure of size O(n log n) bits, where ndoc is the number of documents matching the query. Observation: The solution is not space-optimal, as the document collection can be represented in n log |Σ| bits.

Space-Optimal Document Retrieval: 

Space-Optimal Document Retrieval Theorem [Sad02]: Document retrieval problem can be solved in O(f(m,n)+ndoc·g(n)) time using an index structure of size |CSA|+4n+o(n)+O(k log (n/k)) bits, where |CSA| ≤ n log |Σ| (1+o(1)) is the size of the compressed suffix array used; f(m,n)=O(m log n) is the pattern search time; and Ω(logεn)=g(n) is the time to decode a suffix array value.

Our Result: Space- and Time-Efficient Document Retrieval: 

Our Result: Space- and Time-Efficient Document Retrieval Theorem: Document retrieval problem can be solved in the optimal O(m+ndoc) time using an index structure of size |CSA|+2n+o(n)+n log k(1+o(1)) bits, when |Σ|,k polylog(n); for unbounded |Σ|,k the time bound components become O(m log |Σ|) and O(ndoc log k), respectively.

Details of Our Result (1/3): 

Details of Our Result (1/3) We use the alphabet-friendly FM-index [FMMN07] to find the suffix array interval containing the pattern occurrences. We use the generalized wavelet tree [GGV03,FMMN07] to store document numbers according to the suffix array order.

Details of Our Result (2/3): 

Details of Our Result (2/3) Observation: prev[i]=selectdoc[i](doc,rankdoc[i](doc,i)-1), where rankk'(A,i) gives the number of times value k' appears in A[1,i]; and selectk'(A,j) gives the position of the j-th occurrence of value k' in A.

Details of Our Result (3/3): 

Details of Our Result (3/3) The generalized wavelet tree representation of doc-array provides constant time rank and select when kpolylog (n). Constant time Range Minimum Queries (RMQ) on implicit prev-array can be supported using 2n+o(n) bits [FH07].

A simpler way to obtain the O(ndoc log k) result...: 

A simpler way to obtain the O(ndoc log k) result... 1 2 3 4 5 6 7 8 9 doc 2 3 4 2 1 2 3 1 4 2 2 1 2 1 3 4 3 4 1 1 2 2 2 3 3 4 4

Extensions: 

Extensions The approach can easily be extended to report the documents in relevance order under standard scoring schemes like TF*IDF; and show context around the first/several/all occurrences in selected documents.

Small experiment: 

Small experiment 50MB English text k=200 size query time m=3 query time m=4

authorStream Live Help