Share PowerPoint. Anywhere!

space efficient algorithms veli

Uploaded from authorPOINT Lite
Download as Download Not Available PPT
Presentation Description

No description available

Like authorSTREAM?


You can vote once a day till December
10th, Vote Now!
Views: 8
Like it  ( Likes) Dislike it  ( Dislikes)
Added: January 04, 2008 This presentation is Public
Presentation Category :Education
Presentation StatisticsNew!
Views on authorSTREAM: 8
Presentation Transcript

Space-Efficient Algorithms for Document Retrieval : Space-Efficient Algorithms for Document Retrieval Veli Mäkinen University of Helsinki Joint work with Niko Välimäki


Introduction : Introduction Information Retrieval Document Retrieval Inverted Index Combinatorial Pattern Matching Text Indexing Suffix tree Field Problem Solution [PST06] [Mut02] practice: space limits theory: time limits [Sad07 & this paper]


Text Indexing : Text Indexing Let T = t1t2 ... tn be a text string from an ordered alphabet Σ. Text Indexing problem is to build an index structure for T that supports the following operations on a given pattern P=p1p2 ... pm: Count(P): How many times P occurs in T? List(P): list the occurrence positions of P in T.


Document Retrieval : Document Retrieval Let D={T1,T2,...Tk} be a set of text documents of total length n. Document Retrieval problem is to build an index for D that supports the following operation on a given pattern P=p1p2 ... pm: - Find(P): List the documents that contain P (in the order of relevance,...)


Inverted Index & Document Retrieval : Inverted Index & Document Retrieval ... be: (d1,4) (d1,18) ... (d2,74) (d2,139)... ... to: (d1,1) (d1,15) ...(d2,136)... ... Find("to be")= Remove duplicates((Find("to")+3)∩Find("be")) = d1 (Hamlet), d2 (Merchant of Venice),... To be, or not to be: that is the question: Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune, Or to take arms against a sea of troubles, And by opposing end them? To die: to sleep; PORTIA: If to do were as easy as to know what were good to do, chapels had been churches and poor men's cottages princes' palaces. It is a good divine that follows his own instructions: I can easier teach twenty what were good to be done, than be one of the twenty to follow mine own teaching. Creating inverted file over Shakespeare's plays...............................


Suffix Array & Document Retrieval (1/2) : Suffix Array & Document Retrieval (1/2) Build generalized suffix array of D: 1 2 .... 6853491 6853492 6853493 6853494 ... To be, or not to be: that is the question: Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune, Or to take arms against a sea of troubles, And by opposing end them? To die: to sleep; PORTIA: If to do were as easy as to know what were good to do, chapels had been churches and poor men's cottages princes' palaces. It is a good divine that follows his own instructions: I can easier teach twenty what were good to be done, than be one of the twenty to follow mine own teaching.


Suffix Array & Document Retrieval : Suffix Array & Document Retrieval Build generalized suffix array of D: Locate the interval containing all occurrences of pattern P: Remove duplicates: 1 2 .... 6853491 6853492 6853493 6853494 ... 1 2 .... 6853491 6853492 6853493 6853494 ... "to be" d1 (Hamlet), d2 (Merchant of Venice),...


Muthukrishnan's improvement : Muthukrishnan's improvement 1 2 .... 6853491 6853492 6853493 6853494 ... 6 4 .... 2 1 1 3 doc "to be"


Time-Optimal Document Retrieval : Time-Optimal Document Retrieval Theorem [Mut02]: Document retrieval problem can be solved in the optimal O(m+ndoc) time using an index structure of size O(n log n) bits, where ndoc is the number of documents matching the query. Observation: The solution is not space-optimal, as the document collection can be represented in n log |Σ| bits.


Space-Optimal Document Retrieval : Space-Optimal Document Retrieval Theorem [Sad02]: Document retrieval problem can be solved in O(f(m,n)+ndoc·g(n)) time using an index structure of size |CSA|+4n+o(n)+O(k log (n/k)) bits, where |CSA| ≤ n log |Σ| (1+o(1)) is the size of the compressed suffix array used; f(m,n)=O(m log n) is the pattern search time; and Ω(logεn)=g(n) is the time to decode a suffix array value.


Our Result: Space- and Time-Efficient Document Retrieval : Our Result: Space- and Time-Efficient Document Retrieval Theorem: Document retrieval problem can be solved in the optimal O(m+ndoc) time using an index structure of size |CSA|+2n+o(n)+n log k(1+o(1)) bits, when |Σ|,k polylog(n); for unbounded |Σ|,k the time bound components become O(m log |Σ|) and O(ndoc log k), respectively.


Details of Our Result (1/3) : Details of Our Result (1/3) We use the alphabet-friendly FM-index [FMMN07] to find the suffix array interval containing the pattern occurrences. We use the generalized wavelet tree [GGV03,FMMN07] to store document numbers according to the suffix array order.


Details of Our Result (2/3) : Details of Our Result (2/3) Observation: prev[i]=selectdoc[i](doc,rankdoc[i](doc,i)-1), where rankk'(A,i) gives the number of times value k' appears in A[1,i]; and selectk'(A,j) gives the position of the j-th occurrence of value k' in A.


Details of Our Result (3/3) : Details of Our Result (3/3) The generalized wavelet tree representation of doc-array provides constant time rank and select when kpolylog (n). Constant time Range Minimum Queries (RMQ) on implicit prev-array can be supported using 2n+o(n) bits [FH07].


A simpler way to obtain the O(ndoc log k) result... : A simpler way to obtain the O(ndoc log k) result... 1 2 3 4 5 6 7 8 9 doc 2 3 4 2 1 2 3 1 4 2 2 1 2 1 3 4 3 4 1 1 2 2 2 3 3 4 4


Extensions : Extensions The approach can easily be extended to report the documents in relevance order under standard scoring schemes like TF*IDF; and show context around the first/several/all occurrences in selected documents.


Small experiment : Small experiment 50MB English text k=200 size query time m=3 query time m=4