Succinct Data Structures for Permutations, Functions and Suffix Arrays: Succinct Data Structures for Permutations, Functions and Suffix Arrays Ian Munro
University of Waterloo
Joint work with F. Fich, M. He, J. Horton, A. López-Ortiz, S. Srinivasa Rao, Rajeev Raman, Venkatesh Raman
How do we encode a permutation or
generalization … function or
specialization … suffix array
in a small amount of space
and still perform queries in constant time ???
Permutations: a Shortcut Notation: Permutations: a Shortcut Notation Let P be a simple array giving π; P[i] = π[i]
Also have B[i] be a pointer t positions back in (the cycle of) the permutation;
B[i]= π-t[i] .. But only define B for every tth position in cycle. (t is a constant; ignore cycle length “round-off”)
So array representation
P = [8 4 12 5 13 x x 3 x 2 x 10 1]
1 2 3 4 5 6 7 8 9 10 11 12 13 2 4 5 13 1 8 3 12 10
Representing Shortcuts: Representing Shortcuts In a cycle there is a B every t positions …
But these positions can be in arbitrary order
Which i’s have a B, and how do we store it?
Keep a vector of all positions
0 indicates no B 1 indicates a B
Rank gives the position of B[“i”] in B array
So: π(i) and π -1(i) in O(1) time & (1+ε)n lg n bits
Theorem: Under a pointer machine model with space (1+ ε) n references, we need time 1/ε to answer π and π -1 queries; i.e. this is as good as it gets.
Getting n lg n Bits: an Aside: Getting n lg n Bits: an Aside This is the best we can do for O(1) operations
But using Benes networks:
1-Benes network is a 2 input/2 output switch
r+1-Benes network … join tops to tops
1
2
3
4
5
6
7
8 3
5
7
8
1
6
4
2 R-Benes Network R-Benes Network
A Benes Network: A Benes Network Realizing the permutation
(3 5 7 8 1 6 4 2)
1
2
3
4
5
6
7
8 3
5
7
8
1
6
4
2
What can we do with it?: What can we do with it? Divide into blocks of lg lg n gates … encode their actions in a word. Taking advantage of regularity of address mechanism
and also
Modify approach to avoid power of 2 issue
Can trace a path in time O(lg n/(lg lg n)
This is the best time we are able get for π and π-1 in minimum space.
Observe: This method “violates” the pointer machine lower bound by using “micropointers”.
Back to the main track: Powers of π: Back to the main track: Powers of π Consider the cycles of π
( 2 6 8)( 3 5 9 10)( 4 1 7)
Keep a bit vector to indicate the start of each cycle
( 2 6 8 3 5 9 10 4 1 7)
Ignoring parentheses, view as new permutation, ψ.
Note: ψ-1(i) is position containing i …
So we have ψ and ψ-1 as before
Use ψ-1(i) to find i, then bit vector (rank, select) to find πk or π-k
Functions: Functions Now consider arbitrary functions [n]→[n]
“A function is just a hairy permutation”
All tree edges lead to a cycle
Challenges here: Challenges here Essentially write down the components in a convenient order and use the n lg n bits to describe the mapping (as per permutations)
To get fk(i):
Find the level ancestor (k levels up) in a tree
Or
Go up to root and apply f the remaining number of steps around a cycle
Level Ancestors: Level Ancestors There are several level ancestor techniques using
O(1) time and O(n) WORDS.
Adapt Bender & Farach-Colton to work in O(n) bits
But going the other way …
f-k is a set: f-k is a set Moving Down the tree requires care
f-3( ) = ( )
The trick:
Report all nodes on a given level of a tree in time proportional to the number of nodes, and
Don’t waste time on trees with no answers
Final Function Result: Final Function Result Given an arbitrary function f: [n]→[n]
With an n lg n + O(n) bit representation we can compute fk(i) in O(1) time and f-k(i) in time O(1 + size of answer).
Back to Text … And Suffix Arrays: Back to Text … And Suffix Arrays Text T[1..n] over (a,b)*# (a<#<b)
There are 2n-1 such texts, which of the n! suffix arrays are valid?
1 2 3 4 5 6 7 8
SA= 4 7 5 1 8 3 6 2
is
a b b a a b a #
SA-1= 4 8 6 1 3 7 2 5
M= 4 7 1 5 8 2 3 6 isn’t ..why?
Ascending to Max: Ascending to Max M is a permutation so M-1 is its inverse
i.e. M-1[i] says where i is in M
Ascending-to-Max: 1 i n-2
M-1[i] < M-1[n] and M-1[i+1] < M-1[n] M-1[i] < M-1[i+1]
M-1[i] > M-1[n] and M-1[i+1] > M-1[n] M-1[i] > M-1[i+1]
4 7 5 1 8 3 6 2 OK
4 7 1 5 8 2 3 6 NO
Non-Nesting: Non-Nesting Non-Nesting: 1 i,j n-1 and M-1[i]<M-1[j]
M-1[i] < M-1[i+1] and M-1[j] < M-1[j+1] M-1[i+1] < M-1[j+1]
M-1[i] > M-1[i+1] and M-1[j] > M-1[j+1] M-1[i+1] < M-1[j+1]
4 7 5 1 8 3 6 2 OK
4 7 1 5 8 2 3 6 NO
Characterization Theorem for Suffix Arrays on Binary Texts: Characterization Theorem for Suffix Arrays on Binary Texts Theorem: Ascending to Max & Non-nesting Suffix Array
Corollary: Clean method of breaking SA into segments
Corollary: Linear time algorithm to check whether SA is valid
Cardinality Queries: Cardinality Queries T= a b a a a b b a a a b a a b b #
Remember lengths longest run of a’s and of b’s
SA (broken by runs, but not stored explicitly)
8 3 | 9 4 12| 1 10 5 13|16 | 7 2 11 15|6 14
Ba, bit vector .. If SA-1[i-1] in an “a” section store 1 in Ba,[SA-1[i]], else 0
Ba 0 0 1 1 0 0 1 1 1 0 0 1 1 0 1 1
Create rank structure on Ba, and similarly Bb, (Note these are reversed except at #)
Algorithm Count(T,P)
s ← 1; e ←n; i ← m;
while i>0 and se do
if P[i[=a then
s← rank1(Ba,s-1)+1; e←rank1(Ba,e)
else
s← na + 2 + rank1(Bb,s-1); e←na + 1 +rank1(Bb,e)
i ← i-1
Return max(e-s+1,0)
Time: O(length of query)
Listing Queries: Listing Queries Complex methods
Key idea: for queries of length at least d, index every dth position .. For T and forT(reversed)
So we have matches for T[i..n] and T[1,i-1]
View these as points in 2 space (Ferragina & Manzini and Grossi & Vitter)
Do a range query (Alstrup et al)
Variety of results follow
General Conclusion: General Conclusion Interesting, and useful, combinatorial objects can be:
Stored succinctly … O(lower bound) +o()
So that
Natural queries are performed in O(1) time (or at least very close)
This can make the difference between using them and not …