33 Huynh

Uploaded from authorPOINT
Views:
 
     
 

Presentation Description

No description available.

Comments

By: ihab22 (25 month(s) ago)

Although i was looking for exact matching algorithms but i liked this one and i want to download it.

Presentation Transcript

Approximate String Matching using Compressed Suffix Arrays: 

Approximate String Matching using Compressed Suffix Arrays T.N.D. Huynh, W.K. Sung National University of Singapore W.K. Hon, T.W. Lam The University of Hong Kong

String Matching Problem: 

String Matching Problem Given a text T of length n over an alphabet Σ, a pattern P of length m, find all occurrences of P inside the text T E.g., T = barbara P = bar  2 occurrences, at position 1 and 4 in T

Index for String Matching: 

Index for String Matching Often, T is given ahead, which is going to be matched with various P later Also, n » m. E.g., T = Human Genome ~ 3 * 109 P = Gene ~ 103  It pays to waste some space to build an index for T that will facilitate later matching

Index for String Matching [Examples]: 

Index for String Matching [Examples]

k-Approximate String Matching: 

k-Approximate String Matching Find all occurrences of P in T that have at most k 'errors' (mismatch, edits) from P E.g., T = barbara P = rba  5 occurrences, at positions 1 (delete r from P), 2 (insert a to P), 3 (match), 4 (delete r from P), 6 (delete b from P)

Previous Work & Our Result (k=1): 

Previous Work andamp; Our Result (k=1)

Our Index: 

Our Index Our index is Suffix Array + Inverse Definition 1: The suffix array of T is an array SA such that SA[i] stores the starting position of the i-th smallest suffix of T

An Example of Suffix Array: 

An Example of Suffix Array E.g., T = barbara

Our Index: Suffix Array + Inverse: 

Our Index: Suffix Array + Inverse Lemma 1: Given a pattern P. Suppose P occurs in T. Then all (exact) occurrences of P corresponds to a range, say [st,ed], in SA such that SA[st], SA[st+1], …, SA[ed] are position of all such occurrences.

Our Index: Suffix Array + Inverse: 

Our Index: Suffix Array + Inverse Lemma 2: Given the range [st1,ed1] for P1 and the range [st2,ed2] for P2. Then, the range [st,ed] for P1P2 can be found in O(log n) time, based on SA and its inverse. Idea of proof: Similar to Manber andamp; Myers’ algorithm, using binary search.

Our Index: Suffix Array + Inverse: 

Our Index: Suffix Array + Inverse Corollary 3: Given the range [st,ed] for P, and an array C such that C[c] stores the total occurrences of a character in T that is smaller than the character c. Then, the range of cP can be found in O(log n) time. Proof: Directly follows from Lemma 2 since [C[c-1]+1, C[c]] is the range of SA that corresponds to c.

1-Approximate Matching Algorithm: 

1-Approximate Matching Algorithm [The delete case] Find the range [sti,edi] for P[1…i], for every i  [1,m] Find the range [sti’,edi’] for P[i…m], for every i  [1,m] For every i  [1,m], find the range of P[1…i-1] P[i+1..m]. Report the occurrences.  Time complexity: O(m log n + occ)

1-Approximate Matching Algorithm: 

1-Approximate Matching Algorithm For the mismatch case or other edit cases, the algorithm is similar, except that in Step 3, we have to find the range for |Σ|m strings (instead of m strings in the delete case).  Time complexity: O(|Σ|m log n+occ)

The General Case: 

The General Case Our algorithm can be extended to solve the general k-approximate matching problem. The time complexity will be: O(|Σ|k mk log n + occ) Further, if we replace SA + Inverse by CSA of Grossi andamp; Vitter, the space becomes O(n) bits, and the time will be blown up by an O(log n) factor

Future Work: 

Future Work Can we improve the time to O(m + occ) for the 1-approximate matching problem?