logging in or signing up 33 Huynh Gulkund Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 129 Category: Product Traini.. License: All Rights Reserved Like it (0) Dislike it (0) Added: September 24, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... By: ihab22 (25 month(s) ago) Although i was looking for exact matching algorithms but i liked this one and i want to download it. Saving..... Post Reply Close Saving..... Edit Comment Close Premium member Presentation Transcript Approximate String Matching using Compressed Suffix Arrays: Approximate String Matching using Compressed Suffix Arrays T.N.D. Huynh, W.K. Sung National University of Singapore W.K. Hon, T.W. Lam The University of Hong Kong String Matching Problem: String Matching Problem Given a text T of length n over an alphabet Σ, a pattern P of length m, find all occurrences of P inside the text T E.g., T = barbara P = bar 2 occurrences, at position 1 and 4 in T Index for String Matching: Index for String Matching Often, T is given ahead, which is going to be matched with various P later Also, n » m. E.g., T = Human Genome ~ 3 * 109 P = Gene ~ 103 It pays to waste some space to build an index for T that will facilitate later matching Index for String Matching [Examples]: Index for String Matching [Examples] k-Approximate String Matching: k-Approximate String Matching Find all occurrences of P in T that have at most k 'errors' (mismatch, edits) from P E.g., T = barbara P = rba 5 occurrences, at positions 1 (delete r from P), 2 (insert a to P), 3 (match), 4 (delete r from P), 6 (delete b from P) Previous Work & Our Result (k=1): Previous Work andamp; Our Result (k=1) Our Index: Our Index Our index is Suffix Array + Inverse Definition 1: The suffix array of T is an array SA such that SA[i] stores the starting position of the i-th smallest suffix of T An Example of Suffix Array: An Example of Suffix Array E.g., T = barbara Our Index: Suffix Array + Inverse: Our Index: Suffix Array + Inverse Lemma 1: Given a pattern P. Suppose P occurs in T. Then all (exact) occurrences of P corresponds to a range, say [st,ed], in SA such that SA[st], SA[st+1], …, SA[ed] are position of all such occurrences. Our Index: Suffix Array + Inverse: Our Index: Suffix Array + Inverse Lemma 2: Given the range [st1,ed1] for P1 and the range [st2,ed2] for P2. Then, the range [st,ed] for P1P2 can be found in O(log n) time, based on SA and its inverse. Idea of proof: Similar to Manber andamp; Myers’ algorithm, using binary search. Our Index: Suffix Array + Inverse: Our Index: Suffix Array + Inverse Corollary 3: Given the range [st,ed] for P, and an array C such that C[c] stores the total occurrences of a character in T that is smaller than the character c. Then, the range of cP can be found in O(log n) time. Proof: Directly follows from Lemma 2 since [C[c-1]+1, C[c]] is the range of SA that corresponds to c. 1-Approximate Matching Algorithm: 1-Approximate Matching Algorithm [The delete case] Find the range [sti,edi] for P[1…i], for every i [1,m] Find the range [sti’,edi’] for P[i…m], for every i [1,m] For every i [1,m], find the range of P[1…i-1] P[i+1..m]. Report the occurrences. Time complexity: O(m log n + occ) 1-Approximate Matching Algorithm: 1-Approximate Matching Algorithm For the mismatch case or other edit cases, the algorithm is similar, except that in Step 3, we have to find the range for |Σ|m strings (instead of m strings in the delete case). Time complexity: O(|Σ|m log n+occ) The General Case: The General Case Our algorithm can be extended to solve the general k-approximate matching problem. The time complexity will be: O(|Σ|k mk log n + occ) Further, if we replace SA + Inverse by CSA of Grossi andamp; Vitter, the space becomes O(n) bits, and the time will be blown up by an O(log n) factor Future Work: Future Work Can we improve the time to O(m + occ) for the 1-approximate matching problem? You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
33 Huynh Gulkund Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 129 Category: Product Traini.. License: All Rights Reserved Like it (0) Dislike it (0) Added: September 24, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... By: ihab22 (25 month(s) ago) Although i was looking for exact matching algorithms but i liked this one and i want to download it. Saving..... Post Reply Close Saving..... Edit Comment Close Premium member Presentation Transcript Approximate String Matching using Compressed Suffix Arrays: Approximate String Matching using Compressed Suffix Arrays T.N.D. Huynh, W.K. Sung National University of Singapore W.K. Hon, T.W. Lam The University of Hong Kong String Matching Problem: String Matching Problem Given a text T of length n over an alphabet Σ, a pattern P of length m, find all occurrences of P inside the text T E.g., T = barbara P = bar 2 occurrences, at position 1 and 4 in T Index for String Matching: Index for String Matching Often, T is given ahead, which is going to be matched with various P later Also, n » m. E.g., T = Human Genome ~ 3 * 109 P = Gene ~ 103 It pays to waste some space to build an index for T that will facilitate later matching Index for String Matching [Examples]: Index for String Matching [Examples] k-Approximate String Matching: k-Approximate String Matching Find all occurrences of P in T that have at most k 'errors' (mismatch, edits) from P E.g., T = barbara P = rba 5 occurrences, at positions 1 (delete r from P), 2 (insert a to P), 3 (match), 4 (delete r from P), 6 (delete b from P) Previous Work & Our Result (k=1): Previous Work andamp; Our Result (k=1) Our Index: Our Index Our index is Suffix Array + Inverse Definition 1: The suffix array of T is an array SA such that SA[i] stores the starting position of the i-th smallest suffix of T An Example of Suffix Array: An Example of Suffix Array E.g., T = barbara Our Index: Suffix Array + Inverse: Our Index: Suffix Array + Inverse Lemma 1: Given a pattern P. Suppose P occurs in T. Then all (exact) occurrences of P corresponds to a range, say [st,ed], in SA such that SA[st], SA[st+1], …, SA[ed] are position of all such occurrences. Our Index: Suffix Array + Inverse: Our Index: Suffix Array + Inverse Lemma 2: Given the range [st1,ed1] for P1 and the range [st2,ed2] for P2. Then, the range [st,ed] for P1P2 can be found in O(log n) time, based on SA and its inverse. Idea of proof: Similar to Manber andamp; Myers’ algorithm, using binary search. Our Index: Suffix Array + Inverse: Our Index: Suffix Array + Inverse Corollary 3: Given the range [st,ed] for P, and an array C such that C[c] stores the total occurrences of a character in T that is smaller than the character c. Then, the range of cP can be found in O(log n) time. Proof: Directly follows from Lemma 2 since [C[c-1]+1, C[c]] is the range of SA that corresponds to c. 1-Approximate Matching Algorithm: 1-Approximate Matching Algorithm [The delete case] Find the range [sti,edi] for P[1…i], for every i [1,m] Find the range [sti’,edi’] for P[i…m], for every i [1,m] For every i [1,m], find the range of P[1…i-1] P[i+1..m]. Report the occurrences. Time complexity: O(m log n + occ) 1-Approximate Matching Algorithm: 1-Approximate Matching Algorithm For the mismatch case or other edit cases, the algorithm is similar, except that in Step 3, we have to find the range for |Σ|m strings (instead of m strings in the delete case). Time complexity: O(|Σ|m log n+occ) The General Case: The General Case Our algorithm can be extended to solve the general k-approximate matching problem. The time complexity will be: O(|Σ|k mk log n + occ) Further, if we replace SA + Inverse by CSA of Grossi andamp; Vitter, the space becomes O(n) bits, and the time will be blown up by an O(log n) factor Future Work: Future Work Can we improve the time to O(m + occ) for the 1-approximate matching problem?