Speeding up pattern matching by text compression Yusuke Shibata, Takuya Kida, Shuichi Fukamachi, Masayuki Takeda, Ayumi Shinohara, Takeshi Shinohara, Setsuo Arikawa Department of Informatics, Kyushu University, Japan Department of AI, Kyushu Institute of Technology, Japan Contents Pattern matching on compressed text. A unifying framework for compressed pattern matching (Collage System) Byte pair encoding (BPE). Pattern matching algorithm on BPE compressed text. Experimental result. Conclusion. Pattern Matching Problem Pattern Text matching Pattern matching is one of the most fundamental operations in string processing. Recently, Knuth-Morris-Pratt a new trend for accelerating pattern matching has (1974) emerged: Speeding up pattern matching by text compression. Boyer-Moore (1977) From the traditional criteria for data compression, i.e., Aho-Corasick (1975) compression ratio and compression/decompression time, adaptive dictionary Shift-Ormethods (1992) such as the Lempel-Ziv family are often preferred. However, such methods cannot speed up the pattern matching since an extra work is needed to keep track of compression mechanism. Pattern Matching on Compressed Text original text Search File transfer on Secondary disk storage It requires extra time and space. on Memory compressed text File transfer on Secondary disk storage on Memory Expand Search on Memory Pattern Matching on Compressed Text compressed text Search directly File transfer on Secondary disk storage on Memory GOAL 1 To perform a faster search in compressed texts in comparison with a regular decompression followed by an ordinary search. GOAL 2 Speeding up pattern matching by text compression To perform a faster search in compressed texts in comparison with an ordinary search in the original texts. Previous Results(1) year researcher compression 1988 1992 1992 1994 1994 1995 1996 1996 1997 1997 Eliam-Tsoreff and Vishkin Amir, Landau, and Vishkin Amir and Benson Amir, Benson, and Farach Manber Farach and Thorup Gasieniec, et al. Amir, Benson and Farach Karpinski, Rytter, and Shinohara run-length two-dimensional run-length two-dimensional run-length two-dimensional run-length Miyazaki, Shinohara, and Takeda 1997 Takeda 1998 Fukamachi, Shinohara, and Takeda 1998 Kida, et al. 1998 Shibata original compression scheme LZ77 LZ77 LZW straight-line programs straight-line programs finite state encoding Huffman encoding LZW byte pair encoding Previous Results(2) year researcher compression 1998 de Moura, Navarro, Ziviani, and Baeza-Yates Word based encoding 1999 Shibata, Takeda, Shinohara, and Arikawa Antidictionary based 1999 Kida, Takeda, Shinohara, and Arikawa 1999 Navarro and Raffinot Unifying framework 1999 Kida, et al. Today’s talk 2000 Shibata, et al. LZW LZ family Dictionary based methods (Collage system) Byte pair encoding A Unifying Framework for Compressed Pattern Matching Previous: Compression A PM Algorithm A Compression B PM Algorithm B Compression C PM Algorithm C Collage system Compression A Kida et al.[1999]: Compression B Compression C Pattern matching algorithm on the unifying framework Collage System Definition and Several Examples Dictionary Based Compression Original text factorize into a series of phrases encoding compressed text Dictionary structure How to choose the phrases. How to design the data structure of the dictionary. How to encode phrases. Collage System Collage system is a pair 〈D, S 〉 D : A sequence of assignments (Dictionary structure) X1 := expr1 ; X2 := expr2 ; ・・・ Xn := exprn ; S : A sequence of variables defined in D (Compressed text) S = Xi1 , Xi2 ,・・・, Xil ( Xi ∈D ) ||D|| = n : number of assignments in D |S| = l : number of variables in S Collage System D : A sequence of assignments (Dictionary structure) X1 = expr1 ; X2 = expr2 ; ・・・ Xn = exprn ; where exprk are ... a a ∈Σ∪{ε}, (primitive assignment) Xi ・ Xj for i, j < k, (concatenation) ( Xi ) j for i < k and integer j ( j times repetition) [ j ]X for i < k and integer j (prefix truncation) for i < k and integer j (suffix truncation) i Xi [j] Example of Collage System D: X1 = a ; X2 = b ; X3 = X1・X2 ; X4 = X2・X1 ; X5 = ( X3 )3 ; X6 = [ 3 ]X5 ; X7 = X6・X4 ; T(X7) prefix truncation X7 X6 X4 ab 3 times repetition X5 ba ababab X3 bab X1 X2 babba S : X3 , X6 , X4 , X7 abbabbababba [ 3 ] (( a X2 X1 b )3 ) b a height(X7) = 4 height(D) = 4 ??? Pattern Matching Algorithm on a Collage System Compressed pattern matching on a collage system Theorem[Kida et al. 1999] Problem of compressed pattern matching can be solved in O( (||D||+|S|)・height(D) + m2 + r ) time using O( ||D|| + m2 ) space. If D contains no truncation, it can be solved in O( ||D|| + |S| + m2 + r ) time. ||D|| : |S| : m : r : number of assignments in D number of variables in S pattern length number of pattern occurrences Basic Idea Pattern π= a b a b b 0 a 1 b 2 a 3 b 4 b 5 : goto function : failure function state: 0 original text: S : 1 2 3 4 3 4 5 a b a b a b b a Xi1 Xi2 Xi3 Xi4 1 Jump and Output The function Jump( j, u) =δKMP( j, u) • It simulates the sequence of state transitions for u. •The domain is Q×D Reply in O(1) time The set Output( j, u) ={1≦i≦|u| | P = a suffix of P[1: j]・u[1: i]} •This set contains the pattern occurrences. Reply in O( l ) time Realization of Jump and Output for Jump( q, Xk) , if Xk is ... a Xi ・ Xj O(1) time If the factor concatenation problem for length m string can be solved in O(1) time, it can be solved in O(1) time. for Output( q, Xk), if Xk is ... a Xi ・ Xj O(1) time Size of the set Output It can be enumerate in O( l ) time from Output of Xi and Xj. Factor Concatenation Problem Instance: Two factors x and y of a string P each represented as a node of suffix trie of P. Question: Is the string xy a factor of P ? If ‘yes’ then return its node number. example: P = COPACABANA OPA , CABAN OPACABAN concatenate ‘Yes’! P[2:9] Solution to the problem • Using a suffix trie, it can be solved in O(m) time after preprocessing of O(m2) time and space. • Using a two-dimensional lookup table, it can be solved in O(1), but we need O(m4) time and space preprocessing. It can be solved in O(1) time after O(m2) space and time preprocessing. Outline of Our Algorithm pattern P and collage system 〈D, S 〉 ( S := Xi1 , Xi2 ,・・・, Xin ) Output. All occurrences of the patterns. Input. /* preprocessing of D and P */ preprocess(D); preprocess(P); l:=0; q:=0; for j:=1 to n do begin for each dOutput(q, Xij) do report ‘pattern occurs at position l+d ’; q:= Jump(q, Xij); /* state transition */ l:= l + |Xij |; /* calculation of the offset */ end Compressed pattern matching on a collage system truncation LZ77, LZSS, etc... not suitable for speeding up pattern matching O( (||D|| + |S| )・height(D) + m2 + r ) time no truncation LZ78, LZW, BPE, Run-length, etc... O( ||D|| + |S| + m2 + r ) time Byte Pair Encoding original encoding algorithm and modified algorithm Byte Pair Encoding ABAB AB AB Text: T = ABABCDEBDEFABDEABC AB→G GGCDEBDEFGDEGC DE DE DE DE→H GC GC GGCHBHFGHGC GC→I GIHBHFGHI Code A B C D E F G H I Pair A B C D E F AB DE GC Pair Table Used Character Byte Pair Encoding “collage system” Text: T = ABABCDEBDEFABDEABC AB→G GGCDEBDEFGDEGC DE→H GGCHBHFGHGC GC→I GIHBHFGHI S : X7 , X9 , X8 , X2 , X8 , X6 , X7 , X8 , X9 D: X1 = A; X2 = B ; X3 = C ; X4 = D ; X5 = E ; X6 = F ; X7 = X1・X2 ; X8 = X4・X5 ; X9 = X7・X3 ; Speeding up of compression Time complexity of BPE O(uN) u : The number of character codes, N : Text length using doubly-linked list O(u + N) time Speed-up of compression original text: we apply the BPE algorithm to the first block. D: X1 = A X2 = C ・ X3 = X2 X1 X255 = X247・X8 X256 = X125・X48 BPE compressed text: Pattern Matching Machine for multiple replacement [Arikawa et al. 1984] Comparison of Compression Ratio and time BPE are worse than those of “Compress” and “Gzip” compression Ratio(%) BPE original modified Brown corpus ( 6.8Mb) 51.0 59.0 Medline (60.3Mb) 56.2 59.0 Genbank (17.1Mb) 30.8 32.5 compression time(sec) Brown corpus Medline Genbank Compress Gzip 43.7 42.3 26.8 39.0 33.3 23.1 It is drastically accelerated by our modification 196.9 1699.9 440.6 8.0 60.7 16.5 12.7 73.3 19.3 37.7 242.2 100.9 Compressed pattern matching on BPE compressed text Problem of compressed pattern matching on BPE compressed text can be solved in O( ||D|| + |S| + m2 + r ) time. ||D|| ≦256 -The dictionary D is encoded separately from the sequence S. -The size of D is small enough. -The variables of S are encoded using a fixed length code. a clinicallyoriented subset of Medlin a data set from GenBank Experimental result Medline data (compression ratio is 59%) Genbank data (compression ratio is 32%) 0.25 0.80 KMP 0.70 KMP 0.20 0.50 our algorithm run time (sec) run time (sec) 0.60 0.15 Agrep 0.40 0.10 0.30 our algorithm Agrep 0.20 5 10 15 20 25 30 0.05 5 pattern length 10 15 20 pattern length Ultra ... 25 30 Concluding Remarks Conclusion and Future Works Conclusion We introduced compressed pattern matching from practical viewpoints. We observed that our algorithm is reduced at the same rate as the compression ratio compared with uncompressed case. We also observed that it is occasionally faster than Agrep. Future Works More recent work • Can we reduce the complexity of the preprocessing? O(m2) O(m) • To develop a sublinear algorithm on BPE compressed texts. • To develop an approximate pattern matching algorithm on a collage system. • To develop a new compression which is suitable for compressed pattern matching. More recent work Does text compression speed up such a sublinear time algorithm? A Boyer-Moore type algorithm for compressed pattern matching [CPM2000] We proposed a Boyer-Moore (BM) type algorithm for pattern matching in BPE compressed texts. More recent work Medline data (compression ratio is 59%) Genbank data (compression ratio is 32%) 0.80 0.25 KMP 0.70 KMP 0.20 0.15 our algorithm 0.50 run time (sec) run time (sec) 0.60 Agrep 0.10 0.40 our algorithm Agrep 0.30 0.05 most recent work 0.20 5 10 15 most20 recent work30 25 pattern length 0.00 5 10 15 20 pattern length 25 30
© Copyright 2024 ExpyDoc