データ圧縮による 文字列検索の高速化(仮称)

Speeding up pattern matching
by text compression
Yusuke Shibata, Takuya Kida,
Shuichi Fukamachi, Masayuki Takeda,
Ayumi Shinohara, Takeshi Shinohara,
Setsuo Arikawa
Department of Informatics, Kyushu University, Japan
Department of AI, Kyushu Institute of Technology, Japan
Contents
Pattern matching on compressed text.
A unifying framework for compressed
pattern matching (Collage System)
Byte pair encoding (BPE).
Pattern matching algorithm on BPE compressed
text.
Experimental result.
Conclusion.
Pattern Matching Problem
Pattern
Text
matching
Pattern matching is one of the most fundamental operations
in string processing.
Recently, Knuth-Morris-Pratt
a new trend for accelerating
pattern matching has
(1974)
emerged: Speeding up pattern matching by text compression.
Boyer-Moore (1977)
From the traditional criteria for data compression, i.e.,
Aho-Corasick
(1975)
compression
ratio and compression/decompression
time,
adaptive dictionary
Shift-Ormethods
(1992) such as the Lempel-Ziv family
are often preferred. However, such methods cannot speed up
the pattern matching since an extra work is needed to keep
track of compression mechanism.
Pattern Matching on Compressed Text
original text
Search
File transfer
on Secondary
disk storage
It requires extra
time and space.
on Memory
compressed text
File transfer
on Secondary
disk storage
on Memory
Expand
Search
on Memory
Pattern Matching on Compressed Text
compressed text
Search directly
File transfer
on Secondary
disk storage
on Memory
GOAL 1
To perform a faster search in compressed texts in comparison
with a regular decompression followed by an ordinary search.
GOAL 2
Speeding up pattern matching
by text compression
To perform a faster search in compressed texts in comparison
with an ordinary search in the original texts.
Previous Results(1)
year
researcher
compression
1988
1992
1992
1994
1994
1995
1996
1996
1997
1997
Eliam-Tsoreff and Vishkin
Amir, Landau, and Vishkin
Amir and Benson
Amir, Benson, and Farach
Manber
Farach and Thorup
Gasieniec, et al.
Amir, Benson and Farach
Karpinski, Rytter, and Shinohara
run-length
two-dimensional run-length
two-dimensional run-length
two-dimensional run-length
Miyazaki, Shinohara, and Takeda
1997 Takeda
1998 Fukamachi, Shinohara, and Takeda
1998 Kida, et al.
1998 Shibata
original compression scheme
LZ77
LZ77
LZW
straight-line programs
straight-line programs
finite state encoding
Huffman encoding
LZW
byte pair encoding
Previous Results(2)
year
researcher
compression
1998 de Moura, Navarro, Ziviani, and
Baeza-Yates
Word based encoding
1999 Shibata, Takeda, Shinohara, and
Arikawa
Antidictionary based
1999 Kida, Takeda, Shinohara, and
Arikawa
1999 Navarro and Raffinot
Unifying
framework
1999 Kida, et al.
Today’s talk
2000 Shibata, et al.
LZW
LZ family
Dictionary based methods
(Collage system)
Byte pair encoding
A Unifying Framework
for Compressed Pattern Matching
Previous:
Compression A
PM Algorithm A
Compression B
PM Algorithm B
Compression C
PM Algorithm C
Collage system
Compression A
Kida et al.[1999]:
Compression B
Compression C
Pattern matching algorithm
on the unifying framework
Collage System
Definition and Several Examples
Dictionary Based Compression
Original
text
factorize into
a series of phrases
encoding
compressed
text
Dictionary
structure
How to choose the phrases.
How to design the data structure of the dictionary.
How to encode phrases.
Collage System
Collage system is a pair 〈D, S 〉
D : A sequence of assignments (Dictionary structure)
X1 := expr1 ; X2 := expr2 ; ・・・ Xn := exprn ;
S : A sequence of variables defined in D
(Compressed text)
S = Xi1 , Xi2 ,・・・, Xil
( Xi ∈D )
||D|| = n : number of assignments in D
|S| = l : number of variables in S
Collage System
D : A sequence of assignments (Dictionary structure)
X1 = expr1 ; X2 = expr2 ; ・・・ Xn = exprn ;
where exprk are ...
a
a ∈Σ∪{ε},
(primitive assignment)
Xi ・ Xj
for i, j < k,
(concatenation)
( Xi ) j
for i < k and integer j ( j times repetition)
[ j ]X
for i < k and integer j
(prefix truncation)
for i < k and integer j
(suffix truncation)
i
Xi
[j]
Example of Collage System
D:
X1 = a ;
X2 = b ;
X3 = X1・X2 ;
X4 = X2・X1 ;
X5 = ( X3 )3 ;
X6 = [ 3 ]X5 ;
X7 = X6・X4 ;
T(X7)
prefix
truncation
X7
X6
X4
ab
3 times
repetition X5
ba
ababab
X3
bab
X1 X2
babba
S : X3 , X6 , X4 , X7
abbabbababba
[ 3 ] ((
a
X2
X1
b )3 ) b
a
height(X7) = 4
height(D) = 4
???
Pattern Matching Algorithm
on a Collage System
Compressed pattern matching on a collage system
Theorem[Kida et al. 1999]
Problem of compressed pattern matching
can be solved in
O( (||D||+|S|)・height(D) + m2 + r ) time
using O( ||D|| + m2 ) space.
If D contains no truncation, it can be solved in
O( ||D|| + |S| + m2 + r ) time.
||D|| :
|S| :
m :
r :
number of assignments in D
number of variables in S
pattern length
number of pattern occurrences
Basic Idea
Pattern π= a b a b b
0
a
1
b
2
a
3
b
4
b
5
: goto function
: failure function
state: 0
original text:
S :
1
2
3
4
3
4
5
a b a b a b b a
Xi1 Xi2
Xi3
Xi4
1
Jump and Output
The function Jump( j, u) =δKMP( j, u)
• It simulates the sequence of state transitions for u.
•The domain is Q×D
Reply in
O(1) time
The set Output( j, u)
={1≦i≦|u| | P = a suffix of P[1: j]・u[1: i]}
•This set contains the pattern occurrences.
Reply in
O( l ) time
Realization of Jump and Output
for Jump( q, Xk) , if Xk is ...
a
Xi ・ Xj
O(1) time
If the factor concatenation problem for
length m string can be solved in O(1)
time, it can be solved in O(1) time.
for Output( q, Xk), if Xk is ...
a
Xi ・ Xj
O(1) time
Size of the
set Output
It can be enumerate in O( l ) time
from Output of Xi and Xj.
Factor Concatenation Problem
Instance: Two factors x and y of a string P
each represented as a node of suffix trie of P.
Question: Is the string xy a factor of P ?
If ‘yes’ then return its node number.
example: P = COPACABANA
OPA , CABAN
OPACABAN
concatenate
‘Yes’! P[2:9]
Solution to the problem
• Using a suffix trie, it can be solved in O(m) time
after preprocessing of O(m2) time and space.
• Using a two-dimensional lookup table, it can be solved in
O(1), but we need O(m4) time and space
preprocessing.
It can be solved in O(1) time after
O(m2) space and time preprocessing.
Outline of Our Algorithm
pattern P and collage system 〈D, S 〉
( S := Xi1 , Xi2 ,・・・, Xin )
Output. All occurrences of the patterns.
Input.
/* preprocessing of D and P */
preprocess(D); preprocess(P);
l:=0; q:=0;
for j:=1 to n do begin
for each dOutput(q, Xij) do
report ‘pattern occurs at position l+d ’;
q:= Jump(q, Xij); /* state transition */
l:= l + |Xij |;
/* calculation of the offset */
end
Compressed pattern matching on a collage system
truncation
LZ77, LZSS, etc...
not suitable for
speeding up
pattern
matching
O( (||D|| + |S| )・height(D) + m2 + r ) time
no truncation
LZ78, LZW,
BPE, Run-length, etc...
O( ||D|| + |S| + m2 + r ) time
Byte Pair Encoding
original encoding algorithm
and modified algorithm
Byte Pair Encoding
ABAB
AB AB
Text: T = ABABCDEBDEFABDEABC
AB→G
GGCDEBDEFGDEGC
DE DE DE
DE→H
GC
GC
GGCHBHFGHGC
GC→I
GIHBHFGHI
Code
A
B
C
D
E
F
G
H
I
Pair
A
B
C
D
E
F
AB
DE
GC
Pair Table
Used
Character
Byte Pair Encoding “collage system”
Text: T = ABABCDEBDEFABDEABC
AB→G
GGCDEBDEFGDEGC
DE→H
GGCHBHFGHGC
GC→I
GIHBHFGHI
S : X7 , X9 , X8 , X2 , X8 , X6 , X7 , X8 , X9
D:
X1 = A;
X2 = B ;
X3 = C ;
X4 = D ;
X5 = E ;
X6 = F ;
X7 = X1・X2 ;
X8 = X4・X5 ;
X9 = X7・X3 ;
Speeding up of compression
Time complexity of BPE O(uN)
u : The number of character codes,
N : Text length
using doubly-linked list
O(u + N) time
Speed-up of compression
original text:
we apply the BPE algorithm to the first block.
D:
X1 = A
X2 = C
・
X3 = X2 X1
X255 = X247・X8
X256 = X125・X48
BPE compressed text:
Pattern Matching Machine
for multiple replacement
[Arikawa et al. 1984]
Comparison of Compression Ratio and time
BPE are worse than those
of “Compress” and “Gzip”
compression Ratio(%)
BPE
original modified
Brown corpus ( 6.8Mb)
51.0
59.0
Medline
(60.3Mb) 56.2
59.0
Genbank
(17.1Mb) 30.8
32.5
compression time(sec)
Brown corpus
Medline
Genbank
Compress
Gzip
43.7
42.3
26.8
39.0
33.3
23.1
It is drastically accelerated
by our modification
196.9
1699.9
440.6
8.0
60.7
16.5
12.7
73.3
19.3
37.7
242.2
100.9
Compressed pattern matching
on BPE compressed text
Problem of compressed pattern matching
on BPE compressed text can be solved in
O( ||D|| + |S| + m2 + r ) time.
||D|| ≦256
-The dictionary D is encoded separately from the sequence S.
-The size of D is small enough.
-The variables of S are encoded using a fixed length code.
a clinicallyoriented subset of
Medlin
a data set
from GenBank
Experimental result
Medline data
(compression ratio is 59%)
Genbank data
(compression ratio is 32%)
0.25
0.80
KMP
0.70
KMP
0.20
0.50
our algorithm
run time (sec)
run time (sec)
0.60
0.15
Agrep
0.40
0.10
0.30
our algorithm
Agrep
0.20
5
10
15
20
25
30
0.05
5
pattern length
10
15
20
pattern length
Ultra ...
25
30
Concluding Remarks
Conclusion and Future Works
Conclusion

We introduced compressed pattern matching
from practical viewpoints.

We observed that our algorithm is reduced at the
same rate as the compression ratio compared
with uncompressed case.

We also observed that it is occasionally faster than
Agrep.
Future Works
More recent work
• Can we reduce the complexity of the
preprocessing? O(m2)  O(m)
• To develop a sublinear algorithm on BPE
compressed texts.
• To develop an approximate pattern matching
algorithm on a collage system.
• To develop a new compression which is
suitable for compressed pattern matching.
More recent work
Does text compression speed up
such a sublinear time
algorithm?
A Boyer-Moore type algorithm for
compressed pattern matching [CPM2000]
We proposed a Boyer-Moore (BM) type algorithm
for pattern matching in BPE compressed texts.
More recent work
Medline data
(compression ratio is 59%)
Genbank data
(compression ratio is 32%)
0.80
0.25
KMP
0.70
KMP
0.20
0.15
our algorithm
0.50
run time (sec)
run time (sec)
0.60
Agrep
0.10
0.40
our algorithm
Agrep
0.30
0.05
most recent work
0.20
5
10
15
most20 recent
work30
25
pattern length
0.00
5
10
15
20
pattern length
25
30