文字列照合アルゴリズム

1
北海道大学
Hokkaido University
Lecture on Information Knowledge Network
"Information retrieval and pattern matching"
Laboratory of Information Knowledge Network,
Division of Computer Science,
Graduate School of Information Science and Technology,
Hokkaido University
Takuya KIDA
2011/11/22
Lecture on Information knowledge network
The 6th
Pattern matching on
compression text
About data compression
Motivation and aim of this study
Pattern matching on Huffman encoded text
Pattern matching on LZW compressed text
Unified framework: Collage system
Aspect of speeding-up of pattern matching
by text compression: BPE compression
2011/11/22
Lecture on Information
knowledge network
2
北海道大学 Hokkaido University
3
About data compression
Lossless compression
Non-universal encoding
Lossy compression
Entropy encoding
Huffman
encoding
Runlength
JPEG
Arithmetic
encoding
MPEG
Universal encoding
Dictionary-based Grammar-based
Statistical
PPM
LZ78
MP3
Sequitur
sort-based
LZ77
LZW BPE
BWT
used for image and
voice data
※reference: Managing Gigabytes: Compressing and Indexing Documents and Images, I. H. Witten, A. Moffat, T. C. Bell, Morgan Kaufmann Pub, 1999.
2011/11/22
Lecture on Information knowledge network
北海道大学 Hokkaido University
4
Aim of this study
Ordinal
pattern matching machine
Original text
Ordinal
pattern matching machine
decompress
Compressed
text
Compressed
text
2011/11/22
Original text
Pattern matching machine
for compressed texts
Lecture on Information knowledge network
北海道大学 Hokkaido University
5
Example of application
We want to pack a lot of data into a small computer such as a
mobile phone and PDA as much as possible!
 Because of small amount of memory, to construct an extra
index structure isn’t good solution!
 However, we want to retrieve at high speed!

Personal databases
Short memos
E-mails
Business cards
Directories
Schedule tables
E-books
KOJIEN
E2J/J2E dictionaries
※写真はsharp mi110と東芝V601T
2011/11/22
Lecture on Information knowledge network
北海道大学 Hokkaido University
6
Difficulty of PM on compressed texts
Document files
There might hardly be "To decrease capacity, the text data is
preserved by compressing it" in the category that personally
uses the computer today when the capacity of the hard disk and
the memory has grown enough. I have not used this function
though the function to reduce capacity putting compression on
Windows in each folder is provided. It will be seemed as an
advantage none to compress the text data because there are 100
harms though preserving it by compressing it if it is a
multimedia data like the image and the voice data, etc. is natural.
However, the good policy doing the compression preservation
deleting neither for instance a large amount of log file nor past
mail data, etc.In a word
2011/11/22
Compressed document files
0111100001111001111111010110100010
1010100111101000101110011010111101
1000111011111101001101011111001101
0011100110110000011111101011010111
11111100000101001001010011010
1. The starting position of each
codeword is invisible
2. Representation of each string
is not unique
Lecture on Information knowledge network
北海道大学 Hokkaido University
7
Our goal
Search-on-the-fly method
Decompress-then-search method
Goal: Do pattern matching
faster than the above!
Search-without-decompress method
※ 上図イラストは竹田正幸先生の作
2011/11/22
Lecture on Information knowledge network
北海道大学 Hokkaido University
8
Lempel-Ziv-Welch (LZW) compression
T. A. Welch: A technique for high performance data compression, IEEE Comput., Vol.17, pp.8-19, 1984.
Text T:
a b ab ab ba b c aba bc abab
Compressed text E(T): 1 2 4
0
a
b
a
6
b
1
4
2
b
7
c
a
c
5
9
8
12
b
a
b
3
a
10
4
5 23
6
9
11
Let D be the set of strings
entered in the dictionary trie
D = {a, b, c, ab, ba, bc, ca, aba,
abb, bab, bca, abab}
D is constructed adaptively
11
Dictionary trie
|D| = O(compressed text length)
※ LZW is used for UNIX compress command, GIF image format, and so on.
2011/11/22
Lecture on Information knowledge network
北海道大学 Hokkaido University
9
Move of Aho-Corasick PM machine
AC machine for pattern set Π= {aba, ababb, abca, bb}
0
a
b
1
2
a
c
b
Current state:
Text:
Output:
2011/11/22
8
{aba}
5
{ababb, bb}
7
a
{abca}
: goto function
: failure function
{ } : output
{bb}
1
a
6
4
b
9
b
0
3
b
2
b
3
a
4
b
aba
3
a
4
b
aba
1
5
b
a
bb
ababb
Lecture on Information knowledge network
北海道大学 Hokkaido University
10
Idea for doing pattern matching on LZW texts
T. Kida, M. Takeda, A. Shinohara, M. Miyazaki, and S. Arikawa: Multiple pattern matching in LZW compressed text,
Proc. Data Compression Conference, pp. 103-112, IEEE Computer Society, Mar. 1998.

To simulate the move of AC machine on LZW compressed texts
0
a
b
1
2
a
c
b
Current state:
Text:
Comp. text:
Output:
2011/11/22
8
6
4
{aba}
b
5
{ababb, bb}
7
a
{abca}
: goto function
9
b
0
3
b
: failure function
{ } : output
{bb}
1
2
a
b
1
2
3
a
4
b
3
a
4
b
4
4
aba
aba
1
5
b
a
5
bb
ababb
Lecture on Information knowledge network
北海道大学 Hokkaido University
11
Core functions:Jump & Output

Can we compute two functions Jump and Output well?
–function Jump(q, u) :
simulates the consecutive transitions caused by string u in O(1) time.
 The domain is Q×D.
It needs O(m|D|) It can be realized
 returns the state number
space by a naïve
in O(m2+|D|)
of AC machine
way.
space!

–function Output(q, u) :
reports the occurrences within the string obtained by concatenating
the string corresponding to state q and string u in O(r) time.
 The domain is Q×D.
It needs O(m|D|) It can be realized
 returns the set of
space by a naïve
in O(m2+|D|)
pattern IDs
way.
space!

2011/11/22
Lecture on Information knowledge network
北海道大学 Hokkaido University
12
function Jump
Let δ(q,u) be the (extended) state transition function※ of the AC machine.
O(m3) space
Jump(q, u) =
δ(q, u)
δ(ε, u)
if u is a factor of some pattern,
otherwise.
O(|D|) space
O(m2) space O(m2) space※
Ancestor(N'1 (q, u'), |u'|-|u|)
if u is a factor of some pattern,
Jump(q, u) =
δ(ε, u)
otherwise.
O(|D|) space
※ δ(q,u) returns the state position after making transition from the state q by string u.
※ u’ is the string corresponding to the nearest ancestor node of u that is also explicit on the
generalized suffix trie for P.
2011/11/22
Lecture on Information knowledge network
北海道大学 Hokkaido University
13
function Output
u~
:the longest prefix of u that is also a suffix of a pattern.
~ i <|u|, |p|< i,
A(u) = {〈i,p〉| p∈Π, |u|<
and u[i-|p|+1...i ]=p}
Note that state q corresponds
to a prefix of some pattern
O(m2) space
O(|D|) space
Output(q, u) = Output(q, u)
~ ∪ A(u)
q
p1
2011/11/22
u
u~
p2
p1
p2
Lecture on Information knowledge network
北海道大学 Hokkaido University
14
Pseudo code of Kida, et al.[1998]’s algorithm
PMonLZW (E(T) = u1u2…un, Π: pattern set)
1 Construct AC machine and generalized suffix trie for Π;
2 Initialize the dictionary trie for E(T);
3 Preprocess Jump(q,u) and Output(q,u)
for any q and u∈{a pattern π∈Πのfactor}
4 l ← 0;
5 q ← q0;
6 for i ← 1…n do
7
for each 〈d ,π〉∈Output(q, ui) do
8
report pattern π occurs at position l+d;
9
q ← Jump(q, ui);
10
l ← l + |ui|;
11
Update the dictionary trie;
/* enter the string for node ui+1 into D */
12
Update variables for Jump(q, ui+1) and Output(q, ui+1);
/* compute δ(ε,ui+1), A(ui+1), ui+1’, and |ui+1| by using its parent info. */
13
end of for
14 end of for
2011/11/22
Lecture on Information knowledge network
北海道大学 Hokkaido University
15
The result of Kida, et al. [1998]

The original idea is from
– A. Amir, G. Benson, and M. Farach: Let sleeping files lie: Pattern
matching in Z-compressed files, J. Computer and System Sciences,
Vol.52, pp.299-307, 1996.

It simulates KMP on LZW compressed texts
By simulating Aho-Corasick(AC)pattern matching machine,
we can do multiple pattern matching.
2
 It takes O(m +|D|) time and space for preprocessing.
2
 It scans compressed texts in O(n+r) time with O(m +|D|)
space for multiple patterns, and reports all the occurrences.

※ This firstly appears in “T. Kida, M. Takeda, A. Shinohara, M. Miyazaki, and S. Arikawa: Multiple pattern
matching in LZW compressed text, Proc. Data Compression Conference, pp. 103-112, IEEE Computer
Society, Mar. 1998.” Its Journal ed. Appears in “T. Kida, M. Takeda, A. Shinohara, M. Miyazaki, and S.
Arikawa: Multiple Pattern Matching in LZW Compressed Text, Journal of Discrete Algorithms, 1(1), pp.
133-158, Hermes Science Publishing, Dec. 2000.”
2011/11/22
Lecture on Information knowledge network
北海道大学 Hokkaido University
16
Idea for applying bit-parallel technique
T. Kida, M. Takeda, A. Shinohara, and S. Arikawa: Shift-And approach to pattern matching in LZW compressed text,
Proc. CPM'99, LNCS1645, pp. 1-13, Springer-Verlag, Jul. 1999.
Pattern P:=
Text T:=
a
a
b
a
c
aabac
a
1
0
0
0
0
a
1
1
0
0
0
b
0
0
1
0
0
a
1
0
0
1
0
Jump!
2011/11/22
a
1
1
0
0
0
Mask bits
c
0
0
0
0
0
a
1
0
0
0
0
a
1
1
0
0
0
b
0
0
1
0
0
a
1
0
0
1
0
c
0
0
0
0
1
a
1
0
0
0
0
b
0
0
0
0
0
a
1
1
0
1
0
b
0
0
1
0
0
c
0
0
0
0
1
Jump!
Lecture on Information knowledge network
北海道大学 Hokkaido University
17
Extended state updating function f’

For any a∈Σ, u∈Σ*, S∈{1,…, m}, we define as
follows.
–
–
–
–

M(a) = { 1< i < m | P[i] = a }
f(S, a) = ((S ⊕ 1)∪{1}) ∩ M(a)
f’(S,ε) = S and f’(S, ua) = f’( f(S, u), a)
M’(u) = f’({1,・・・, m}, u)
O(|D|) time and space
Then, for any u∈Σ*, S∈{1,・・・, m}, we define as
–f’(S, u) = ((S ⊕ |u|)∪{1,・・・, |u|}) ∩ M’(u)
O(1) time
2011/11/22
Lecture on Information knowledge network
北海道大学 Hokkaido University
18
function Output (Bit-parallel type)

Definition:
O(|D|) time and space
–Output(S, u) = { 1 < j < |u| | m∈S }
–U(u) = {1 < j < |u| | i <m and u[1..i] =P[m-i+1..m] }
–A(u) = {1 < j < |u| | m < i and u[1-m+1..i]=P }
–Output(S, u) =((m
⊖ S)∩U(u)) ∪ A(u)
q
u
P
P
(m ⊖ S)∩U(u)
2011/11/22
O(|D|) time and space
A(u)
Lecture on Information knowledge network
北海道大学 Hokkaido University
19
The result of Kida, et al. [1999]
applied the bit-parallel technique based on Shift-And method
to processing of functions Jump and Output to speed up.
 It uses O(m+|Σ|) time and space for preprocessing.
 For a given pattern, it scans a given compressed text in O(n+r)
time and O(m+|D|) space, and it reports all the occurrences.
 It excels in the extensibility as well as Shift-And method.

– pattern matching for a generalized pattern
– pattern matching with allowing k mismatches
– multiple pattern matching
2011/11/22
Lecture on Information knowledge network
北海道大学 Hokkaido University
20
Achievement of our aim!
CPU time(sec.)
1.4
1.2
AlphaStation XP1000
(Alpha21264: 667MHz)
Tru64 UNIX V4.0F
1.0
Genbank(DNA base sequence)
17.1Mbyte
0.8
Search-on-the-fly method
0.6
compress(LZW)+KMP
gunzip(LZ77)+KMP
0.4
Search-without-decompress method
0.2
T. Kida, et al.[1998]
Speeding-up by bit-parallelism[1999]
0
2011/11/22
5
10 15 20 25
Pattern length
30
Lecture on Information knowledge network
北海道大学 Hokkaido University
21
Take a breath
2010.12.24 RG Gundam1/1
(@Higashi-Shizuoka Park)
2011/11/22
Lecture on Information knowledge network
北海道大学 Hokkaido University
22
Why do you need compressed PM?
We have enough storage
space now. Why do you
compress small data like
text documents?
If …
Goal 2
The time for doing
pattern matching
on the original text
>
A new goal!
The time for doing
compressed pattern
matching
×
×
2011/11/22
×
×
Lecture on Information knowledge network
北海道大学 Hokkaido University
23
A new goal! (Goal 2)
CPU time(sec.)
1.4
1.2
AlphaStation XP1000
(Alpha21264: 667MHz)
Tru64 UNIX V4.0F
1.0
Genbank(DNA base sequence)
17.1Mbyte
0.8
Search-on-the-fly method
0.6
compress(LZW)+KMP
gunzip(LZ77)+KMP
0.4
0.2
0
2011/11/22
Search-without-decompress method
Matching by KMP
on the original text
T. Kida, et al.[1998]
Overwhelmingly faster!
5
10 15 20 25
Pattern length
Speeding-up by bit-parallelism[1999]
30
Lecture on Information knowledge network
北海道大学 Hokkaido University
24
Byte Pair Encoding (BPE) method
18
Text
ABABCDEBDEFABDEABC
G
GGCDEBDEFGDEGC
dictionary
H
G →AB
H →DE
I → GC
GGCHBHFGHGC
I
After substitutions G I H B H F G H I
Size:256
= 1 byte
9
2011/11/22
Lecture on Information knowledge network
北海道大学 Hokkaido University
25
Achievement of Goal 2
0.8
The fastest in the previous
CPU time(sec.)
0.7
0.6
AlphaStation XP1000
(Alpha21264: 667MHz)
Tru64 UNIX V4.0F
Medline(English text)
60.3Mbyte
Matching by KMP on the original text
0.5
Search-without-decompress method
0.4
Compressed PM on BPE (KMP type)
0.3
Agrep on the original text
0.2
Search-without-decompress method
0.1
0.0
2011/11/22
Compressed PM on BPE (BM type)
Shibata, et al. (2000)
5
10 15 20 25
Pattern length
30
Lecture on Information knowledge network
北海道大学 Hokkaido University
26
Summarize the above…
ordinal
2
The original uncompressed text
for LZSS
4
GOAL
GOAL
High compression
Text compressed by LZSS
for LZW
GOAL
3
Medium compression
Text compressed by LZW
for BPE
1
GOAL
Low compression
Text compressed by BPE
2011/11/22
…but it’s the most suitable for PM!
Lecture on Information knowledge network
北海道大学 Hokkaido University
27
Paradigm shift 1
Develop
a novel compression method
which is suitable for
pattern matching!
Choosing a suitable compression
enables us to accelerate
pattern matching!
Develop pattern matching algorithms
for each compression methods
2011/11/22
Lecture on Information knowledge network
北海道大学 Hokkaido University
28
Data compression methods for PM

Dense coding type
– [ETDC] Nieves R. Brisaboa, Eva Lorenzo Iglesias, Gonzalo Navarro, and Jose R. Parama:
An efficient compression code for text databases, In ECIR2003, pp. 468-481, 2003.
– [SCDC] Nieves R. Brisaboa, Antonio Farina, Gonzalo Navarro, and Maria F. Esteller:
(s, c)-dense coding: An optimized compression code for natural language text databases, In
SPIRE2003, pp. 122-136, 2003.
– [FibC] Shmuel Tomi Klein and Miri Kopel Ben-Nissan: Using fibonacci compression codes
as alternatives to dense codes, In DCC2008, pp. 472-481, 2008.
– [SVVC] Nieves R. Brisaboa, Antonio Farina, Juan-Ramon Lopez, Gonzalo Navarro, and
Eduardo R. Lopez: A new searchable variable-to-variable compressor, In DCC2010, pp.
199-208, 2010.

VF coding type (including grammar-based compressions)
– [BPEX] Shirou Maruyama, Yohei Tanaka, Hiroshi Sakamoto, and Masayuki Takeda:
Context-sensitive grammar transform: Compression and pattern matching, In SPIRE2008,
LNCS5280, pp. 27-38, Nov. 2008.
– [DynC] Shmuel T. Klein and Dana Shapira: Improved variable-to-fixed length codes,
In SPIRE2008, pp. 39-50, 2009.
– [STVF] Takashi Uemura, Satoshi Yoshida, Takuya Kida, Tatsuya Asai, and Seishi Okamoto:
Training parse trees for efficient VF coding, In SPIRE2010, pp. 179-184, 2010.
2011/11/22
Lecture on Information knowledge network
北海道大学 Hokkaido University
29
Paradigm shift 2
Break difficulties of
various processing by using
the compression technology!
We can speed up pattern matching
by compressing the data.
We use the data compression technology
to reduce the cost for storing
and transferring the data.
2011/11/22
Lecture on Information knowledge network
北海道大学 Hokkaido University
30
Doing something by using compression

Speeding up the calculation of similarity between two long
strings by compression technique.
– “A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices”,
M. Crochemore, G. M. Landau, and M. Ziv-Ukelson, Proceeding of 13th Symposium on Discrete Algorithm,
pp.679-688, 2002

Processing a very huge graph structure on memory at high
speed by compression technique.
– Shinichi Nakano(Gunma University) “Graph compression with query
support”
Their method can represent a triangulated planar graph in 2m+o(n) bit and
moreover can support some queries on it.

Speeding up the query processing for XML data by compression
technique.
– Tetsuya Maita and Hiroshi Sakamoto(Kyushu Institute of Technology)
2011/11/22
Lecture on Information knowledge network
北海道大学 Hokkaido University
31
The 6th summary

Pattern matching algorithms on compressed texts
– Pattern matching on Huffman encoded text → automaton with synchronization
– Pattern matching on LZW compressed text → simulating the move of KMP(AC) on the
compressed text

Unified framework: Collage system
– A formal system to represent a text compressed by lexicographical
compression method
– We have clarified what kind of compression methods are suitable for pattern matching.

Aspect of speed-up pattern matching by compression
– BPE compression: it has low compression ratio, but it can speed up pattern matching
– Our experimental results showed that we could do pattern matching faster than doing
on the original texts

A big paradigm shift caused
– The data compression technology can be used in the other purposes rather than
reducing the data size

The next theme (which is the final topic of "Information retrieval and pattern
matching“)
– Various topics I didn’t mention about
2011/11/22
Lecture on Information knowledge network