全文検索のためのデータ構造と構成の効率について定兼邦彦東京大学理学系研究科情報科学専攻 http://naomi.is.s.u-tokyo.ac.jp/~sada/papers/fulltext.ppt 内容 • 全文検索のためのデータ構造の比較 – 検索時間 – ディスク容量 – 更新時間 • 検索精度 2 背景 • 電子化された文書の普及 – WWW, メール – 新聞, 辞書, 書籍 – ゲノムデータベース • 大量のテキストから高速に検索したい – もれがないようにしたい – 必要なもののみ欲しい 3 全文検索のアルゴリズム • sequential search • signature file [Moders 49] – 各文書がどのキーワードを含むか • inverted file [Bleir 67] – 各キーワードがどこにあるか • digital tree (trie) – 任意のキーワード 4 Inverted fileのデータ構造キーワードごとに出現文書，位置を記憶 • sorted array – キーワードの出現位置のリスト • prefix B-tree – 更新が簡単 • trie – prefixをコンパクトに表現 5 Word indexes vs. Full-text indexes word indexes • 決まったキーワードのみ – サイズが小さい – 構成が早い • データ構造 – sorted array – prefix B-tree – trie full-text indexes • 任意のキーワード – サイズが大きい – 構成が遅い • データ構造 – suffix array – String B-tree – suffix tree 6 Full-text indexのデータ構造 • suffix tree [Weiner 73] • suffix array [Manber, Myers 93] • String B-tree[Ferragina, Grossi 95] 7 Suffix tree ・文字列の全てのsuffix(接尾辞)を表す compacted trie ・メモリ上では線形時間で構成可能 ×サイズが大きい a b ×unbalanced b a $ b$ $ a b $ abab$ 8 Suffix array ・文字列の全てのsuffixのポインタを辞書順にソートした配列・省スペース(5N) ×更新が遅い 3 ab$ 1 abab$ 4 b$ 2 bab$ 9 String B-tree ・suffixのポインタをB-treeで表したもの・検索時のdiskアクセスが少ない(blind tree) ・最悪時の性能が良い・挿入、削除が容易・サイズ: 13N ×１から作るのは遅い abab$ 10 I/O complexity • 検索のI/O complexity • 更新のI/O complexity • 構成のI/O complexity 11 検索のI/O complexity ・Suffix tree – Nに依存しない ○String B-tree ・Suffix array O p logB   occ  p  occ  O  logB N   B   occ p  O  log2 N   B B  p : キーワード長 occ : 答えの数 N : 文字列長 12 更新のI/O complexity • Suffix tree – Nに依存しない • String B-tree • Suffix array – 追加する量が多ければ String B-treeと差はない O p logB  O p logB ( N  p) O p log2 N  N  p : キーワード長 N : 文字列長 B : ディスクページサイズ 13 構成のI/O complexity N N   ◎suffix tree O logM / B  (optimal) M B N N   ○suffix array O log N log  2 M /B M B ・String B-tree O( N logB N ) N : 文字列長 M : メモリサイズ B : ディスクページサイズ 14 構成アルゴリズム • Suffix treeの構成 – メモリ上 – ディスク上 • Suffix arrayの構成 – メモリ上 – ディスク上 15 Suffix treeの構成 • メモリ上 (線形時間) * * * * Weiner 73 McCreight 76 Ukkonen 95 Farach 97 • divide and conquer, batch処理 • ディスク上 * Farach, Ferragina, Muthukrishnan 98 16 Disk上でのsuffix tree構成 • アルゴリズムをsortingとscanで表現 • 数のsortingと同じI/O complexity (optimal) N N O( log M / B ) I/O B M 17 Sorting I/O complexity • 次の問題はsortingと同じI/O complexityを持つ – – – – – – – treeのノードのlcaをK個 (K個のrange minima) tree T のEuler Tour ET(T)とノードの深さ文字列中の任意の位置のK文字 treeの各ノードの子孫でmarkされているもの uncompacted trieのmerge suffix treeの全てのsuffix link suffix treeの構成 18 Block vs. Random I/O N N •2-way merge O ( log 2 ) block I/O B M N N •M/B-way merge O( log M / B ) random I/O B M N N 補題: random I/Oがo ( log M / B ) 回のsorting B M N N log 2 ) 回のI/Oを必要とする。アルゴリズムは  ( B M 19 Disk上のsuffix treeの問題点 treeの枝が文字列へのポインタで表されている treeをたどる際にrandom accessが生じる古典的なアルゴリズムは適さない (divide and conquerを用いる) 20 Algorithm outline $ a b • Odd treeを作る a $ • Even treeを作る b $ • mergeする Odd tree b $ a b $ Even tree $a b b a $ a $ b $ b $ 21 Building the odd tree • 連続する2文字を1つの文字とみなし長さN/2の文字列を作る • 新しい文字列のsuffix treeを再帰的に作る • 文字を元に戻す $ A abab$ AA$ $ A $ $ a b $ a b $ 22 Building the even tree • 偶数番目のsuffixを辞書順にradix sortする – (先頭の文字, 奇数番目のsuffixの辞書順) • 隣り合うsuffix間のlcpを求める • compacted trieを作る abab$ 2 4 $ a b 1 $ a b $ 2 3 b 2: (b,ab$) = (b,2) $ 4: (b,$) = (b,1) a b $ Even tree 23 Merging the odd and even trees • • • • • anchor pairを見つける side tree pairに分割する pull nodeを見つける merge nodeを見つける TeとToをmergeする 24 Suffix arrayのメモリ上での構成 • quick sort ×文字列の比較なので非常に遅い • ternary partitioning[Bentley, Sedgewick 97] ○無駄な文字列比較が少ない ×極端に遅くなることがある • doubling algorithm – Manber, Myers 93 – Sadakane, Imai 98 • 多くの場合最速 25 Doubling algorithm • Karp, Miller, Rosenberg 72 • ディスク上の文字列ソート[Arge et al. 97] • 長さ 1, 2, 4, … の部分文字列を数値に変換 – log n 回の比較で文字列を区別できる 26 Suffix sorting by doubling (1/5) • 各suffixを先頭の1文字でグループに分ける – グループに番号をつける • 各グループの中をsuffixの2文字目で分ける – 番号を更新 (番号が異なる先頭の2文字が異なる) • 各グループの中をsuffixの3,4文字目で分ける – グループの番号でソート • 全てのsuffixの順序が決まるまで繰り返す – 順序の決まっているグループはskipする 27 Suffix sorting by doubling (2/5) V[I[i]+1] V[I[i]] I[i] 0 1 10 11 1 6 11 6 6 6 10 11 11 11 3 12 6 1 4 7 ｅｅｎｏｏｏｏ＄ｏｂｒｔｒｔｅｎｔｎｔｏｏｏ 10 ｏｂｅ＄ 3 3 6 0 1 1 3 3 13 2 11 ＄ｂｂｅｅｏ＄ｒ 5 6 5 ｒｎｏｔ 0 ｔｏｂｅ 8 ｔｔｏｂｔｏｂｅｏｒｎｏｔｔｏｂｅ$ 先頭の２文字でソート 28 6 9 ｔｏｂｅ Suffix sorting by doubling (3/5) V[I[i]+2] 8 0 1 1 V[I[i]] 0 I[i] 13 2 3 4 3 1 1 4 5 6 6 8 9 10 11 11 13 11 12 3 6 1 10 4 7 5 0 9 8 ｏｒｎｏｏｔｔｏｒｎｏｔｔｏｂｅｔｏｂｅｔｔｏｂ＄ｂｂｅｅｎｏｏｂｅｅ＄ｏｔｅｒｏ＄ｎｔｏｒｏｂｅ＄ｔｏｂｅｏｒｎｏｔｔｏｂｅ$ 先頭の４文字でソート 29 Suffix sorting by doubling (4/5) V[I[i]+4] 8 V[I[i]] 0 I[i] 13 11 2 1 2 3 0 4 5 6 7 8 9 10 11 11 13 12 3 6 10 1 4 7 5 0 9 8 ｎｏｔｔｏｂｅ＄ｏｒｎｏｏｔｔｏｒｎｏｔｔｏｂｅｏｒｔｏｂｅ＄ｔｔｏｂ＄ｂｂｅｅｅｅ＄ｏｒ＄ｏｎｒｏｂｅｏｔｏｂｅｏｒｎｏｔｔｏｂｅ$ 先頭の８文字でソート 30 Suffix sorting by doubling (5/5) V[I[i]] 0 I[i] 13 11 2 1 2 3 4 5 6 7 8 9 10 11 12 13 12 3 6 10 1 4 7 5 9 0 8 ｎｏｔｔｏｂｅ＄ｏｒｎｏｏｔｔｏｒｎｏｔｔｏｂｅｏｒｔｏｂｅ＄ｔｔｏｂ＄ｂｂｅｅｅｅ＄ｏｒ＄ｏｎｒｏｂｅｏｔｏｂｅｏｒｎｏｔｔｏｂｅ$ ソート終了 31 Suffix arrayのディスク上での構成 • Gonnet, Baeza-Yates, Snider 92 – diskはsequential accessのみ 2 N 2 MB I/O • Crauser, Ferragina 98 – doubling algorithm + discarding N N log 2 N (log M / B ) B M I/O 32 Doubling algorithm + discarding • doubling algorithmをディスク上で行う • log2 N 回の反復 • M/B-way マージソートを用いるメモリ内と異なる点 • すでにソートされている部分はスキップ 33 Word indexes vs. Full-text indexes 網羅性 word indexes • 単語の先頭のみ – データ量は約1/7 (日本語, 英語とも) • 検索もれの可能性 – 形態素解析が必要 – DNA配列には使えない full-text indexes • 全ての部分文字列 • 長いものを見つけるのが得意 • 検索結果にごみが入る – – – – 京都のつもりが東京都ルパンのつもりがダブルパンチはらだのつもりがはらだたしい AND検索で回避? 35 Full-text indexの利点・欠点・検索結果は文字列へのポインタ △ポインタから文書番号への変換が必要 ○超高速grepとして利用できる ×サイズが大きい・Full-text indexからword indexは構成可能 – テキストを走査する – 必要の無いindexに印をつける – indexを走査し、印のついているものを削除 36 課題 • 検索結果のごみをなくす – AND検索? • シソーラスの利用 – OR検索 • 構造化された文書からの検索 – 見出しのみから検索など • データの収集速度 – 元の文書を圧縮して送る – word indexだけ送る 37 圧縮と検索の統合 • Block sorting圧縮法[Burrows, Wheeler 94] – suffix arrayに従い文字列を並べ替えてから圧縮伸張時に文字列とsuffix arrayが復元されるテキストを転送する際はBlock sortingで圧縮しておけば良い [Sadakane, Imai 98a] 38 謝辞貴重なコメントをくださったNTTの原田昌紀氏、中村隆幸氏に感謝いたします。 39 参考文献(1/3) [1] L.Arge, P.Ferragina, R.Grossi, and J.S. Vitter. On sorting strings in external memory. In ACM Symposium on Theory of Computing, pp. 540--548, 1997. URL [2] J.L. Bentley and R.Sedgewick. Fast algorithms for sorting and searching strings. In Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 360--369, 1997. URL [3] M. Burrows and D. J. Wheeler. A Block-sorting Lossless Data Compression Algorithms. Technical Report 124, Digital SRC Research Report, 1994. URL [4] A.Crauser and P.Ferragina. External memory construction of full-text indexes. In DIMACS Workshop on External Memory Algorithms and/or Visualization, 1998. URL [5] M.Farach. Optimal Suffix Tree Construction with Large Alphabets. In 38th Symp. on Foundations of Computer Science, pp. 137--143, 1997. URL 40 参考文献(2/3) [6] P.Ferragina and R.Grossi. The String B-Tree: a new data structure for string search in external memory and its applications. Journal of the ACM, 1998. URL [7] G.H. Gonnet, R.Baeza-Yates, and T.Snider. New Indices for Text: PAT trees and PAT arrays. In W.Frakes and R.Baeza-Yates, editors, Information Retrieval: Algorithms and Data Structures, chapter5, pp. 66--82. Prentice-Hall, 1992. URL [8] R.M. Karp, R.E. Miller, and A.L. Rosenberg. Rapid identification of repeated patterns in strings, arrays and trees. In 4th ACM Symposium on Theory of Computing, pp. 125-136, 1972. [9] U.Manber and G.Myers. Suffix arrays: A New Method for On-Line String Searches. SIAM Journal on Computing, Vol.22, No.5, pp. 935--948, October 1993. 41 参考文献(3/3) [10] E.M. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM, Vol.23, No.12, pp. 262--272, 1976. [11] K.Sadakane and H.Imai. A Cooperative Distributed Text Database Management Method Unifying Search and Compression Based on the Burrows-Wheeler Transformation. In URL Proceedings of NewDB’98, 1998. [12] K.Sadakane and H.Imai. Constructing Suffix Arrays of Large Texts. In Proceedings of DEWS'98, 1998. URL [13] E.Ukkonen. On-line construction of suffix trees. Algorithmica, Vol.14, No.3, pp. 249-260, September 1995. [14] P.Weiner. Linear Pattern Matching Algorihms. In Proceedings of the 14th IEEE Symposium on Switching and Automata Theory, pp. 1--11, 1973. 42