Inferring Strings from Suffix Trees and Links on a Binary Alphabet Tomohiro I, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan Outline Reverse Problems on String Data Structures Suffix Tree, Suffix Links Reverse Problem on Suffix Trees Efficient Solution Inferring a Labeling Function Suffix Tour Graph On a Binary Alphabet Reverse Problems on String Data Structures border arrays, suffix arrays, DAWG, etc. Direct Problem string data structure Given a string, compute its data structure. Reverse Problem data structure string Given a data structure, compute its string. Solving reverse problems could lead to deeper understanding of strings and data structures. Hot Topic Reverse Problems on String Data Structures Border Array • [Franek et al., 2002] • [Duval et al., 2005] Suffix Array • [Duval and Lefebvre, 2002] • [Bannai et al., 2003] • [Schürmann et al., 2005] DAWG • [Bannai et al., 2003] Parameterized Border Array • [I et al., 2009] • [I et al., 2010] KMP Failure Function • [Gawrychowski et al., 2010] Runs • [Matsubara et al., 2010] Palindromic Structure • [I et al., 2010] Prefix Table • [Clement et al., 2009] Cover Array • [Crochemore et al., 2010] LPF Table • [He et al., 2011] We consider the reverse problem on suffix trees. Suffix Tree, Suffix Links The suffix tree of w is the compacted trie which represents the suffixes of w. The suffix link of a node points to the node that represents the substring obtained by deleting the first character. 12345678 ababaaa$ ababaaa$ ababaaa$ ababaaa$ ababaaa$ ababaaa$ ababaaa$ ababaaa$ $ 8 7 $ $ 6 Index of suffixes. a b a b a a $ 5 a a $ 3 a Suffix link a a $ b aa a $ b aa 4 a $ 1 2 Direct Problem on Suffix Trees Input : A string w. Output : The suffix tree of w. $ 8 w ababaaa$ 7 $ $ 6 a b a b a a $ 5 a a $ 3 a a $ b aa 4 a $ 1 It can be solved in linear time [e.g. Ukkonen, 1995]. a b aa a $ 2 Reverse Problem on Suffix Trees Input : An unlabeled ordered rooted tree T. Output : A string which realizes T (if such exists). A string w is said to realize T if the suffix tree of w is isomorphic to T. Reverse Problem on Suffix Trees Input : An unlabeled ordered rooted tree T and links f. Output : A string which realizes T and f (if such exists). A string w is said to realize (T, f ) if the suffix tree of w and its suffix links are isomorphic to T and f. link function f Reverse Problem on Suffix Trees Input : An unlabeled ordered rooted tree T and links f. Output : A string which realizes T and f (if such exists). A string w is said to realize (T, f ) if the suffix tree of w and its suffix links are isomorphic to T and f. $ a b link function f 12345678 ababaaa$ 8 7 6 5 4 3 1 2 Reverse Problem on Suffix Trees Input : An unlabeled ordered rooted tree T and links f for inner nodes. Output : A string which realizes T and f (if such exists). A string w is said to realize (T, f ) if the suffix tree of w and its suffix links for inner nodes are isomorphic to T and f. link function f ababaaa$ aaababa$ aababaa$ abaaaba$ How can we solve this problem? Input : An unlabeled ordered rooted tree T and links f for inner nodes. Output : A string which realizes T and f (if such exists). How can we solve this problem? Input : An unlabeled ordered rooted tree T and links f for inner nodes. Output : A string which realizes T and f (if such exists). If we can infer a “correct” order of leaves, we can get a string. $ a b 12345678 ababaaa$ 8 7 6 5 4 3 1 2 How can we solve this problem? Input : An unlabeled ordered rooted tree T and links f for inner nodes. Output : A string which realizes T and f (if such exists). If we can infer a “correct” order of leaves, we can get a string. $ a b 12345678 ababaaa$ 8 7 A naïve solution of considering all permutations takes O(n!) time. 6 5 4 2 We need to take into account some “constraints” on leaves’ order, 3 are 1 implicitly given by input (T, f ). which We introduce suffix tour graphs to capture the constraints. Outline Reverse Problems on String Data Structures Suffix Tree, Suffix Links Reverse Problem on Suffix Trees Efficient Solution Inferring a Labeling Function Suffix Tour Graph On a Binary Alphabet Notations Input (T, f ) V : the set of nodes of T E : the set of edges of T : the root node of T Vin : the set of inner nodes of T Vleaf : the set of leaf nodes of T ordered rooted tree T f : Vin{}Vin v V, V(v), Vin(v) and Vleaf(v) respectively represent the set of nodes, inner nodes and leaf nodes of the subtree rooted at v. children(v) : the set of children of v. chi(v) : the i-th child of v. par(v) : the parent of v. Preconditions of an Input 1. The first child of the root node is a leaf. ch1() Vleaf 2. There exists a path of function f from any node v0 Vin{} to the root node . v0 Vin{}, v1, v2, …, vk s.t. vk and vi f (vi1) for any 1 i k satisfied not satisfied In what follows… Infer the first character of the string of each edge, namely, a labeling function g : E ∪{$}. $ $ $ a b a b a a $ a a $ b aa a $ a a a $ b aa a $ Conditions for g to hold Infer the first character of the string of each edge, namely, a labeling function g : E ∪{$}. 1. The edge from to its first child is labeled with $. g((, ch1())) $. $ Conditions for g to hold Infer the first character of the string of each edge, namely, a labeling function g : E ∪{$}. 1. The edge from to its first child is labeled with $. g((, ch1())) $. $ 2. The labels for the children are sorted in lexicographical order. v V, 1 i |children(v)|, g((v, chi(v))) g((v, chi1(v))). v a c d e Conditions for g to hold Infer the first character of the string of each edge, namely, a labeling function g : E ∪{$}. 3. Condition on links of parent-child nodes. v V{}, vp par(v), there exists u children( f (vp)) s.t. g((vp, v)) g(( f (vp), u)). In addition, if v Vin then f (v) V(u). f (vp) c vp u c f (v) v Labels for Inner Edges By Condition 3, the labels for inner edges (edges from inner nodes to inner nodes) can be uniquely determined. If the determined labels contradict Condition 2, the input turns out to be invalid. $ a a b invalid b $ a a a b Lg and Dg When a labeling function g holds Conditions 1~3, we define the following values for any node v. Lg(v) |{uVleaf | f (par(u))par(v), g((par(u), u))g((par(v), v))}| Dg(v) = yV(v) Lg(y) Lg(v) means # of leaves in the following situation. $ 1 $ par(v) c u $ c 0 v a 1 b 0 1 par(u) 2 0 a a b 0 a 1 a 0 1 b 0 1 b Lg and Dg When a labeling 1~3, Constraints in leaves’function order : g holds Conditions Dg(v) leaves in Vleaf(v) The leaf of in Vleaf(v).values have constraints wenext define theu is following for any node v.on such u’s. Lg(v) |{uVleaf | f (par(u))par(v), g((par(u), u))g((par(v), v))}| Dg(v) = yV(v) Lg(y) Lg(v) means # of leaves in the following situation. c u $ $ c par(u) a b a b 0 2 2 2 a 0 0 0 0 v 1 8 0 4 1 1 par(v) $ 1 1 1 1 a b 0 0 0 0 a b 1 1 1 1 Conditions for g to hold 4. # of leaves of subtree rooted at v must be at least Dg(v). |Vleaf(v)| Dg(v) 0 Lg(v) means # of leaves in the following situation. $ 1 1 par(v) c par(u) c u a 0 0 1 $ 1 1 $ 1 1 8 b 0 4 a b 0 0 2 2 2 a 0 0 0 0 v 1 a 1 1 b 1 1 1 1 1 a b 0 0 0 0 1 0 1 0 0 Suffix Tour Graph When a labeling function g holds Conditions 1~4, we define the suffix tour graph STGg (VG, EG) w.r.t. g. VG V EG {(u, v) | uVleaf, f (par(u))par(v), g((par(u), u))g((par(v), v))} ∪{(u, v)k | (u, v)E, k |Vleaf(v)| Dg(v)} $ 0 0 1 a STGg b 1 $ $ a b 0 a a 1 a 1 1 b 1 0 0 b 0 Lg(v) means # of leaves in the following situation. Suffix Tour Graph par(v) When a labeling function g holds Conditions 1~4, c par(u) we define the suffix tour graph STGg (VG, EG) w.r.t. g. v c VG V u EG {(u, v) | uVleaf, f (par(u))par(v), g((par(u), u))g((par(v), v))} ∪{(u, v)k | (u, v)E, k |Vleaf(v)| Dg(v)} $ 0 0 1 a STGg b 1 $ $ a b 0 a a 1 a 1 1 b 1 0 0 b 0 Lg(v) means # of leaves in the following situation. Suffix Tour Graph par(v) When a labeling function g holds Conditions 1~4, c par(u) we define the suffix tour graph STGg (VG, EG) w.r.t. g. v c VG V u EG {(u, v) | uVleaf, f (par(u))par(v), g((par(u), u))g((par(v), v))} ∪{(u, v)k | (u, v)E, k |Vleaf(v)| Dg(v)} $ 0 0 1 a STGg b 1 $ $ a b 0 a a 1 a 1 1 b 1 0 0 b 0 STGg is an Eulerian graph. (possibly disjoint) Necessary and Sufficient Condition for (T, f ) and g to be valid STGg has an Eulerian cycle that contains and all leaves. When there exists such a cycle, a correct order of leaves that realizes (T, f ) and g can be obtained by the order of visiting leaves on the cycle. $ 0 0 1 a STGg b a b 0 a a 1 a 1 1 8 1 $ $ b 1 0 0 b 0 7 6 5 STGg is an Eulerian graph. (possibly disjoint) 4 3 1 2 Necessary and Sufficient Condition for (T, f ) and g to be valid STGg has an Eulerian cycle that contains and all leaves. When there exists such a cycle, a correct order of leaves that realizes (T, f ) and g can be obtained by the order of visiting leaves on the cycle. Example for an invalid labeling function g. $ 0 1 1 a STGg b 1 $ a a b 0 a b 1 a 1 0 b 1 0 0 b 0 STGg is an Eulerian graph. (possibly disjoint) Computing an Eulerian Cycle An Eulerian cycle can be computed in linear time in the graph size. We also showed that the size of STGg is linear in the input size. Given g, we can check if g is valid or not by constructing STGg ⇒ computing an Eulerian cycle in linear time in the input size. What remains is to find a valid labeling function g. On a Binary Alphabet In the case of binary alphabets, due to Conditions 1~4 it suffices to consider at most five labeling functions. $ 0 0 1 a STGg b 1 $ $ a b 0 a a 1 a 1 1 b 1 0 0 b 0 On a Binary Alphabet In the case of binary alphabets, due to Conditions 1~4 it suffices to consider at most five labeling functions. $ 0 0 1 a STGg b 1 $ $ a b 1 a b 1 a 1 0 b 1 0 0 b 0 On a Binary Alphabet In the case of binary alphabets, due to Conditions 1~4 it suffices to consider at most five labeling functions. $ 0 0 1 a STGg b 1 $ a a b 0 $ b 1 $ 1 1 a 1 0 0 a 0 On a Binary Alphabet In the case of binary alphabets, due to Conditions 1~4 it suffices to consider at most five labeling functions. $ 0 0 1 a STGg b 1 $ a a b 1 $ b 1 $ 1 0 b 1 0 0 b 0 On a Binary Alphabet In the case of binary alphabets, due to Conditions 1~4 it suffices to consider at most five labeling functions. $ 0 1 1 a STGg b 1 $ a a b 0 a b 1 a 1 0 b 1 0 0 b 0 On a Binary Alphabet In the case of binary alphabets, due to Conditions 1~4 it suffices to consider at most five labeling functions. On a binary alphabet, the reverse problem on Theorem suffix trees can be solved in linear time. $ 0 1 1 a STGg b 1 $ a a b 0 a b 1 a 1 0 b 1 0 0 b 0 Summary We introduced suffix tour graphs which lead to the efficient solution of the reverse problem on suffix trees. (Note that it can be applied to non-binary cases.) On a binary alphabet, we showed that the problem can be solved in linear time in the input size. Open Problems What about non-binary cases? ⇒It seems to be difficult ⇒since # of labeling functions g increase combinatorially. What about the problem in which suffix links are not given? ⇒I do not have any idea. Exercise? Compute a string which realizes this tree and links. Hints These labels are determined uniquely. $ a $ a b b $ a a b a a b $ a b b $ a b b Exercise? Compute a string which realizes this tree and links. babaabaaababaa$ babaababaaabaa$ babaaababaabaa$
© Copyright 2024 ExpyDoc