Document

Inferring Strings from Suffix Trees and Links
on a Binary Alphabet
Tomohiro I, Shunsuke Inenaga,
Hideo Bannai, Masayuki Takeda
Kyushu University, Japan
Outline
Reverse Problems on String Data Structures
Suffix Tree, Suffix Links
Reverse Problem on Suffix Trees
Efficient Solution
Inferring a Labeling Function
Suffix Tour Graph
On a Binary Alphabet
Reverse Problems on String Data Structures
border arrays, suffix arrays, DAWG,
etc.
Direct Problem
string
data structure
Given a string, compute its data structure.
Reverse Problem
data structure
string
Given a data structure, compute its string.
Solving reverse problems could lead to deeper understanding
of strings and data structures.
Hot Topic
Reverse Problems on String Data Structures
Border Array
• [Franek et al., 2002]
• [Duval et al., 2005]
Suffix Array
• [Duval and Lefebvre, 2002]
• [Bannai et al., 2003]
• [Schürmann et al., 2005]
DAWG
• [Bannai et al., 2003]
Parameterized Border Array
• [I et al., 2009]
• [I et al., 2010]
KMP Failure Function
• [Gawrychowski et al., 2010]
Runs
• [Matsubara et al., 2010]
Palindromic Structure
• [I et al., 2010]
Prefix Table
• [Clement et al., 2009]
Cover Array
• [Crochemore et al., 2010]
LPF Table
• [He et al., 2011]
We consider the reverse problem on suffix trees.
Suffix Tree, Suffix Links
The suffix tree of w is the compacted trie which represents
the suffixes of w.
The suffix link of a node points to the node that represents
the substring obtained by deleting the first character.
12345678
ababaaa$
ababaaa$
ababaaa$
ababaaa$
ababaaa$
ababaaa$
ababaaa$
ababaaa$
$
8
7
$
$
6
Index of suffixes.
a
b
a b
a
a
$
5
a
a
$
3
a
Suffix link
a
a
$
b
aa
a
$
b
aa 4
a
$
1
2
Direct Problem on Suffix Trees
Input : A string w.
Output : The suffix tree of w.
$
8
w  ababaaa$
7
$
$
6
a
b
a b
a
a
$
5
a
a
$
3
a
a
$
b
aa 4
a
$
1
It can be solved in linear time [e.g. Ukkonen, 1995].
a
b
aa
a
$
2
Reverse Problem on Suffix Trees
Input : An unlabeled ordered rooted tree T.
Output : A string which realizes T (if such exists).
A string w is said to realize T
if the suffix tree of w is isomorphic to T.
Reverse Problem on Suffix Trees
Input : An unlabeled ordered rooted tree T and links f.
Output : A string which realizes T and f (if such exists).
A string w is said to realize (T, f ) if the suffix tree of w
and its suffix links are isomorphic to T and f.
link function f
Reverse Problem on Suffix Trees
Input : An unlabeled ordered rooted tree T and links f.
Output : A string which realizes T and f (if such exists).
A string w is said to realize (T, f ) if the suffix tree of w
and its suffix links are isomorphic to T and f.
$
a
b
link function f
12345678
ababaaa$
8
7
6
5
4
3
1
2
Reverse Problem on Suffix Trees
Input : An unlabeled ordered rooted tree T and links f for inner nodes.
Output : A string which realizes T and f (if such exists).
A string w is said to realize (T, f ) if the suffix tree of w
and its suffix links for inner nodes are isomorphic to T and f.
link function f
ababaaa$
aaababa$
aababaa$
abaaaba$
How can we solve this problem?
Input : An unlabeled ordered rooted tree T and links f for inner nodes.
Output : A string which realizes T and f (if such exists).
How can we solve this problem?
Input : An unlabeled ordered rooted tree T and links f for inner nodes.
Output : A string which realizes T and f (if such exists).
If we can infer a “correct” order of leaves, we can get a string.
$
a
b
12345678
ababaaa$
8
7
6
5
4
3
1
2
How can we solve this problem?
Input : An unlabeled ordered rooted tree T and links f for inner nodes.
Output : A string which realizes T and f (if such exists).
If we can infer a “correct” order of leaves, we can get a string.
$
a
b
12345678
ababaaa$
8
7
A naïve solution of considering all permutations takes O(n!) time. 
6
5
4
2
We need to take into account some “constraints” on leaves’ order,
3 are
1 implicitly given by input (T, f ).
which
 We introduce suffix tour graphs to capture the constraints.
Outline
Reverse Problems on String Data Structures
Suffix Tree, Suffix Links
Reverse Problem on Suffix Trees
Efficient Solution
Inferring a Labeling Function
Suffix Tour Graph
On a Binary Alphabet
Notations
Input (T, f )
V : the set of nodes of T
E : the set of edges of T
 : the root node of T
Vin : the set of inner nodes of T
Vleaf : the set of leaf nodes of T
ordered rooted tree T

f : Vin{}Vin
 v  V,
V(v), Vin(v) and Vleaf(v) respectively represent
the set of nodes, inner nodes and leaf nodes of the subtree rooted at v.
children(v) : the set of children of v.
chi(v) : the i-th child of v.
par(v) : the parent of v.
Preconditions of an Input
1. The first child of the root node  is a leaf.
ch1()  Vleaf
2. There exists a path of function f
from any node v0  Vin{} to the root node .
 v0  Vin{},  v1, v2, …, vk
s.t. vk   and vi  f (vi1) for any 1  i  k
satisfied

not satisfied

In what follows…
Infer the first character of the string of each edge,
namely, a labeling function g : E  ∪{$}.
$
$
$
a

b
a b
a
a
$
a
a
$
b
aa
a
$
a
a
a
$
b
aa
a
$
Conditions for g to hold
Infer the first character of the string of each edge,
namely, a labeling function g : E  ∪{$}.
1. The edge from  to its first child is labeled with $.
g((, ch1()))  $.
$

Conditions for g to hold
Infer the first character of the string of each edge,
namely, a labeling function g : E  ∪{$}.
1. The edge from  to its first child is labeled with $.
g((, ch1()))  $.
$
2. The labels for the children
are sorted in lexicographical order.
 v  V, 1  i  |children(v)|,
g((v, chi(v)))  g((v, chi1(v))).

v
a c d
e
Conditions for g to hold
Infer the first character of the string of each edge,
namely, a labeling function g : E  ∪{$}.
3. Condition on links of parent-child nodes.
 v  V{}, vp  par(v),
there exists u  children( f (vp)) s.t. g((vp, v))  g(( f (vp), u)).
In addition, if v  Vin then f (v)  V(u).
f (vp)
c
vp
u
c
f (v)
v
Labels for Inner Edges
By Condition 3, the labels for inner edges (edges from
inner nodes to inner nodes) can be uniquely determined.
If the determined labels contradict Condition 2,
the input turns out to be invalid.
$
a
a b

invalid
b
$
a
a

a
b
Lg and Dg
When a labeling function g holds Conditions 1~3,
we define the following values for any node v.
Lg(v)  |{uVleaf | f (par(u))par(v), g((par(u), u))g((par(v), v))}|
Dg(v) = yV(v) Lg(y)
Lg(v) means # of leaves
in the following situation.
$
1
$
par(v)
c
u
$
c
0
v
a
1
b
0
1
par(u)

2
0
a
a b
0
a
1
a
0
1
b
0
1
b
Lg and Dg
When a labeling
1~3,
Constraints
in leaves’function
order : g holds Conditions
Dg(v) leaves in
Vleaf(v)
The
leaf of
in Vleaf(v).values have
constraints
wenext
define
theu is
following
for any
node v.on such u’s.
Lg(v)  |{uVleaf | f (par(u))par(v), g((par(u), u))g((par(v), v))}|
Dg(v) = yV(v) Lg(y)
Lg(v) means # of leaves
in the following situation.
c
u
$
$
c
par(u)
a
b
a b
0 2
2 2
a
0 0 0 0
v
1 8
0 4
1 1
par(v)
$
1 1

1 1
a
b
0 0 0 0
a
b
1 1 1 1
Conditions for g to hold
4. # of leaves of subtree rooted at v must be at least Dg(v).
|Vleaf(v)|  Dg(v)  0
Lg(v) means # of leaves
in the following situation.
$
1 1
par(v)
c
par(u)
c
u
a
0
0
1
$
1 1
$
1
1 8
b
0 4
a b
0
0 2
2 2
a
0 0 0 0
v

1
a
1 1
b
1 1 1 1
1
a
b
0 0 0 0
1
0
1
0
0
Suffix Tour Graph
When a labeling function g holds Conditions 1~4,
we define the suffix tour graph STGg  (VG, EG) w.r.t. g.
VG  V
EG  {(u, v) | uVleaf, f (par(u))par(v), g((par(u), u))g((par(v), v))}
∪{(u, v)k | (u, v)E, k  |Vleaf(v)|  Dg(v)}
$
0
0
1
a
STGg
b
1
$
$

a b
0
a
a
1
a
1
1
b
1
0
0
b
0

Lg(v) means # of leaves
in the following
situation.
Suffix
Tour Graph
par(v)
When a labeling function g holds Conditions 1~4,
c
par(u)
we define the suffix tour graph STGg  (VG, EG) w.r.t. g.
v
c
VG  V
u
EG  {(u, v) | uVleaf, f (par(u))par(v),
g((par(u), u))g((par(v), v))}
∪{(u, v)k | (u, v)E, k  |Vleaf(v)|  Dg(v)}
$
0
0
1
a
STGg
b
1
$
$

a b
0
a
a
1
a
1
1
b
1
0
0
b
0

Lg(v) means # of leaves
in the following
situation.
Suffix
Tour Graph
par(v)
When a labeling function g holds Conditions 1~4,
c
par(u)
we define the suffix tour graph STGg  (VG, EG) w.r.t. g.
v
c
VG  V
u
EG  {(u, v) | uVleaf, f (par(u))par(v),
g((par(u), u))g((par(v), v))}
∪{(u, v)k | (u, v)E, k  |Vleaf(v)|  Dg(v)}
$
0
0
1
a
STGg
b
1
$
$

a b
0
a
a
1
a
1
1
b
1
0
0
b
0
STGg is an Eulerian graph.
(possibly disjoint)

Necessary and Sufficient Condition
for (T, f ) and g to be valid
STGg has an Eulerian cycle that contains  and all leaves.
When there exists such a cycle,
a correct order of leaves that realizes (T, f ) and g can be obtained
by the order of visiting leaves on the cycle.
$
0
0
1
a
STGg
b
a b
0
a
a
1
a
1
1

8
1
$
$

b
1
0
0
b
0
7
6
5
STGg is an Eulerian graph.
(possibly disjoint)
4
3
1
2
Necessary and Sufficient Condition
for (T, f ) and g to be valid
STGg has an Eulerian cycle that contains  and all leaves.
When there exists such a cycle,
a correct order of leaves that realizes (T, f ) and g can be obtained
by the order of visiting leaves on the cycle.
Example for an invalid labeling function g.
$
0
1
1
a
STGg
b
1
$
a

a b
0
a
b
1
a
1
0
b
1
0
0
b
0
STGg is an Eulerian graph.
(possibly disjoint)

Computing an Eulerian Cycle
An Eulerian cycle can be computed in linear time in the
graph size.
We also showed that the size of STGg is linear in the
input size.
Given g, we can check if g is valid or not by
constructing STGg ⇒ computing an Eulerian cycle
in linear time in the input size.
What remains is to find a valid labeling function g.
On a Binary Alphabet
In the case of binary alphabets, due to Conditions 1~4
it suffices to consider at most five labeling functions.
$
0
0
1
a
STGg
b
1
$
$

a b
0
a
a
1
a
1
1
b
1
0
0
b
0

On a Binary Alphabet
In the case of binary alphabets, due to Conditions 1~4
it suffices to consider at most five labeling functions.
$
0
0
1
a
STGg
b
1
$
$

a b
1
a
b
1
a
1
0
b
1
0
0
b
0

On a Binary Alphabet
In the case of binary alphabets, due to Conditions 1~4
it suffices to consider at most five labeling functions.
$
0
0
1
a
STGg
b
1
$
a

a b
0
$
b
1
$
1
1
a
1
0
0
a
0

On a Binary Alphabet
In the case of binary alphabets, due to Conditions 1~4
it suffices to consider at most five labeling functions.
$
0
0
1
a
STGg
b
1
$
a

a b
1
$
b
1
$
1
0
b
1
0
0
b
0

On a Binary Alphabet
In the case of binary alphabets, due to Conditions 1~4
it suffices to consider at most five labeling functions.
$
0
1
1
a
STGg
b
1
$
a

a b
0
a
b
1
a
1
0
b
1
0
0
b
0

On a Binary Alphabet
In the case of binary alphabets, due to Conditions 1~4
it suffices to consider at most five labeling functions.
On a binary alphabet, the reverse problem on
Theorem
suffix trees can be solved in linear time.
$
0
1
1
a
STGg
b
1
$
a

a b
0
a
b
1
a
1
0
b
1
0
0
b
0

Summary
We introduced suffix tour graphs which lead to
the efficient solution of the reverse problem on suffix trees.
(Note that it can be applied to non-binary cases.)
On a binary alphabet, we showed that the problem can be
solved in linear time in the input size.
Open Problems
What about non-binary cases?
⇒It seems to be difficult
⇒since # of labeling functions g increase combinatorially.
What about the problem in which suffix links are not given?
⇒I do not have any idea.
Exercise?
Compute a string which realizes this tree and links.
Hints
These labels are determined uniquely.
$
a
$
a
b
b
$
a
a
b
a
a
b
$
a
b
b
$ a
b
b
Exercise?
Compute a string which realizes this tree and links.
babaabaaababaa$
babaababaaabaa$
babaaababaabaa$