Document

Kouji Shimohira, Shunsuke Inenaga,
Hideo Bannai, Masayuki Takeda
Kyushu University, Japan
Outline
 Non-linear Text
 Computing Longest Common Subsequence
of Acyclic Non-linear Texts
 Computing Longest Common Subsequence
of Cyclic Non-linear Texts
 Conclusions and Future works
Outline
 Non-linear Text
 Computing Longest Common Subsequence
of Acyclic Non-linear Texts
 Computing Longest Common Subsequence
of Cyclic Non-linear Texts
 Conclusions and Future works
Non-linear Text
Non-linear Text G = ( V, E , L )
A directed graph with vertices labeled by characters.
V : the set of vertices.
E : the set of arcs.
L : V → Σ : a labeling function.
e.g.
V = { v1 , v2 , v3 , v4 , v5 , v6 , v7 , v8 , v9 }
E = {(v1,v2),(v2,v3),(v2,v7),(v3,v4),(v4,v5),
(v5,v6),(v6,v7),(v6,v8),(v7,v8),(v8,v9)}
L = { v1 → a, v2 → b, v3 → c, v4 → a, v5 → a,
v6 → b, v7 → b, v8 → a, v9 → c }
G
a
c
1
b
3
a
a
4
5
b
6
a
2
b
7
8
c
9
Non-linear Text
L(v) : the character label of vertex v
P(v) : the set of paths that end at vartex v.
L(P(v)) : the set of strings spelled by paths in P(v)
P(G) : the set of paths in G. ( P(G) = {P(v) | v∈V } )
L(P(G)) : the set of strings spelled by paths in P(G) (= L(G))
substr(L(G)) : the set of substrings of strings in L(G)
subseq(L(G)) : the set of subseqences of strings in L(G)
e.g.
G’
a
c
1
b
3
a
a
4
5
b
L(v7) = b
6
a
2
b
7
P(v7) = {v1v2v3v4v5v6v7 , v1v2v7 }
8
c
9
L(P(v7)) = {abcaabb , abb }
bcaa ∈ substr(L(G))
aca ∈ subseq(L(G))
Algorithms on Non-linear Text
problem
Substring
Matching
Approximate
Matching
Longest Common
Substring
Longest Common
Subsequence
text
pattern
time complexity
acyclic graph
linear
O(n+m|E|)
tree
linear
O(n)
graph
linear
O(n+m|E|)
graph with edit
operations
linear
NP-complete [Amir et al, 1997]
graph
linear with edit
operations
text 1
text 2
acyclic graph
[Park & Kim, 1995]
[Akutsu, 1993]
[Amir et al, 1997]
O(m(n+e))
[Navarro, 2000]
acyclic graph
O(|E1||E2|)
(this work)
graph
acyclic graph
O(|E1||E2|)
(this work)
acyclic graph
acyclic graph
O(|Σ|2|E1||E2|)
acyclic graph
acyclic graph
O(|E1||E2|)
graph
graph
[Thang, 2011]
(this work)
O(|E1||E2|+|V1||V2|log|Σ|) (this work)
|E1| : the number of arcs in text 1, |E2| : the number of arcs in text 2, |Σ| : the alphabet size,
|V1| : the number of vertices in text 1, |V2| : the number of vertices in text 2.
Outline
 Non-linear Text
 Computing Longest Common Subsequence
of Acyclic Non-linear Texts
 Computing Longest Common Subsequence
of Cyclic Non-linear Texts
 Conclusions and Future works
Longest Common Subsequence
Problem for Acyclic Non-linear Texts
Problem 1
Input : Acyclic non-linear texts G1=(V1, E1, L1) and G2=(V2, E2, L2)
Output : Length of longest string in subseq(G1)∩subseq(G2)
e.g.
G1 a
1
b
2
c
3
c
4
d
5
b
6
G2 a 1
d
4
c2
d3
b5
a
6
subseq(G1) = { a ,b, c, d, ab, ac, ad,
bb, bc, bd, cb, cd, db, abb, abc,
abd, acb, acd, adb, bcb, bcd,
bdb, cdb, abcb, abcd, abdb,
acdb, bcdb, abcdb }
subseq(G2) = { a, b, c, d, ab, ac, ad,
ba, ca, cb, cd, da, db, aba, aca,
acb, acd, ada, adb, cba, cda,
cdb, dba, acba, acda, acdb, adba,
cdba, acdba }
Output = 4
Algorithm 1
Algorithm 1 : Computing the length of longest common
subsequence of acyclic non-linear texts
Sort vertices of G1 and G2 in
topological order
Let C be a |V1|×|V2| integer table
(Ci,j : the length of a longest string
in
subseq(P(v1,i))∩subseq(P(v2,j))
Compute Ci,j using dynamic programing
for all 1≦ i ≦ |V1| and 1≦ j ≦ |V2|
Return max Ci,j
Topological Sort
Sort vertices of G1 and G2 in topological order
G1 a
c
G2 a
d
1
4
1
4
b
d
c
b
2
5
c
b
G1
3
a
6
1
b
2
c
3
c
4
d
5
b
6
v1,1 v1,2 v1,3 v1,4 v1,5 v1,6
sort
2
5
d
a
3
6
G2
a1 c
2
d
3
d
4
b
5
b
6
v2,1 v2,2 v2,3 v2,4 v2,5 v2,6
Dynamic Programing table
Let C be a |V1|×|V2| integer table
(Ci,j : the length of a longest string in subseq(P(v1,i))∩subseq(P(v2,j))
G1
a
G1
1
b
G2
a1 c
2
2
c
d
3
3
c
d
4
4
d
b
5
5
b
b
6
6
G2
C a1 c
a1
b
2
c
3
c
4
d
5
b
6
2
d
3
d
4
b
5
b
6
L1(v1,i) ≠ L2(v2,j)
Compute Ci,j using dynamic programing for all 1≦ i ≦ |V1| and 1≦ j ≦ |V2|
If L1(v1,i) ≠ L2(v2,j) then
Ci,j = max ({ Ck,j | (v1,k , v1,i)∈E1}∪{Ci,ℓ | (v2,ℓ , v2,j)∈E2}∪{0})
G1 a
c
G2 a
d
1
4
1
4
b
d
c
b
2
5
2
5
c
b
d
a
3
6
3
6
G1
G2
C a1 c2 d3 d4 b5 b6
S
1,1
1,2
1,1
1,2
S
a1 1 1 1 1 1 1
b
1,1
1,2
1
2
c3
2,1
c4
S
5
b
6
1
2,1
1
2
0
2
4,5
1
5,3
2
2
2,6
4,2
1,1
3
2
1
1
2,5
2,5
4,1
3,2
2
1
2
1
1,2
2,4
4,3
3,2
1
1
2
0
1,4
3,2
S
1,1
d
1,3
3
1
L1(v1,i) = L2(v2,j)
Compute Ci,j using dynamic programing for all 1≦ i ≦ |V1| and 1≦ j ≦ |V2|
If L1(v1,i) = L2(v2,j) then
Ci,j = 1 + max ({ Ck,ℓ | (v1,k , v1,i)∈E1 , (v2,ℓ , v2,j)∈E2 }∪{0})
G1 a
c
G2 a
d
1
4
1
4
b
d
c
b
2
5
2
5
c
b
d
a
3
6
G1
G2
C a1 c2 d3 d4 b5 b6
S
1,1
1,2
1,1
1,2
S
a1 1 1 1 1 1 1
b
6
d
b
1,2
1
2
c3
c
3
1,1
2,1
2,1
S
3,1
2
2
3
+1
2
0
3
1
2
1
5,5
3
5,3
2
2
4,5
5,3
5,4
2
2,6
4,2
1,1
5,3
2
1
1
2,5
2,5
4,1
3,2
3,2
1
2,4
4,3
3,2
1,2
1
2
1
1
5
1
2
0
1,4
3,2
S
1,1
6
1
1
4
1,3
4
3
Output
Return max Ci,j
max Ci,j = 4
G2
C a1 c
G1 a
c
G2 a
d
1
4
1
4
b
d
c
b
2
5
2
5
c
b
d
a
3
6
G1
a
b
c
c
3
6
d
b
S
1,1
S
S
1,1
3,2
3,1
6
1
1
5
4,3
3,2
2
3,2
1
5,3
2
1
2
1
5,5
3
5,3
2
2
4,5
5,3
5,4
3
2,6
4,2
1,1
2
2
0
3
2,5
2,5
4,1
1
2
1
1
S
1,2
2,4
b
5
1
1
2
b
1,2
1,4
3,2
4
1
1
2
0
4
1,1
1,3
2,1
d
3
1
1
1
3
1,2
1,2
2,1
d
1
1
2
2
1,1
1
1
Output : 4
3
6,5
4
4
6
Time Complexity
Sort vertices of G1 and G2 in
topological order
Linear time
Let C be a |V1|×|V2| integer table
(Ci,j : the length of a longest string
in
subseq(P(v1,i))∩subseq(P(v2,j))
Compute Ci,j using dynamic programing
for all 1≦ i ≦ |V1| and 1≦ j ≦ |V2|
Return max Ci,j
O(|E1||E2|) time
Time Complexity
Compute Ci,j using dynamic programing for all 1≦ i ≦ |V1| and 1≦ j ≦ |V2|
case of L1(v1,i) ≠ L2(v2,j)
ei : the number of arcs incoming to v1,i
fj : the number of arcs incoming to v2,j
To compute the value of Ci,j ,
ei + fj elements in table C are used.
To compute Ci,j for all 1≦ i ≦ |V1| and 1≦ j ≦ |V2|
G1
G2
C a1 c2 d3 d4 b5 b6
S
1,1
1,2
1,1
1,2
S
a1 1 1 1 1 1 1
b
2,1
c4
S
b
1,2
1
2
c3
d
= O(|E1||E2|)
1,1
2,1
1
2
1
2
1
2
1
5,5
3
5,3
2
2
4,5
5,3
5,4
3
2
0
3
2
2,6
4,2
1,1
5,3
2
1
1
2,5
2,5
4,1
3,2
3,2
1
2
1
1,2
2,4
4,3
3,2
3,1
1
2
0
1,4
3,2
S
1,1
6
1
1
5
1,3
3
6,5
4
4
Time Complexity
Compute Ci,j using dynamic programing for all 1≦ i ≦ |V1| and 1≦ j ≦ |V2|
case of L1(v1,i) = L2(v2,j)
ei : the number of arcs incoming to v1,i
fj : the number of arcs incoming to v2,j
To compute the value of Ci,j ,
ei fj elements in table C are used.
To compute Ci,j for all 1≦ i ≦ |V1| and 1≦ j ≦ |V1|
G1
G2
C a1 c2 d3 d4 b5 b6
S
1,1
1,2
1,1
1,2
S
a1 1 1 1 1 1 1
b
d
b
1,2
1
2
c3
c
= O(|E1||E2|)
1,1
2,1
2,1
S
3,1
2
2
1
2
1
5,5
3
5,3
2
2
4,5
5,3
5,4
3
2
0
3
2
2,6
4,2
1,1
5,3
2
1
1
2,5
2,5
4,1
3,2
3,2
1
2,4
4,3
3,2
1,2
1
2
1
1
5
1
2
0
1,4
3,2
S
1,1
6
1
1
4
1,3
3
6,5
4
4
Time Complexity
Sort vertices of G1 and G2 in
topological order
Linear time
Let C be a |V1|×|V2| interger array
(Ci,j : the length of a longest string
in
subseq(P(v1,i))∩subseq(P(v2,j))
Compute Ci,j using dymnamic programing
for all 1≦ i ≦ |V1| and 1≦ j ≦ |V2|
O(|E1||E2|) time
Return max Ci,j
O(|V1||V2|) time
The total time complexity is O(|E1||E2|) time.
Outline
 Non-linear Text
 Computing Longest Common Subsequence
of Acyclic Non-linear Texts
 Computing Longest Common Subsequence
of Cyclic Non-linear Texts
 Conclusions and Future works
Longest Common Subsequence
Problem for Cyclic Non-linear Texts
Problem 2
Input : Non-linear texts G1=(V1, E1, L1) and G2=(V2, E2, L2)
Output : ∞
(if subseq(G1)∩subseq(G2) is infinite)
The Length of longest string in subseq(G1)∩subseq(G2)
(otherwise)
e.g. 1
G1 a
a
c
Character “d” is in a cycle in both G1 and G2.
b
d
c
G2 a
c
d
a
b
d
ccdccdccd······ ∈ L(G1)
bdbdbdbd······ ∈ L(G2)
dddddd········ ∈ subseq(G1)∩subseq(G2)
Output = ∞
Longest Common Subsequence
Problem for Cyclic Non-linear Texts
Problem 2
Input : Non-linear texts G1=(V1, E1, L1) and G2=(V2, E2, L2)
Output : ∞
(if subseq(G1)∩subseq(G2) is infinite)
The Length of longest string in subseq(G1)∩subseq(G2)
(otherwise)
e.g. 2
G1 a
a
c
b
d
c
G2 a
c
d
a
b
a
subseq(G1)∩subseq(G2) = {a, b, c, d,
aa, ab, ac, ad, cd, aab, aac, aad,
acd, aacd}
Output = 4
Algorithm 2
Algorithm 2 : Computing the length of longest common
subsequence of cyclic non-linear texts
Transform G1 and G2 into G’1 and G’2 based
on the strongly connected components
Check whether subseq(G1)∩subseq(G2) is
infinite or not
Sort vertices of G’1 and G’2 in
topological order
Let C be a |V’1|×|V’2| integer table
(Ci,j : the length of a longest string
in
subseq(P(v’1,i))∩subseq(P(v’2,j))
Compute Ci,j using dynamic programing for
all 1≦ i ≦ |V’1| and 1≦ j ≦ |V’2|
Return max Ci,j
Strongly Connected Component
Transform G1 and G2 into G’1 and G’2 based on the strongly connected components
G1 a
a
G’ {a}
c
1
{a}
3
{c,d}
2
1
b
d
{b}
c
4
transform
G2 a
a
c
d
b
a
G’ {a}
1
{c}
2
{d}
3
2
{a,b}
4
strongly connected component
cyclic non-linear texts
acyclic non-linear texts
Check whether output is infinity or not.
Check whether subseq(G1)∩subseq(G2) is infinite or not
S1, S2 : the union of sets of labels of vertices that have a self-loop in G’1 , G’2
case of S1∩S2 ≠ Ø
Let c be any character in S1∩S2.
An infinite repetition c* of c is a common subsequence of G1 and G2.
Hence, output = ∞.
case of S1∩S2 = Ø
subseq(G1)∩subseq(G2) is finite.
G’ {a}
1
{a}
G’ {a}
2
1
{c}
2
{d}
2
1
{b}
3
{c,d}
4
{a,b}
4
3
S1 = {c,d}
S2 = {a}∪{a,b} = {a,b}
S1∩S2 = {c,d}∩{a,b} = Ø
Algorithm 2
Sort vertices of G’1 and G’2 in topological order
G’ {a}
1
{a}
2
G’
1
{b}
3
{c,d}
1
4
sort
G’ {a}
1
{c}
2
{d}
3
G’
2
2
{a,b}
4
{a}
1
{a}
2
{b}
3
{c,d}
v’1,
v’1,
v’1,
v’1,
1
2
3
4
{a}
1
{c}
2
{d}
3
{a,b}
v’2,
v’2,
v’2,
v’2,
1
2
3
4
4
4
Algorithm 2
Let C be a |V’1|×|V’2| integer table
(Ci,j : the length of a longest string in subseq(P(v’1,i))∩subseq(P(v’2,j))
G’
2
G’
1
a
1
a
2
b
3
{c,d}
C
4
G’
1
G’
2
{a}
1
c
2
d
3
{a,b}
4
{a}
1
{c}
2
{d}
3
{a,b}
{a}
1
0
0
0
0
{a}
2
0
0
0
0
{b}
3
0
0
0
0
0
0
0
0
{c,d}
4
4
Algorithm 2
Compute Ci,j using dynamic programing for all 1≦ i ≦ |V’1| and 1≦ j ≦ |V’2|
If L’1(v’1,i) ∩ L’2(v’2,j) ≠ Ø then
Ci,j = 1 + max ({ Ck,ℓ | (v’1,k , v’1,i)∈E’1 , (v’2,ℓ , v’2,j)∈E’2 }∪{0})
G’
2
G’ {a}
1
{a}
C
2
G’
1
{b}
3
{c,d}
1
1
{c}
{a}
{d}
3
{b}
2
{a,b}
4
{c,d}
2,1
2
1
1,2
2
2,3
2,1
2
S
2,2
2,2
{a,b}
1
2
2
3
3
1,2
2,1
2,1
{d}
1
2
2
2
1,1
1,1
4
{c}
1
1
4
2
1
S
{a}
G’ {a}
{a}
2
2,2
+1 2
3
4,2
3
40
0
4
Algorithm 2
Compute Ci,j using dynamic programing for all 1≦ i ≦ |V’1| and 1≦ j ≦ |V’2|
If L’1(v’1,i) ∩ L’2(v’2,j) = Ø then
Ci,j = max ({ Ck,j | (v’1,k , v’1,i)∈E’1}∪{Ci,ℓ | (v’2,ℓ , v’2,j)∈E’2}∪{0})
G’
2
G’ {a}
1
{a}
C
2
G’
1
{b}
3
{c,d}
1
1
{c}
{a}
{d}
3
{b}
2
{a,b}
4
{c,d}
2,1
2
2
2
2,2
2
4,2
3
1
1,2
2,3
2,1
2
S
2,2
2,2
{a,b}
1
2
2
3
3
1,2
2,1
2,1
{d}
1
2
2
2
1,1
1,1
4
{c}
1
1
4
2
1
S
{a}
G’ {a}
{a}
3
4,3
4
40
4
Algorithm 2
Compute Ci,j using dynamic programing for all 1≦ i ≦ |V’1| and 1≦ j ≦ |V’2|
If L’1(v’1,i) ∩ L’2(v’2,j) = Ø then
Ci,j = max ({ Ck,j | (v’1,k , v’1,i)∈E’1}∪{Ci,ℓ | (v’2,ℓ , v’2,j)∈E’2}∪{0})
G’
2
G’ {a}
1
{a}
C
2
G’
1
{b}
3
{c,d}
1
1
{c}
{a}
3
{b}
2
{a,b}
4
{c,d}
2,1
4
2
2
2
2,2
2
4,2
3
1
1,2
2,3
2,1
2
S
2,2
2,2
{a,b} 4
1
2
2
3
3
1,2
2,1
2,1
{d}
1
2
2
2
1,1
1,1
{d}
{c}
1
1
4
2
1
S
{a}
G’ {a}
{a}
3
4,3
4
40
Output
Return max Ci,j
G’
2
G’ {a}
1
{a}
C
2
G’
1
{b}
3
{c,d}
1
1
{c}
{a}
3
{b}
2
{a,b}
4
{c,d}
2,1
4
2
2
2
2,2
2
4,2
3
1
1,2
2,3
2,1
2
S
2,2
2,2
{a,b}
1
2
2
3
3
1,2
2,1
2,1
{d}
1
2
2
2
1,1
1,1
{d}
{c}
1
1
4
2
1
S
{a}
G’ {a}
{a}
3
4,3
4
4
4
Time Complexity
Transform G1 and G2 into G’1 and G’2 based
on the strongly connected components
Check whether subseq(G1)∩subseq(G2) is
infinite or not
Sort vertices of G’1 and G’2 in
topological order
Let C be a |V’1|×|V’2| integer table
(Ci,j : the length of a longest string
in
subseq(P(v’1,i))∩subseq(P(v’2,j))
Linear time
O(|Σ|log|Σ|) time
Linear time
Compute Ci,j
Compute Ci,j using dynamic programing for
all 1≦ i ≦ |V’1| and 1≦ j ≦ |V’2|
O(|E’1||E’2|+|V’1||V’2|log|Σ|) time
Return max Ci,j
Compare L’(v1,i) and L’(v2,j)
Time Complexity
Transform G1 and G2 into G’1 and G’2 based
on the strongly connected components
Check whether subseq(G1)∩subseq(G2) is
infinite or not
Sort vertices of G’1 and G’2 in
topological order
Linear time
O(|Σ|log|Σ|) time
Linear time
The total
complexity
Let C betime
a |V’1|×
|V’2| integer tableis O(|E1||E2|+|V1||V2|log|Σ|) time.
(Ci,j : the length of a longest string
in
subseq(P(v’1,i))∩subseq(P(v’2,j))
Compute Ci,j using dynamic programing for
all 1≦ i ≦ |V’1| and 1≦ j ≦ |V’2|
Return max Ci,j
O(|E’1||E’2|+|V’1||V’2|log|Σ|) time
O(|V’1||V’2|) time
Conclusions and Future works
problem
Longest Common
Substring
Longest Common
Subsequence
time complexity
text 1
text 2
acyclic graph
acyclic graph
O(|E1||E2|)
graph
acyclic graph
O(|E1||E2|)
acyclic graph
acyclic graph
O(|E1||E2|)
graph
graph
O(|E1||E2|+|V1||V2|log|Σ|)
Future works
・Longest Common Substring Problem on Cyclic Non-linear text
・the case where the number of times a vertex can be used is bounded
・pattern matching with non-linear patterns
Thank You For Listening