Kouji Shimohira, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan Outline Non-linear Text Computing Longest Common Subsequence of Acyclic Non-linear Texts Computing Longest Common Subsequence of Cyclic Non-linear Texts Conclusions and Future works Outline Non-linear Text Computing Longest Common Subsequence of Acyclic Non-linear Texts Computing Longest Common Subsequence of Cyclic Non-linear Texts Conclusions and Future works Non-linear Text Non-linear Text G = ( V, E , L ) A directed graph with vertices labeled by characters. V : the set of vertices. E : the set of arcs. L : V → Σ : a labeling function. e.g. V = { v1 , v2 , v3 , v4 , v5 , v6 , v7 , v8 , v9 } E = {(v1,v2),(v2,v3),(v2,v7),(v3,v4),(v4,v5), (v5,v6),(v6,v7),(v6,v8),(v7,v8),(v8,v9)} L = { v1 → a, v2 → b, v3 → c, v4 → a, v5 → a, v6 → b, v7 → b, v8 → a, v9 → c } G a c 1 b 3 a a 4 5 b 6 a 2 b 7 8 c 9 Non-linear Text L(v) : the character label of vertex v P(v) : the set of paths that end at vartex v. L(P(v)) : the set of strings spelled by paths in P(v) P(G) : the set of paths in G. ( P(G) = {P(v) | v∈V } ) L(P(G)) : the set of strings spelled by paths in P(G) (= L(G)) substr(L(G)) : the set of substrings of strings in L(G) subseq(L(G)) : the set of subseqences of strings in L(G) e.g. G’ a c 1 b 3 a a 4 5 b L(v7) = b 6 a 2 b 7 P(v7) = {v1v2v3v4v5v6v7 , v1v2v7 } 8 c 9 L(P(v7)) = {abcaabb , abb } bcaa ∈ substr(L(G)) aca ∈ subseq(L(G)) Algorithms on Non-linear Text problem Substring Matching Approximate Matching Longest Common Substring Longest Common Subsequence text pattern time complexity acyclic graph linear O(n+m|E|) tree linear O(n) graph linear O(n+m|E|) graph with edit operations linear NP-complete [Amir et al, 1997] graph linear with edit operations text 1 text 2 acyclic graph [Park & Kim, 1995] [Akutsu, 1993] [Amir et al, 1997] O(m(n+e)) [Navarro, 2000] acyclic graph O(|E1||E2|) (this work) graph acyclic graph O(|E1||E2|) (this work) acyclic graph acyclic graph O(|Σ|2|E1||E2|) acyclic graph acyclic graph O(|E1||E2|) graph graph [Thang, 2011] (this work) O(|E1||E2|+|V1||V2|log|Σ|) (this work) |E1| : the number of arcs in text 1, |E2| : the number of arcs in text 2, |Σ| : the alphabet size, |V1| : the number of vertices in text 1, |V2| : the number of vertices in text 2. Outline Non-linear Text Computing Longest Common Subsequence of Acyclic Non-linear Texts Computing Longest Common Subsequence of Cyclic Non-linear Texts Conclusions and Future works Longest Common Subsequence Problem for Acyclic Non-linear Texts Problem 1 Input : Acyclic non-linear texts G1=(V1, E1, L1) and G2=(V2, E2, L2) Output : Length of longest string in subseq(G1)∩subseq(G2) e.g. G1 a 1 b 2 c 3 c 4 d 5 b 6 G2 a 1 d 4 c2 d3 b5 a 6 subseq(G1) = { a ,b, c, d, ab, ac, ad, bb, bc, bd, cb, cd, db, abb, abc, abd, acb, acd, adb, bcb, bcd, bdb, cdb, abcb, abcd, abdb, acdb, bcdb, abcdb } subseq(G2) = { a, b, c, d, ab, ac, ad, ba, ca, cb, cd, da, db, aba, aca, acb, acd, ada, adb, cba, cda, cdb, dba, acba, acda, acdb, adba, cdba, acdba } Output = 4 Algorithm 1 Algorithm 1 : Computing the length of longest common subsequence of acyclic non-linear texts Sort vertices of G1 and G2 in topological order Let C be a |V1|×|V2| integer table (Ci,j : the length of a longest string in subseq(P(v1,i))∩subseq(P(v2,j)) Compute Ci,j using dynamic programing for all 1≦ i ≦ |V1| and 1≦ j ≦ |V2| Return max Ci,j Topological Sort Sort vertices of G1 and G2 in topological order G1 a c G2 a d 1 4 1 4 b d c b 2 5 c b G1 3 a 6 1 b 2 c 3 c 4 d 5 b 6 v1,1 v1,2 v1,3 v1,4 v1,5 v1,6 sort 2 5 d a 3 6 G2 a1 c 2 d 3 d 4 b 5 b 6 v2,1 v2,2 v2,3 v2,4 v2,5 v2,6 Dynamic Programing table Let C be a |V1|×|V2| integer table (Ci,j : the length of a longest string in subseq(P(v1,i))∩subseq(P(v2,j)) G1 a G1 1 b G2 a1 c 2 2 c d 3 3 c d 4 4 d b 5 5 b b 6 6 G2 C a1 c a1 b 2 c 3 c 4 d 5 b 6 2 d 3 d 4 b 5 b 6 L1(v1,i) ≠ L2(v2,j) Compute Ci,j using dynamic programing for all 1≦ i ≦ |V1| and 1≦ j ≦ |V2| If L1(v1,i) ≠ L2(v2,j) then Ci,j = max ({ Ck,j | (v1,k , v1,i)∈E1}∪{Ci,ℓ | (v2,ℓ , v2,j)∈E2}∪{0}) G1 a c G2 a d 1 4 1 4 b d c b 2 5 2 5 c b d a 3 6 3 6 G1 G2 C a1 c2 d3 d4 b5 b6 S 1,1 1,2 1,1 1,2 S a1 1 1 1 1 1 1 b 1,1 1,2 1 2 c3 2,1 c4 S 5 b 6 1 2,1 1 2 0 2 4,5 1 5,3 2 2 2,6 4,2 1,1 3 2 1 1 2,5 2,5 4,1 3,2 2 1 2 1 1,2 2,4 4,3 3,2 1 1 2 0 1,4 3,2 S 1,1 d 1,3 3 1 L1(v1,i) = L2(v2,j) Compute Ci,j using dynamic programing for all 1≦ i ≦ |V1| and 1≦ j ≦ |V2| If L1(v1,i) = L2(v2,j) then Ci,j = 1 + max ({ Ck,ℓ | (v1,k , v1,i)∈E1 , (v2,ℓ , v2,j)∈E2 }∪{0}) G1 a c G2 a d 1 4 1 4 b d c b 2 5 2 5 c b d a 3 6 G1 G2 C a1 c2 d3 d4 b5 b6 S 1,1 1,2 1,1 1,2 S a1 1 1 1 1 1 1 b 6 d b 1,2 1 2 c3 c 3 1,1 2,1 2,1 S 3,1 2 2 3 +1 2 0 3 1 2 1 5,5 3 5,3 2 2 4,5 5,3 5,4 2 2,6 4,2 1,1 5,3 2 1 1 2,5 2,5 4,1 3,2 3,2 1 2,4 4,3 3,2 1,2 1 2 1 1 5 1 2 0 1,4 3,2 S 1,1 6 1 1 4 1,3 4 3 Output Return max Ci,j max Ci,j = 4 G2 C a1 c G1 a c G2 a d 1 4 1 4 b d c b 2 5 2 5 c b d a 3 6 G1 a b c c 3 6 d b S 1,1 S S 1,1 3,2 3,1 6 1 1 5 4,3 3,2 2 3,2 1 5,3 2 1 2 1 5,5 3 5,3 2 2 4,5 5,3 5,4 3 2,6 4,2 1,1 2 2 0 3 2,5 2,5 4,1 1 2 1 1 S 1,2 2,4 b 5 1 1 2 b 1,2 1,4 3,2 4 1 1 2 0 4 1,1 1,3 2,1 d 3 1 1 1 3 1,2 1,2 2,1 d 1 1 2 2 1,1 1 1 Output : 4 3 6,5 4 4 6 Time Complexity Sort vertices of G1 and G2 in topological order Linear time Let C be a |V1|×|V2| integer table (Ci,j : the length of a longest string in subseq(P(v1,i))∩subseq(P(v2,j)) Compute Ci,j using dynamic programing for all 1≦ i ≦ |V1| and 1≦ j ≦ |V2| Return max Ci,j O(|E1||E2|) time Time Complexity Compute Ci,j using dynamic programing for all 1≦ i ≦ |V1| and 1≦ j ≦ |V2| case of L1(v1,i) ≠ L2(v2,j) ei : the number of arcs incoming to v1,i fj : the number of arcs incoming to v2,j To compute the value of Ci,j , ei + fj elements in table C are used. To compute Ci,j for all 1≦ i ≦ |V1| and 1≦ j ≦ |V2| G1 G2 C a1 c2 d3 d4 b5 b6 S 1,1 1,2 1,1 1,2 S a1 1 1 1 1 1 1 b 2,1 c4 S b 1,2 1 2 c3 d = O(|E1||E2|) 1,1 2,1 1 2 1 2 1 2 1 5,5 3 5,3 2 2 4,5 5,3 5,4 3 2 0 3 2 2,6 4,2 1,1 5,3 2 1 1 2,5 2,5 4,1 3,2 3,2 1 2 1 1,2 2,4 4,3 3,2 3,1 1 2 0 1,4 3,2 S 1,1 6 1 1 5 1,3 3 6,5 4 4 Time Complexity Compute Ci,j using dynamic programing for all 1≦ i ≦ |V1| and 1≦ j ≦ |V2| case of L1(v1,i) = L2(v2,j) ei : the number of arcs incoming to v1,i fj : the number of arcs incoming to v2,j To compute the value of Ci,j , ei fj elements in table C are used. To compute Ci,j for all 1≦ i ≦ |V1| and 1≦ j ≦ |V1| G1 G2 C a1 c2 d3 d4 b5 b6 S 1,1 1,2 1,1 1,2 S a1 1 1 1 1 1 1 b d b 1,2 1 2 c3 c = O(|E1||E2|) 1,1 2,1 2,1 S 3,1 2 2 1 2 1 5,5 3 5,3 2 2 4,5 5,3 5,4 3 2 0 3 2 2,6 4,2 1,1 5,3 2 1 1 2,5 2,5 4,1 3,2 3,2 1 2,4 4,3 3,2 1,2 1 2 1 1 5 1 2 0 1,4 3,2 S 1,1 6 1 1 4 1,3 3 6,5 4 4 Time Complexity Sort vertices of G1 and G2 in topological order Linear time Let C be a |V1|×|V2| interger array (Ci,j : the length of a longest string in subseq(P(v1,i))∩subseq(P(v2,j)) Compute Ci,j using dymnamic programing for all 1≦ i ≦ |V1| and 1≦ j ≦ |V2| O(|E1||E2|) time Return max Ci,j O(|V1||V2|) time The total time complexity is O(|E1||E2|) time. Outline Non-linear Text Computing Longest Common Subsequence of Acyclic Non-linear Texts Computing Longest Common Subsequence of Cyclic Non-linear Texts Conclusions and Future works Longest Common Subsequence Problem for Cyclic Non-linear Texts Problem 2 Input : Non-linear texts G1=(V1, E1, L1) and G2=(V2, E2, L2) Output : ∞ (if subseq(G1)∩subseq(G2) is infinite) The Length of longest string in subseq(G1)∩subseq(G2) (otherwise) e.g. 1 G1 a a c Character “d” is in a cycle in both G1 and G2. b d c G2 a c d a b d ccdccdccd······ ∈ L(G1) bdbdbdbd······ ∈ L(G2) dddddd········ ∈ subseq(G1)∩subseq(G2) Output = ∞ Longest Common Subsequence Problem for Cyclic Non-linear Texts Problem 2 Input : Non-linear texts G1=(V1, E1, L1) and G2=(V2, E2, L2) Output : ∞ (if subseq(G1)∩subseq(G2) is infinite) The Length of longest string in subseq(G1)∩subseq(G2) (otherwise) e.g. 2 G1 a a c b d c G2 a c d a b a subseq(G1)∩subseq(G2) = {a, b, c, d, aa, ab, ac, ad, cd, aab, aac, aad, acd, aacd} Output = 4 Algorithm 2 Algorithm 2 : Computing the length of longest common subsequence of cyclic non-linear texts Transform G1 and G2 into G’1 and G’2 based on the strongly connected components Check whether subseq(G1)∩subseq(G2) is infinite or not Sort vertices of G’1 and G’2 in topological order Let C be a |V’1|×|V’2| integer table (Ci,j : the length of a longest string in subseq(P(v’1,i))∩subseq(P(v’2,j)) Compute Ci,j using dynamic programing for all 1≦ i ≦ |V’1| and 1≦ j ≦ |V’2| Return max Ci,j Strongly Connected Component Transform G1 and G2 into G’1 and G’2 based on the strongly connected components G1 a a G’ {a} c 1 {a} 3 {c,d} 2 1 b d {b} c 4 transform G2 a a c d b a G’ {a} 1 {c} 2 {d} 3 2 {a,b} 4 strongly connected component cyclic non-linear texts acyclic non-linear texts Check whether output is infinity or not. Check whether subseq(G1)∩subseq(G2) is infinite or not S1, S2 : the union of sets of labels of vertices that have a self-loop in G’1 , G’2 case of S1∩S2 ≠ Ø Let c be any character in S1∩S2. An infinite repetition c* of c is a common subsequence of G1 and G2. Hence, output = ∞. case of S1∩S2 = Ø subseq(G1)∩subseq(G2) is finite. G’ {a} 1 {a} G’ {a} 2 1 {c} 2 {d} 2 1 {b} 3 {c,d} 4 {a,b} 4 3 S1 = {c,d} S2 = {a}∪{a,b} = {a,b} S1∩S2 = {c,d}∩{a,b} = Ø Algorithm 2 Sort vertices of G’1 and G’2 in topological order G’ {a} 1 {a} 2 G’ 1 {b} 3 {c,d} 1 4 sort G’ {a} 1 {c} 2 {d} 3 G’ 2 2 {a,b} 4 {a} 1 {a} 2 {b} 3 {c,d} v’1, v’1, v’1, v’1, 1 2 3 4 {a} 1 {c} 2 {d} 3 {a,b} v’2, v’2, v’2, v’2, 1 2 3 4 4 4 Algorithm 2 Let C be a |V’1|×|V’2| integer table (Ci,j : the length of a longest string in subseq(P(v’1,i))∩subseq(P(v’2,j)) G’ 2 G’ 1 a 1 a 2 b 3 {c,d} C 4 G’ 1 G’ 2 {a} 1 c 2 d 3 {a,b} 4 {a} 1 {c} 2 {d} 3 {a,b} {a} 1 0 0 0 0 {a} 2 0 0 0 0 {b} 3 0 0 0 0 0 0 0 0 {c,d} 4 4 Algorithm 2 Compute Ci,j using dynamic programing for all 1≦ i ≦ |V’1| and 1≦ j ≦ |V’2| If L’1(v’1,i) ∩ L’2(v’2,j) ≠ Ø then Ci,j = 1 + max ({ Ck,ℓ | (v’1,k , v’1,i)∈E’1 , (v’2,ℓ , v’2,j)∈E’2 }∪{0}) G’ 2 G’ {a} 1 {a} C 2 G’ 1 {b} 3 {c,d} 1 1 {c} {a} {d} 3 {b} 2 {a,b} 4 {c,d} 2,1 2 1 1,2 2 2,3 2,1 2 S 2,2 2,2 {a,b} 1 2 2 3 3 1,2 2,1 2,1 {d} 1 2 2 2 1,1 1,1 4 {c} 1 1 4 2 1 S {a} G’ {a} {a} 2 2,2 +1 2 3 4,2 3 40 0 4 Algorithm 2 Compute Ci,j using dynamic programing for all 1≦ i ≦ |V’1| and 1≦ j ≦ |V’2| If L’1(v’1,i) ∩ L’2(v’2,j) = Ø then Ci,j = max ({ Ck,j | (v’1,k , v’1,i)∈E’1}∪{Ci,ℓ | (v’2,ℓ , v’2,j)∈E’2}∪{0}) G’ 2 G’ {a} 1 {a} C 2 G’ 1 {b} 3 {c,d} 1 1 {c} {a} {d} 3 {b} 2 {a,b} 4 {c,d} 2,1 2 2 2 2,2 2 4,2 3 1 1,2 2,3 2,1 2 S 2,2 2,2 {a,b} 1 2 2 3 3 1,2 2,1 2,1 {d} 1 2 2 2 1,1 1,1 4 {c} 1 1 4 2 1 S {a} G’ {a} {a} 3 4,3 4 40 4 Algorithm 2 Compute Ci,j using dynamic programing for all 1≦ i ≦ |V’1| and 1≦ j ≦ |V’2| If L’1(v’1,i) ∩ L’2(v’2,j) = Ø then Ci,j = max ({ Ck,j | (v’1,k , v’1,i)∈E’1}∪{Ci,ℓ | (v’2,ℓ , v’2,j)∈E’2}∪{0}) G’ 2 G’ {a} 1 {a} C 2 G’ 1 {b} 3 {c,d} 1 1 {c} {a} 3 {b} 2 {a,b} 4 {c,d} 2,1 4 2 2 2 2,2 2 4,2 3 1 1,2 2,3 2,1 2 S 2,2 2,2 {a,b} 4 1 2 2 3 3 1,2 2,1 2,1 {d} 1 2 2 2 1,1 1,1 {d} {c} 1 1 4 2 1 S {a} G’ {a} {a} 3 4,3 4 40 Output Return max Ci,j G’ 2 G’ {a} 1 {a} C 2 G’ 1 {b} 3 {c,d} 1 1 {c} {a} 3 {b} 2 {a,b} 4 {c,d} 2,1 4 2 2 2 2,2 2 4,2 3 1 1,2 2,3 2,1 2 S 2,2 2,2 {a,b} 1 2 2 3 3 1,2 2,1 2,1 {d} 1 2 2 2 1,1 1,1 {d} {c} 1 1 4 2 1 S {a} G’ {a} {a} 3 4,3 4 4 4 Time Complexity Transform G1 and G2 into G’1 and G’2 based on the strongly connected components Check whether subseq(G1)∩subseq(G2) is infinite or not Sort vertices of G’1 and G’2 in topological order Let C be a |V’1|×|V’2| integer table (Ci,j : the length of a longest string in subseq(P(v’1,i))∩subseq(P(v’2,j)) Linear time O(|Σ|log|Σ|) time Linear time Compute Ci,j Compute Ci,j using dynamic programing for all 1≦ i ≦ |V’1| and 1≦ j ≦ |V’2| O(|E’1||E’2|+|V’1||V’2|log|Σ|) time Return max Ci,j Compare L’(v1,i) and L’(v2,j) Time Complexity Transform G1 and G2 into G’1 and G’2 based on the strongly connected components Check whether subseq(G1)∩subseq(G2) is infinite or not Sort vertices of G’1 and G’2 in topological order Linear time O(|Σ|log|Σ|) time Linear time The total complexity Let C betime a |V’1|× |V’2| integer tableis O(|E1||E2|+|V1||V2|log|Σ|) time. (Ci,j : the length of a longest string in subseq(P(v’1,i))∩subseq(P(v’2,j)) Compute Ci,j using dynamic programing for all 1≦ i ≦ |V’1| and 1≦ j ≦ |V’2| Return max Ci,j O(|E’1||E’2|+|V’1||V’2|log|Σ|) time O(|V’1||V’2|) time Conclusions and Future works problem Longest Common Substring Longest Common Subsequence time complexity text 1 text 2 acyclic graph acyclic graph O(|E1||E2|) graph acyclic graph O(|E1||E2|) acyclic graph acyclic graph O(|E1||E2|) graph graph O(|E1||E2|+|V1||V2|log|Σ|) Future works ・Longest Common Substring Problem on Cyclic Non-linear text ・the case where the number of times a vertex can be used is bounded ・pattern matching with non-linear patterns Thank You For Listening
© Copyright 2024 ExpyDoc