ソードコードの編集に基づいた

テキストマイニング技術を応用した
メソッドクローン検出手法の提案
〇山中裕樹1, 吉田則裕2,
崔恩瀞1, 井上克郎1
1 大阪大学
2 奈良先端科学技術大学院大学
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
メソッドクローン
 メソッド単位のコードクローン
►互いに一致または類似した処理を行うメソッド
►ソースコードのコピーアンドペーストなどによって発生
クローンペア
2
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
メソッドクローンの定義[1]
種類
意味
タイプ1 レイアウト・空白・コメントの違いを除き完全に一致している
タイプ2 タイプ1に加え変数名・型の違いを除き構文的に一致している
タイプ3 タイプ2に加え文が挿入・削除・変更されている
タイプ4 構文上異なる実装だが，同一処理を実行している
[1]C. K. Roy, J. R. Cordy, R. Koschke. Comparison and evaluation of code
clone detection techniques and tools: a qualitative approach. Science of
Computer Programming, Vol. 74, No. 7, pp. 470–495, 2009.
3
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
メソッドクローン：タイプ1
種類
意味
タイプ1 レイアウト・空白・コメントの違いを除き完全に一致している
タイプ2 タイプ1に加え変数名・型の違いを除き構文的に一致している
タイプ3 タイプ2に加え文が挿入・削除・変更されている
タイプ4 構文上異なる実装だが，同一処理を実行している
int sum(int[] data){
int sum = 0;
for(int i=0; i<data.length; i++){
sum = sum + data[i];
}
return sum;
}
メソッド1
int sum(int[] data){
int sum = 0;
for(int i=0; i<data.length; i++){
sum = sum + data[i];
}
return sum;
}
メソッド2
4
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
メソッドクローン：タイプ2
種類
意味
タイプ1 レイアウト・空白・コメントの違いを除き完全に一致している
タイプ2 タイプ1に加え変数名・型の違いを除き構文的に一致している
タイプ3 タイプ2に加え文の挿入・削除・変更されている
タイプ4 構文上異なる実装だが，同一処理を実行している
int sum(int[] data){
int sum = 0;
for(int i=0; i<data.length; i++){
sum = sum + data[i];
}
return sum;
}
double sum(double[] data){
double sum = 0;
for(int j=0; j<data.length; j++){
sum = sum + data[j];
}
return sum;
}
メソッド1
メソッド2
5
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
メソッドクローン：タイプ3
種類
意味
タイプ1 レイアウト・空白・コメントの違いを除き完全に一致している
タイプ2 タイプ1に加え変数名・型の違いを除き構文的に一致している
タイプ3 タイプ2に加え文が挿入・削除・変更されている
タイプ4 構文上異なる実装だが，同一処理を実行している
int sum(int[] data){
int sum = 0;
for(int i=0; i<data.length; i++){
sum = sum + data[i];
}
return sum;
}
メソッド1
int sum(int[] data){
int sum = 0;
for(int i=0; i<data.length; i++){
sum += data[i];
}
return sum;
}
メソッド2
6
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
メソッドクローン：タイプ4
種類
意味
タイプ1 レイアウト・空白・コメントの違いを除き完全に一致している
タイプ2 タイプ1に加え変数名・型の違いを除き構文的に一致している
タイプ3 タイプ2に加え文が挿入・削除・変更されている
タイプ4 構文上異なる実装だが，同一処理を実行している
int sum(int[] data){
int sum = 0;
for(int i=0; i<data.length; i++){
sum = sum + data[i];
}
return sum;
}
メソッド1
int sum(int[] data){
int sum = 0, i=0;
while(i<data.length){
sum += data[i]; i++;
}
return sum;
}
メソッド2
7
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
本研究の概要
 タイプ1-タイプ4のメソッドクローン検出手法の提案
►テキストマイニング技術を利用
►メソッドを特徴ベクトルに変換し，類似度を計算
処理が類似したメソッドを高速に検出
8
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
メソッドクローン検出の目的
 リファクタリング支援
Class S
Class A
Class B
hoge()
hoge()
メソッド
の引上げ
hoge()
Class A
Class B
クローン
 一貫した修正漏れによるバグの検出（タイプ3,4）
int sum(int[] data){
if ( data == null)
return null;
・・・
}
例外処理の
追加漏れ
int sum(int[] data){
//例外処理漏れ
・・・
}
9
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
既存のクローン検出手法
 構文的な類似性に着目した手法[2][3]
►メリット：タイプ1,タイプ2のクローンを高速に検出可能
►デメリット：タイプ3,タイプ4のクローンを検出が困難
 意味的な類似性に着目した手法[4]
►メリット：タイプ1- タイプ4のクローンを検出可能
►デメリット：検出時間に膨大な時間がかかる
[2] T. Kamiya, S. Kusumoto, K. Inoue. CCFinder: a multilinguistic token-based code clone detection system for large scale
source code. IEEE Trans. Softw. Eng., Vol. 28, No. 7, pp. 654–670, 2002.
[3] L. Jiang, G. Misherghi, Z. Su, S. Glondu. DECKARD: scalable and accurate tree-based detection of code clones. In Proc. of
ICSE ’07, pp. 96-105, 2007.
[4] R. Komondoor, S. Horwitz. Using slicing to identify duplication in source code. In Proc. of SAS ’01, pp. 40–56, 2001.
10
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
提案手法の概要
 メソッド中のワードに重みを付け特徴ベクトルを計算
►識別子名を構成する単語（関数名，変数名）
►予約語に含まれる単語（ if, while ）
 近似アルゴリズムを用いて特徴ベクトルをクラスタリング
• タイプ1からタイプ4のメソッドクローンを検出
• 近似アルゴリズムにより高速な検出を実現
11
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
検出アルゴリズム
STEP1：各メソッドからワードの抽出
STEP2: ワードに対して重みを計算し特徴ベクトルに変換
STEP3: 各メソッドの特徴ベクトルをクラスタリング
STEP4: クラスタ中の特徴ベクトル間の類似度を計算
STEP1
STEP2
STEP3
STEP4
メソッド A
ワード回数
xxx
3
yyy
2
・・・・・・
メソッドB
ソースコード
ワード回数
xxx
2
yyy
4
・・・
メソッド A
{a1 , a2 , a3 , }
メソッドA
メソッドB
メソッドB
{b1 , b2 , b3 , }
メソッドC
メソッドD
メソッドE
特徴ベクトル
クラスタ
・・・
ワードリスト
類似度メソッド対クローン
メソッドA
0.95
✓
メソッドB
メソッドC
0.70
メソッドD
メソッドC
0.70
メソッドE
メソッドD
0.90
✓
メソッドE
・・・
・・・
・・・
クローンペアリスト
12
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
STEP1：ワードの抽出
 各メソッドから識別子名・予約語を抽出
►2文字以下の識別子は1つのメタワードとして扱う
►複数の単語の場合，区切り文字や大文字で分割
 例：dataSize ⇒ data + size
メソッドAのソースコード
int sum(int[] data, int dataSize){
int sum = 0;
for(int i=0; i<dataSize; i++)
sum += data[i];
return sum;
}
メソッドAのワードリスト
予約語
分割した
識別子名
ワード出現回数
int
4
for
1
return
1
data
4
sum
4
size
2
メタワード
4
13
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
STEP1：ワードの抽出
 各メソッドから識別子名・予約語を抽出
►2文字以下の識別子は1つのメタワードとして扱う
►複数の単語の場合，区切り文字や大文字で分割
 例：dataSize ⇒ data + size
メソッドAのソースコード
int sum(int[] data, int dataSize){
int sum = 0;
for(int i=0; i<dataSize; i++)
sum += data[i];
return sum;
}
メソッドAのワードリスト
予約語
分割した
識別子名
ワード出現回数
int
4
for
1
return
1
data
4
sum
4
size
2
メタワード
4
14
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
STEP1：ワードの抽出
 各メソッドから識別子名・予約語を抽出
►2文字以下の識別子は1つのメタワードとして扱う
►複数の単語の場合，区切り文字や大文字で分割
 例：dataSize ⇒ data + size
メソッドAのソースコード
int sum(int[] data, int dataSize){
int sum = 0;
for(int i=0; i<dataSize); i++)
sum += data[i];
return sum;
}
メソッドAのワードリスト
予約語
分割した
識別子名
ワード出現回数
int
4
for
1
return
1
data
4
sum
4
size
2
メタワード
4
15
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
STEP2：特徴ベクトルの計算
 TF-IDF法を利用
►文書中の単語に関する重み付けの手法
►TF値とIDF値の積で表される
メソッド中の
ワードの出現頻度
×
ソースコード全体の
ワードの希少さ
各ワードの重みを特徴量として
各メソッドを特徴ベクトルに変換
16
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
TF-IDFの計算例
メソッドAにおけるワード sum のTF-IDF値を求める
全メソッドのワード含有のチェック
メソッドAのワードリスト
ワード
int
for
return
data
出現回数
4
1
1
4
sum
4
メタワード
2
4
予約語
分割した
識別子名
sumの出現回数
size
TFsum
4

 0.20
20
全ワードの出現回数
ワード
メソッドA メソッドB メソッドC メソッドD
・
・
・
・
・
・
sum
✔
・
・
・
・
・
・
・
・
・
IDFsum
・
・
・
✔
全メソッド数
×
・
・
・
・
・
・
・
・
・
・
・
・
4
 log   0.30
2
sumを含むメソッド数
TF - IDFsum  0.06
ワードsumの
特徴量
17
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
STEP3：特徴ベクトルのクラスタリング
 LSH(Locality-Sensitive Hashing)[5]を利用
►近似最近傍探索アルゴリズムの一つ
 ハッシュ関数を用いて高速にクラスタリング可能
►クローンペアと成りうる候補を絞ることが目的
メソッド名
特徴ベクトル
メソッドA
(5,4,2,1,・・・ )
メソッドB
(0,0,2,2,・・・ )
メソッドC
(0,0,2,2,・・・ )
メソッドD
(3,4,2,1,・・・ )
・・・
・・・
各メソッドの特徴ベクトル
メソッドA
メソッドD
クラスタリング
メソッドB
メソッドC
メソッドのクラスタ
[5] P. Indyk, R. Motwani. Approximate nearest neighbors: towards removing the
curse of dimensionality. In Proc. of STOC ’98, pp. 604-613, 1998.
18
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
STEP4: ベクトル間の類似度計算
 各クラスタ内で特徴ベクトル間の類似度を計算
►コサイン類似度を利用
 
 
►特徴ベクトル a , b 間の類似度の計算方法
 
a b
 
 
sim a , b  cos a , b   
a b




閾値（0.9）以上であればクローンとして検出
19
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
評価実験：リサーチクエッション
 RQ1:
►タイプ1-タイプ4のクローンを検出できるか
 RQ2：
►既存手法と比較して本手法は有用であるか
 RQ3:
►どのようなメソッドクローンを検出できるか
20
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
RQ1: 検出性の評価
 Royらのベンチマーク[6]を利用
►タイプ1-タイプ4の16個のクローンペアが用意
ベンチマーク
検出結果
タイプ1
3
3
タイプ2
4
2
タイプ3
5
5
タイプ4
4
4
全タイプのクローンを検出することを確認
[6] C. K. Roy, J. R. Cordy, R. Koschke. Comparison and evaluation of code clone detection techniques and
tools: a qualitative approach. Science of Computer Programming, Vol. 74, No. 7, pp.470-495, 2009.
21
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
RQ2: 既存手法との比較
 メモリベースの検出ツールMeCC[7]と比較
►タイプ1-タイプ4の手続き（関数）単位のクローン検出
 MeCCの評価実験で利用されたOSSに適用
►検出精度と検出時間を比較
プロジェクト名
ApacheHTTPD
Python
PostgreSQL
言語
C
C
C
サイズ
343KLOC
435KLOC
937KLOC
[7] H. Kim, Y. Jung, S. Kim, K. Yi. MeCC: memory comparison-based clone
detector. In Proc. of ICSE ’11, pp. 301–310, 2011.
22
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
RQ2: 検出精度の比較結果
 適合率と各タイプの正解検出クローン数を比較
プロジェクト名
Python
ApacheHTTPD
本手法
PostgreSQL
合計
Python
ApacheHTTPD
MeCC
PostgreSQL
合計
適合率
94.5%
95.4%
94.7%
94.6%
85.3%
87.5%
83.1%
85.00%
タイプ1
19
71
57
147
3
2
9
14
検出クローン
タイプ2 タイプ3 タイプ4
103
159
21
100
190
11
230
341
17
433
690
49
127
82
13
84
71
10
120
88
14
331
241
37
• MeCCよりも高い適合率でクローンを検出
• 各タイプのクローン検出数もMeCCより多い
23
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
RQ2: 検出時間の比較結果
 MeCCの実行環境と同一のものを利用
► Ubuntu 64-bit ( RAM: 8.0GB, CPU: Intel Xeon 2.4GHz)
ApacheHTTPD
Python
PostgreSQL
本手法
1m43s
2m13s
4m39s
MeCC
310m34s
65m26s
428m32s
MeCCよりも高速に
クローン検出を行うことが可能
24
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
RQ3: 検出クローンの例(1/2)
 繰り返し処理の置き換え
static const XML_Char *
poolCopyStringN(STRING_POOL *pool, const
XML_Char *s, int n)
{
if (!pool->ptr && !poolGrow(pool))
return NULL;
for (; n > 0; --n, s++) {
if (!poolAppendChar(pool, *s))
return NULL;
}
s = pool->start;
poolFinish(pool);
return s;
}
for文を用いた繰り返し
static const XML_Char * FASTCALL
poolCopyString(STRING_POOL *pool, const
XML_Char *s)
{
// 例外処理漏れの可能性がある!
do {
if (!poolAppendChar(pool, *s))
return NULL;
}while (*s++);
s = pool->start;
poolFinish(pool);
return s;
}
do-while文を用いた繰り返し
25
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
RQ3: 検出クローンの例(2/2)
 文の並び替え
static PyObject *
static PyObject *
dequeiter_next(dequeiterobject *it)
dequereviter_next(dequeiterobject *it)
{
{
PyObject *item;
PyObject *item;
if (it->deque->state != it->state) {
if (it->counter == 0)
it->counter = 0;
return NULL;
PyErr_SetString(PyExc_RuntimeError,
if (it->deque->state != it->state) {
"deque mutated during iteration");
it->counter = 0;
return NULL;
PyErr_SetString(PyExc_RuntimeError,
}
"deque mutated during iteration");
if (it->counter == 0)
return NULL;
return NULL;
}
assert (!(it->b == it->deque->rightblock &&
assert (!(it->b == it->deque->leftblock &&
it->index > it->deque->rightindex));
it->index < it->deque->leftindex));
・
・
・
・
例外処理の
・
・
位置が異なる
}
}
26
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
まとめと今後の課題
 まとめ
►テキストマイニング技術を用いたタイプ4のクローン検出
手法を提案
►既存手法と比較して高精度・高速なクローン検出が
可能
 今後の課題
►他のクローン検出ツールとの比較
►タイプ2のクローン検出精度の向上
 辞書を用いたワードのクラスタリングなどを利用
27
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
ご清聴ありがとうございました
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Download Report