International Journal of Computer Application Issue

International Journal of Computer Application
Available online on http://www.rspublication.com/ijca/ijca_index.htm
Issue 4, Volume 3 (May-June 2014)
ISSN: 2250-1797
Secure Mining of Classification in Horizontally Distributed Databases
Mohamed A.Ouda, Sameh A. Salem, Ihab A. Ali, and El-Sayed M.Saad
Department of Communication and Computer, Faculty of Engineering
Helwan University, Cairo - Egypt
Abstract
Data mining can extract interesting knowledge from large amount of data, but in distributed databases these large
amounts of data is split among different parties. Privacy concerns may prevent parities from directly sharing the data
which could lead to mutual benefit. This paper addresses a proposed secure mining protocol using K-Nearest Neighbor
(KNN) classification method over horizontally partitioned data. The proposed protocol depends mainly on the
integration of Elliptic Curve Cryptography (ECC) public key cryptosystem and Diffie-Hellman Key Exchange.
Comparisons with RSA public key cryptosystem show that the proposed protocol has relatively achieved the best
performance. A security analysis of the presented protocols along with the experimental results show their
applicability with different parameters such as number of parties involved, and the size of the underlying datasets.
Keywords: privacy-preserving; secure multi-party computation; K- Nearest Neighbor algorithm.
I. Introduction
Data mining techniques are used increasingly by private and public sectors to identify patterns, trends and
exceptions. Data mining functionalities include establishing correlations through association rules, supervised
and unsupervised machine learning through classification and clustering analysis, rare events analysis through
outlier detection and trend analysis for sequential and time series.
Data mining techniques in general are performed on two types of data environment: centralized databases
or data warehouses and decentralized data sets. Here we use the distributed classification discovery as a
scenario. The issue of violating the privacy of individuals when data to be shared is still one of the main
drawbacks of distributed data mining. We study the problem of secured classification mining in horizontally
partitioned databases. In that model there are several sites that hold homogeneous databases, which mean
databases share the same set of attributes but hold information on different entities.
This paper is organized as follows: Literature survey and related work in section 2 .Section 3 introduces the
cryptographic privacy preserving mechanisms that are used. Section 4 shows the proposed protocol. Section 5
presents the implemented system. Analysis and evaluation of proposed protocol is shown in section 6 and
finally the conclusion.
II. Related work
A research on privacy-preserving data mining has become of great interest since few years ago [1]-[3].
There have been proposed privacy preserving algorithms for different data mining applications, including
clustering [4]-[7], association rules mining across multiple databases [8]-[10], Bayes classification [11]-[13],
and decision trees [1][2]. The Secure Multiparty Computation paradigm provides cryptographic solutions for
protecting privacy in any distributed computation [14][15]. Works most related to ours are [16]-[20]. Herein,
we study applying Elliptic Curve Cryptography (ECC) integrated with Diffie-Hellman Key Exchange for
R S. Publication (rspublication.com), [email protected]
Page 108
International Journal of Computer Application
Available online on http://www.rspublication.com/ijca/ijca_index.htm
Issue 4, Volume 3 (May-June 2014)
ISSN: 2250-1797
secured KNN classification mining. The work done on privacy preserving data mining using ECC can be also
shown in [21][22]. The contributions of this paper contain the following:
 A solution for KNN classification with horizontal collaboration.
 Privacy and efficiency analysis to show the performance scaling up with various factors.
II.1 Homogeneous collaborative classifying mining with privacy preserving
There have been two approaches for privacy-preserving in multiparty classification mining. One is a
randomization approach [23][24] in which altering or perturbing the data before submitting to the data miner
such that the actual individual data cannot be recovered, while certain computations can still be applied to the
data. The other is a cryptographic approach mostly using SMC (Secure Multiparty Computation) [25][26].
Most efficient privacy preserving solutions can be often designed for specific distributed computations. We
focus on SMC-based classification mining algorithms on horizontally partitioned data in the cryptographic
approach. The SMC literature defines two basic adversarial models :
- Semi-Honest Model: Semi-honest adversaries follow the protocol faithfully, but can try to infer the secret
information of the other parties from the data they see during the execution of the protocol .
- Malicious Model: Malicious adversaries may do anything to infer secret information. They can abort the
protocol at any time, send spurious messages, spoof messages, collude with other (malicious) parties, etc.
III. Cryptographic privacy preserving mechanisms:
A- RSA Public-Key Cryptographic Algorithm
RSA is the most widely used in public-key cryptosystem [27]. Its security depends on the fact of number
theory in which the factorization of big integer is very difficult.
In RSA algorithm as in Fig.(1), key-pair (e, d) is generated by the receiver, who posts the encryption-key e on
a public media, while keeping the decryption-key d secret.
Fig. 1: Public-key encryption schemes: an illustration
B- Diffe-Hellman Protocol
The Diffie-Hellman protocol [28] is the basic public-key cryptosystem proposed for secret key sharing. If
party 𝐴 and party 𝐵 first agree to use a specific curve, field size, and type of mathematics, they then share the
secret key by process as follows. We can see that we just need scalar multiplication 𝑃 in order to implement the
Diffie-Hellman protocol.
1. 𝐴 and 𝐵 each chose random private key 𝑘𝑎 and 𝑘𝑏
2. 𝐴 and 𝐵 each calculate (𝑘𝑎 𝑃) and (𝑘𝑏 𝑃), and send them to opposite side.
3. 𝐴 and 𝐵 both compute the shared secret key 𝑆𝐾 = 𝑘𝑎 (𝑘𝑏 𝑃) = 𝑘𝑏 (𝑘𝑎 𝑃).
C- Elliptic Curve Cryptography
The elliptic curves we are using are restricted to elements of a finite field. In our work we use ECC defined
over odd prime field 𝐹𝑝 . Introduction to elliptic curves and their usage in cryptography can be found in [29].
Elliptic curve cryptosystems over finite field have some advantages like the key size can be much smaller
R S. Publication (rspublication.com), [email protected]
Page 109
International Journal of Computer Application
Available online on http://www.rspublication.com/ijca/ijca_index.htm
Issue 4, Volume 3 (May-June 2014)
ISSN: 2250-1797
compared to other cryptosystems like RSA, e.g. a 160-bit key in ECC is considered to be as secure as 1024-bit
key in RSA [30][31]. This leads to faster encryption/decryption process.
The security of ECC depends on the difficulty of Elliptic Curve Discrete Logarithm Problem (ECDLP),
which states that, “Given an elliptic curve E defined over a finite field 𝐹𝑝 of order 𝑛 , a point 𝑃 ∈ 𝐸 (𝐹𝑝 ) , and a
point 𝑄 ∈ 𝐸 (𝐹𝑝 ) , find the integer 𝑘 ∈ [0, 𝑛 − 1] such that 𝑄 = 𝑘 𝑃. The integer 𝑘 is called the discrete
logarithm of Q to the base P. It is computationally infeasible to obtain 𝑘, if 𝑘 is sufficiently large. Hence the
main operation involved in ECC is point multiplication. i.e. multiplication of a scalar 𝑘 with any point 𝑃 on the
curve to obtain another point 𝑄 on the curve.
IV. Proposed privacy-preserving classification in homogeneous collaborative environment
IV.1 proposed model for privacy Preserving:
-Key Generation for Privacy Preserving Model:
In our proposed protocol we extend Elliptic Curve [30] Diffie-Hellman Key Exchange [28] to be multiparties’ cryptosystem for distributed environment dataset. In the proposed model, master site and distributed
slave sites on both ends of communication send a public key which can be seen by anyone. The public key is
then combined with the private key to create a shared secret key which, due to the underlying mathematics, is the
same on both sides. This shared secret key is then used to hash a new key that can be used by either site (master
and slave site) for encrypting and decrypting messages.
𝑔, 𝑝, 𝐴
Slave site1
B1  g pk1 mod p
Master site
𝐵1
S k1  A pk1 mod p
𝑔, 𝑝, 𝐴
A  g pk m mod p
𝑔, 𝑝, 𝐴
Slave site2
B2  g pk 2 mod p
S k2  A pk2 mod p
Sk  B1 pk mod p
m
1
S k 2  B2
𝐵2
pk m
mod p
S k n  Bn pk m mod p
𝑔, 𝑝, 𝐴
Slave site n
Bn  g pkn mod p
𝐵𝑛
S kn  A pkn mod p
Fig.2: Generation of shared secret keys S ks at the master and slave sites
Where:
p : is a prime number on which the finite field Fp is defined.
g : is a base point taken from the elliptic group.
pk s , pk m : are private keys of slave site S s and the master site S m respectively selected from the interval
[ 1, p  1 ], 1  s  n .
Bs , A : are public keys of slave site S s and the master site S m respectively.
Shared secret key S k
s
Sks  mod( A
pks
, p)  mod(mod(g pkm , p) pks , p)  mod(mod(g
pkm pks
, p))  mod(mod(g pks , p) pkm , p)
 mod(Bs pkm , p)
-Proposed Encryption Scheme:
1) Let (G, E, D, M) be a ECC cryptography scheme, where G is an algorithm generating keys, E and D are the
encryption and decryption algorithms, and M is the message space.
R S. Publication (rspublication.com), [email protected]
Page 110
International Journal of Computer Application
Available online on http://www.rspublication.com/ijca/ijca_index.htm
Issue 4, Volume 3 (May-June 2014)
ISSN: 2250-1797
2) Let s, 1  s  n , ( n  number of sites/parties), and k is number of nearest neighbors beforehand.
3) At the master site and each slave site the public and private key pair of ECC algorithm is generated. Key
pair for a slave site s is : ( Bs , pk s ) , 1  s  n , and for the master site is: ( A, pkm )
4) Each slave party/site s exchanges public encryption key with the master site.
5) The shared secret key S k is generated for each slave site s, 1  s  n and the master site. This shared secret
s
key is then used by both master and slave site for encrypting and decrypting of messages. Each slave site s
calculate its shared secret key as follows:
S k s  mod( A pks , p)  mod(g pkm pks , p)
Where A is the public key of the master site, A  mod(g pkm , p) . While the shared secret key at the master
site corresponding each slave site is calculated as follows:
S k s  mod(Bs pkm , p) , Bs  mod(g
pks
, p) , where Bs is the public key of site s .
S k s  mod(g pk s ) pkm , p)  mod(g pkm pk s , p) .
6) Given a minimum distance ds min  M
messages where,
ms  {ds min , cls } ,
1 s  n
7) the encrypted values are computed as:
and corresponding class label cls  M , at site
ES k (ms )  mod(ms g pkm pks , p), 1  s  n
s
si
as plaintext
(1)
- Proposed Decryption Scheme:
To decrypt ES k (ms )  mod(ms g pkm pks , p) at the master site, the decryption key is calculated which represents
s
the inverse value of the shared key S k , then DS k  mod((g pkm pks ) 1, p)  (S k s ) 1 , such that:
s
1
(Sk s ( ES k (ms )))  (mod((g
) , p)(mod(ms g pkm pks , p))  mod((g pkm pks )1 g pkm pks ms , p)  mod(ms , p)
s
then, DS k ( ES k (ms ))  mod(ms , p)  ms ,
s
s
pkm pks 1
s
1 s  n
(2)
- Correctness of the proposed model:
|F |
As | Fp | , is the order of the finite group Fp then, x  1 for all x in Fp , as established from Lagrange's
p
theorem in group theory [32]. The order | Fp | , of the group Fp is known for all sites.
a- The value DS  mod((g pk
m pk s
ks
) 1 , p) at the master client will be calculated as follows:
It is known its private key pk m , and the public key Bs of slave site s , Bs  mod(g pks , p) then:
mod((g pks , p)|Fp| pkm , p)  mod(g |Fp| pks  pkm pks , p)
 mod(g |Fp| pks g  pkm pks , p)
 mod(1 pks g  pkm pks , p)
 mod(g  pkm pks , p)
 mod((g pkm pks ) 1, p)
b- The value DSk  mod((g pkm pks ) 1 , p) at the slave site will be computed as follows:
s
It is known its private key pk s , and the public key of master site A  mod(g pk , p)
m
then ,
mod((g pkm , p)|Fp| pks , p)  mod(g pkm|Fp| pkm pks , p)
 mod(g |Fp| pkm g  pkm pks , p)
R S. Publication (rspublication.com), [email protected]
Page 111
International Journal of Computer Application
Available online on http://www.rspublication.com/ijca/ijca_index.htm
Issue 4, Volume 3 (May-June 2014)
ISSN: 2250-1797
 mod(1 pkm g  pkm pks , p)
 mod((g pkm pks ) 1, p)
The result of decryption from one site s is ms  {d s min , cls } which represent the minimum distance of k
neighbors and corresponding dominant class label to produce the global predicted class of the classifier.
In homogeneous collaborative classification mining it is required to get data mining results over the whole
database 𝐷𝐵 = 𝐷𝐵1 ∪ 𝐷𝐵2 ∪ … ∪ 𝐷𝐵𝑛 . where each DBi (1  i  n) represents a partial database located at one of
the distributed sites P1 , P2 ,...,Pn (i.e. n -divisions). Each partial database DBi has r attributes and different number
of entities.
IV.2 The K-Nearest Neighbor classifier:
Standard data mining algorithm K-Nearest Neighbor classification [33][34] is an instance based learning
algorithm that has been shown to be very effective for a variety of problem domains. The objective of k-nearest
neighbor classification is to discover k nearest neighbors for a given instance, then assign a class label to the
given instance according to the majority class of the k nearest neighbors as shown in Fig. 3. The standard KNN
algorithm can be stated as follows:
a) Consider learning discrete-valued target functions of the form 𝑓 ∶ 𝑅 𝛾 → 𝐶, where 𝐶 is the finite set
𝑐1 , 𝑐2 , . . , 𝑐𝑠 , where 𝑐1 , 𝑐2 , . . , 𝑐𝑠 are class values.
b) Building training set:
For each training example(𝑥, 𝑓(𝑥)), add the example to the list training set.
c) Classification algorithm: Given a query instance 𝑋𝑝 to be classified, let 𝑥1 , 𝑥2 , . . , 𝑥𝑘 denote the k instances
from training set that are nearest to 𝑋𝑝 .
Then return 𝑓 𝑋𝑝 ← arg 𝑚𝑎𝑥𝑐∈𝐶 𝑘𝑖=1 𝛿(𝑐, 𝑓(𝑥𝑖 )) , where 𝛿 𝑎, 𝑏 = 1 if 𝑎 = 𝑏 and 𝛿 𝑎, 𝑏 = 0
otherwise.
The value 𝑓 𝑋𝑝 represents the dominant class value of query instance 𝑋𝑝 .
To build a k-nearest neighbor classifier, the key point is to privately obtain k-nearest instances for a given
point which will be achieved in the proposed protocol.
- The distance function used in this work is the standard Euclidean distance which is defined as :
D( X , Y ) 

r
i 1
( X [i]  Y [i])2
where r is the dimension space of an instance 𝑋 , 𝑋[𝑖] denote the i
𝐷(𝑋, 𝑌) is the distance between two data objects 𝑋, 𝑌.
(3)
th
component value of data object 𝑋, and
Fig.3: The classification of a test point at K=3, based on the
classes of its nearest neighbors.
IV.3 Proposed protocol
The proposed protocol presents a method for privately computing data mining process from distributed
sources without disclosing any information about the sources or their data except that revealed by final
classification result. The proposed protocol develops a solution for privacy–preserving k-nearest neighbor
classification. In this paper, a semi-honest model for an adversary is used.
R S. Publication (rspublication.com), [email protected]
Page 112
International Journal of Computer Application
Available online on http://www.rspublication.com/ijca/ijca_index.htm
Issue 4, Volume 3 (May-June 2014)
ISSN: 2250-1797
The Integrated PPDM Algorithm of K Nearest Classifier is as follows:
Input:
DB1 , DB2 ,...,DBn at n sites, each of
d j data objects, j {1,...,n} , each data object X  x1 , x2 ,...,xr of r
dimension space, r  1 , 𝐾 = number of nearest neighbors beforehand, 𝑘 > 1 integer odd value , 𝑦𝑖 class values
i {1,...,m} , l attribute values, 𝑋𝑝 query instance
Output: Global class value of query instance 𝑋𝑝 .
1. 𝑑𝑖 𝑚𝑖𝑛 Represents minimum neighbor distance with majority class relative to query instance 𝑋𝑝 , and class label 𝑐𝑙𝑖
is the corresponding class of 𝑑𝑖 𝑚𝑖𝑛 .
2. For i =1 … n do // generating shared secret keys
3. Master and slave sites generate public and private key pair using ECC Algorithm ;
4. The master and slave sites exchange their public keys and shared secret keys S ki are constructed.
5. End For // generating shared secret keys.
6. For i =1 … n do // computing 𝑑𝑖 𝑚𝑖𝑛 (local KNN) at 𝑛 sites and performing the encryption process.
7. Each site 𝑆𝑖 locally computes 𝑑𝑖 𝑚𝑖𝑛 and corresponding class value 𝑐𝑙𝑖 according to 𝐾 nearest algorithm relative to
query instance 𝑋𝑝 .
8. Using ECC encrypts (𝑑𝑖 𝑚𝑖𝑛 , 𝑐𝑙𝑖 ) to 𝐶𝑖 =𝐸𝑆𝑘𝑖 𝑑𝑖 𝑚𝑖𝑛 , 𝑐𝑙𝑖 as per Eq. (1).
9. The encrypted value 𝐶𝑖 is sent to the master site.
10. End For // computing 𝑑𝑖 𝑚𝑖𝑛 and performing the encryption process.
11. For i =1 … n do // Decryption process at master site
12. Decrypt 𝐶𝑖 as per Eq. (2) to get 𝑑𝑖 𝑚𝑖𝑛 and 𝑐𝑙𝑖
13. End For //decryption process
14. Construct the mapping table that maps the relative difference between 𝑑𝑖 𝑚𝑖𝑛 with all 𝑑𝑗 𝑚𝑖𝑛 𝑖 ≠ 𝑗 & 𝑖, 𝑗 ∈
1,𝑚 𝑡𝑜 +1,−1
15. Calculate the weight for each row in the mapping table by adding the row elements and get the sum.
16. Determine the global min distance which corresponds to min weight in the mapping table.
17. Get the predicted class that match global min distance (min weight in the mapping table).
Global computation in KNN Classifier:
Every client/site send its encrypted local 𝑑𝑖 𝑚𝑖𝑛 , and corresponding class value 𝑐𝑙𝑖 to master client (as a third
trusted party TTP). Master client after decryption each 𝑑𝑖 𝑚𝑖𝑛 and its class label 𝑐𝑙𝑖 , has a sequence of 𝑑𝑖 𝑚𝑖𝑛 ,
1 ≤ i ≤ 𝑛 (for 𝑛 parties/sites) which uses to construct the permutation mapping table. To construct a mapping
table we compare every value in the sequence with other values and if the result is equal or greater than zero the
result inserted in the permutation mapping table will be +1 otherwise will be −1, e.g if 𝑑1 𝑚𝑖𝑛 − 𝑑2 𝑚𝑖𝑛 ≥ 0
the value in the mapping table is +1 otherwise is -1.
As an example, let us have the sequence
𝑑1 𝑚𝑖𝑛 , 𝑑2 𝑚𝑖𝑛 , 𝑑3 𝑚𝑖𝑛 and 𝑑4 𝑚𝑖𝑛 of four parties/sites and 𝑑2 𝑚𝑖𝑛 < 𝑑1 𝑚𝑖𝑛 < 𝑑4 𝑚𝑖𝑛 < 𝑑3 𝑚𝑖𝑛 then (𝑑1 𝑚𝑖𝑛 −
𝑑2 𝑚𝑖𝑛 ), (𝑑1 𝑚𝑖𝑛 − 𝑑3 𝑚𝑖𝑛 ), (𝑑1 𝑚𝑖𝑛 − 𝑑4 𝑚𝑖𝑛 ), (𝑑2 𝑚𝑖𝑛 − 𝑑3 𝑚𝑖𝑛 ), (𝑑2 𝑚𝑖𝑛 − 𝑑4 𝑚𝑖𝑛 ), (𝑑3 𝑚𝑖𝑛 − 𝑑4 𝑚𝑖𝑛 ) will be
{+1, −1, −1, −1, −1, +1} as presented in Table 1. The weight for any element in the sequence relative to the
others is the algebraic sum of the row corresponding to that element. Since 𝑑2 𝑚𝑖𝑛 has the smallest weight -2,
then its corresponding class label will be the predicted value of query instance.
Table 1: An example of a permutation mapping table
𝑑1 𝑚𝑖𝑛
𝑑2 𝑚𝑖𝑛
𝑑3 𝑚𝑖𝑛
𝑑4 𝑚𝑖𝑛
𝑑1 𝑚𝑖𝑛
+1
-1
+1
+1
𝑑2 𝑚𝑖𝑛
+1
+1
+1
+1
𝑑3 𝑚𝑖𝑛
-1
-1
+1
-1
𝑑4 𝑚𝑖𝑛
-1
-1
+1
+1
R S. Publication (rspublication.com), [email protected]
weight
0
-2
+4
+2
Page 113
International Journal of Computer Application
Available online on http://www.rspublication.com/ijca/ijca_index.htm
Issue 4, Volume 3 (May-June 2014)
ISSN: 2250-1797
start
start
Iput K, 𝑋𝑝 , and
DB1,DB2,…,DBn
Iput K, 𝑋𝑝 , and DB1,
DB2,…,DBn
The master and slave sites generate their
public and private key pair as fig. 2.
The master site generate public and
private key pair for n sites: (𝑒1 , 𝑑1 ),
(𝑒2 , 𝑑2 ), …, (𝑒𝑛 , 𝑑𝑛 ).
The master and slave sites exchange
their public keys and secret shared keys
are constructed
Public keys (𝑒1 , 𝑒2 , … , 𝑒𝑛 ) are sent from
master to slave sites, one for each site.
KNN mining process is performed at
each slave site and local properties are
generated.
KNN mining process is performed at
each slave site and local properties are
generated.
Local properties are encrypted using
secret shared key at each slave site and
then transferred to the master site.
Local properties are encrypted using
public key at each slave site and then
transferred to the master site.
Local properties are decrypted at the
master site and mapping table is
constructed.
Local properties are decrypted at the
master site using corresponding private
key and mapping table is constructed
The predicted class(a)
corresponding to the
min weight at the
(b)mapping table is
established.
The predicted class corresponding to the
min weight at the mapping table is
established.
(c)
(d)
(e)
End
(f)
(b)
End
Fig.4: The process of proposed protocol using ECC (a)
and RSA (b) encryption algorithms.
V. Implemented System
To show the applicability of the proposed protocol and its performance when we are dealing with different
number of sites/parties, the protocol has been tested on various real-world datasets as per Table 2 [35]. In these
experiments, datasets are securely and horizontally distributed among 2,3,4,5, and 6 sites/parties.
The data objects sets were generated on each local site/party independently. For the central reference
classification we used the union of the local data object sets. As we suppose that the central classification is
optimal, we measure the performance time of our proposed approach w.r.t. the central encrypted classification.
We varied both the number of data objects and the number of client sites. We compared proposed protocol to a
single run of KNN classification on all data objects. In order to evaluate the proposed protocol, we carried out
the local classification sequentially. We collected all encrypted representatives of all local runs, and then applied
a global classification on these representatives after decryption process. For all these steps we always used the
same computer. All experiments were developed using C# standard Edition 2010 on Intel® Core2 Duo, 2.0
GHz, 4 GB RAM machine.
R S. Publication (rspublication.com), [email protected]
Page 114
International Journal of Computer Application
Available online on http://www.rspublication.com/ijca/ijca_index.htm
Issue 4, Volume 3 (May-June 2014)
ISSN: 2250-1797
Table 2: Data Sets
Dataset
Name
Landsat
satellite
images
Radar
return
Adult
Attribute
Characteristics
Number of
Instances
Number of
Attributes
Number
of classes
Area
integer
4435
36
7
Physical
Integer, Real
210
34
2
Physical
Categorical,
Integer
6000
13
2
Social
V.2 Performance Tests
The experiments that are performed are as follows:
The first experiment is done on real dataset called Statlog (Landsat Satellite) that can be available by NASA.
This database was generated from Landsat Multi-Spectral Scanner image data. The database consists of the
multi-spectral values of pixels in 3x3 neighborhoods in a satellite image, and the classification associated with
the central pixel in each neighborhood. In the sample database, the class of a pixel is coded as a number. This
sample dataset contains 36 attributes with 6,435 instances and its aim is to predict the class of the soil of an
image land instance. Table 3 shows the performance and accuracy of this experiment. Table 4 shows a
comparative performance and accuracy of RSA encryption method with the proposed one.
The second experiment is done on radar data set which is collected by a system in Goose Bay, Labrador. This
system consists of a phased array of 16 high-frequency antennas with a total transmitted power on the order of
6.4 kilowatts. The targets were free electrons in the ionosphere. “Good” radar returns are those showing
evidence of some type of structure in the ionosphere. “Bad” returns are those that do not; their signals pass
through the ionosphere. This sample dataset contains 34 attributes with 351 instances. Table 5 shows the
performance and accuracy of this experiment. Table 6 shows a comparative performance and accuracy of RSA
encryption method with the proposed one.
The last experiment is done on Adult model, which is known as Census Income dataset, which predict
whether income exceeds $50k/year base on census data. This model contains 14 attributes with 48842 instance.
Table 7 shows the performance and accuracy of this experiment. Table 8 shows a comparative performance and
accuracy of RSA encryption method with the proposed one.
For comparison study we use the following notations:
- Relative time overhead = Execution time with proposed encryption / Execution time without proposed encryption.
- Speedup = Execution time of RSA/Execution time of proposed encryption
-Relative accuracy = Accuracy of RSA/Accuracy of Proposed encryption
Table3: Execution Time and accuracy of distributed/centralized
Land Satellite Images dataset
Encrypted Centralized
data set
Distributed data set
No. of
sites
Dataset size
(No. of
records)
2
3
4
5
6
1490
2235
2980
3725
4470
Accuracy
%
82.1
83.9
84.1
83.0
84.3
Execution Time (ms)
Without
With
proposed
proposed
encryption
encryption
9435
10530
11467
12690
13991
Relative
time
Overhead
Accuracy
%
Execution
Time (ms)
1.04
1.10
1.12
1.17
1.18
85.3
87.0
87.3
87.6
88.1
16790
24012
30162
36993
44646
9866
11650
12884
14883
16482
R S. Publication (rspublication.com), [email protected]
Page 115
International Journal of Computer Application
Available online on http://www.rspublication.com/ijca/ijca_index.htm
Issue 4, Volume 3 (May-June 2014)
ISSN: 2250-1797
18,000
16,000
Execution Time (ms)
14,000
12,000
10,000
Without Encryption
8,000
With Encryption
6,000
4,000
2,000
0
2
3
4
5
6
No. of sites
Fig. 5: The overhead of proposed encryption algorithm on execution
time of distributed Land Satellite Images dataset
50,000
45,000
Execution Time (ms)
40,000
35,000
30,000
Distributed data set
25,000
Centralized data set
20,000
15,000
10,000
5,000
0
2
3
4
5
6
No. of sites
Fig.6: The execution time of proposed encryption algorithm on
(distributed/ centralized)Land Satellite Images dataset.
Table 4: Execution time and accuracy of proposed and RSA encryption
protocols for distributed Land Satellite Images dataset.
No. of
sites
2
3
4
5
6
Dataset size
(No. of
records)
1490
2235
2980
3725
4470
Execution Time (ms)
Proposed
RSA
encryption encryption
9866
13392
11650
17491
12884
22783
14883
25714
16482
29975
Speed up
1.36
1.50
1.77
1.73
1.78
Accuracy %
Proposed
RSA
encryption encryption
82.1
83.5
83.9
84.5
84.1
84.75
83
83.25
84.3
84
R S. Publication (rspublication.com), [email protected]
Relative
Accuracy
1.02
1.01
1.01
1.00
0.99
Page 116
International Journal of Computer Application
Available online on http://www.rspublication.com/ijca/ijca_index.htm
Issue 4, Volume 3 (May-June 2014)
ISSN: 2250-1797
35000
Execution Time (ms)
30000
25000
20000
Proposed Encryption
RSA Encryption
15000
10000
5000
0
2
3
4
5
6
No. of sites
Fig 7: Execution time of proposed and RSA encryption algorithm
for distributed Land Satellite Images dataset.
Table5: Execution Time and accuracy of distributed/centralized
Radar Return dataset
Encrypted Centralized
data set
Distributed data set
No. of
sites
Dataset size
(No. of
records)
2
3
4
5
6
70
105
140
175
210
Accuracy
%
87.23
89.36
92.2
92.91
92.1
Execution Time (ms)
Without
With
proposed
proposed
encryption
encryption
429
568
704
838
987
Relative
time
Overhead
Accuracy
%
Execution
Time (ms)
1.74
1.40
1.35
1.41
1.39
89.36
92.20
92.91
93.10
93.62
1614
1923
1951
2313
2385
749
801
956
1185
1375
1600
1400
Execution Time (ms)
1200
1000
Without Encryption
800
With Encryption
600
400
200
0
2
3
4
5
6
No. of sites
Fig. 8: The overhead of proposed encryption algorithm on
execution time of distributed Radar Return dataset
R S. Publication (rspublication.com), [email protected]
Page 117
International Journal of Computer Application
Available online on http://www.rspublication.com/ijca/ijca_index.htm
Issue 4, Volume 3 (May-June 2014)
ISSN: 2250-1797
3,000
Execution Time (ms)
2,500
2,000
Distributed data set
1,500
Centralized data set
1,000
500
0
2
3
4
5
6
No. of sites
Fig.9: The execution time of proposed encryption algorithm on
(distributed/ centralized) Radar Return dataset
Table 6: Execution time and accuracy of proposed and RSA encryption
protocols for distributed Radar Return dataset
Dataset size
(No. of
records)
70
105
140
175
210
No. of
sites
2
3
4
5
6
Execution Time (ms)
Proposed
RSA
encryption encryption
749
3672
801
5006
956
6843
1185
8188
1375
9888
Speed up
4.90
6.24
7.15
6.90
7.19
Accuracy %
Proposed
RSA
encryption encryption
87.23
86.52
89.36
87.23
92.2
91.41
92.91
92.91
92.1
92.91
Relative
Accuracy
0.99
0.97
0.99
1.00
1.01
Execution Time (ms)
12000
10000
8000
Proposed Encryption
6000
RSA Encryption
4000
2000
0
2
3
4
5
6
No. of sites
Fig. 10: Execution time of proposed and RSA encryption
algorithm for distributed Radar Return dataset.
Table7: Execution time and accuracy of distributed/centralized Adult dataset.
Encrypted Centralized
data set
Distributed data set
No. of
sites
Dataset size
(No. of
records)
2
3
4
5
6
2000
3000
4000
5000
6000
Accuracy
%
78.56
79.38
79.62
80.04
80.62
Execution Time (ms)
Without
With
proposed
proposed
encryption
encryption
4939
5985
6930
8610
9520
Relative
time
Overhead
Accuracy
%
Execution
Time (ms)
1.20
1.20
1.20
1.17
1.19
78.23
79.88
80.25
80.52
81.97
10890
13148
16606
20490
24013
5960
7223
8319
10100
11404
R S. Publication (rspublication.com), [email protected]
Page 118
International Journal of Computer Application
Available online on http://www.rspublication.com/ijca/ijca_index.htm
Issue 4, Volume 3 (May-June 2014)
ISSN: 2250-1797
12000
Execution Time (ms)
10000
8000
Without Encryption
6000
With Encryption
4000
2000
0
2
3
4
5
6
No. of sites
Fig. 11: The overhead of proposed encryption algorithm on
execution time of distributed Adult dataset
30000
Execution Time (ms)
25000
20000
Distributed dataset
15000
Centralized dataset
10000
5000
0
2
3
4
5
6
No. of sites
Fig.12: The execution time of proposed encryption algorithm on
(distributed/ centralized) Adult dataset.
Table 8: Execution time and accuracy of proposed and RSA
encryption algorithm for distributed Adult dataset.
No. of
sites
Dataset size
(No. of
records)
2
Execution Time (ms)
Proposed
encryption
RSA
encryption
2000
5960
25429
3
3000
7223
4
4000
5
6
Accuracy %
Speed up
Relative
Accuracy
Proposed
encryption
RSA
encryption
4.26
78.56
78.85
1.003
34749
4.81
79.38
79.69
1.003
8319
44451
5.34
79.62
79.79
1.002
5000
10100
54860
5.43
80.04
80.42
1.002
6000
11404
55131
4.83
80.62
80.88
1.003
R S. Publication (rspublication.com), [email protected]
Page 119
International Journal of Computer Application
Available online on http://www.rspublication.com/ijca/ijca_index.htm
Issue 4, Volume 3 (May-June 2014)
ISSN: 2250-1797
60000
Execution Time (ms)
50000
40000
Proposed Encryption
30000
RSA Encryption
20000
10000
0
2
3
4
5
6
No. of sites
Fig. 13: Execution time of proposed and RSA encryption
Algorithm for distributed Adult dataset.
- Our proposed protocol is performed with K number equal 3 for KNN classification algorithm. We did
distributed classification with and without proposed encryption to know how the proposed encryption
algorithm can affect the system performance.
- Fig 5, 8, and 11 show the execution time for distributed datasets Land Satellite Images, Radar return and
Adult with and without proposed encryption scheme and the experimental results show that the overhead
in performance time does not exceed 20% relative to nonencrypted value for large size dataset e.g. Adult
and Land Satellite Images, which means that the overhead due to encryption for the proposed protocol is
within acceptable range.
- Fig. 6, 9, and 12 show the execution time of encrypted centralized system compared to the proposed
encrypted distributed one. Comparing the execution time for encrypted distributed datasets of 6 sites with
centralized one, the reduction of execution time in the proposed distributed system of 6 sites of total size
4470 record of Land Satellite Images dataset is not less than 41% the centralized one with the same
dataset size as in table 3. But for the case of Radar Return dataset in table 5, the proposed distributed
system of 6 sites of total size 210 record reduce the execution time to more than 48% of centralized one.
Moreover in Fig.12 the reduction in performance time relative to centralized one for proposed distributed
system of 6 sites of total size 6000 record of Adult dataset is above 45%. This is shows how encrypted
centralized clustering can affect the performance of the system comparing it with the proposed distributed
encrypted scheme.
- In Fig. 7,10, and 13 show that the proposed protocol has better performance than RSA encryption
algorithm with comparable/equal accuracy to. The speed up parameter ranges from 1.36 to 7.19 for the
tested data sets.
VI. Analysis and evaluation of proposed protocol
To evaluate privacy preserving algorithm, we have to discuss three main issues which are: privacy (security),
accuracy and efficiency.
- Privacy preserving analysis
Since all the model parameters are completely present with all parties then evaluation can be performed easily. The
party that wants to classify an instance using KNN evaluation procedure can do that locally, so no interactions
between parties. Thus there is no question of privacy being revealed or compromised. The result of local
classification is encrypted and transmitted to the master site using local shared secret key which is semantically
secured. According to the composition theorem [21], if the main protocol is partitioned into sub-protocols such that
the output of one sub-protocol is the input to the next and all intermediate results are kept private, then the whole
protocol would be privacy preserving.
R S. Publication (rspublication.com), [email protected]
Page 120
International Journal of Computer Application
Available online on http://www.rspublication.com/ijca/ijca_index.htm

Issue 4, Volume 3 (May-June 2014)
ISSN: 2250-1797
So each party 𝑃𝑖 encrypts its output pair 𝑑𝑖 𝑚𝑖𝑛 and corresponding class value 𝑐𝑙𝑖 as per equation (1).
ms  {d s min , cls } , 1  s  n ms is plaintext and ES k (ms )  mod(ms g pkm pks , p), is a cipher encryption
s



of output pair. These encrypted values are transmitted to the master client for global classification. So
the output of each party is securely transmitted to the master client to compute the global classification
without leaking any information about the private data of a party except its output.
ECC is semantically secured due to the difficulty of the Elliptic Curve Discrete Logarithm Problem.
Also using ECC in combination with Diffie-Hellman protocol is believed to make public key encryption
more secure.
The participants and the master site learn nothing but the results as in secure computation.
No realized attack is possible, since the master site does not hold any part of the database.
- Accuracy of proposed protocol
Master client, which decrypts, 𝑑𝑖 𝑚𝑖𝑛 and corresponding class value 𝑐𝑙𝑖 produce accurate results with ECC
cryptosystem. As shown in Tables 3,5, and 7 the accuracy of the classifier for parties between 2 to 6 is 78 –
92%, and is varied according to data set size and number of parties but accuracy range is still accepted and as
long as the number of parties increases the accuracy gets better.
- Efficiency of proposed protocol
Raising efficiency of the algorithm is mainly shown the decreases in time complexity. Proposed-KNN
classifier algorithm reduces the time complexity mainly in two aspects.
 First, global 𝑑𝑔 𝑚𝑖𝑛 and corresponding class value 𝑐𝑙𝑔 are quickly generated, since the KNN classifier
algorithm executed locally for every party 𝑃𝑖 , this enables solutions where the communication cost is
independent of the size of the database and greatly cut down communication costs comparing with
centralized data mining which needs to transfer all data into data warehouse to perform data mining
algorithm.
 Second, the length of encryption – decryption key size is shorter than other public key encryption
methods (e.g. RSA) with the same level of security.
-The complexity analysis of the protocol
a- The communication cost
Let us use β and  to denote the number of bits of public key size and ciphertext sent to the master site
respectively and n is the total number of sites/parties. The total communication cost is the cost of 2nβ+n
from steps 4 and 9 in the proposed protocol.
b- The computational cost is affected by:
 The generation of n cryptographic key pair and generation of n shared secret keys.
 The total number of n encryptions and n decryptions.
 Complexity for local KNN algorithm is O(kqr), where r is the dimension of training sample , q is number of
training samples and k is the parameter of KNN algorithm. Additional computations as n2 additions,
n(n-1) subtractions and n log(n) sorting n numbers.
Therefore, the complexity of KNN classifier for n parties is dominant for not only the other computational
costs but also for communication costs too. Consequently, the overall complexity of the proposed model =
O(nkqr).
Conclusion:
In this paper we have provided a solution for K Nearest Neighbor classification with horizontal
collaboration using the integration of Elliptic Curve Cryptography (ECC) public key cryptosystem and DiffieHellman Key Exchange. To analyze the overall efficiency not only the computational cost is considered but also
the communication cost for examining the performance of proposed protocol. The computational time of the
proposed protocol is compared with RSA public key encryption algorithm as well as the centralized data mining
approach using real world datasets. Experimental results show that the overhead in the execution time due to
proposed protocol is not exceeding more than 20% compared with non-encrypted distributed scheme in large
datasets as in Land Satellite Images and adult dataset. While, it gives not less than 40% reduction in execution
R S. Publication (rspublication.com), [email protected]
Page 121
International Journal of Computer Application
Available online on http://www.rspublication.com/ijca/ijca_index.htm
Issue 4, Volume 3 (May-June 2014)
ISSN: 2250-1797
time relative to the centralized scheme with the same size of dataset. Experimental results show that the
proposed protocol with ECC has better performance compared to RSA encryption algorithm with the same level
of security. As demonstrated through the theoretical analysis and the experimental results, the proposed protocol
achieves significant improvements in terms of privacy, accuracy and efficiency.
References:
[1] Yehuda Lindell and Benny Pinkas. “Privacy Preserving Data Mining”. In Proceedings of the 20th Annual
International Cryptology Conference (CRYPTO), pages 36–54, Santa Barbara, CA, USA, 2000.
]2[ R. Agrawal, A. Evfimievski, and R. Srikant, “Information sharing across Private databases,” in SIGMOD
’03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data. NewYork,
NY, USA: ACM Press, pp. 86–97, 2003.
]3[ D. E. O’ Leary, “Some privacy issues in knowledge discovery: The oecd personal privacy guidelines,” IEEE
Expert: Intelligent Systems and Their Applications, vol. 10, no.2, pp. 48–52, 1995.
[4] J. Vaidya and C. Clifton, “Privacy-preserving k-means clustering over vertically partitioned data,”in KDD
’03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data
mining. NewYork, NY, USA: ACM, pp.206–215, 2003.
[5] G. Jagannathan and R. N. Wright, “Privacy-preserving distributed k-means clustering over arbitrarily
partitioned data,”in KDD’05: Proceedings of the eleventh ACM SIGKDD international conference on
Knowledge discovery in data mining. New York, NY, USA: ACM, pp.593–599, 2005.
[6] M. Gorawski and L. Slabinski, “Implementation of homomorphic encryption in privacy-preserving clustering
distributed data” , in 2nd ADBIS Workshop on Data Mining and Knowledge Discovery, Thessaloniki,
Greece, 6 September 2006 in conjunction with 10th East-European Conference on Advances in Databases and
Information Systems ADBIS’ , pp.37–47, September 2006.
[7] G. Jagannathan, K. Pillaipakkamnatt, and R. N. Wright, “A new privacy-preserving distributed k-clustering
algorithm”, in SDM, 2006.
[8] J. Vaidya and C. Clifton, “Privacy preserving association rule mining in vertically partitioned data”, in KDD’
02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data
mining. New York, NY, USA: ACM, pp. 639–644, 2002.
[9] M. Kantarcioglu and C. Clifton, “Privacy-Preserving Distributed Mining of Association Rules on
Horizontally Partitioned Data”, IEEE Transactions on Knowledge and Data Engineering, vol.16 , no.9,
pp.1026-1037, 2004.
[10] J. Vaidya, M. Kantarcioglu, and C. Clifton, “Privacy-preserving naïve bayes classification ”, VLDBJ., vol.
17, no. 4, pp. 879–898, 2008.
[11] Kantarcioglu M., Vaidya J. “Privacy-Preserving Naïve Bayes Classifier for Horizontally Partitioned Data.
IEEE Work shop on Privacy-Preserving Data Mining, 2003.
[12] D. Meng, K. Sivakumar, and H. Kargupta, “Privacy-sensitive bayesian network parameter learning”, in
ICDM ’04: Proceedings of the Fourth IEEE International Conference on Data Mining. Washington, DC,
USA: IEEE Computer Society, pp. 487–490, 2004.
[13] R. Wright and Z. Yang, “Privacy-preserving bayesian network structure computation on distributed
heterogeneous data.” in KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on
Knowledge discovery and data mining. New York, NY, USA: ACM, pp. 713–718, 2004.
[14] O. Goldreich, S. Micali, and A.Wigderson,“How to play any mental game”, in STOC ’87: Proceedings of
the nineteenth annual ACM symposium on Theory of computing. New York, NY, USA: ACM, pp. 218 –229,
1987.
[15]A. C. Yao, “How to generate and exchange secrets”, in Proceedings of the 27 th Annual Symposium on
Foundations of Computer Science. Washington, DC, USA: IEEE Computer Society, pp.162–167, 1986.
[16] Justin Zhan , Liwu Chang and , Stan Matwin “Privacy Preserving K-nearest Neighbor Classification”, in
International Journal of Network Security ,2005.
[17] J. Zhan, and S. Matwin. “A Crypto-Based Approach to Privacy-Preserving Collaborative Data Mining”. In
Sixth IEEE international conference of data mining, IEEE, 2006.
R S. Publication (rspublication.com), [email protected]
Page 122
International Journal of Computer Application
Available online on http://www.rspublication.com/ijca/ijca_index.htm
Issue 4, Volume 3 (May-June 2014)
ISSN: 2250-1797
[18] J. Zhan. “Privacy-Preserving Collaborative Data Mining”. PhD thesis, Department of Computer Science,
University of Ottawa, IEEE, 2006.
[19] Shaneck, M., Kim, Y., Kumar, V. “Privacy Preserving Nearest Neighbor Search”. In: 6th IEEE International
Conference on Data Mining Workshops, pp. 541–545, 2006.
[20] Yinian Qi, and Mikhail J. Atallah” Efficient Privacy-Preserving k-Nearest Neighbor Search”. In: 28th
International Conference on Distributed Computing Systems, pp. 311–319, 2008.
[21] Ankit Chouhan, Sankita Patel, and D. C. Jinwala. “Comparative analysis of elliptic curve cryptography
based algorithms to implement privacy preserving clustering through secure multiparty computations”. In
Information Security, Scientific Research, 2013.
[22] Pooja Gupta, Ashish Kumar.” Two phase privacy preserving data mining”. In ComEngApp-Journal, vol. 2,
No. 3, 2013.
[23] R. Agrawal and R. Srikant. “Privacy-preserving data mining”. In Proceeding of the ACM SIGMOD
Conference on Management of data, pages 439-450. ACM Press May 2000.
[24] D. Agrawal and C. Aggarwal. “On the design and quantification of privacy preserving data mining
algorithms”. In the Proceeding of the 20th ACM SIGACT- SIGMOD-SIGART Symposium on Principles of
database System, pages 247-255, Santa Barbara, CA, May 21-23, 2001.
]25[ A. C. Yao, “How to generate and exchange secrets” in Proceedings of the 27th Annual Symposium on
Foundations of Computer Science. Washington, DC, USA: IEEE Computer Society, pp.162–167, 1986.
]26[ C. Clifton, M. Kantarcioglu, J. Vaidya, X. Lin, and M. Y. Zhu, “Tools for privacy preserving distributed
data mining,” ACM SIGKDD Explorations, vol. 4, 2003.
[27] P. Paillier, “Public-key cryptosystems based on composite degree residuosity classes”, Advances in
Cryptography - EUROCRYPT '99, pp. 223-238, Prague, Czech Republic, 1999.
[28] W. Diffie, and M. Hellman, “New directions in cryptography”, IEEE Transactions on Information Theory,
vol. 22, pp. 644-654, 1976.
[29] I.F. Blake, G. Seroussi, and N. P. Smart, Elliptic curves in cryptography, Cambridge: Cambridge University
Press, 2000.
[30] N. Koblitz, “Elliptic Curve Cryptosystems”, Mathematics of Computation, 48, pp. 203- 209, 1987.
[31] Darrel Hankerson, Alfred Menezes and Scott Vanstone, Guide to Elliptic Curve Cryptography.
[32] Gallian, Joseph, Contemporary Abstract Algebra (6th ed.), Boston: Houghton Mifflin, 2006.
[33] Nitin Bhatia and Vandana, Survey of Nearest Neighbor Techniques, in (IJCSIS) International Journal of
Computer Science and Information Security, Vol. 8, No. 2, 2010.
[34] T. Cover and P. Hart. Nearest neighbor pattern classification. In IEEE Transaction of Information Theory,
Vol. 13, pp.21-27, January, 1968.
[35] Ronny Kohavi and Barry Becker, “UCI Repository of Machine Learning Databases”,
http://archive.ics.uci.edu/ml/datasets.html, Data Mining and Visualization, Silicon Graphics, 1996.
R S. Publication (rspublication.com), [email protected]
Page 123