International Journal of Computer Application Available online on http://www.rspublication.com/ijca/ijca_index.htm Issue 4, Volume 3 (May-June 2014) ISSN: 2250-1797 Secure Mining of Classification in Horizontally Distributed Databases Mohamed A.Ouda, Sameh A. Salem, Ihab A. Ali, and El-Sayed M.Saad Department of Communication and Computer, Faculty of Engineering Helwan University, Cairo - Egypt Abstract Data mining can extract interesting knowledge from large amount of data, but in distributed databases these large amounts of data is split among different parties. Privacy concerns may prevent parities from directly sharing the data which could lead to mutual benefit. This paper addresses a proposed secure mining protocol using K-Nearest Neighbor (KNN) classification method over horizontally partitioned data. The proposed protocol depends mainly on the integration of Elliptic Curve Cryptography (ECC) public key cryptosystem and Diffie-Hellman Key Exchange. Comparisons with RSA public key cryptosystem show that the proposed protocol has relatively achieved the best performance. A security analysis of the presented protocols along with the experimental results show their applicability with different parameters such as number of parties involved, and the size of the underlying datasets. Keywords: privacy-preserving; secure multi-party computation; K- Nearest Neighbor algorithm. I. Introduction Data mining techniques are used increasingly by private and public sectors to identify patterns, trends and exceptions. Data mining functionalities include establishing correlations through association rules, supervised and unsupervised machine learning through classification and clustering analysis, rare events analysis through outlier detection and trend analysis for sequential and time series. Data mining techniques in general are performed on two types of data environment: centralized databases or data warehouses and decentralized data sets. Here we use the distributed classification discovery as a scenario. The issue of violating the privacy of individuals when data to be shared is still one of the main drawbacks of distributed data mining. We study the problem of secured classification mining in horizontally partitioned databases. In that model there are several sites that hold homogeneous databases, which mean databases share the same set of attributes but hold information on different entities. This paper is organized as follows: Literature survey and related work in section 2 .Section 3 introduces the cryptographic privacy preserving mechanisms that are used. Section 4 shows the proposed protocol. Section 5 presents the implemented system. Analysis and evaluation of proposed protocol is shown in section 6 and finally the conclusion. II. Related work A research on privacy-preserving data mining has become of great interest since few years ago [1]-[3]. There have been proposed privacy preserving algorithms for different data mining applications, including clustering [4]-[7], association rules mining across multiple databases [8]-[10], Bayes classification [11]-[13], and decision trees [1][2]. The Secure Multiparty Computation paradigm provides cryptographic solutions for protecting privacy in any distributed computation [14][15]. Works most related to ours are [16]-[20]. Herein, we study applying Elliptic Curve Cryptography (ECC) integrated with Diffie-Hellman Key Exchange for R S. Publication (rspublication.com), [email protected] Page 108 International Journal of Computer Application Available online on http://www.rspublication.com/ijca/ijca_index.htm Issue 4, Volume 3 (May-June 2014) ISSN: 2250-1797 secured KNN classification mining. The work done on privacy preserving data mining using ECC can be also shown in [21][22]. The contributions of this paper contain the following: A solution for KNN classification with horizontal collaboration. Privacy and efficiency analysis to show the performance scaling up with various factors. II.1 Homogeneous collaborative classifying mining with privacy preserving There have been two approaches for privacy-preserving in multiparty classification mining. One is a randomization approach [23][24] in which altering or perturbing the data before submitting to the data miner such that the actual individual data cannot be recovered, while certain computations can still be applied to the data. The other is a cryptographic approach mostly using SMC (Secure Multiparty Computation) [25][26]. Most efficient privacy preserving solutions can be often designed for specific distributed computations. We focus on SMC-based classification mining algorithms on horizontally partitioned data in the cryptographic approach. The SMC literature defines two basic adversarial models : - Semi-Honest Model: Semi-honest adversaries follow the protocol faithfully, but can try to infer the secret information of the other parties from the data they see during the execution of the protocol . - Malicious Model: Malicious adversaries may do anything to infer secret information. They can abort the protocol at any time, send spurious messages, spoof messages, collude with other (malicious) parties, etc. III. Cryptographic privacy preserving mechanisms: A- RSA Public-Key Cryptographic Algorithm RSA is the most widely used in public-key cryptosystem [27]. Its security depends on the fact of number theory in which the factorization of big integer is very difficult. In RSA algorithm as in Fig.(1), key-pair (e, d) is generated by the receiver, who posts the encryption-key e on a public media, while keeping the decryption-key d secret. Fig. 1: Public-key encryption schemes: an illustration B- Diffe-Hellman Protocol The Diffie-Hellman protocol [28] is the basic public-key cryptosystem proposed for secret key sharing. If party 𝐴 and party 𝐵 first agree to use a specific curve, field size, and type of mathematics, they then share the secret key by process as follows. We can see that we just need scalar multiplication 𝑃 in order to implement the Diffie-Hellman protocol. 1. 𝐴 and 𝐵 each chose random private key 𝑘𝑎 and 𝑘𝑏 2. 𝐴 and 𝐵 each calculate (𝑘𝑎 𝑃) and (𝑘𝑏 𝑃), and send them to opposite side. 3. 𝐴 and 𝐵 both compute the shared secret key 𝑆𝐾 = 𝑘𝑎 (𝑘𝑏 𝑃) = 𝑘𝑏 (𝑘𝑎 𝑃). C- Elliptic Curve Cryptography The elliptic curves we are using are restricted to elements of a finite field. In our work we use ECC defined over odd prime field 𝐹𝑝 . Introduction to elliptic curves and their usage in cryptography can be found in [29]. Elliptic curve cryptosystems over finite field have some advantages like the key size can be much smaller R S. Publication (rspublication.com), [email protected] Page 109 International Journal of Computer Application Available online on http://www.rspublication.com/ijca/ijca_index.htm Issue 4, Volume 3 (May-June 2014) ISSN: 2250-1797 compared to other cryptosystems like RSA, e.g. a 160-bit key in ECC is considered to be as secure as 1024-bit key in RSA [30][31]. This leads to faster encryption/decryption process. The security of ECC depends on the difficulty of Elliptic Curve Discrete Logarithm Problem (ECDLP), which states that, “Given an elliptic curve E defined over a finite field 𝐹𝑝 of order 𝑛 , a point 𝑃 ∈ 𝐸 (𝐹𝑝 ) , and a point 𝑄 ∈ 𝐸 (𝐹𝑝 ) , find the integer 𝑘 ∈ [0, 𝑛 − 1] such that 𝑄 = 𝑘 𝑃. The integer 𝑘 is called the discrete logarithm of Q to the base P. It is computationally infeasible to obtain 𝑘, if 𝑘 is sufficiently large. Hence the main operation involved in ECC is point multiplication. i.e. multiplication of a scalar 𝑘 with any point 𝑃 on the curve to obtain another point 𝑄 on the curve. IV. Proposed privacy-preserving classification in homogeneous collaborative environment IV.1 proposed model for privacy Preserving: -Key Generation for Privacy Preserving Model: In our proposed protocol we extend Elliptic Curve [30] Diffie-Hellman Key Exchange [28] to be multiparties’ cryptosystem for distributed environment dataset. In the proposed model, master site and distributed slave sites on both ends of communication send a public key which can be seen by anyone. The public key is then combined with the private key to create a shared secret key which, due to the underlying mathematics, is the same on both sides. This shared secret key is then used to hash a new key that can be used by either site (master and slave site) for encrypting and decrypting messages. 𝑔, 𝑝, 𝐴 Slave site1 B1 g pk1 mod p Master site 𝐵1 S k1 A pk1 mod p 𝑔, 𝑝, 𝐴 A g pk m mod p 𝑔, 𝑝, 𝐴 Slave site2 B2 g pk 2 mod p S k2 A pk2 mod p Sk B1 pk mod p m 1 S k 2 B2 𝐵2 pk m mod p S k n Bn pk m mod p 𝑔, 𝑝, 𝐴 Slave site n Bn g pkn mod p 𝐵𝑛 S kn A pkn mod p Fig.2: Generation of shared secret keys S ks at the master and slave sites Where: p : is a prime number on which the finite field Fp is defined. g : is a base point taken from the elliptic group. pk s , pk m : are private keys of slave site S s and the master site S m respectively selected from the interval [ 1, p 1 ], 1 s n . Bs , A : are public keys of slave site S s and the master site S m respectively. Shared secret key S k s Sks mod( A pks , p) mod(mod(g pkm , p) pks , p) mod(mod(g pkm pks , p)) mod(mod(g pks , p) pkm , p) mod(Bs pkm , p) -Proposed Encryption Scheme: 1) Let (G, E, D, M) be a ECC cryptography scheme, where G is an algorithm generating keys, E and D are the encryption and decryption algorithms, and M is the message space. R S. Publication (rspublication.com), [email protected] Page 110 International Journal of Computer Application Available online on http://www.rspublication.com/ijca/ijca_index.htm Issue 4, Volume 3 (May-June 2014) ISSN: 2250-1797 2) Let s, 1 s n , ( n number of sites/parties), and k is number of nearest neighbors beforehand. 3) At the master site and each slave site the public and private key pair of ECC algorithm is generated. Key pair for a slave site s is : ( Bs , pk s ) , 1 s n , and for the master site is: ( A, pkm ) 4) Each slave party/site s exchanges public encryption key with the master site. 5) The shared secret key S k is generated for each slave site s, 1 s n and the master site. This shared secret s key is then used by both master and slave site for encrypting and decrypting of messages. Each slave site s calculate its shared secret key as follows: S k s mod( A pks , p) mod(g pkm pks , p) Where A is the public key of the master site, A mod(g pkm , p) . While the shared secret key at the master site corresponding each slave site is calculated as follows: S k s mod(Bs pkm , p) , Bs mod(g pks , p) , where Bs is the public key of site s . S k s mod(g pk s ) pkm , p) mod(g pkm pk s , p) . 6) Given a minimum distance ds min M messages where, ms {ds min , cls } , 1 s n 7) the encrypted values are computed as: and corresponding class label cls M , at site ES k (ms ) mod(ms g pkm pks , p), 1 s n s si as plaintext (1) - Proposed Decryption Scheme: To decrypt ES k (ms ) mod(ms g pkm pks , p) at the master site, the decryption key is calculated which represents s the inverse value of the shared key S k , then DS k mod((g pkm pks ) 1, p) (S k s ) 1 , such that: s 1 (Sk s ( ES k (ms ))) (mod((g ) , p)(mod(ms g pkm pks , p)) mod((g pkm pks )1 g pkm pks ms , p) mod(ms , p) s then, DS k ( ES k (ms )) mod(ms , p) ms , s s pkm pks 1 s 1 s n (2) - Correctness of the proposed model: |F | As | Fp | , is the order of the finite group Fp then, x 1 for all x in Fp , as established from Lagrange's p theorem in group theory [32]. The order | Fp | , of the group Fp is known for all sites. a- The value DS mod((g pk m pk s ks ) 1 , p) at the master client will be calculated as follows: It is known its private key pk m , and the public key Bs of slave site s , Bs mod(g pks , p) then: mod((g pks , p)|Fp| pkm , p) mod(g |Fp| pks pkm pks , p) mod(g |Fp| pks g pkm pks , p) mod(1 pks g pkm pks , p) mod(g pkm pks , p) mod((g pkm pks ) 1, p) b- The value DSk mod((g pkm pks ) 1 , p) at the slave site will be computed as follows: s It is known its private key pk s , and the public key of master site A mod(g pk , p) m then , mod((g pkm , p)|Fp| pks , p) mod(g pkm|Fp| pkm pks , p) mod(g |Fp| pkm g pkm pks , p) R S. Publication (rspublication.com), [email protected] Page 111 International Journal of Computer Application Available online on http://www.rspublication.com/ijca/ijca_index.htm Issue 4, Volume 3 (May-June 2014) ISSN: 2250-1797 mod(1 pkm g pkm pks , p) mod((g pkm pks ) 1, p) The result of decryption from one site s is ms {d s min , cls } which represent the minimum distance of k neighbors and corresponding dominant class label to produce the global predicted class of the classifier. In homogeneous collaborative classification mining it is required to get data mining results over the whole database 𝐷𝐵 = 𝐷𝐵1 ∪ 𝐷𝐵2 ∪ … ∪ 𝐷𝐵𝑛 . where each DBi (1 i n) represents a partial database located at one of the distributed sites P1 , P2 ,...,Pn (i.e. n -divisions). Each partial database DBi has r attributes and different number of entities. IV.2 The K-Nearest Neighbor classifier: Standard data mining algorithm K-Nearest Neighbor classification [33][34] is an instance based learning algorithm that has been shown to be very effective for a variety of problem domains. The objective of k-nearest neighbor classification is to discover k nearest neighbors for a given instance, then assign a class label to the given instance according to the majority class of the k nearest neighbors as shown in Fig. 3. The standard KNN algorithm can be stated as follows: a) Consider learning discrete-valued target functions of the form 𝑓 ∶ 𝑅 𝛾 → 𝐶, where 𝐶 is the finite set 𝑐1 , 𝑐2 , . . , 𝑐𝑠 , where 𝑐1 , 𝑐2 , . . , 𝑐𝑠 are class values. b) Building training set: For each training example(𝑥, 𝑓(𝑥)), add the example to the list training set. c) Classification algorithm: Given a query instance 𝑋𝑝 to be classified, let 𝑥1 , 𝑥2 , . . , 𝑥𝑘 denote the k instances from training set that are nearest to 𝑋𝑝 . Then return 𝑓 𝑋𝑝 ← arg 𝑚𝑎𝑥𝑐∈𝐶 𝑘𝑖=1 𝛿(𝑐, 𝑓(𝑥𝑖 )) , where 𝛿 𝑎, 𝑏 = 1 if 𝑎 = 𝑏 and 𝛿 𝑎, 𝑏 = 0 otherwise. The value 𝑓 𝑋𝑝 represents the dominant class value of query instance 𝑋𝑝 . To build a k-nearest neighbor classifier, the key point is to privately obtain k-nearest instances for a given point which will be achieved in the proposed protocol. - The distance function used in this work is the standard Euclidean distance which is defined as : D( X , Y ) r i 1 ( X [i] Y [i])2 where r is the dimension space of an instance 𝑋 , 𝑋[𝑖] denote the i 𝐷(𝑋, 𝑌) is the distance between two data objects 𝑋, 𝑌. (3) th component value of data object 𝑋, and Fig.3: The classification of a test point at K=3, based on the classes of its nearest neighbors. IV.3 Proposed protocol The proposed protocol presents a method for privately computing data mining process from distributed sources without disclosing any information about the sources or their data except that revealed by final classification result. The proposed protocol develops a solution for privacy–preserving k-nearest neighbor classification. In this paper, a semi-honest model for an adversary is used. R S. Publication (rspublication.com), [email protected] Page 112 International Journal of Computer Application Available online on http://www.rspublication.com/ijca/ijca_index.htm Issue 4, Volume 3 (May-June 2014) ISSN: 2250-1797 The Integrated PPDM Algorithm of K Nearest Classifier is as follows: Input: DB1 , DB2 ,...,DBn at n sites, each of d j data objects, j {1,...,n} , each data object X x1 , x2 ,...,xr of r dimension space, r 1 , 𝐾 = number of nearest neighbors beforehand, 𝑘 > 1 integer odd value , 𝑦𝑖 class values i {1,...,m} , l attribute values, 𝑋𝑝 query instance Output: Global class value of query instance 𝑋𝑝 . 1. 𝑑𝑖 𝑚𝑖𝑛 Represents minimum neighbor distance with majority class relative to query instance 𝑋𝑝 , and class label 𝑐𝑙𝑖 is the corresponding class of 𝑑𝑖 𝑚𝑖𝑛 . 2. For i =1 … n do // generating shared secret keys 3. Master and slave sites generate public and private key pair using ECC Algorithm ; 4. The master and slave sites exchange their public keys and shared secret keys S ki are constructed. 5. End For // generating shared secret keys. 6. For i =1 … n do // computing 𝑑𝑖 𝑚𝑖𝑛 (local KNN) at 𝑛 sites and performing the encryption process. 7. Each site 𝑆𝑖 locally computes 𝑑𝑖 𝑚𝑖𝑛 and corresponding class value 𝑐𝑙𝑖 according to 𝐾 nearest algorithm relative to query instance 𝑋𝑝 . 8. Using ECC encrypts (𝑑𝑖 𝑚𝑖𝑛 , 𝑐𝑙𝑖 ) to 𝐶𝑖 =𝐸𝑆𝑘𝑖 𝑑𝑖 𝑚𝑖𝑛 , 𝑐𝑙𝑖 as per Eq. (1). 9. The encrypted value 𝐶𝑖 is sent to the master site. 10. End For // computing 𝑑𝑖 𝑚𝑖𝑛 and performing the encryption process. 11. For i =1 … n do // Decryption process at master site 12. Decrypt 𝐶𝑖 as per Eq. (2) to get 𝑑𝑖 𝑚𝑖𝑛 and 𝑐𝑙𝑖 13. End For //decryption process 14. Construct the mapping table that maps the relative difference between 𝑑𝑖 𝑚𝑖𝑛 with all 𝑑𝑗 𝑚𝑖𝑛 𝑖 ≠ 𝑗 & 𝑖, 𝑗 ∈ 1,𝑚 𝑡𝑜 +1,−1 15. Calculate the weight for each row in the mapping table by adding the row elements and get the sum. 16. Determine the global min distance which corresponds to min weight in the mapping table. 17. Get the predicted class that match global min distance (min weight in the mapping table). Global computation in KNN Classifier: Every client/site send its encrypted local 𝑑𝑖 𝑚𝑖𝑛 , and corresponding class value 𝑐𝑙𝑖 to master client (as a third trusted party TTP). Master client after decryption each 𝑑𝑖 𝑚𝑖𝑛 and its class label 𝑐𝑙𝑖 , has a sequence of 𝑑𝑖 𝑚𝑖𝑛 , 1 ≤ i ≤ 𝑛 (for 𝑛 parties/sites) which uses to construct the permutation mapping table. To construct a mapping table we compare every value in the sequence with other values and if the result is equal or greater than zero the result inserted in the permutation mapping table will be +1 otherwise will be −1, e.g if 𝑑1 𝑚𝑖𝑛 − 𝑑2 𝑚𝑖𝑛 ≥ 0 the value in the mapping table is +1 otherwise is -1. As an example, let us have the sequence 𝑑1 𝑚𝑖𝑛 , 𝑑2 𝑚𝑖𝑛 , 𝑑3 𝑚𝑖𝑛 and 𝑑4 𝑚𝑖𝑛 of four parties/sites and 𝑑2 𝑚𝑖𝑛 < 𝑑1 𝑚𝑖𝑛 < 𝑑4 𝑚𝑖𝑛 < 𝑑3 𝑚𝑖𝑛 then (𝑑1 𝑚𝑖𝑛 − 𝑑2 𝑚𝑖𝑛 ), (𝑑1 𝑚𝑖𝑛 − 𝑑3 𝑚𝑖𝑛 ), (𝑑1 𝑚𝑖𝑛 − 𝑑4 𝑚𝑖𝑛 ), (𝑑2 𝑚𝑖𝑛 − 𝑑3 𝑚𝑖𝑛 ), (𝑑2 𝑚𝑖𝑛 − 𝑑4 𝑚𝑖𝑛 ), (𝑑3 𝑚𝑖𝑛 − 𝑑4 𝑚𝑖𝑛 ) will be {+1, −1, −1, −1, −1, +1} as presented in Table 1. The weight for any element in the sequence relative to the others is the algebraic sum of the row corresponding to that element. Since 𝑑2 𝑚𝑖𝑛 has the smallest weight -2, then its corresponding class label will be the predicted value of query instance. Table 1: An example of a permutation mapping table 𝑑1 𝑚𝑖𝑛 𝑑2 𝑚𝑖𝑛 𝑑3 𝑚𝑖𝑛 𝑑4 𝑚𝑖𝑛 𝑑1 𝑚𝑖𝑛 +1 -1 +1 +1 𝑑2 𝑚𝑖𝑛 +1 +1 +1 +1 𝑑3 𝑚𝑖𝑛 -1 -1 +1 -1 𝑑4 𝑚𝑖𝑛 -1 -1 +1 +1 R S. Publication (rspublication.com), [email protected] weight 0 -2 +4 +2 Page 113 International Journal of Computer Application Available online on http://www.rspublication.com/ijca/ijca_index.htm Issue 4, Volume 3 (May-June 2014) ISSN: 2250-1797 start start Iput K, 𝑋𝑝 , and DB1,DB2,…,DBn Iput K, 𝑋𝑝 , and DB1, DB2,…,DBn The master and slave sites generate their public and private key pair as fig. 2. The master site generate public and private key pair for n sites: (𝑒1 , 𝑑1 ), (𝑒2 , 𝑑2 ), …, (𝑒𝑛 , 𝑑𝑛 ). The master and slave sites exchange their public keys and secret shared keys are constructed Public keys (𝑒1 , 𝑒2 , … , 𝑒𝑛 ) are sent from master to slave sites, one for each site. KNN mining process is performed at each slave site and local properties are generated. KNN mining process is performed at each slave site and local properties are generated. Local properties are encrypted using secret shared key at each slave site and then transferred to the master site. Local properties are encrypted using public key at each slave site and then transferred to the master site. Local properties are decrypted at the master site and mapping table is constructed. Local properties are decrypted at the master site using corresponding private key and mapping table is constructed The predicted class(a) corresponding to the min weight at the (b)mapping table is established. The predicted class corresponding to the min weight at the mapping table is established. (c) (d) (e) End (f) (b) End Fig.4: The process of proposed protocol using ECC (a) and RSA (b) encryption algorithms. V. Implemented System To show the applicability of the proposed protocol and its performance when we are dealing with different number of sites/parties, the protocol has been tested on various real-world datasets as per Table 2 [35]. In these experiments, datasets are securely and horizontally distributed among 2,3,4,5, and 6 sites/parties. The data objects sets were generated on each local site/party independently. For the central reference classification we used the union of the local data object sets. As we suppose that the central classification is optimal, we measure the performance time of our proposed approach w.r.t. the central encrypted classification. We varied both the number of data objects and the number of client sites. We compared proposed protocol to a single run of KNN classification on all data objects. In order to evaluate the proposed protocol, we carried out the local classification sequentially. We collected all encrypted representatives of all local runs, and then applied a global classification on these representatives after decryption process. For all these steps we always used the same computer. All experiments were developed using C# standard Edition 2010 on Intel® Core2 Duo, 2.0 GHz, 4 GB RAM machine. R S. Publication (rspublication.com), [email protected] Page 114 International Journal of Computer Application Available online on http://www.rspublication.com/ijca/ijca_index.htm Issue 4, Volume 3 (May-June 2014) ISSN: 2250-1797 Table 2: Data Sets Dataset Name Landsat satellite images Radar return Adult Attribute Characteristics Number of Instances Number of Attributes Number of classes Area integer 4435 36 7 Physical Integer, Real 210 34 2 Physical Categorical, Integer 6000 13 2 Social V.2 Performance Tests The experiments that are performed are as follows: The first experiment is done on real dataset called Statlog (Landsat Satellite) that can be available by NASA. This database was generated from Landsat Multi-Spectral Scanner image data. The database consists of the multi-spectral values of pixels in 3x3 neighborhoods in a satellite image, and the classification associated with the central pixel in each neighborhood. In the sample database, the class of a pixel is coded as a number. This sample dataset contains 36 attributes with 6,435 instances and its aim is to predict the class of the soil of an image land instance. Table 3 shows the performance and accuracy of this experiment. Table 4 shows a comparative performance and accuracy of RSA encryption method with the proposed one. The second experiment is done on radar data set which is collected by a system in Goose Bay, Labrador. This system consists of a phased array of 16 high-frequency antennas with a total transmitted power on the order of 6.4 kilowatts. The targets were free electrons in the ionosphere. “Good” radar returns are those showing evidence of some type of structure in the ionosphere. “Bad” returns are those that do not; their signals pass through the ionosphere. This sample dataset contains 34 attributes with 351 instances. Table 5 shows the performance and accuracy of this experiment. Table 6 shows a comparative performance and accuracy of RSA encryption method with the proposed one. The last experiment is done on Adult model, which is known as Census Income dataset, which predict whether income exceeds $50k/year base on census data. This model contains 14 attributes with 48842 instance. Table 7 shows the performance and accuracy of this experiment. Table 8 shows a comparative performance and accuracy of RSA encryption method with the proposed one. For comparison study we use the following notations: - Relative time overhead = Execution time with proposed encryption / Execution time without proposed encryption. - Speedup = Execution time of RSA/Execution time of proposed encryption -Relative accuracy = Accuracy of RSA/Accuracy of Proposed encryption Table3: Execution Time and accuracy of distributed/centralized Land Satellite Images dataset Encrypted Centralized data set Distributed data set No. of sites Dataset size (No. of records) 2 3 4 5 6 1490 2235 2980 3725 4470 Accuracy % 82.1 83.9 84.1 83.0 84.3 Execution Time (ms) Without With proposed proposed encryption encryption 9435 10530 11467 12690 13991 Relative time Overhead Accuracy % Execution Time (ms) 1.04 1.10 1.12 1.17 1.18 85.3 87.0 87.3 87.6 88.1 16790 24012 30162 36993 44646 9866 11650 12884 14883 16482 R S. Publication (rspublication.com), [email protected] Page 115 International Journal of Computer Application Available online on http://www.rspublication.com/ijca/ijca_index.htm Issue 4, Volume 3 (May-June 2014) ISSN: 2250-1797 18,000 16,000 Execution Time (ms) 14,000 12,000 10,000 Without Encryption 8,000 With Encryption 6,000 4,000 2,000 0 2 3 4 5 6 No. of sites Fig. 5: The overhead of proposed encryption algorithm on execution time of distributed Land Satellite Images dataset 50,000 45,000 Execution Time (ms) 40,000 35,000 30,000 Distributed data set 25,000 Centralized data set 20,000 15,000 10,000 5,000 0 2 3 4 5 6 No. of sites Fig.6: The execution time of proposed encryption algorithm on (distributed/ centralized)Land Satellite Images dataset. Table 4: Execution time and accuracy of proposed and RSA encryption protocols for distributed Land Satellite Images dataset. No. of sites 2 3 4 5 6 Dataset size (No. of records) 1490 2235 2980 3725 4470 Execution Time (ms) Proposed RSA encryption encryption 9866 13392 11650 17491 12884 22783 14883 25714 16482 29975 Speed up 1.36 1.50 1.77 1.73 1.78 Accuracy % Proposed RSA encryption encryption 82.1 83.5 83.9 84.5 84.1 84.75 83 83.25 84.3 84 R S. Publication (rspublication.com), [email protected] Relative Accuracy 1.02 1.01 1.01 1.00 0.99 Page 116 International Journal of Computer Application Available online on http://www.rspublication.com/ijca/ijca_index.htm Issue 4, Volume 3 (May-June 2014) ISSN: 2250-1797 35000 Execution Time (ms) 30000 25000 20000 Proposed Encryption RSA Encryption 15000 10000 5000 0 2 3 4 5 6 No. of sites Fig 7: Execution time of proposed and RSA encryption algorithm for distributed Land Satellite Images dataset. Table5: Execution Time and accuracy of distributed/centralized Radar Return dataset Encrypted Centralized data set Distributed data set No. of sites Dataset size (No. of records) 2 3 4 5 6 70 105 140 175 210 Accuracy % 87.23 89.36 92.2 92.91 92.1 Execution Time (ms) Without With proposed proposed encryption encryption 429 568 704 838 987 Relative time Overhead Accuracy % Execution Time (ms) 1.74 1.40 1.35 1.41 1.39 89.36 92.20 92.91 93.10 93.62 1614 1923 1951 2313 2385 749 801 956 1185 1375 1600 1400 Execution Time (ms) 1200 1000 Without Encryption 800 With Encryption 600 400 200 0 2 3 4 5 6 No. of sites Fig. 8: The overhead of proposed encryption algorithm on execution time of distributed Radar Return dataset R S. Publication (rspublication.com), [email protected] Page 117 International Journal of Computer Application Available online on http://www.rspublication.com/ijca/ijca_index.htm Issue 4, Volume 3 (May-June 2014) ISSN: 2250-1797 3,000 Execution Time (ms) 2,500 2,000 Distributed data set 1,500 Centralized data set 1,000 500 0 2 3 4 5 6 No. of sites Fig.9: The execution time of proposed encryption algorithm on (distributed/ centralized) Radar Return dataset Table 6: Execution time and accuracy of proposed and RSA encryption protocols for distributed Radar Return dataset Dataset size (No. of records) 70 105 140 175 210 No. of sites 2 3 4 5 6 Execution Time (ms) Proposed RSA encryption encryption 749 3672 801 5006 956 6843 1185 8188 1375 9888 Speed up 4.90 6.24 7.15 6.90 7.19 Accuracy % Proposed RSA encryption encryption 87.23 86.52 89.36 87.23 92.2 91.41 92.91 92.91 92.1 92.91 Relative Accuracy 0.99 0.97 0.99 1.00 1.01 Execution Time (ms) 12000 10000 8000 Proposed Encryption 6000 RSA Encryption 4000 2000 0 2 3 4 5 6 No. of sites Fig. 10: Execution time of proposed and RSA encryption algorithm for distributed Radar Return dataset. Table7: Execution time and accuracy of distributed/centralized Adult dataset. Encrypted Centralized data set Distributed data set No. of sites Dataset size (No. of records) 2 3 4 5 6 2000 3000 4000 5000 6000 Accuracy % 78.56 79.38 79.62 80.04 80.62 Execution Time (ms) Without With proposed proposed encryption encryption 4939 5985 6930 8610 9520 Relative time Overhead Accuracy % Execution Time (ms) 1.20 1.20 1.20 1.17 1.19 78.23 79.88 80.25 80.52 81.97 10890 13148 16606 20490 24013 5960 7223 8319 10100 11404 R S. Publication (rspublication.com), [email protected] Page 118 International Journal of Computer Application Available online on http://www.rspublication.com/ijca/ijca_index.htm Issue 4, Volume 3 (May-June 2014) ISSN: 2250-1797 12000 Execution Time (ms) 10000 8000 Without Encryption 6000 With Encryption 4000 2000 0 2 3 4 5 6 No. of sites Fig. 11: The overhead of proposed encryption algorithm on execution time of distributed Adult dataset 30000 Execution Time (ms) 25000 20000 Distributed dataset 15000 Centralized dataset 10000 5000 0 2 3 4 5 6 No. of sites Fig.12: The execution time of proposed encryption algorithm on (distributed/ centralized) Adult dataset. Table 8: Execution time and accuracy of proposed and RSA encryption algorithm for distributed Adult dataset. No. of sites Dataset size (No. of records) 2 Execution Time (ms) Proposed encryption RSA encryption 2000 5960 25429 3 3000 7223 4 4000 5 6 Accuracy % Speed up Relative Accuracy Proposed encryption RSA encryption 4.26 78.56 78.85 1.003 34749 4.81 79.38 79.69 1.003 8319 44451 5.34 79.62 79.79 1.002 5000 10100 54860 5.43 80.04 80.42 1.002 6000 11404 55131 4.83 80.62 80.88 1.003 R S. Publication (rspublication.com), [email protected] Page 119 International Journal of Computer Application Available online on http://www.rspublication.com/ijca/ijca_index.htm Issue 4, Volume 3 (May-June 2014) ISSN: 2250-1797 60000 Execution Time (ms) 50000 40000 Proposed Encryption 30000 RSA Encryption 20000 10000 0 2 3 4 5 6 No. of sites Fig. 13: Execution time of proposed and RSA encryption Algorithm for distributed Adult dataset. - Our proposed protocol is performed with K number equal 3 for KNN classification algorithm. We did distributed classification with and without proposed encryption to know how the proposed encryption algorithm can affect the system performance. - Fig 5, 8, and 11 show the execution time for distributed datasets Land Satellite Images, Radar return and Adult with and without proposed encryption scheme and the experimental results show that the overhead in performance time does not exceed 20% relative to nonencrypted value for large size dataset e.g. Adult and Land Satellite Images, which means that the overhead due to encryption for the proposed protocol is within acceptable range. - Fig. 6, 9, and 12 show the execution time of encrypted centralized system compared to the proposed encrypted distributed one. Comparing the execution time for encrypted distributed datasets of 6 sites with centralized one, the reduction of execution time in the proposed distributed system of 6 sites of total size 4470 record of Land Satellite Images dataset is not less than 41% the centralized one with the same dataset size as in table 3. But for the case of Radar Return dataset in table 5, the proposed distributed system of 6 sites of total size 210 record reduce the execution time to more than 48% of centralized one. Moreover in Fig.12 the reduction in performance time relative to centralized one for proposed distributed system of 6 sites of total size 6000 record of Adult dataset is above 45%. This is shows how encrypted centralized clustering can affect the performance of the system comparing it with the proposed distributed encrypted scheme. - In Fig. 7,10, and 13 show that the proposed protocol has better performance than RSA encryption algorithm with comparable/equal accuracy to. The speed up parameter ranges from 1.36 to 7.19 for the tested data sets. VI. Analysis and evaluation of proposed protocol To evaluate privacy preserving algorithm, we have to discuss three main issues which are: privacy (security), accuracy and efficiency. - Privacy preserving analysis Since all the model parameters are completely present with all parties then evaluation can be performed easily. The party that wants to classify an instance using KNN evaluation procedure can do that locally, so no interactions between parties. Thus there is no question of privacy being revealed or compromised. The result of local classification is encrypted and transmitted to the master site using local shared secret key which is semantically secured. According to the composition theorem [21], if the main protocol is partitioned into sub-protocols such that the output of one sub-protocol is the input to the next and all intermediate results are kept private, then the whole protocol would be privacy preserving. R S. Publication (rspublication.com), [email protected] Page 120 International Journal of Computer Application Available online on http://www.rspublication.com/ijca/ijca_index.htm Issue 4, Volume 3 (May-June 2014) ISSN: 2250-1797 So each party 𝑃𝑖 encrypts its output pair 𝑑𝑖 𝑚𝑖𝑛 and corresponding class value 𝑐𝑙𝑖 as per equation (1). ms {d s min , cls } , 1 s n ms is plaintext and ES k (ms ) mod(ms g pkm pks , p), is a cipher encryption s of output pair. These encrypted values are transmitted to the master client for global classification. So the output of each party is securely transmitted to the master client to compute the global classification without leaking any information about the private data of a party except its output. ECC is semantically secured due to the difficulty of the Elliptic Curve Discrete Logarithm Problem. Also using ECC in combination with Diffie-Hellman protocol is believed to make public key encryption more secure. The participants and the master site learn nothing but the results as in secure computation. No realized attack is possible, since the master site does not hold any part of the database. - Accuracy of proposed protocol Master client, which decrypts, 𝑑𝑖 𝑚𝑖𝑛 and corresponding class value 𝑐𝑙𝑖 produce accurate results with ECC cryptosystem. As shown in Tables 3,5, and 7 the accuracy of the classifier for parties between 2 to 6 is 78 – 92%, and is varied according to data set size and number of parties but accuracy range is still accepted and as long as the number of parties increases the accuracy gets better. - Efficiency of proposed protocol Raising efficiency of the algorithm is mainly shown the decreases in time complexity. Proposed-KNN classifier algorithm reduces the time complexity mainly in two aspects. First, global 𝑑𝑔 𝑚𝑖𝑛 and corresponding class value 𝑐𝑙𝑔 are quickly generated, since the KNN classifier algorithm executed locally for every party 𝑃𝑖 , this enables solutions where the communication cost is independent of the size of the database and greatly cut down communication costs comparing with centralized data mining which needs to transfer all data into data warehouse to perform data mining algorithm. Second, the length of encryption – decryption key size is shorter than other public key encryption methods (e.g. RSA) with the same level of security. -The complexity analysis of the protocol a- The communication cost Let us use β and to denote the number of bits of public key size and ciphertext sent to the master site respectively and n is the total number of sites/parties. The total communication cost is the cost of 2nβ+n from steps 4 and 9 in the proposed protocol. b- The computational cost is affected by: The generation of n cryptographic key pair and generation of n shared secret keys. The total number of n encryptions and n decryptions. Complexity for local KNN algorithm is O(kqr), where r is the dimension of training sample , q is number of training samples and k is the parameter of KNN algorithm. Additional computations as n2 additions, n(n-1) subtractions and n log(n) sorting n numbers. Therefore, the complexity of KNN classifier for n parties is dominant for not only the other computational costs but also for communication costs too. Consequently, the overall complexity of the proposed model = O(nkqr). Conclusion: In this paper we have provided a solution for K Nearest Neighbor classification with horizontal collaboration using the integration of Elliptic Curve Cryptography (ECC) public key cryptosystem and DiffieHellman Key Exchange. To analyze the overall efficiency not only the computational cost is considered but also the communication cost for examining the performance of proposed protocol. The computational time of the proposed protocol is compared with RSA public key encryption algorithm as well as the centralized data mining approach using real world datasets. Experimental results show that the overhead in the execution time due to proposed protocol is not exceeding more than 20% compared with non-encrypted distributed scheme in large datasets as in Land Satellite Images and adult dataset. While, it gives not less than 40% reduction in execution R S. Publication (rspublication.com), [email protected] Page 121 International Journal of Computer Application Available online on http://www.rspublication.com/ijca/ijca_index.htm Issue 4, Volume 3 (May-June 2014) ISSN: 2250-1797 time relative to the centralized scheme with the same size of dataset. Experimental results show that the proposed protocol with ECC has better performance compared to RSA encryption algorithm with the same level of security. As demonstrated through the theoretical analysis and the experimental results, the proposed protocol achieves significant improvements in terms of privacy, accuracy and efficiency. References: [1] Yehuda Lindell and Benny Pinkas. “Privacy Preserving Data Mining”. In Proceedings of the 20th Annual International Cryptology Conference (CRYPTO), pages 36–54, Santa Barbara, CA, USA, 2000. ]2[ R. Agrawal, A. Evfimievski, and R. Srikant, “Information sharing across Private databases,” in SIGMOD ’03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data. NewYork, NY, USA: ACM Press, pp. 86–97, 2003. ]3[ D. E. O’ Leary, “Some privacy issues in knowledge discovery: The oecd personal privacy guidelines,” IEEE Expert: Intelligent Systems and Their Applications, vol. 10, no.2, pp. 48–52, 1995. [4] J. Vaidya and C. Clifton, “Privacy-preserving k-means clustering over vertically partitioned data,”in KDD ’03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. NewYork, NY, USA: ACM, pp.206–215, 2003. [5] G. Jagannathan and R. N. Wright, “Privacy-preserving distributed k-means clustering over arbitrarily partitioned data,”in KDD’05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. New York, NY, USA: ACM, pp.593–599, 2005. [6] M. Gorawski and L. Slabinski, “Implementation of homomorphic encryption in privacy-preserving clustering distributed data” , in 2nd ADBIS Workshop on Data Mining and Knowledge Discovery, Thessaloniki, Greece, 6 September 2006 in conjunction with 10th East-European Conference on Advances in Databases and Information Systems ADBIS’ , pp.37–47, September 2006. [7] G. Jagannathan, K. Pillaipakkamnatt, and R. N. Wright, “A new privacy-preserving distributed k-clustering algorithm”, in SDM, 2006. [8] J. Vaidya and C. Clifton, “Privacy preserving association rule mining in vertically partitioned data”, in KDD’ 02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. New York, NY, USA: ACM, pp. 639–644, 2002. [9] M. Kantarcioglu and C. Clifton, “Privacy-Preserving Distributed Mining of Association Rules on Horizontally Partitioned Data”, IEEE Transactions on Knowledge and Data Engineering, vol.16 , no.9, pp.1026-1037, 2004. [10] J. Vaidya, M. Kantarcioglu, and C. Clifton, “Privacy-preserving naïve bayes classification ”, VLDBJ., vol. 17, no. 4, pp. 879–898, 2008. [11] Kantarcioglu M., Vaidya J. “Privacy-Preserving Naïve Bayes Classifier for Horizontally Partitioned Data. IEEE Work shop on Privacy-Preserving Data Mining, 2003. [12] D. Meng, K. Sivakumar, and H. Kargupta, “Privacy-sensitive bayesian network parameter learning”, in ICDM ’04: Proceedings of the Fourth IEEE International Conference on Data Mining. Washington, DC, USA: IEEE Computer Society, pp. 487–490, 2004. [13] R. Wright and Z. Yang, “Privacy-preserving bayesian network structure computation on distributed heterogeneous data.” in KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. New York, NY, USA: ACM, pp. 713–718, 2004. [14] O. Goldreich, S. Micali, and A.Wigderson,“How to play any mental game”, in STOC ’87: Proceedings of the nineteenth annual ACM symposium on Theory of computing. New York, NY, USA: ACM, pp. 218 –229, 1987. [15]A. C. Yao, “How to generate and exchange secrets”, in Proceedings of the 27 th Annual Symposium on Foundations of Computer Science. Washington, DC, USA: IEEE Computer Society, pp.162–167, 1986. [16] Justin Zhan , Liwu Chang and , Stan Matwin “Privacy Preserving K-nearest Neighbor Classification”, in International Journal of Network Security ,2005. [17] J. Zhan, and S. Matwin. “A Crypto-Based Approach to Privacy-Preserving Collaborative Data Mining”. In Sixth IEEE international conference of data mining, IEEE, 2006. R S. Publication (rspublication.com), [email protected] Page 122 International Journal of Computer Application Available online on http://www.rspublication.com/ijca/ijca_index.htm Issue 4, Volume 3 (May-June 2014) ISSN: 2250-1797 [18] J. Zhan. “Privacy-Preserving Collaborative Data Mining”. PhD thesis, Department of Computer Science, University of Ottawa, IEEE, 2006. [19] Shaneck, M., Kim, Y., Kumar, V. “Privacy Preserving Nearest Neighbor Search”. In: 6th IEEE International Conference on Data Mining Workshops, pp. 541–545, 2006. [20] Yinian Qi, and Mikhail J. Atallah” Efficient Privacy-Preserving k-Nearest Neighbor Search”. In: 28th International Conference on Distributed Computing Systems, pp. 311–319, 2008. [21] Ankit Chouhan, Sankita Patel, and D. C. Jinwala. “Comparative analysis of elliptic curve cryptography based algorithms to implement privacy preserving clustering through secure multiparty computations”. In Information Security, Scientific Research, 2013. [22] Pooja Gupta, Ashish Kumar.” Two phase privacy preserving data mining”. In ComEngApp-Journal, vol. 2, No. 3, 2013. [23] R. Agrawal and R. Srikant. “Privacy-preserving data mining”. In Proceeding of the ACM SIGMOD Conference on Management of data, pages 439-450. ACM Press May 2000. [24] D. Agrawal and C. Aggarwal. “On the design and quantification of privacy preserving data mining algorithms”. In the Proceeding of the 20th ACM SIGACT- SIGMOD-SIGART Symposium on Principles of database System, pages 247-255, Santa Barbara, CA, May 21-23, 2001. ]25[ A. C. Yao, “How to generate and exchange secrets” in Proceedings of the 27th Annual Symposium on Foundations of Computer Science. Washington, DC, USA: IEEE Computer Society, pp.162–167, 1986. ]26[ C. Clifton, M. Kantarcioglu, J. Vaidya, X. Lin, and M. Y. Zhu, “Tools for privacy preserving distributed data mining,” ACM SIGKDD Explorations, vol. 4, 2003. [27] P. Paillier, “Public-key cryptosystems based on composite degree residuosity classes”, Advances in Cryptography - EUROCRYPT '99, pp. 223-238, Prague, Czech Republic, 1999. [28] W. Diffie, and M. Hellman, “New directions in cryptography”, IEEE Transactions on Information Theory, vol. 22, pp. 644-654, 1976. [29] I.F. Blake, G. Seroussi, and N. P. Smart, Elliptic curves in cryptography, Cambridge: Cambridge University Press, 2000. [30] N. Koblitz, “Elliptic Curve Cryptosystems”, Mathematics of Computation, 48, pp. 203- 209, 1987. [31] Darrel Hankerson, Alfred Menezes and Scott Vanstone, Guide to Elliptic Curve Cryptography. [32] Gallian, Joseph, Contemporary Abstract Algebra (6th ed.), Boston: Houghton Mifflin, 2006. [33] Nitin Bhatia and Vandana, Survey of Nearest Neighbor Techniques, in (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 2, 2010. [34] T. Cover and P. Hart. Nearest neighbor pattern classification. In IEEE Transaction of Information Theory, Vol. 13, pp.21-27, January, 1968. [35] Ronny Kohavi and Barry Becker, “UCI Repository of Machine Learning Databases”, http://archive.ics.uci.edu/ml/datasets.html, Data Mining and Visualization, Silicon Graphics, 1996. R S. Publication (rspublication.com), [email protected] Page 123
© Copyright 2024 ExpyDoc