Journal of Computational Information Systems 10: 19 (2014) 8201–8208 Available at http://www.Jofcis.com High Representatives Selection Method for Near Duplicate Documents Detection ? Gaudence UWAMAHORO, Zuping ZHANG ∗ School of Information Science and Engineering, Central South University, Changsha 410083, China Abstract Dimensionality of text documents is an obstacle for identifying near duplicate documents. Systems are required to and generate the wanted results in short time but they are slow and some results are omitted. Several methods have been proposed to remediate that shortcoming but efficiency is still an issue. We propose a method based on high selected terms from the documents to increase efficiency. On the average we are using 12% of original document size to get a better similarity score as we have reduced compare to the original one and that leads to a decreased time in searching process. Keywords: Frequent Term; Near-duplicate Document Detection; Text Document; Apriori Algorithm; Document Representation 1 Introduction The access of digital documents becomes easy because of Internet technology. In different research areas like Artificial Intelligence, Text Mining, Information Retrieval, researchers need information from those documents. Some Text mining tools are required to extract the information but it is challenge because text is not structured. Some documents are duplicates and near duplicates due to the change in their content using operations like edit, delete, and substitution as in [1, 2]. Several methods have been proposed such as edit distance, fingerprinting, shingling, checksumming, bag-of-words and similarity measures like cosine and Jaccard as in [3, 4]. They work well but still they suffer from efficiency. Shingling technique is one of method for estimation of the degree of similarity between documents. More shingles two documents share more they are near duplicates and if shingles are the same then two documents are equivalent as in [5]. Fetterly et al. use five-grams as a shingle and sample 84 shingles for each document, and then built into six super shingles. Documents that share two super shingles are near duplicates [6]. The shingles drawback is that their size is greater than the size of document itself. Any method that can be used to reduce shingles size, it reduces time processing but reduces also accuracy of ? Project supported the National NaturalScience Foundation of China (Grant No. 61379109, M1321007) and Research Fund for the Doctoral Program of Higher Education of China (Grant No. 20120162110077). We would like to thank the anonymous referees for theirhelpful comments and suggestions. ∗ Corresponding Author Email address: [email protected] (Zuping ZHANG). 1553–9105 / Copyright © 2014 Binary Information Press DOI: 10.12733/jcis11719 October 1, 2014 8202 G. Uwamahoro et al. /Journal of Computational Information Systems 10: 19 (2014) 8201–8208 result. Most of methods used have the same shortcoming of being efficient only for small size documents. To overcome that shortcoming the different fingerprinting techniques have been proposed in [7, 8]. Xiao, W., L. and J. in [9] proposed a new filtering technique by exploiting the ordering information in prefix filtering. Another method based on concept tree was proposed in [10] where they present each document as a tree. In this paper, the method proposed is based on frequent terms extracted using Apriori algorithm proposed by Agrawal and also used in [11]. A document is made of sentences and the support of term or termset is the number of sentences it appears in. 2 Proposed Method High dimensionality of documents is an issue for near duplicate documents detection. We propose an efficient method for near duplicate documents detection which is based on document representation using only high selected frequent terms generated using Apriori algorithm instead of using all frequent termsets which are not necessary when comparing documents based on portion of frequent terms shared by pair of documents to save space and time for search process. More the use of high length of frequent termsets in comparison more the chance of frequent term to be one of candidates to represent a document is reduced. We also realized that using tf − idf method many terms in document are not all candidates during comparison. Those terms that appear only in one document and not in the other one, take space and increase time of searching process. However, in this paper we dont consider the order in which term in sentence occur. The proposed system consists of four phases: Text Preprocessing phase (stopwords removal, lowering every letter in term and tokenization), phrases or chunks construction, finding frequent termsets for every document in collection and similarity measurement using Jaccard similarity. The document is considered to be a set of chunks (phrases) made by a set of terms occurring in every phrase. Document D = {{american, action, f ilm, slowly, drowning, death, sea}, {asian, wire − f u, copycats, pretty, death, leaving, likes}, etc.}. Finding a set of terms appearing in those chunks and find frequent termsets that can represent each document in a collection that provides a fast way to compare documents. Let Consider a document D made of a set of phrases S where each phrase is a set of terms and the minsup be the support threshold. The minimum support is the support of a term to appear in the phrase. Let S = {S1 , ..., Sn } be a set of phrases in a documents and T be the set of all terms occurring in phrases of S. Each phrase Sj is represented by the set of terms occurring in Sjsuch that Sj ⊆ T . Let minsup be a real number, 0 ≤ minsup ≤ 1. If we take Tj as set of terms a phrase Sj is said to contain Tj if and only if Sj ⊆ T . Given a set of documents the steps of proposed method are shown in Fig. 1. 2.1 Algorithm description In our proposed method, each document in collection is split in fixed size chunks with 7-length each chunk, called phrases in this paperwhich means that each phrase is made of seven consecutive terms. Let be the collection of documents Dn a set of documents n where n is a number of documents with different sizes. Therefore Dn = {d1 , d2 , ..., dn }. Each document in collection is made of chunks of fixed length of seven consecutives words and we call that phrase in this paper. G. Uwamahoro et al. /Journal of Computational Information Systems 10: 19 (2014) 8201–8208 8203 Fig. 1: Frequent termsets selection for document representation Therefore di = {p1 , p2 , p3 , ..., pb } where b is the number of phrases in each document. Set of phrases for each document is used as input for algorithm that generate terms sets that appear frequently in different phrases of document also called frequent termsets made of k-termsets where k is the length of words that make a termset (example 2-itemset). In that example the k = 2. In our method, we fix the minimum support minsup which is the base during frequent term generation. A term in document is considered as frequent term if its support is minsup or more. In our method the minimum support is fixed in the way each terms is given a high chance to be generated as frequent term. The minimum support for each document is calculated as 2/b which means that a term is to be frequent it must appear at least in two phrases (chunks) of the document. The frequent termsets generated have different size or length. Let Ck be the set of frequent termsets of size k generated. We count the size of frequent termsets generated from singleton, where k = 1, doubleton where k = 2, triple, where k = 3, quadruple, where k = 4 etc. The frequent k-termsets is generated from the k − 1-termsets i.e for example the quadruples are generated from the triples. It means that for k-termsets cannot be frequents unless its subsets are. After generating all frequent termsets for each document, we have to choose the ones that can represent that document. We are referred to the similarity between pair of documents to compare calculated first using Jaccard similarity. We have to compare different frequent termsets of both documents based on the length of frequents termsets. For example if k = 4 the maximum length of frequent termset, we compare frequent 4-termsets for one document and 4-termsets for another document, get result or similarity and continue to compare until to reach the minimum length, and the termsets that represent the document are the termsets used to get high similarity score. Using Apriori algorithm, the first pass counts terms occurrences in all phrases in the document to get the 1-termsets F1 . Subsequently a large Fk − 1-termsets is generated where k is length of termset. Let consider a document D a set of chunks T with fixed length L. therefore D = {t1 , t2 , t3 , ..., tn } where t1 , t2 , t3 , ..., tn are chunks in T . LookforFrequent Algorithm Input: Phrases T []: chunks of document, minimum-support Output: set of frequent termsets F F1 = {1 − termsets} in T 1. for k = 2, Fk − 1 6= φ; k + + do Ck = apriori − compute(Fk − 1); 8204 G. Uwamahoro et al. /Journal of Computational Information Systems 10: 19 (2014) 8201–8208 2. for all chunks w in T do Cw = subset(Ck , w) where Ck is a candidate termsets of size k 3. for all frequent termsets candidates d ∈ Cw do d.count + + ; 4. end for 5. end for Fk = {d ∈ Ck | d.count ≥ minimum − support}; 6. end for 7. Return F = ∪k Fk To select frequent termsets to represent each document, we choose the ones that have high similarity score when we are comparing two documents from the maximum length of frequent termsets. To calculate similarity for two documents, we will compare all frequent termsets that have the same length from the maximum length of frequent termsets to the minimum length. Let take example of frequent termset with maximum length of k = 4; i.e. frequent 4-termsets where each frequent termset is composed of 4 terms. The minimum length is 1; i.e. frequent 1-termsets. First, the algorithm will check the maximum length of frequent termsets for each document. If both documents have the same maximum length of their frequent termsets, then we can calculate similarity between two documents, if not the algorithm the algorithm go the k − 1 (maximum length-1). The comparison of both documents will continue until it reaches frequent 1-termsets and then the frequent termsets compared with high similarity score will be chosen as document representation for each for each document. Our method has two parts: one part is the one that identify frequent termsets in every document using Identify-Frequent (IF) algorithm and the other one to choose frequent termsets based on frequents termsets used to get high similarity score during comparison called High FrequentTermsetRepresentation (HFTR). Let consider D a set of document where D = {d1 , d2 , d3 , ..., dn }. Identify-Frequent (IF) Algorithm Input: document di Output: L: list of frequent termsets with different sizes T erms ← [] 1. for each di in D 2. for each chunk T erms ← terms chunks(di ) 3. end for 4. end for G. Uwamahoro et al. /Journal of Computational Information Systems 10: 19 (2014) 8201–8208 8205 5. L ← lookf orf requents(T erms) 6. Return L High Frequent Term Representation (HFTR) Algorithm Input: di Li , dj Li : document di and its list of frequent-termsets and documents dj and its list of frequent-termsets and documents Output: similarity maximum score based on frequent termsets 1. For each document in D D ← list(di ) counter = n 2. for all (di , dj ) ∈ D 3. While counter = 0 Simf req ← jaccard(di Lcounter, dj Lcounter) 4. End while 5. End for 6. End for 7. Return max (Simf req) 2.2 Time complexity In our algorithm the operations considered are the one considered to generate the frequent termsets and the ones to choose frequent termsets for document representation. The time for building chunks from every document is O(n). The time spent also depends on input size in documents where each document is made of several terms. The generation of large frequent termsets is done by use of Apriori which use bottom-up with breadth- first search approach (manner). At each iteration, a candidate set of large termsets counts the number of occurrences of each candidate termset and decides large termsets based on pre-determined minimum support fixed by user. That operation time complexity is O(n2 ). After getting all frequent termsets, the main operation in our algorithm is to choose frequent termsets to represent each document. The operations of comparing the frequent termsets based on the maximum length of frequent termsets. The time required for that operation depends on the length of frequent termsets. If maximum length n is equal to 5, then the number of operation is 5 the comparison starts by comparing frequent termsets with maximum length to the minimum length which is one. Since each comparison the algorithm use intersection and union algorithm to calculate the similarity, each iteration, the size of document to compare is decreased and the time to use is decreased, the time complexity requires is O(n) for this operation. The time complexity of our algorithm is O(n2 ). As the number of frequent termsets for documents to compare decreases, the time used to compare decreases. 8206 G. Uwamahoro et al. /Journal of Computational Information Systems 10: 19 (2014) 8201–8208 The work space cannot exceed the running time as we know that writing in each memory cell requires at least a constant amount of time. 3 Experiment We show the process of compare pairs of documents for near duplicate documents detection using our method and show how it performs well than the method that uses all words. Both methods calculate similarity by considering intersection of common terms share by both documents on the total terms of both documents using Jaccard. Let take two documents cv683 12167.txt D1 with 733 words and minsup 0.019 and cv692 15451.txt D2 with 805 words and minsup 0.01. Each document is a set of sets of frequent termsets with different lengths generated based on minsup. A term is frequent if it appears at least in 2 phrases. D1 = {s1 , s2 , s3 }, D2 = {s21 , s22 , s23 , s24 }. Where s1 is a set of 1-termsets with 79 terms, s2 is a set of 2-frequent termsets with 35 terms, s3 is a set of 3-frequent termsets with 5 terms in document D1 , s21 is a set of 1-termsets with 94 terms, s22 is a set of 2-frequent termsets with 35 terms, s23 is a set of 3-frequent termsets with 8 terms, s24 is a set of 4-frequent termsets with 1 terms in document D2 . To choose frequent terms to represent the document the proposed method will choose from the sets with highest frequent termsets length for two documents, in this comparison is 4- frequent termsets. As the document cv683 12167.txt its 4-frequent termsets is empty, we continue to see in 3-frequent termsets, 2frequent termsets until we reach the 1-frequent termsets. We have seen that comparing those documents by considering the 1-frequent termsets i.e. frequent termsets with minimum length 1, the similarity score is higher than the one obtained using all terms of documents and the set of 1-frequent termsets is chosen to represent each document. The similarity using all word is 0.0520156046814 whereas using frequent termsets is 0.0520231213873. The space to keep all terms is much cost, but using frequent termsets the size of each document has reduced too much and the running time is reduced at minimum as is shown in Table 1. Using some documents taken from txt sentoken data in one of its two parts called positive in our experiment, we have seen that we can use high selected frequent termesets from document instead of using all words of document to get fast similarity between pair of documents. The abbreviations in Table 1 such as D.Rep., RTw , RTF , S ALL w and S F QF represent respectively Doc-Representation, Running-Time (using all words), Running-Time-Freq (using Frequent words), similarity using all words and similarity using frequent words. In column 1 and column 2, documents are represented by their identifications (ids) followed by their sizes. Documents id1 (733w) − id2 (805w) id3 (558w) − id4 (472w) id5 (301w) − id4 (472w) id1 (733w) − id6 (349w) id1 (733w) − id3 (558w) id7 (533w) − id1 (733w) id8 (651w) − id1 (733w) Table 1: Document time and size- comparison D.Rep. RTw RTF id1 (79w) − id2 (94w) 0.0249 0.0031 id3 (62w) − id4 (65w) 0.0161 0.0024 id5 (40w − id4 (65w) 0.0124 0.0021 id1 (79w)Cid6 (12w) 0.016639 0.001871 id1 (79w) − id3 (62w) 0.0197 0.0026 id7 − (74w) − id1 (79w) 0.0194 0.0028 id8 (90w) − id1 (79w) 0.0212 0.0030 S ALLw 0.0520 0.0456 0.0232 0.049907 0.0627 0.0529 0.0578 S F QF 0.0520 0.0472 0.0285 0.054945 0.0709 0.0653 0.0710 Table 1 shows the pairs of documents to compare with their number of terms in each document. Doc-Representation in column two of the Table 1 shows the number of frequent termsets selected G. Uwamahoro et al. /Journal of Computational Information Systems 10: 19 (2014) 8201–8208 8207 to represent each document. Document representation by selected frequent termsets reduces the size of documents and space memory required during comparison as is shown in Table 2. From the documents used in our experiments with the size of 4402 terms in total, using our method we use only 516 terms as the size has reduced at 89.86%. Documents cv683 12167.txt cv692 15451.txt cv003 11664.txt cv087 1989.txt cv778 17330.txt cv779 17881.txt cv086 18371.txt cv508 16006.txt Table 2: Document size representation Doc. Size with Size of Doc. Reduced size all words (No. Representation (No. of words) of words) (No. of words 733 79 724 805 94 711 558 62 496 472 65 407 301 40 261 349 12 337 533 74 459 651 90 561 Percentage (reduced size) 98.7% 88.3% 88.8% 86.2% 86.7% 96.5% 86.1% 86.1% The proposed method provides a high efficiency as the size of documents to search in is reduced using the proposed method and that leads to a reduced time comparison as shown in Fig. 2. The proposed method HTFR is faster than the traditional method that uses all words of each documents and the similarity score is greater than the one obtained using all words in similarity calculation as in Fig. 3. Fig. 2: Time comparison between using All-words Fig. 3: Similarity comparison between using and HFTR methods All-words and HFTR methods 4 Conclusion In this paper, we propose a new method based on high selected frequent termsets from document to represent a document for efficiently identifying the similarity between documents. We show that during comparison of pair of documents using all words in document it costs time and we proposed the use of frequent termsets to represent each document which leaded to significant minimized size and time to get similarity between both documents. We realized that using chunks as document description and minimized minimum support to identify frequent termsets in document help to get important frequent termsets with minimum length that can represent a document. The similarity calculated based on common frequent termsets two documents shared using the proposed method, there is a high chance for frequent termsets chosen to represent a document and not have high effect on similarity score. We have also realized that similarity 8208 G. Uwamahoro et al. /Journal of Computational Information Systems 10: 19 (2014) 8201–8208 calculated using same Jaccard similarity based on traditional method that considers all words of both document, and the similarity calculated using frequent termsets chosen to represent a document, the result is high or almost the same. Representing a documents using minimum length for frequent termsets is a robust method because it saves space and time during documents comparison. References [1] D. Carvalho, M. G., Laender, A. H. F., Goncalves, M. A., da Silva, A. S, A genetic programming approach to record deduplication, IEEE Transactions on Knowledge and Data Engineering, vol. 24(3), 2012, pp. 399-412. [2] E. Valls, P. Rosso, Detection of near-duplicate user generated contents: the SMS spam collection, in: Proceedings of the 3rd international workshop on search and mining user-generated contents, 2011, pp. 27-34. [3] Y. Lin, T. Liao, S. Lee, Detecting near-duplicate documents using sentence-level features and supervised learning, 2012, pp. 1467-1476. [4] B. Karthikeyan, V. Vaithiyanathan, C. V. Lavanya, Similarity Detection in Source Code Using Data Mining Techniques, European Journal of Scientific Research ISSN 1450-216X, vol. 62(4), 2011, pp. 500-505. [5] D. Christopher Manning, P. Raghavan, H. Schutze, Introduction to Information Retrieval, ISBN 13: 9780521865715, Cambridge University Press, 2008. [6] D. Fetterly, M. Manasse, and M. Najork, On the evolution of clusters of near-duplicate web pages in: Proceedings of the first Latin AmericanWeb Congress (LAWeb), 2003, pp. 37-45. [7] G. Singh Manku, A. Jain and A. Das Sarma, Detecting near-duplicates for web crawling in: Proceedings of the 16th international conference on World Wide Web, Banff, Alberta, Canada, 2007, pp. 141-150. [8] M. Charikar, Similarity Estimation Techniques from Rounding Algorithm, in: Proc. of 34th Annual Symposium on Theory of Computing (STOC), 2008, pp. 380-388. [9] C. Xiao, W.Wang, X.Lin, J.X.Yu, Efficient Similarity Join for Near Duplicate Detection, Beijing, China, 2008, pp. 131-140. [10] P. Lakkaraju, S. Gauch, M. Speretta, Document similarity Based on Concept Tree Distance in: Proceedings of Nineteeth ACM conference on Hypertext and Hypermedia. Pitteburgh, PA, USA, 2008, pp. 127-132. [11] X. NIU, D. ZHANG, H. LIU, Research on Query Expansion Based on Association Rules and Application in Web Information Retrieval, Journal of Computational Information Systems, vol. 4(3), 2008, pp. 977-984.
© Copyright 2024 ExpyDoc