High Representatives Selection Method for Near Duplicate

Journal of Computational Information Systems 10: 19 (2014) 8201–8208
Available at http://www.Jofcis.com
High Representatives Selection Method for Near
Duplicate Documents Detection ?
Gaudence UWAMAHORO, Zuping ZHANG ∗
School of Information Science and Engineering, Central South University, Changsha 410083, China
Abstract
Dimensionality of text documents is an obstacle for identifying near duplicate documents. Systems are
required to and generate the wanted results in short time but they are slow and some results are omitted.
Several methods have been proposed to remediate that shortcoming but efficiency is still an issue. We
propose a method based on high selected terms from the documents to increase efficiency. On the average
we are using 12% of original document size to get a better similarity score as we have reduced compare
to the original one and that leads to a decreased time in searching process.
Keywords: Frequent Term; Near-duplicate Document Detection; Text Document; Apriori Algorithm;
Document Representation
1
Introduction
The access of digital documents becomes easy because of Internet technology. In different research
areas like Artificial Intelligence, Text Mining, Information Retrieval, researchers need information
from those documents. Some Text mining tools are required to extract the information but it
is challenge because text is not structured. Some documents are duplicates and near duplicates
due to the change in their content using operations like edit, delete, and substitution as in [1, 2].
Several methods have been proposed such as edit distance, fingerprinting, shingling, checksumming, bag-of-words and similarity measures like cosine and Jaccard as in [3, 4]. They work well
but still they suffer from efficiency. Shingling technique is one of method for estimation of the
degree of similarity between documents. More shingles two documents share more they are near
duplicates and if shingles are the same then two documents are equivalent as in [5]. Fetterly et
al. use five-grams as a shingle and sample 84 shingles for each document, and then built into six
super shingles. Documents that share two super shingles are near duplicates [6].
The shingles drawback is that their size is greater than the size of document itself. Any method
that can be used to reduce shingles size, it reduces time processing but reduces also accuracy of
?
Project supported the National NaturalScience Foundation of China (Grant No. 61379109, M1321007) and
Research Fund for the Doctoral Program of Higher Education of China (Grant No. 20120162110077). We would
like to thank the anonymous referees for theirhelpful comments and suggestions.
∗
Corresponding Author
Email address: [email protected] (Zuping ZHANG).
1553–9105 / Copyright © 2014 Binary Information Press
DOI: 10.12733/jcis11719
October 1, 2014
8202 G. Uwamahoro et al. /Journal of Computational Information Systems 10: 19 (2014) 8201–8208
result. Most of methods used have the same shortcoming of being efficient only for small size
documents. To overcome that shortcoming the different fingerprinting techniques have been
proposed in [7, 8]. Xiao, W., L. and J. in [9] proposed a new filtering technique by exploiting the
ordering information in prefix filtering. Another method based on concept tree was proposed in
[10] where they present each document as a tree. In this paper, the method proposed is based on
frequent terms extracted using Apriori algorithm proposed by Agrawal and also used in [11]. A
document is made of sentences and the support of term or termset is the number of sentences it
appears in.
2
Proposed Method
High dimensionality of documents is an issue for near duplicate documents detection. We propose
an efficient method for near duplicate documents detection which is based on document representation using only high selected frequent terms generated using Apriori algorithm instead of
using all frequent termsets which are not necessary when comparing documents based on portion
of frequent terms shared by pair of documents to save space and time for search process. More
the use of high length of frequent termsets in comparison more the chance of frequent term to
be one of candidates to represent a document is reduced. We also realized that using tf − idf
method many terms in document are not all candidates during comparison. Those terms that
appear only in one document and not in the other one, take space and increase time of searching
process. However, in this paper we dont consider the order in which term in sentence occur.
The proposed system consists of four phases: Text Preprocessing phase (stopwords removal,
lowering every letter in term and tokenization), phrases or chunks construction, finding frequent
termsets for every document in collection and similarity measurement using Jaccard similarity.
The document is considered to be a set of chunks (phrases) made by a set of terms occurring in every phrase. Document D = {{american, action, f ilm, slowly, drowning, death, sea}, {asian, wire
− f u, copycats, pretty, death, leaving, likes}, etc.}. Finding a set of terms appearing in those
chunks and find frequent termsets that can represent each document in a collection that provides
a fast way to compare documents.
Let Consider a document D made of a set of phrases S where each phrase is a set of terms and
the minsup be the support threshold. The minimum support is the support of a term to appear
in the phrase. Let S = {S1 , ..., Sn } be a set of phrases in a documents and T be the set of all
terms occurring in phrases of S. Each phrase Sj is represented by the set of terms occurring in
Sjsuch that Sj ⊆ T . Let minsup be a real number, 0 ≤ minsup ≤ 1. If we take Tj as set of terms
a phrase Sj is said to contain Tj if and only if Sj ⊆ T . Given a set of documents the steps of
proposed method are shown in Fig. 1.
2.1
Algorithm description
In our proposed method, each document in collection is split in fixed size chunks with 7-length each
chunk, called phrases in this paperwhich means that each phrase is made of seven consecutive
terms. Let be the collection of documents Dn a set of documents n where n is a number of
documents with different sizes. Therefore Dn = {d1 , d2 , ..., dn }. Each document in collection is
made of chunks of fixed length of seven consecutives words and we call that phrase in this paper.
G. Uwamahoro et al. /Journal of Computational Information Systems 10: 19 (2014) 8201–8208 8203
Fig. 1: Frequent termsets selection for document representation
Therefore di = {p1 , p2 , p3 , ..., pb } where b is the number of phrases in each document. Set of
phrases for each document is used as input for algorithm that generate terms sets that appear
frequently in different phrases of document also called frequent termsets made of k-termsets where
k is the length of words that make a termset (example 2-itemset). In that example the k = 2.
In our method, we fix the minimum support minsup which is the base during frequent term
generation. A term in document is considered as frequent term if its support is minsup or more.
In our method the minimum support is fixed in the way each terms is given a high chance to
be generated as frequent term. The minimum support for each document is calculated as 2/b
which means that a term is to be frequent it must appear at least in two phrases (chunks) of the
document. The frequent termsets generated have different size or length.
Let Ck be the set of frequent termsets of size k generated. We count the size of frequent termsets
generated from singleton, where k = 1, doubleton where k = 2, triple, where k = 3, quadruple,
where k = 4 etc. The frequent k-termsets is generated from the k − 1-termsets i.e for example
the quadruples are generated from the triples. It means that for k-termsets cannot be frequents
unless its subsets are. After generating all frequent termsets for each document, we have to
choose the ones that can represent that document. We are referred to the similarity between pair
of documents to compare calculated first using Jaccard similarity. We have to compare different
frequent termsets of both documents based on the length of frequents termsets. For example if
k = 4 the maximum length of frequent termset, we compare frequent 4-termsets for one document
and 4-termsets for another document, get result or similarity and continue to compare until to
reach the minimum length, and the termsets that represent the document are the termsets used
to get high similarity score. Using Apriori algorithm, the first pass counts terms occurrences in
all phrases in the document to get the 1-termsets F1 . Subsequently a large Fk − 1-termsets is
generated where k is length of termset. Let consider a document D a set of chunks T with fixed
length L. therefore D = {t1 , t2 , t3 , ..., tn } where t1 , t2 , t3 , ..., tn are chunks in T .
LookforFrequent Algorithm
Input: Phrases T []: chunks of document, minimum-support
Output: set of frequent termsets F
F1 = {1 − termsets} in T
1. for k = 2, Fk − 1 6= φ; k + + do
Ck = apriori − compute(Fk − 1);
8204 G. Uwamahoro et al. /Journal of Computational Information Systems 10: 19 (2014) 8201–8208
2. for all chunks w in T do
Cw = subset(Ck , w) where Ck is a candidate termsets of size k
3. for all frequent termsets candidates d ∈ Cw do
d.count + + ;
4. end for
5. end for
Fk = {d ∈ Ck | d.count ≥ minimum − support};
6. end for
7. Return F = ∪k Fk
To select frequent termsets to represent each document, we choose the ones that have high
similarity score when we are comparing two documents from the maximum length of frequent
termsets. To calculate similarity for two documents, we will compare all frequent termsets that
have the same length from the maximum length of frequent termsets to the minimum length. Let
take example of frequent termset with maximum length of k = 4; i.e. frequent 4-termsets where
each frequent termset is composed of 4 terms. The minimum length is 1; i.e. frequent 1-termsets.
First, the algorithm will check the maximum length of frequent termsets for each document. If
both documents have the same maximum length of their frequent termsets, then we can calculate
similarity between two documents, if not the algorithm the algorithm go the k − 1 (maximum
length-1).
The comparison of both documents will continue until it reaches frequent 1-termsets and then
the frequent termsets compared with high similarity score will be chosen as document representation for each for each document. Our method has two parts: one part is the one that identify
frequent termsets in every document using Identify-Frequent (IF) algorithm and the other one to
choose frequent termsets based on frequents termsets used to get high similarity score during comparison called High FrequentTermsetRepresentation (HFTR). Let consider D a set of document
where D = {d1 , d2 , d3 , ..., dn }.
Identify-Frequent (IF) Algorithm
Input: document di
Output: L: list of frequent termsets with different sizes T erms ← []
1. for each di in D
2. for each chunk
T erms ← terms chunks(di )
3. end for
4. end for
G. Uwamahoro et al. /Journal of Computational Information Systems 10: 19 (2014) 8201–8208 8205
5. L ← lookf orf requents(T erms)
6. Return L
High Frequent Term Representation (HFTR) Algorithm
Input: di Li , dj Li : document di and its list of frequent-termsets and documents dj and its list of
frequent-termsets and documents
Output: similarity maximum score based on frequent termsets
1. For each document in D
D ← list(di )
counter = n
2. for all (di , dj ) ∈ D
3. While counter = 0
Simf req ← jaccard(di Lcounter, dj Lcounter)
4. End while
5. End for
6. End for
7. Return max (Simf req)
2.2
Time complexity
In our algorithm the operations considered are the one considered to generate the frequent termsets and the ones to choose frequent termsets for document representation. The time for building
chunks from every document is O(n). The time spent also depends on input size in documents
where each document is made of several terms. The generation of large frequent termsets is done
by use of Apriori which use bottom-up with breadth- first search approach (manner). At each
iteration, a candidate set of large termsets counts the number of occurrences of each candidate
termset and decides large termsets based on pre-determined minimum support fixed by user.
That operation time complexity is O(n2 ). After getting all frequent termsets, the main operation
in our algorithm is to choose frequent termsets to represent each document. The operations of
comparing the frequent termsets based on the maximum length of frequent termsets.
The time required for that operation depends on the length of frequent termsets. If maximum
length n is equal to 5, then the number of operation is 5 the comparison starts by comparing frequent termsets with maximum length to the minimum length which is one. Since each comparison
the algorithm use intersection and union algorithm to calculate the similarity, each iteration, the
size of document to compare is decreased and the time to use is decreased, the time complexity
requires is O(n) for this operation. The time complexity of our algorithm is O(n2 ). As the number
of frequent termsets for documents to compare decreases, the time used to compare decreases.
8206 G. Uwamahoro et al. /Journal of Computational Information Systems 10: 19 (2014) 8201–8208
The work space cannot exceed the running time as we know that writing in each memory cell
requires at least a constant amount of time.
3
Experiment
We show the process of compare pairs of documents for near duplicate documents detection using
our method and show how it performs well than the method that uses all words. Both methods
calculate similarity by considering intersection of common terms share by both documents on the
total terms of both documents using Jaccard. Let take two documents cv683 12167.txt D1 with
733 words and minsup 0.019 and cv692 15451.txt D2 with 805 words and minsup 0.01. Each
document is a set of sets of frequent termsets with different lengths generated based on minsup.
A term is frequent if it appears at least in 2 phrases. D1 = {s1 , s2 , s3 }, D2 = {s21 , s22 , s23 , s24 }.
Where s1 is a set of 1-termsets with 79 terms, s2 is a set of 2-frequent termsets with 35 terms,
s3 is a set of 3-frequent termsets with 5 terms in document D1 , s21 is a set of 1-termsets with 94
terms, s22 is a set of 2-frequent termsets with 35 terms, s23 is a set of 3-frequent termsets with 8
terms, s24 is a set of 4-frequent termsets with 1 terms in document D2 . To choose frequent terms
to represent the document the proposed method will choose from the sets with highest frequent
termsets length for two documents, in this comparison is 4- frequent termsets. As the document
cv683 12167.txt its 4-frequent termsets is empty, we continue to see in 3-frequent termsets, 2frequent termsets until we reach the 1-frequent termsets. We have seen that comparing those
documents by considering the 1-frequent termsets i.e. frequent termsets with minimum length
1, the similarity score is higher than the one obtained using all terms of documents and the
set of 1-frequent termsets is chosen to represent each document. The similarity using all word
is 0.0520156046814 whereas using frequent termsets is 0.0520231213873. The space to keep all
terms is much cost, but using frequent termsets the size of each document has reduced too much
and the running time is reduced at minimum as is shown in Table 1. Using some documents
taken from txt sentoken data in one of its two parts called positive in our experiment, we have
seen that we can use high selected frequent termesets from document instead of using all words
of document to get fast similarity between pair of documents. The abbreviations in Table 1
such as D.Rep., RTw , RTF , S ALL w and S F QF represent respectively Doc-Representation,
Running-Time (using all words), Running-Time-Freq (using Frequent words), similarity using all
words and similarity using frequent words. In column 1 and column 2, documents are represented
by their identifications (ids) followed by their sizes.
Documents
id1 (733w) − id2 (805w)
id3 (558w) − id4 (472w)
id5 (301w) − id4 (472w)
id1 (733w) − id6 (349w)
id1 (733w) − id3 (558w)
id7 (533w) − id1 (733w)
id8 (651w) − id1 (733w)
Table 1: Document time and size- comparison
D.Rep.
RTw
RTF
id1 (79w) − id2 (94w)
0.0249
0.0031
id3 (62w) − id4 (65w)
0.0161
0.0024
id5 (40w − id4 (65w)
0.0124
0.0021
id1 (79w)Cid6 (12w)
0.016639 0.001871
id1 (79w) − id3 (62w)
0.0197
0.0026
id7 − (74w) − id1 (79w)
0.0194
0.0028
id8 (90w) − id1 (79w)
0.0212
0.0030
S ALLw
0.0520
0.0456
0.0232
0.049907
0.0627
0.0529
0.0578
S F QF
0.0520
0.0472
0.0285
0.054945
0.0709
0.0653
0.0710
Table 1 shows the pairs of documents to compare with their number of terms in each document.
Doc-Representation in column two of the Table 1 shows the number of frequent termsets selected
G. Uwamahoro et al. /Journal of Computational Information Systems 10: 19 (2014) 8201–8208 8207
to represent each document. Document representation by selected frequent termsets reduces the
size of documents and space memory required during comparison as is shown in Table 2. From
the documents used in our experiments with the size of 4402 terms in total, using our method we
use only 516 terms as the size has reduced at 89.86%.
Documents
cv683 12167.txt
cv692 15451.txt
cv003 11664.txt
cv087 1989.txt
cv778 17330.txt
cv779 17881.txt
cv086 18371.txt
cv508 16006.txt
Table 2: Document size representation
Doc. Size with
Size of Doc.
Reduced size
all words (No. Representation
(No. of words)
of words)
(No. of words
733
79
724
805
94
711
558
62
496
472
65
407
301
40
261
349
12
337
533
74
459
651
90
561
Percentage
(reduced size)
98.7%
88.3%
88.8%
86.2%
86.7%
96.5%
86.1%
86.1%
The proposed method provides a high efficiency as the size of documents to search in is reduced
using the proposed method and that leads to a reduced time comparison as shown in Fig. 2.
The proposed method HTFR is faster than the traditional method that uses all words of each
documents and the similarity score is greater than the one obtained using all words in similarity
calculation as in Fig. 3.
Fig. 2: Time comparison between using All-words
Fig. 3: Similarity comparison between using
and HFTR methods
All-words and HFTR methods
4
Conclusion
In this paper, we propose a new method based on high selected frequent termsets from document
to represent a document for efficiently identifying the similarity between documents. We show
that during comparison of pair of documents using all words in document it costs time and we
proposed the use of frequent termsets to represent each document which leaded to significant
minimized size and time to get similarity between both documents. We realized that using
chunks as document description and minimized minimum support to identify frequent termsets
in document help to get important frequent termsets with minimum length that can represent a
document. The similarity calculated based on common frequent termsets two documents shared
using the proposed method, there is a high chance for frequent termsets chosen to represent a
document and not have high effect on similarity score. We have also realized that similarity
8208 G. Uwamahoro et al. /Journal of Computational Information Systems 10: 19 (2014) 8201–8208
calculated using same Jaccard similarity based on traditional method that considers all words
of both document, and the similarity calculated using frequent termsets chosen to represent
a document, the result is high or almost the same. Representing a documents using minimum
length for frequent termsets is a robust method because it saves space and time during documents
comparison.
References
[1]
D. Carvalho, M. G., Laender, A. H. F., Goncalves, M. A., da Silva, A. S, A genetic programming
approach to record deduplication, IEEE Transactions on Knowledge and Data Engineering, vol.
24(3), 2012, pp. 399-412.
[2]
E. Valls, P. Rosso, Detection of near-duplicate user generated contents: the SMS spam collection,
in: Proceedings of the 3rd international workshop on search and mining user-generated contents,
2011, pp. 27-34.
[3]
Y. Lin, T. Liao, S. Lee, Detecting near-duplicate documents using sentence-level features and
supervised learning, 2012, pp. 1467-1476.
[4]
B. Karthikeyan, V. Vaithiyanathan, C. V. Lavanya, Similarity Detection in Source Code Using
Data Mining Techniques, European Journal of Scientific Research ISSN 1450-216X, vol. 62(4),
2011, pp. 500-505.
[5]
D. Christopher Manning, P. Raghavan, H. Schutze, Introduction to Information Retrieval, ISBN
13: 9780521865715, Cambridge University Press, 2008.
[6]
D. Fetterly, M. Manasse, and M. Najork, On the evolution of clusters of near-duplicate web pages
in: Proceedings of the first Latin AmericanWeb Congress (LAWeb), 2003, pp. 37-45.
[7]
G. Singh Manku, A. Jain and A. Das Sarma, Detecting near-duplicates for web crawling in:
Proceedings of the 16th international conference on World Wide Web, Banff, Alberta, Canada,
2007, pp. 141-150.
[8]
M. Charikar, Similarity Estimation Techniques from Rounding Algorithm, in: Proc. of 34th Annual
Symposium on Theory of Computing (STOC), 2008, pp. 380-388.
[9]
C. Xiao, W.Wang, X.Lin, J.X.Yu, Efficient Similarity Join for Near Duplicate Detection, Beijing,
China, 2008, pp. 131-140.
[10] P. Lakkaraju, S. Gauch, M. Speretta, Document similarity Based on Concept Tree Distance in:
Proceedings of Nineteeth ACM conference on Hypertext and Hypermedia. Pitteburgh, PA, USA,
2008, pp. 127-132.
[11] X. NIU, D. ZHANG, H. LIU, Research on Query Expansion Based on Association Rules and
Application in Web Information Retrieval, Journal of Computational Information Systems, vol.
4(3), 2008, pp. 977-984.