Handbook of Research on Text and Web Mining

Handbook of Research
on Text and Web Mining
Technologies
Min Song
New Jersey Institute of Technology, USA
Yi-fang Brook Wu
New Jersey Institute of Technology, USA
Volume I
Information science reference
Hershey • New York
Director of Editorial Content:
Director of Production: Managing Editor:
Assistant Managing Editor:
Typesetter: Cover Design:
Printed at:
Kristin Klinger
Jennifer Neidig
Jamie Snavely
Carole Coulson
Chris Hrobak
Lisa Tosheff
Yurchak Printing Inc.
Published in the United States of America by
Information Science Reference (an imprint of IGI Global)
701 E. Chocolate Avenue, Suite 200
Hershey PA 17033
Tel: 717-533-8845
Fax: 717-533-8661
E-mail: [email protected]
Web site: http://www.igi-global.com
and in the United Kingdom by
Information Science Reference (an imprint of IGI Global)
3 Henrietta Street
Covent Garden
London WC2E 8LU
Tel: 44 20 7240 0856
Fax: 44 20 7379 0609
Web site: http://www.eurospanbookstore.com
Copyright © 2009 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in any form or by
any means, electronic or mechanical, including photocopying, without written permission from the publisher.
Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or companies does
not indicate a claim of ownership by IGI Global of the trademark or registered trademark.
Library of Congress Cataloging-in-Publication Data
Handbook of research on text and web mining techologies / Min Song and Yi-Fang Wu, editors.
p. cm.
Includes bibliographical references and index.
Summary: "This handbook presents recent advances and surveys of applications in text and web mining of interests to researchers and endusers "--Provided by publisher.
ISBN 978-1-59904-990-8 (hardcover) -- ISBN 978-1-59904-991-5 (ebook)
1. Data mining--Handbooks, manuals, etc. 2. Web databases--Handbooks, manuals, etc. I. Song, Min, 1969- II. Wu, Yi-Fang, 1970QA76.9.D343H43 2008
005.75'9--dc22
2008013118
British Cataloguing in Publication Data
A Cataloguing in Publication record for this book is available from the British Library.
All work contributed to this book set is original material. The views expressed in this book are those of the authors, but not necessarily of
the publisher.
If a library purchased a print copy of this publication, please go to http://www.igi-global.com/agreement for information on activating
the library's complimentary electronic access to this publication.
141
Chapter IX
Featureless Data Clustering
Wilson Wong
University of Western Australia, Australia
Wei Liu
University of Western Australia, Australia
Mohammed Bennamoun
University of Western Australia, Australia
ABSTRACT
Feature-based semantic measurements have played a dominant role in conventional data clustering
algorithms for many existing applications. However, the applicability of existing data clustering approaches to a wider range of applications is limited due to issues such as complexity involved in semantic computation, long pre-processing time required for feature preparation, and poor extensibility of
semantic measurement due to non-incremental feature source. This chapter first summarises the many
commonly used clustering algorithms and feature-based semantic measurements, and then highlights
the shortcomings to make way for the proposal of an adaptive clustering approach based on featureless
semantic measurements. The chapter concludes with experiments demonstrating the performance and
wide applicability of the proposed clustering approach.
INTRODUCTION
Data clustering has a wide range of applicability ranging from ontology learning (Wong et al., 2006)
and market research, to pattern recognition and image processing (Jain et al., 1999). Depending on the
areas of applications, many different names such as cluster analysis, numerical taxonomy, automatic
classification, botryology and typological analysis have been devised to refer to essentially the same
practice of data clustering. Each application of data clustering can be characterised by two aspects, namely,
Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Featureless Data Clustering
the manner through which the clusters are formed, and the criteria that govern the formation of clusters.
The first aspect relates to the choice of clustering algorithm while the second aspect dictates the type of
semantic measure (i.e. similarity or relatedness) to be used. The choice of the clustering algorithms, and
semantic measures very much depends on the data elements to be clustered, computational constraints,
and also the desired results. In the past, data clustering have been particularly successful with certain
types of data such as documents, software systems, lexical units, webpages, Uniform Resource Identifiers
(URIs) and images. This gives rise to various sub-areas of clustering such as document clustering
(Steinbach et al., 2000), software botryology (Tzerpos and Holt, 1998), term clustering (Wong et al.,
2007), webpage clustering (Wang and Kitsuregawa, 2002), usage-based URI clustering (Mobasher et
al., 1999) and clustering-based image segmentation (Jain et al. 1999).
Unlike documents, webpages, software systems, and images or visual objects, terms and URIs
are essentially featureless. The semantic relation between visual objects can be established by feature
analysis based on visible (e.g. physical and behavioral) traits. The metadata and text that documents
and webpages contain can be useful as features. However, establishing the semantic relation between
data elements such as terms depends on something less tangible, namely, background knowledge which
humans acquired through their senses over the years. Similarly, URIs alone are essentially strings of
characters whose significance can be difficult to establish without visiting the actual resources such as
webpages. Along the same line of thought, Lagus et al. (1996) state “In principle a document might be
encoded as a histogram of its words. . . symbolic words as such retain no information of their relatedness”. To illustrate, consider how do one know that a cow has four legs without actually looking at it
or acquiring such fact from someone else? Equally challenging is how can one establish that the URI
http://del.icio.us/ actually refers to a social bookmark site and not a website related to cooking
or food without prior exposure? In this chapter, we refer to documents, webpages, images and software
systems as decomposable data while non-decomposable data is used exclusively to denote terms and
URIs. Decomposable data elements are those that can be meaningfully analysed in terms of their components or constituent parts for uncovering semantics or significance. Obviously, having the ability to
take apart a data element for extracting features to assist in further stages of semantic relation analysis
can be very helpful in data clustering. Putting aside the computational complexity and complications
related to feature extraction and feature selection, clustering involving decomposable data can be much
more objective since the criteria for cluster formation can be explicitly identified.
The focus of this chapter is on the clustering of non-decomposable data (i.e. terms and URIs) where
the absence of explicit features poses certain challenges, which requires adjustments to be made, especially with regard to the computation of semantic relation, and the clustering algorithm itself. Firstly, a
recent survey in ontology learning (Gomez-Perez and Manzano-Macho, 2003) reveals that all reviewed
systems which apply term clustering methods rely on contextual cues surrounding the terms as features.
However, the large collection of documents, and predefined patterns and templates that are required
for the extraction of contextual cues makes the portability and extensibility of such systems difficult.
Similarly, the conventional approach to URI clustering for Web personalization requires large amount
of log records to enable the use of feature-based semantic measures. Consequently, featureless semantic
measures are fast becoming a necessity for clustering non-decomposable data. Secondly, to compensate
for the lack of configurable and objective measurement of semantic relation through the use of features,
changes to the clustering algorithm is also necessary. In feature-based semantic measurements, the
type and granularity of features can be selected or refined depending on application requirements. For
example, the choice of features for face recognition can come from certain regions of the face such
142
Featureless Data Clustering
as ears, and some characteristics such as colour, intensity or texture can have different significance.
On the other hand, featureless measurement typically relies on more prevalent or dominant aspects to
determine semantic relations, and hence, can be rather subjective. As such, the clustering algorithm
itself has to play a more significant role for refining the clusters to compensate for the lack of control
during semantic measurement.
In this chapter, we will look at an enhanced 3-pass Tree-Traversing Ant (TTA) algorithm for clustering
both terms and URIs. This enhanced TTA algorithm is an improvement over the original 2-pass version
proposed by Wong et al. (2007) which was designed specifically for term clustering. This 3-pass TTA
is empowered by two featureless measurements of relatedness and distance, namely, the Normalised
Google Distance (NGD) by Cilibrasi and Vitanyi (2007), and n° of Wikipedia (n°W) by Wong et al.
(2006). While sharing minor functionalities with the conventional hierarchical algorithm, the 3-pass
TTA is actually a hybrid algorithm that performs both division and agglomeration whose performance
is controlled using ant-sorting algorithms (Lumer and Faieta, 1994). Before that, we will have a brief
look into several common clustering algorithms, and semantic measures in Section 2. In Section 3,
we will present the enhanced 3-pass clustering algorithm, and the relatedness and distance measures
involved. In Section 4, we will include some experimental results related to the clustering of terms and
URIs. We conclude this chapter with an outlook to potential extensions of this work and suggested
directions in Section 5.
BACKGROUND AND RELATED WORKS
Two of the most common data clustering algorithms are hierarchical clustering and non-hierarchical
clustering. In hierarchical clustering, successive clusters are discovered using previously established
clusters, while clusters are established all at once using non-hierarchical algorithms. Hierarchical
algorithms can be further divided into divisive algorithms and agglomerative algorithms. Divisive
clustering begins with all data elements in one cluster, which is later successively divided into smaller
ones. On the other hand, agglomerative clustering begins with single-element clusters and works by
successively merging them into larger ones. As for non-hierarchical clustering, some of the common
algorithms are K-means and fuzzy C-means. Other less conventional clustering algorithms include antbased clustering, which is inspired by the collective behaviour in decentralized, self-organized systems.
In addition to the clustering algorithm, a function or measure for determining the similarity, distance or
relatedness between elements is essential. In conventional clustering, feature-based measures such as
cosine similarity, Jaccard distance or Dice similarity (Salton and McGill, 1983) are employed to support
the use of hierarchical and non-hierarchical clustering. In regard to the types of semantic relations, certain
researchers consider the terms “similarity” and “relatedness” to be interchangeable. In the strictest sense,
semantic similarity deals only with synonymy, while semantic relatedness incorporates a wider range of
concepts such as antonymy, hyponymy and meronymy. For example, the words “buy” and “purchase”
belong to the same semantic class and can be used interchangeably within the same context. In that
sense, the two words are semantically similar. On the other hand, the words “car” and “tyre” cannot be
considered as similar. Nonetheless, we can definitely accept that “car” is related to “tyre” in some way
(e.g. through meronymy) more than “car” is to “vegetarian”. Due to the absence of feature extraction
and selection during the measurement of semantic relation using featureless measures, the distinction
143
Featureless Data Clustering
between similarity and relatedness can be rather obscure. As such, having discussed the differences,
we will remain to use the terms semantic similarity and semantic relatedness interchangeably.
Hierarchical algorithms are relatively common in term clustering, especially for related applications
such as ontology learning. Faure and Nedellec (1998) presented a corpus-based conceptual clustering
method as part of an ontology learning system called ASIUM. The clustering method was designed for
aggregating basic classes based on a distance measure inspired by the Hamming distance. The basic
classes are formed prior to clustering in a phase for extracting sub-categorization frames (Faure and
Poibeau, 2000). Terms that appear in at least two different occasions with the same verb, and the same
preposition or syntactic role, can be regarded as semantically similar such that they can substitute each
another in a particular context. Such terms are extracted in the form of selection restriction as part of
the sub-categorization frames. These semantically similar terms will form the basic classes. The basic
classes form the lowest level of the ontology and are successively aggregated to construct a hierarchy
from bottom-up. Each time, only two basic classes are compared. The clustering begins by computing
the distance between all pairs of basic classes and aggregate those with distance less than the user-defined threshold. The distance between two classes containing the same words with the same frequencies is 0. On the other hand, a pair of classes without a single common word has a distance 1. In other
words, the terms in the basic classes act as features, allowing for inter-class comparison. The measure
for distance is defined as
where card(C1) and card(C2) is the number of words in (i.e. the size of) class C1 and C2 respectively, and
Ncomm is the number of words common to both C1 and C2. ΣFC1 and ΣFC2 is the sum of frequencies of
the words in C1 and C2 which also occur in C1 and C2, respectively. f(wordiC1)and f(wordiC2) are respectively the frequency of the i-th word of class C1 and C2. Maedche and Volz (2001) presented a bottomup hierarchical clustering method that functions as part of the ontology learning system Text-to-Onto.
This clustering method, designed for clustering terms, rely on an all-knowing Oracle, denoted by H,
which is capable of returning possible hypernyms for a given term. In other words, the performance
of the clustering algorithm has an upper-bound limited by the ability of the oracle to know all possible
hypernyms for a term. The oracle is constructed with the help of WordNet and lexico-syntactic patterns (Cimiano and Stab 2005). During the clustering phase, the algorithm would be fed with a list of
terms and the similarity between each pair is computed using a cosine measure. For this purpose, the
syntactic surface dependencies of each term, in the form of modifiers of various types of phrases, are
extracted and used as the features for that term. The algorithm is an extremely long list of nested ifelse statements. For the sake of brevity, it suffices to know that the algorithm attempts to examine the
hypernym–hyponym relations between all pairs of terms as it decides to place a term as a hypernym
of the other or vise versa, or places two terms under a new parent. Each time information about the
hypernymy relation between two terms is required, the oracle is consulted. The projection H(t) will
return a set of tuples (x, y) where x is a hypernym for term t, and y is the number of times the algorithm
has found evidences for it.
Besides the conventional hierarchical clustering, the less common idea of ant-based clustering was
first proposed by Deneubourg et al. (1991) as part of an attempt to explain the different types of emer-
144
Featureless Data Clustering
gent technologies inspired by nature. Ants are represented as agents that move around some simulated
environment such as a square grid, in random. Objects are randomly placed in this environment and
the ants can pick-up the object, move and drop them. These three basic operations are influenced by the
distribution of the objects. Objects that are surrounded or isolated by dissimilar ones are more likely to
be picked up and are later dropped elsewhere in the surrounding of more similar ones. The probability
of picking up and dropping of objects are influenced by the probabilities:
where f(i) is an estimation of the distribution density of the objects in the ants’ immediate environment
(i.e. local neighborhood) with respect to the object that the ants are considering to pick or drop, and
k p and kd are two threshold constants. The choice of f(i) varies depending on the cost and other factors
related to the environment and the data items. As f(i) decreases below k p, the probability of picking up
the object is very high, and the opposite occurs when f(i) exceeds k p. As for the probability of dropping
an object, high f(i) exceeding kd will induce the ants to give up the object while low f(i), which is less
than kd, will encourage the ants to hold onto the object. The combination of these three simple operations and the heuristics behind them gave birth to the notion of basic ants for clustering, also known
as standard ant clustering algorithm (SACA). Lumer and Faieta (1994) further extended and improved
SACA in terms of the numerical aspects of the algorithm, and the convergence time. The authors represented the objects in terms of numerical vectors and the distance between the vectors is computed
using Euclidean distance. Hence, given that δ(i, j)∈[0, 1] as the Euclidean distance between object i (i.e.
i is the location of the object in the center of the neighborhood) and every other neighboring objects j,
the neighborhood function f(i) is defined by the authors as:
(1)
where s2 is the size of the local neighborhood, and α∈(0, 1] is the constant for scaling the distance among
objects. In other words, an ant will have to consider the average similarity of object i with respect to
all other objects j in the local neighborhood before performing an operation (i.e. pickup or drop). As
the value of f(i) is obtained by averaging the total similarities with the number of neighboring cells, s2,
empty cells which do not contribute to the overall similarity must be penalized. In addition, the radius
of perception (i.e. the extent to which objects are taken into consideration for f(i)) of each ant at the
center of the local neighborhood is given by (s-1)/2 .
Even though most conventional clustering algorithms appear to be more established, researchers
(Handl et al. 2003) have shown that certain commonly adopted algorithms such as K-means and average-link agglomerative yield mediocre results in comparison with the ant-based algorithms. Handl et
al. (2003) have also demonstrated certain desirable properties in ant-based algorithms such as
145
Featureless Data Clustering
•
•
•
•
Tolerance to different cluster size;
Ability to identify the number of clusters;
Performance increases with the size of the datasets; and
Graceful degradation in the face of overlapping clusters.
Despite such advantages, the potentials of ant-based algorithms remain relatively unexplored for
possible applications in term clustering.
In addition to terms, the second type of non-decomposable data which is of interest to is Uniform
Resource Identifiers (URI). The clustering of URIs is studied primarily for the purpose of Web personalisation and navigation. In this area, the navigation histories of users are employed for clustering URIs.
As such, the resulting clusters are URIs related through user navigational patterns instead of content
similarity or site authority. Mobasher et al. (1999) employ association rule hypergraph partitioning (Han
et al., 1997) to cluster URIs based solely on their co-occurrence across different user sessions. As such,
no computation of similarity or distance is required. The authors argued that conventional clustering
techniques that require user sessions for the computation of feature-based similarity or distance are not
feasible since the size of navigation history is often very large. The conventional usage-based clustering of URIs was extended by Manco et al. (2003). In addition to the use of user session as features, the
authors included the similarity of Web page content in the clustering of URIs for Web personalisation.
Contents are simply reduced into a “collection of all meaningful and unique word stems appearing in
pi [the Web page] itself.” (Manco et al., 2003) during a preprocessing phase and the weighting scheme
Term Frequency-Inverse Document Frequency (TF-IDF) is then employed to compute weights for the
terms.
All in all, the use of feature-based similarity or relatedness measures remains predominant in
clustering non-decomposable data. In conventional term clustering, contextual cues located around
words are used primarily as features to enable feature-based similarity or relatedness measurement.
For example, if both words “cow” and “goat” appear frequently with “grass”, then we can deduce
that the two words are related in some way through “grass”. However, such approaches are inevitably
faced with challenges including sourcing for such large, authoritative source of contextual cues, and
the complications in extracting and preparing for such features. Similarly, the clustering of URIs has
traditionally been very much reliant on access logs or navigational histories as sources of features. To
address these issues, we propose an enhanced, adaptive approach inspired by hierarchical and ant-based
algorithms, for clustering non-decomposable data using only featureless relatedness measurements in
the next section.
ADAPTIVE CLUSTERING USING FEATURELESS MEASURES
In this section, we will discuss the details of the adaptive clustering approach, including the algorithms
of the three passes involved, and the relatedness and distance measure employed.
Featureless Relatedness and Distance Measures
The adaptive approach presented in this chapter for clustering terms and URIs makes use of two types
of featureless measures, namely, Normalised Google Distance (NGD) and n° of Wikipedia (n°W). For
146
Featureless Data Clustering
a thorough discussion on the derivation of NGD, interested readers can refer to Cilibrasi and Vitanyi
(2007). At the fundamental level, NGD relies heavily on co-occurrence statistics obtained from search
engines such as Google, which represents the indexable portion of the Web. More importantly, the
index size of Google and its constant update provide NGD with access to a dynamic textual resource
for inducing the required statistical evidence to estimate relatedness. The relatedness between any two
data elements x and y, either terms or URIs is established using the equation below:
sim(x,y) = 1 - NGD(x,y)q
(2)
where θ is a constant for scaling the distance value by NGD (Cilibrasi and Vitanyi, 2007).
The second measure utilises the online graph-structured textual resource, Wikipedia. Similar to NGD,
such notion of calculating relatedness or distance without relying too much on features has only gained
popularity in the recent years, evident through the publications’ date. For example, Milne (2007) proposed
the Wikipedia Link Vector Model (WLVM) for calculating semantic relatedness between terms using
the link structures and titles of articles found within the Wikipedia articles. The relatedness between
two articles is computed using the Vector Space Model (VSM), which is used extensively in Information
Retrieval (IR). However, the links found within the articles are utilised to construct the vectors instead
of the conventional use of terms in IR. Other proposed computations of relatedness using Wikipedia
include those by Gabrilovich and Markovitch (2007) and Strube and Paolo-Ponzetto (2006), which they
referred to as WikiRelate! and Explicit Semantic Analysis (ESA), respectively. Unlike the approach by
Milne (2007), WikiRelate! and ESA rely heavily on the textual content of Wikipedia articles.
In this chapter, we look at the measurement of distance using Wikipedia from the perspective of
graph theory instead of conventional VSM. The distance measure used in this chapter for clustering is
n°W proposed by Wong et al. (2006). In this measure, articles in Wikipedia are regarded as manifestations of the information encoded in terms. Following this, each term a can then be represented using the
corresponding article doca in Wikipedia. In Wikipedia, articles are organized using categorical indices
which eventually lead to the highest level, namely, “Categories”. The organisation of articles in Wikipedia loosely resembles a directed acyclic graph with a root node, instead of a pure tree structure. Hence,
finding the distance between two terms a and b can be reduced to the finding of how closely located
the two corresponding articles doca and docb are in the Wikipedia categorical indices. The problem of
finding the degree of separation between two articles can be considered as a single-source shortest path
problem. For this purpose, the Dijkstra’s Algorithm for finding the shortest-path between two vertices
(i.e. articles) is employed. Formally, the distance between term a and b given by n°W is defined as
| SP|
(a, b) = nW (doca , docb ) = ∑ cek
k =1
(3)
where n°W(doca, docb) is the degree of separation between the articles doca and docb which corresponds
to the term a and b, respectively. The degree of separation is computed as the sum of the cost of all edges
along the shortest path between articles doca and docb in the graph of Wikipedia articles. SP is the set
of edges along the shortest path and ek is the kth edge or element in set SP. |SP| is the number of edges
along the shortest path and cek is the cost associated with the k-th edge.
147
Featureless Data Clustering
A 3-Pass Clustering Approach
The 3-pass clustering approach presented in this chapter is an improvement over the original 2-pass
Tree-Traversing Ant (TTA) algorithm proposed by Wong et al. (2007) for term clustering. The enhancement of the approach through the introduction of an additional pass is necessary to further improve the
applicability of TTA. The reason this approach is referred to as adaptive is due to its ability of clustering
both terms and URIs without the need for any algorithmic revision or new source of features. Recall
that both terms and URIs are non-decomposable. Hence, it is worthwhile to reassert that TTA is able
to overcome the difficulty of clustering a wide range of non-decomposable data with the assistance of
featureless similarity measures such as NGD and n°W.
Algorithm 1 shows the main function of the adaptive clustering approach using TTA. As stated in
Algorithm 1, the clustering approach needs to be provided with an initial input set of elements T, and
the definition of all required thresholds.
First Pass using NGD
The first pass of the approach can be regarded as divisive clustering where the initial set of all terms are
placed in one root node r0. In the context of TTA, each node corresponds to a cluster where the contents
of the node are the elements in the cluster. An ant will randomly pick a term, and proceed on to sense its
relatedness with every other terms in that same node. Two new sub-nodes are created to accommodate
the two least similar terms a and b. Unlike the sensing capability of the basic ants discussed during the
start of the chapter, the sense coverage of TTA includes the two sub-nodes rm1 and rm2 corresponding to
Algorithm 1.
148
Featureless Data Clustering
the immediate main node rm. Accordingly, instead of estimating f(i) as the averaged similarity defined
over the s2 number of terms surrounding term i by basic ants, the neighbourhood function for TTA is
defined as the maximum similarity between term a in main node rm and its neighbours ru∈{rm1,rm2}.
Formally, we define the density of neighbourhood ru with respect to term a during the first-pass as:
f 1TTA (a, ru ) = max sim(a, b)
∀b∈ru
(4)
where ru can either be rm1 or rm2, and sim(a, b) is as defined in Equation (2). Besides deciding on whether
to drop an item or not, like in the case of basic ants, TTAs have to make one additional decision, namely,
where to drop. A TTA will decide on where to drop a term based on the neighbourhood densities that
it has memorised for all sub-nodes ru of the main node rm. Formally, the decision on whether to drop
term a on sub-node r y depends on:
1
drop
P
1 if(f 1TTA (a,ry ) = max f 1TTA (a,ru ))
ru ∈{rm1 , rm 2 }
(a, ry ) = 
0 otherwise
(5)
We employ a threshold S1T to control the division of nodes into further sub-nodes. Whenever a TTA
senses that the similarity between the two least similar terms exceeds S1T, no further branching out will
be performed. Ideally, S1T should be set as high as possible to produce nodes of extremely fine granularity.
Tightly-coupled nodes produced during the first pass can be reconsolidated and improved if necessary
through the subsequent passes. Setting high S1T is also important for identifying outliers which would
not have been detected otherwise. Algorithm 2 summarises the process of splitting the initial cluster of
all elements into subsequent tree-structured nodes.
Second Pass using NGD
The second pass introduced in this chapter is a new addition to the overall adaptive clustering approach
using TTA. This new pass can be regarded as agglomerative clustering where the already partitioned
nodes achieved during the first pass is improved through consolidations. After the first pass, all terms
are moved to the extremities of the tree, which are the leaf nodes. The collection of all leaf nodes are
denoted as the set LN. Essentially, the conditions which determine the consolidation of leaf node ri with
another leaf node rj are
•
•
high inter-cluster (i.e. inter-node) similarity between ri and rj; and
low intra-cluster (i.e. intra-node) variation of the combined ri and rj.
The first condition ensures that the two nodes ri and rj have high relatedness, while the second
condition makes sure that the combination of ri and rj results in a compact node. A compact or tightlycoupled node is a node containing elements that are closely situated around the relatedness mean. The
inter-cluster similarity between nodes ri and rj is simply the group-average similarity between them,
and it is defined as:
149
Featureless Data Clustering
Algorithm 2.
S (ri , rj ) =
∑∑ sim(a, b) a∈ri b∈r j
(6)
| ri || rj |
where sim(a,b) is as defined in Equation (2). As for the intra-cluster variation of the combined ri and
rj denoted as V(ri,rj), the standard deviation of pair-wise similarities of elements in ri and rj from the
group-average similarity S(ri,rj) is employed:
∑∑ (sim(a, b) − S (r , r ) )
2
V (ri , rj ) =
i
a∈ri b∈r j
| ri || rj |
j
(7)
In this new second pass, the TTAs utilise the inter-cluster similarity S(ri,rj) and the intra-cluster variation V(ri,rj) as their measures for neighbourhood density. Note that there is no limit imposed on the size
of the nodes to be considered for consolidation. Following the trails laid out during the first pass, a TTA
then randomly picks up a leaf node and sense for neighbourhood densities with all other leaf nodes.
150
Featureless Data Clustering
The merging is performed if the neighbourhood density exceeds a certain threshold S2T. We define the
neighbourhood density of every other leaf nodes ri∈LN with the leaf node rx as:
f 2TTA (rx , ri ) =
1
e
( H [ S ]− S ( rx , ri )) V ( rx , ri )
e
(8)
where H[S] is the maximum inter-cluster similarity of any two leaf nodes from the set of all leaf nodes
LN defined as:
H [S ] =
max
∀ ( ri , r j )∈LN , ri ≠ r j
S (ri , rj )
(9)
H[S] is a constant for any LN. Intuitively, a high neighbourhood density is obtained if the pair rx and ri
has extremely high inter-cluster similarity (i.e. achieved through minimum difference from the maximum
inter-cluster similarity), and low intra-cluster variation. At one extreme, when the variation V(rx,r y) = 0
and when the difference H[S]-S(rx,ri)=0, the maximum neighbourhood density of 1 is achieved. After
sensing the neighbourhood densities, the decision on whether to drop the elements in leaf node rx on
another leaf node r y is determined by:
1 if(f 2TTA (rx ,ry ) = max f 2TTA (rx ,ri ) ∧ f 2TTA (rx ,ry ) > S 2T )
ri ∈LN
P drop (rx , ry ) = 
(10)
0 otherwise
where S2T is a user-specified value of acceptable relatedness threshold. The process of consolidating
leaf nodes during the second pass is summarised in Algorithm 3.
2
Third Pass using n°W
The third pass in this chapter is the original second pass proposed by Wong et al. (2007), which utilises
Wikipedia as the source for computing relatedness. Functional wise, this third pass is essentially the
same as the new second pass reported earlier, where consolidations of leaf nodes produced by the first
pass are carried out. However, unlike the second pass, the computational concern of a TTA during the
third pass is limited to relocating any leaf node with a single element rx (i.e. isolated leaf node) or |rx|=1.
This third pass serves two purposes as a final, housekeeping phase, namely, to ensure the least number
of compact clusters with maximal coverage, and to isolate outliers. Using a different featureless measure
based on a more authoritative source (i.e. Wikipedia) compared to Google, this pass aims to provide a
more definite means of determining relatedness and hence, ensuring more accurate consolidations of
clusters. Every decision to drop the single element a carried by a TTA is made after the average distance
between the isolated leaf node rx and all other leaf nodes ri∈LN have been sensed. Formally, the density
of neighborhood ri with respect to the isolated leaf node rx during the third pass is defined as:
f
3
TTA
(rx , ri ) =
∑∑
( a, b)
a∈rx b∈ri
| rx || ri |
(11)
151
Featureless Data Clustering
Algorithm 3.
where δ(a,b) is the distance measure between the two elements a and b defined in Equation (3). After
all neighbourhood densities have been sensed, the TTA will decide on the most ideal leaf nodes r y to
drop the single element from rx using the function:
1 if(f 3TTA (rx ,ry ) = min f 3TTA (rx ,ri ) ∧ f 3TTA (rx ,ry ) ≤
ri ∈LN
P drop (rx , ry ) = 
0 otherwise
3
T
)
(12)
where δT is a distance threshold. Intuitively, a TTA will drop the single element a from the isolated leaf
node rx on the leaf node r y, if all elements in r y collectively yield the minimum average distance with
rx that satisfies the distance threshold δT. In the case of a being an outlier, it will have very high average distances with all other leaf nodes. Without a threshold, the TTA will still relocate the outlier to
the leaf node that has the minimum average distance which, by comparison with all other non-outlier
cases, is still considered relatively higher. Hence, the purpose of δT is very similar to that of S2T in that
both are meant to isolate outliers, and to control the relocation of single elements or more generally, the
consolidation of nodes. Ideally, this pass intends to produce the minimal number of nodes to ensure that
each one covers as many semantically-related elements as possible while making sure that the relatedness
between elements in the same node remains considerably high. Algorithm 4 outlines the steps involved
in the third pass of clustering using n°W.
Summary of the 3-Pass TTA
Note that in this 3-pass TTA, only the first pass of the clustering approach is mandatory. The remaining
two passes can be activated or disabled depending on the types of data to be clustered, the desired results,
and the acceptable execution time. In fact, clustering using TTA can be further extended and improved
beyond what is presented in this chapter. Since the foundation of TTA is by itself an adaptive approach, new
152
Featureless Data Clustering
Algorithm 4.
passes can be introduced, or the existing ones replaced. In addition, more refined featureless relatedness
and distance measures can be proposed to further improve the clustering of non-decomposable data.
Clustering Thresholds
In addition to the clustering algorithms and semantic measures, another issue that has crucial effect on
the clustering results is the choice of thresholds. In order to provide the users with certain control over
the clustering process while avoiding the need to manually determine the thresholds, we use numerical
descriptors of central tendency and dispersion which are related to the neighbourhood density functions.
For the first pass, we need to automatically determine a few values that the users can select to use as
the threshold S1T. Recall from the previous section that the partitioning of the clusters is controlled by S1T.
Whenever the similarity between the least similar pair of elements exceeds S1T, the division of clusters
will stop. In other words, the threshold S1T is related to the pair-wise similarity of elements measured
using the function sim(a,b) in Equation (2). As such, there are few possible values that we can choose to be
the threshold S1T, namely, the maximum pair-wise similarity H, the minimum pair-wise similarity L, the
average pair-wise similarity E, and the standard deviation of the pair-wise similarity SD, of the elements
in the input set T.
H [ sim] =
L[ sim] =
max
sim(a, b)
min
sim(a, b)
∀ ( a ,b )∈T , a ≠ b
∀ ( a ,b )∈T , a ≠ b
153
Featureless Data Clustering
E[ sim] =
∑ ∑
sim(a, b)
a∈T b∈T , a ≠ b
| T | (| T | −1)
∑ ∑ (sim(a, b) − E[sim])
2
SD[ sim] =
a∈T b∈T , a ≠ b
| T | (| T | −1)
Using these numerical descriptors, we can set S1T to either H[sim], L[sim], E[sim], E[sim]+SD[sim], or
E[sim]-SD[sim] depending on the desired output. Since H[sim] > E[sim]+SD[sim]> E[sim] > E[sim]-SD[sim]
> L[sim], setting an increasingly higher S1T will prolong the partitioning process and hence, create more
refined clusters or nodes.
As for the second pass, we can follow the same idea of utilising numerical descriptors which are related
to f TTA2. Firstly, we formulate S2T in the same way we did with f TTA2:
S 2T =
1
e
( H [ S ]−
S
)
eV
where H[S] is as defined in Equation (9), and τS and τV are the two variables related to the inter-cluster
similarity and intra-cluster variation functions S(ri,rj) and V(ri,rj) in Equations (6) and (7), respectively.
Similar to S1T, we can set the individual variables τS and τV using the various numerical descriptors related
to the corresponding functions. For τS, we have the maximum, minimum, average and standard deviation
of the inter-cluster similarity, which is H[S], L[S], E[S] and SD[S], respectively. In the case of τV, we have
the maximum, minimum, average and standard deviation of the intra-cluster similarity, which is H[V],
L[V], E[V] and SD[V], respectively. Since the second pass deals with leaf nodes (i.e. groups of elements)
instead of elements alone, we need to define the numerical descriptors in terms of the set of all leaf nodes
LN. In addition to H[S] as defined in Equation (9), the remaining numerical descriptors for inter-cluster
similarity are determined as:
L[ S ] =
min
∀ ( ri , r j )∈LN , ri ≠ r j
E[ S ] =
S (ri , rj )
∑ ∑
ri ∈LN r j ∈LN , ri ≠ r j
S (ri , rj )
(| LN |)(| LN | −1)
∑ ∑ (S (r , r ) − E[S ])
2
SD[ S ] =
ri ∈LN r j ∈LN , ri ≠ r j
i
j
(| LN |)(| LN | −1)
and similarly, the numerical descriptors for intra-cluster variations are defined as:
H [V ] =
154
max
∀ ( ri , r j )∈LN , ri ≠ r j
V (ri , rj )
Featureless Data Clustering
L[V ] =
E[V ] =
min
∀ ( ri , r j )∈LN , ri ≠ r j
∑ ∑
V (ri , rj )
ri ∈LN r j ∈LN , ri ≠ r j
V (ri , rj )
(| LN |)(| LN | −1)
∑ ∑ (V (r , r ) − E[V ])
2
SD[V ] =
ri ∈LN r j ∈LN , ri ≠ r j
i
j
(| LN |)(| LN | −1)
The two variables τS and τV have a proportional and inversely-proportional effect on S2T, respectively.
As τS increases, so does S2T. On the other hand, an increasing τV will produce a decreasing S2T.
In short, this method of associating the thresholds to certain combination of numerical descriptors is
important to allow for the automatic adjustment of the actual values depending on the input set. The need
to manually provide arbitrary values by users is inappropriate and can lead to poor performance.
EXPERIMENTS AND DISCUSSIONS
To demonstrate the applicability of the enhanced 3-pass clustering approach to non-decomposable data,
we conducted five experiments using three sets of terms from chemistry and materials science, and two
sets of URIs across various general topics.
Term Clustering using the Adaptive 3-Pass Approach
The use of the original 2-pass TTA for clustering terms has been thoroughly evaluated (Wong et al.,
2007) using terms from various genres, ranging from diseases, wildlife, people and wine to politics,
finance and technology. Using Lexical Overlap (LO) (Maedche and Staab 2002) as the performance
metric, the original 2-pass TTA achieved an average of 97% over ten experimental datasets. LO is simply
the intersection between the discovered clusters and the recommended (i.e. manually created) clusters.
Obviously, higher LO is achieved when the discovered clusters share many common data elements as
the recommended ones. Through examples, quantitative experiments and discussion, Wong et al. (2007)
concluded the following strengths of the 2-pass TTA when used for clustering terms:
•
•
•
•
•
•
•
Able to further distinguish hidden structures within clusters;
Flexible in regards to the discovery of clusters;
Capable of identifying and isolating outliers;
High tolerance to differing cluster sizes;
Able to produce consistent results;
Able to identify implicit taxonomic relationships between clusters; and
Inherent capability of coping with synonyms, word senses and the fluctuation in terms usage.
155
Featureless Data Clustering
For this section, we have conducted three independent experiments to assess the enhanced 3-pass
approach in terms of term clustering. We constructed three sets of terms in the following categories:
“chemicals compounds”, “explosives and toxins” and “materials”. There are 16 terms in from category
“chemicals compounds”, 15 terms in the “explosives and toxins” category, and 19 terms from category
“materials”. In general, the 3-pass clustering approach using TTA was able to discover most of the naturally-occurring groups of terms within the respective sets, at the level of granularity that enables human
comprehension. The term clustering results are illustrated in Figures 1, 2 and 3 accompanied by some
brief discussions. For all three experiments using terms, we fixed the thresholds S1T to E[sim]+SD[sim], τS
and τV to E[S]+SD[S] and E[V]-SD[V], respectively, and δT to 10. In fact, these threshold values are generally applicable for term clustering using the 3-pass approach, and will generate desirable results.
We carried out the first experiment by clustering the 16 terms belonging to the category “chemical compounds”. In this experiment, both the second and third pass of the clustering approach were
involved. Figure 1 shows the results of term clustering from all three passes. After the first pass alone,
the four expected clusters have already emerged. In Pass 1 of Figure 1, cluster A, B, C and D represents
Figure 1. The result of clustering 16 terms from the category “chemical compounds”. The following
thresholds were achieved using the recommended numerical descriptors: S1T = 0.71 and S2T = 0.84.
There were four intended clusters of terms from the set. Cluster A, B, C and D, represents “photographic
chemicals”, “disinfectants/biocides”, “cosmetic chemicals” and “flavour enhancers”, respectively. A
symbol “′” attached to a cluster designator represents a revision of that cluster. A double prime means
that the cluster has been revised twice.
156
Featureless Data Clustering
“photographic chemicals”, “disinfectants/biocides”, “cosmetic chemicals” and “flavour enhancers”.
The prime symbol “′” attached to the cluster designators simply represents the revision or extension of
those clusters. For example, cluster B′′ in Pass 3 of Figure 1 is the result of expanding cluster B′ in Pass
2 of Figure 1 by adding another element “sodium dichloroisocyanurate”. While certain chemicals may
be used for multiple purposes, the clusters captured by TTA represent their dominant applications. While
the resulting clusters after the first pass may show some promising results, they have not yet achieved
maximum coverage. There are certain qualified terms that have yet to join the appropriate clusters. For
example, “monosodium glutamate” is a common “flavour enhancer”. This is when the second pass,
which is the enhancement added to the adaptive clustering approach, contributes to the improvement
of clustering results. As shown in Pass 2 of Figure 1, clusters B′ and D′ were appropriately extended to
include one additional element each. However, three isolated clusters (i.e. single-element clusters) remain,
namely, “sodium dichloroisocyanurate”, “pyrocatechol” and “guanosine monophosphate”. During the
final, housekeeping phase, these three isolated clusters were relocated to their intended categories by
the third pass as shown in Pass 3 of Figure 1.
The second experiment was conducted using 15 terms belonging to the category “explosives and
toxins”. As one can observe from Figure 2, this experiment was different in that only the first two
passes were required to discover the naturally-occurring clusters. We were anticipating three clusters
in the result, and as shown in Pass 2 of Figure 2, our expectation was again fulfilled. Terms from the
sub-categories “nerve agents”, “mycotoxins”, and “explosives” were grouped into cluster A, B and
C, respectively. Moreover, one is able to note that the second pass alone was adequate to relocate the
remaining isolated leaf node in Pass 1 of Figure 2, which is “Cytochalasin”. Since the computation of
Figure 2. The result of clustering 15 terms from the category “explosives and toxins”. The following
thresholds were achieved using the recommended numerical descriptors: S1T = 0.70 and S2T = 0.87.
There were three intended clusters of terms from the set. Cluster A, B and C represents “nerve agents”,
“mycotoxins”, and “explosives”, respectively. The symbol “′” attached to a cluster designator represents
a revision of that cluster, while the subscript “ij” signifies that cluster Xij is the result of the consolidation
of clusters Xi and Xj. The superscript “P” denotes the parent status of the cluster, which is indirectly
formed through the sharing of a common node by several clusters belonging to taxonomically-related
semantic classes.
157
Featureless Data Clustering
distance using n°W can be slightly more intensive due to the traversal of Wikipedia categorical indices,
the occasional avoidance of the third pass can be beneficial. Another interesting feature revealed from
this experiment is the clustering approach’s capability of revealing hidden structures within clusters
as pointed out by Wong et al. (2007). In Pass 2 of Figure 2, both clusters C12 and C3, which are joined
by a common ancestor to form cluster CP, contain the names of the various types of explosives. Within
this single parent cluster of “explosives”, the approach has further discovered two sub-clusters where
cluster C3 contains a specific type of explosives known as “Sprengel explosives” (i.e. “Oxyliquit” and
“Panclastite”). The discovery of such virtual groupings of clusters related through the parent-child
relationship is usually motivated by the fact that the clusters involved are strongly related in some way
but are not suitable to be directly consolidated into a single cluster.
For the last experiment involving terms, we conducted clustering using the set of 19 terms from
the category “materials”. There were four clusters produced from the clustering, and the cluster memberships properly reflect their intended semantic class. Cluster A consists of the names of the various
“thermoplastics”, while cluster B is made up of several “silicate minerals”. Clusters C and D contain
the names of the several types of “woods” and “glass”. Similar to the first experiment, the second and
third passes were required to finalise the clustering process. Unlike the second experiment where the
third pass was of no use, the third pass in this third experiment played an extremely important role. One
can observe from Pass 1 of Figure 3 the overly partitioned clusters. In comparison to the previous two
Figure 3. The result of clustering 19 terms from the category “materials”. The following thresholds
were achieved using the recommended numerical descriptors: S1T = 0.66 and S2T = 0.85. There were
four intended clusters of terms from the set. Cluster A, B, C and D represents “thermoplastics”, “silicate minerals”, “woods” and “glass”, respectively. The symbol “′” attached to a cluster designator
represents a revision of that cluster.
158
Featureless Data Clustering
experiments, there were many isolated clusters produced after the first pass. Similarly, the output after
the second pass, as shown in Pass 2 of Figure 3, did not show much improvement. In fact, the intended
cluster D did not appear even after the second pass. One of the possible reasons is that the threshold was
set much higher than necessary. As we have mentioned at the start of this section, we utilised the same
settings of numerical descriptors to define all thresholds for all experiments. Nevertheless, the overall
performance of the 3-pass approach was not affected at all since we have the third pass to improve the
results before delivering them. This is clearly shown in Part 3 of Figure 3 where all remaining isolated
clusters were properly relocated and the last remaining cluster D, which we had been expecting, had
successfully emerged.
URI Clustering using the Adaptive 3-Pass Approach
As we have pointed before, the clustering of URIs has been conventionally carried out using log records
or navigational histories for Web personalisation purposes. As a result, huge collection of website-specific access logs are required to validate the performance of clustering approaches. Instead, we have
constructed two arbitrary sets of URIs of various genres to demonstrate the wide applicability of the
clustering approach presented in this chapter. As we have mentioned before, the clustering of URIs using the proposed 3-pass approach requires the third pass to be disabled since the relatedness between
URIs cannot be established using Wikipedia. Two experiments were conducted for this section using
two sets of URIs from the categories “publishers, online retailers & fast-food franchises” and “news
agencies, travel guides & cooking recipes”. The first set consists of 16 URIs, and the second set has 24
URIs. For the two experiments using URIs, we set the thresholds S1T to E[sim]+SD[sim], and τS and τV to
E[S]+SD[S] and E[V], respectively. The pair-wise similarity or relatedness between URIs is inherently
lower than that of terms. One of the reasons behind such trend is the relatively lower observable co-occurrence of URIs as compared to terms on the World Wide Web. Even though the same combination
Figure 4. The result of clustering 16 URIs from the category “publishers, online retailers & fast-food
franchises”. The following thresholds were achieved using the recommended numerical descriptors:
S1T = 0.51 and S2T = 0.88. There were three intended clusters of URIs from the set. Cluster A, B and
C represents “fast-food franchises”, “online retailers” and “academic publishers”, respectively. The
symbol “′” attached to a cluster designator represents a revision of that cluster.
159
Featureless Data Clustering
Figure 5. The result of clustering 24 URIs from the category “news agencies, travel services & cooking recipes”.
The following thresholds were achieved using the recommended numerical descriptors: S1T = 0.69 and S2T =
0.88. There were three intended clusters of URIs from the set. Cluster A, B and C represents “travel services”,
“cooking recipes” and “news agencies”, respectively. The symbol “′” attached to a cluster designator represents
a revision of that cluster. The subscript “ij” signifies that cluster Xij is the result of the consolidation of clusters
Xi and Xj while the superscript “P” denotes the parent status of the cluster, which is indirectly formed through
the sharing of a common node by several clusters belonging to taxonomically-related semantic classes.
160
Featureless Data Clustering
of numerical descriptors is used to define the threshold for the first pass, namely, E[sim]+SD[sim], the
values for S1T are relatively lower as shown in the captions of Figures 4 and 5.
The first experiment using URIs was carried out with a set of 16 URIs from the sub-categories “ fastfood franchises”, “online retailers” and “academic publishers”. Using the measure NGD, backed by the
World Wide Web, and the TTA algorithm, these 16 URIs were successfully grouped into their intended
clusters using only the first two passes. There were strictly no process of feature extraction and selection
involved since, URIs, like terms, cannot be further decomposed into meaningful constituents. After the
completion of the clustering process as shown in Pass 2 of Figure 4, cluster A contains all URIs from
the input set belonging to the various famous “ fast-food franchises”. As for cluster B and C, we have
the URIs of “online retailers” and “academic publishers”.
We use 24 URIs for the last experiment. We were expecting three clusters A, B and C to represent
URIs of websites belonging to the categories “travel services”, “cooking recipes” and “news agencies”,
respectively. Similar to the results of Figure 2, TTA’s ability to uncover sub-clusters within larger clusters
remains evident in URI clustering. As shown in Pass 2 of Figure 5, there are three other sub-clusters
within the “travel services” category, which is represented by the parent cluster AP. The parent cluster
AP is made up of website URIs related to the various types of travel services. For example, the subcluster A1 contains URIs of prominent travel guide websites including “ frommers.com”, “lonelyplanet.
com” and “ fodors.com”. On the other hand, sub-cluster A2 consists of the URIs of two worldwide
accommodation listing and reservation websites. The URIs of travel advisory websites were grouped
into cluster A3. All in all, these three sub-clusters are highly related to one another but at the same time,
they belong to distinct groups. As such, instead of consolidating all of these travel-related URIs into
a single cluster, they were organised into a hierarchy of distinct sub-clusters which are connected by a
common theme, namely, “travel services”. As for cluster B12′ in Pass 2 of Figure 5, it was achieved through
the consolidation of the individual clusters B1 and B2, and later, the inclusion of two isolated clusters
“chinesefood.about.com” and “thaifood.about.com”. Lastly, the significance of the choice of thresholds
is also revealed through this experiment. Unlike the first four experiments, there was an isolated cluster
that remained even after the clustering process was completed. This isolated cluster contains the single
URI “del.icio.us” as shown in Pass 2 of Figure 5. Depending on one’s perception, this isolated cluster
can be perceived as either a clustering outlier, or simply another significant cluster. This demonstrates
the 3-pass’s capability in identifying and isolating outliers, subject to the careful selection of thresholds.
As such, our proposed method of initialising the thresholds using the various combinations of numerical
descriptors is both relevant and important in ensuring the quality of clustering output.
In a nutshell, we have shown that the enhanced 3-pass clustering approach proposed in this chapter
is flexible due to its ability to cluster both terms and URIs without any need for algorithmic revision
or sourcing of relevant features. The adaptability of the 3-pass approach is demonstrated through the
automatic adjustment of thresholds depending on the input set. In addition, the selective enablement of
the second and third pass based on the output from the preceding phase allows the system to avoid any
unnecessary processing, or to further contribute to the improvement of the final clustering output. The
iterative style of improving the clustering result which is spread across three different phases, and the use
of two different sources for semantic measurement also help to ensure the robustness of the approach.
161
Featureless Data Clustering
CONCLUSION AND FUTURE TRENDS
Data clustering is an active research area which is highly relevant to many engineering, business and
marketing applications. Many established clustering approaches have relied on hierarchical and partitional
algorithms with the use of feature-based semantic measures. Due to the challenges involved during the
preprocessing phase to prepare the features required for similarity or relatedness computation, many
research efforts in this area were focused primarily on reducing dimensionality and improving feature
extraction and selection. However, these efforts have limited applicability when we are dealing with
data elements that have little or no features to rely on. Such data elements are what we referred to nondecomposable data. Can we resolve a term or a URI into its constituent parts for analysis the same way
as an image or document? A key characteristic of non-decomposable data is that the comprehension and
appreciation of the semantics or significance of such data requires background knowledge. For example,
how can one know that the word “grass” refers to something which is green without acquiring such
knowledge through physical senses?
The World Wide Web and other forms of continuously-updated structured resources such as Wikipedia
and the Open Directory Project offer a promising starting point to develop avenues for harnessing general
consensus and inducing semantics. The Normalised Google Distance (NGD) and n° of Wikipedia (n°W)
are two examples. In this chapter, we presented an enhanced 3-pass clustering approach using the TreeTraversing Ant (TTA) algorithm, and the two featureless semantic measures NGD and n°W. The first pass,
which is mandatory, divides the elements into naturally-occurring clusters of appropriate granularity.
The second and third passes improve the final clustering output by performing cluster consolidation,
and single element relocation, respectively. The flexibility of the approach is ensured through the
substitutability of the second pass with the third and vice versa, and the use of two different sources for
computing semantic relatedness and distance. This 3-pass approach does not require any algorithmic
revision or new source of features to switch between term and URI clustering. In addition, we have
introduced a method of associating the clustering thresholds to various combinations of numerical
descriptors to allow for automatic adjustment depending on the input set. This prevents the problem of
manually defining arbitrary values to suit the clustering input. Various other strengths of TTA such as
the ability of isolating outliers, and the ability of discovering internal groupings within stable clusters
were also demonstrated through our five experiments using both terms and URIs.
There are a few possible research directions related to this area of non-decomposable data clustering
using featureless semantic measures. The first is concerned with the algorithm itself and more study is
required to extend the 3-pass clustering approach for use with large datasets. Issues like computational
complexity and resources will inevitably arise as more and more data elements are involved during
clustering. Secondly, study into the coverage and limitation of the featureless measurements is also
necessary. Even though the World Wide Web is the largest repository of text ever, and Wikipedia is
constantly growing, are there any terms or URIs that the two associated measures cannot cope with?
If so, what are the consequences?
ACKNOWLEDGMENT
This research was supported by the Australian Endeavour International Postgraduate Research Scholarship, and the Research Grant 2008 by the University of Western Australia.
162
Featureless Data Clustering
REFERENCES
Cimiano, P. & Staab, S. (2005). Learning concept hierarchies from text with a guided agglomerative
clustering algorithm. In Proceedings of the Workshop on Learning and Extending Lexical Ontologies
with Machine Learning Methods, Bonn, Germany.
Cilibrasi, R. & Vitanyi, P. (2007). The Google similarity distance. IEEE Transactions on Knowledge
and Data Engineering, 19(3):370-383.
Deneubourg, J., Goss, S., Franks, N., Sendova-Franks, A., Detrain, C. & Chretien, L. (1991). The dynamics of collective sorting: robot-like ants and ant-like robots. In Proceedings of the 1st International
Conference on Simulation of Adaptive Behavior: From Animals to Animats, France.
Faure, D. & Poibeau, T. (2000). First experiments of using semantic knowledge learned by ASIUM for
information extraction task using INTEX. In Proceedings of the 1st Workshop on Ontology Learning,
Berlin, Germany.
Faure, D., & Nedellec, C. (1998). A corpus-based conceptual clustering method for verb frames and
ontology acquisition. In Proceedings of the 1st International Conference on Language Resources and
Evaluation (LREC), Granada, Spain.
Gabrilovich, E. & Markovitch, S. (2007). Computing Semantic Relatedness using Wikipedia-based
Explicit Semantic Analysis. In Proceedings of the 20th International Joint Conference on Artificial
Intelligence, Hyderabad, India.
Gomez-Perez, A. & Manzano-Macho, D. (2003). A survey of ontology learning methods and techniques.
(Report No. 1.5). OntoWeb Consortium.
Han, E., Karypis, G., Kumar, V. & Mobasher, B. (1997). Clustering Based On Association Rule
Hypergraphs. In Proceedings of the SIGMOD Workshop on Research Issues on Data Mining and
Knowledge Discovery.
Handl, J., Knowles, J. & Dorigo, M. (2003). Ant-based clustering: a comparative study of its relative
performance with respect to k-means, average link and 1d-som. (Report No. TR/IRIDIA/2003-24).
Universite Libre de Bruxelles.
Jain, A., Murty, M. & Flynn, P. (1999). Data clustering: A review. ACM Computing Survey, 31(3):264323.
Lagus, K., Honkela, T., Kaski, S. & Kohonen, T. (1996). Self-organizing maps of document collections:
A New Approach to Interactive Exploration. In Proceedings of the 2nd International Conference on
Knowledge Discovery and Data Mining.
Lumer, E. & Faieta, B. (1994). Diversity and adaptation in populations of clustering ants. In Proceedings
of the 3rd International Conference on Simulation of Adaptive Behavior: From Animals to Animats 3.
Maedche, A. & Volz, R. (2001). The ontology extraction & maintenance framework: text-to-onto. In
Proceedings of the IEEE International Conference on Data Mining, California, USA.
163
Featureless Data Clustering
Maedche, A. & Staab, S. (2002). Measuring similarity between ontologies. In Proceedings of the European Conference on Knowledge Acquisition and Management (EKAW), Madrid, Spain.
Manco, G., Ortale, R. & Sacca, D. (2003). Similarity-based clustering of Web transactions. In Proceedings of the ACM Symposium on Applied Computing.
Milne, D. (2007). Computing Semantic Relatedness using Wikipedia Link Structure. In Proceedings of the
New Zealand Computer Science Research Student Conference (NZCSRSC), Hamilton, New Zealand.
Mobasher, B., Cooley, R. & Srivastava, J. (1999). Creating adaptive Web sites through usage-based
clustering of URLs. In Proceedings of the Workshop on Knowledge and Data Engineering Exchange.
Salton, G. & McGill, M. (1983). Introduction to Modern Information Retrieval. New York, NY: McGraw-Hill.
Steinbach, M., Karypis, G. & Kumar, V. (2000). A comparison of document clustering techniques.
(Report No. 00-034). University of Minnesota.
Strube, M. & Paolo-Ponzetto, S. (2006). WikiRelate! Computing Semantic Relatedness Using Wikipedia.
In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI), Boston, USA.
Tzerpos, V. & Holt, R. (1998). Software botryology, automatic clustering of software systems. In
Proceedings of the 9th International Workshop on Database and Expert Systems Applications.
Wang, Y. & Kitsuregawa, M. (2002). Evaluating contents-link coupled Web page clustering for Web
search results. In Proceedings of the 11th International Conference on Information and Knowledge
Management (CIKM), Virginia, USA.
Wong, W., Liu, W. & Bennamoun, M. (2007). Tree-traversing ant algorithm for term clustering based
on featureless similarities. Data Mining and Knowledge Discovery, 15(3): 349-381.
Wong, W., Liu, W. & Bennamoun, M. (2006). Terms clustering using tree-traversing ants and featureless
similarities. In Proceedings of the International Symposium on Practical Cognitive Agents and Robots
(PCAR), Perth, Australia.
KEY TERMS
Clustering: The process of discovering naturally-occurring groups of data elements.
Semantic Measure: A computational means of determining to what extent two elements are semantically related or similar.
Term: A word used in domain-specific context.
Uniform Resource Identifier (URI): A compact string of characters used to identify or name a
resource.
164