Handbook of Research on Text and Web Mining Technologies Min Song New Jersey Institute of Technology, USA Yi-fang Brook Wu New Jersey Institute of Technology, USA Volume I Information science reference Hershey • New York Director of Editorial Content: Director of Production: Managing Editor: Assistant Managing Editor: Typesetter: Cover Design: Printed at: Kristin Klinger Jennifer Neidig Jamie Snavely Carole Coulson Chris Hrobak Lisa Tosheff Yurchak Printing Inc. Published in the United States of America by Information Science Reference (an imprint of IGI Global) 701 E. Chocolate Avenue, Suite 200 Hershey PA 17033 Tel: 717-533-8845 Fax: 717-533-8661 E-mail: [email protected] Web site: http://www.igi-global.com and in the United Kingdom by Information Science Reference (an imprint of IGI Global) 3 Henrietta Street Covent Garden London WC2E 8LU Tel: 44 20 7240 0856 Fax: 44 20 7379 0609 Web site: http://www.eurospanbookstore.com Copyright © 2009 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher. Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or companies does not indicate a claim of ownership by IGI Global of the trademark or registered trademark. Library of Congress Cataloging-in-Publication Data Handbook of research on text and web mining techologies / Min Song and Yi-Fang Wu, editors. p. cm. Includes bibliographical references and index. Summary: "This handbook presents recent advances and surveys of applications in text and web mining of interests to researchers and endusers "--Provided by publisher. ISBN 978-1-59904-990-8 (hardcover) -- ISBN 978-1-59904-991-5 (ebook) 1. Data mining--Handbooks, manuals, etc. 2. Web databases--Handbooks, manuals, etc. I. Song, Min, 1969- II. Wu, Yi-Fang, 1970QA76.9.D343H43 2008 005.75'9--dc22 2008013118 British Cataloguing in Publication Data A Cataloguing in Publication record for this book is available from the British Library. All work contributed to this book set is original material. The views expressed in this book are those of the authors, but not necessarily of the publisher. If a library purchased a print copy of this publication, please go to http://www.igi-global.com/agreement for information on activating the library's complimentary electronic access to this publication. 141 Chapter IX Featureless Data Clustering Wilson Wong University of Western Australia, Australia Wei Liu University of Western Australia, Australia Mohammed Bennamoun University of Western Australia, Australia ABSTRACT Feature-based semantic measurements have played a dominant role in conventional data clustering algorithms for many existing applications. However, the applicability of existing data clustering approaches to a wider range of applications is limited due to issues such as complexity involved in semantic computation, long pre-processing time required for feature preparation, and poor extensibility of semantic measurement due to non-incremental feature source. This chapter first summarises the many commonly used clustering algorithms and feature-based semantic measurements, and then highlights the shortcomings to make way for the proposal of an adaptive clustering approach based on featureless semantic measurements. The chapter concludes with experiments demonstrating the performance and wide applicability of the proposed clustering approach. INTRODUCTION Data clustering has a wide range of applicability ranging from ontology learning (Wong et al., 2006) and market research, to pattern recognition and image processing (Jain et al., 1999). Depending on the areas of applications, many different names such as cluster analysis, numerical taxonomy, automatic classification, botryology and typological analysis have been devised to refer to essentially the same practice of data clustering. Each application of data clustering can be characterised by two aspects, namely, Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited. Featureless Data Clustering the manner through which the clusters are formed, and the criteria that govern the formation of clusters. The first aspect relates to the choice of clustering algorithm while the second aspect dictates the type of semantic measure (i.e. similarity or relatedness) to be used. The choice of the clustering algorithms, and semantic measures very much depends on the data elements to be clustered, computational constraints, and also the desired results. In the past, data clustering have been particularly successful with certain types of data such as documents, software systems, lexical units, webpages, Uniform Resource Identifiers (URIs) and images. This gives rise to various sub-areas of clustering such as document clustering (Steinbach et al., 2000), software botryology (Tzerpos and Holt, 1998), term clustering (Wong et al., 2007), webpage clustering (Wang and Kitsuregawa, 2002), usage-based URI clustering (Mobasher et al., 1999) and clustering-based image segmentation (Jain et al. 1999). Unlike documents, webpages, software systems, and images or visual objects, terms and URIs are essentially featureless. The semantic relation between visual objects can be established by feature analysis based on visible (e.g. physical and behavioral) traits. The metadata and text that documents and webpages contain can be useful as features. However, establishing the semantic relation between data elements such as terms depends on something less tangible, namely, background knowledge which humans acquired through their senses over the years. Similarly, URIs alone are essentially strings of characters whose significance can be difficult to establish without visiting the actual resources such as webpages. Along the same line of thought, Lagus et al. (1996) state “In principle a document might be encoded as a histogram of its words. . . symbolic words as such retain no information of their relatedness”. To illustrate, consider how do one know that a cow has four legs without actually looking at it or acquiring such fact from someone else? Equally challenging is how can one establish that the URI http://del.icio.us/ actually refers to a social bookmark site and not a website related to cooking or food without prior exposure? In this chapter, we refer to documents, webpages, images and software systems as decomposable data while non-decomposable data is used exclusively to denote terms and URIs. Decomposable data elements are those that can be meaningfully analysed in terms of their components or constituent parts for uncovering semantics or significance. Obviously, having the ability to take apart a data element for extracting features to assist in further stages of semantic relation analysis can be very helpful in data clustering. Putting aside the computational complexity and complications related to feature extraction and feature selection, clustering involving decomposable data can be much more objective since the criteria for cluster formation can be explicitly identified. The focus of this chapter is on the clustering of non-decomposable data (i.e. terms and URIs) where the absence of explicit features poses certain challenges, which requires adjustments to be made, especially with regard to the computation of semantic relation, and the clustering algorithm itself. Firstly, a recent survey in ontology learning (Gomez-Perez and Manzano-Macho, 2003) reveals that all reviewed systems which apply term clustering methods rely on contextual cues surrounding the terms as features. However, the large collection of documents, and predefined patterns and templates that are required for the extraction of contextual cues makes the portability and extensibility of such systems difficult. Similarly, the conventional approach to URI clustering for Web personalization requires large amount of log records to enable the use of feature-based semantic measures. Consequently, featureless semantic measures are fast becoming a necessity for clustering non-decomposable data. Secondly, to compensate for the lack of configurable and objective measurement of semantic relation through the use of features, changes to the clustering algorithm is also necessary. In feature-based semantic measurements, the type and granularity of features can be selected or refined depending on application requirements. For example, the choice of features for face recognition can come from certain regions of the face such 142 Featureless Data Clustering as ears, and some characteristics such as colour, intensity or texture can have different significance. On the other hand, featureless measurement typically relies on more prevalent or dominant aspects to determine semantic relations, and hence, can be rather subjective. As such, the clustering algorithm itself has to play a more significant role for refining the clusters to compensate for the lack of control during semantic measurement. In this chapter, we will look at an enhanced 3-pass Tree-Traversing Ant (TTA) algorithm for clustering both terms and URIs. This enhanced TTA algorithm is an improvement over the original 2-pass version proposed by Wong et al. (2007) which was designed specifically for term clustering. This 3-pass TTA is empowered by two featureless measurements of relatedness and distance, namely, the Normalised Google Distance (NGD) by Cilibrasi and Vitanyi (2007), and n° of Wikipedia (n°W) by Wong et al. (2006). While sharing minor functionalities with the conventional hierarchical algorithm, the 3-pass TTA is actually a hybrid algorithm that performs both division and agglomeration whose performance is controlled using ant-sorting algorithms (Lumer and Faieta, 1994). Before that, we will have a brief look into several common clustering algorithms, and semantic measures in Section 2. In Section 3, we will present the enhanced 3-pass clustering algorithm, and the relatedness and distance measures involved. In Section 4, we will include some experimental results related to the clustering of terms and URIs. We conclude this chapter with an outlook to potential extensions of this work and suggested directions in Section 5. BACKGROUND AND RELATED WORKS Two of the most common data clustering algorithms are hierarchical clustering and non-hierarchical clustering. In hierarchical clustering, successive clusters are discovered using previously established clusters, while clusters are established all at once using non-hierarchical algorithms. Hierarchical algorithms can be further divided into divisive algorithms and agglomerative algorithms. Divisive clustering begins with all data elements in one cluster, which is later successively divided into smaller ones. On the other hand, agglomerative clustering begins with single-element clusters and works by successively merging them into larger ones. As for non-hierarchical clustering, some of the common algorithms are K-means and fuzzy C-means. Other less conventional clustering algorithms include antbased clustering, which is inspired by the collective behaviour in decentralized, self-organized systems. In addition to the clustering algorithm, a function or measure for determining the similarity, distance or relatedness between elements is essential. In conventional clustering, feature-based measures such as cosine similarity, Jaccard distance or Dice similarity (Salton and McGill, 1983) are employed to support the use of hierarchical and non-hierarchical clustering. In regard to the types of semantic relations, certain researchers consider the terms “similarity” and “relatedness” to be interchangeable. In the strictest sense, semantic similarity deals only with synonymy, while semantic relatedness incorporates a wider range of concepts such as antonymy, hyponymy and meronymy. For example, the words “buy” and “purchase” belong to the same semantic class and can be used interchangeably within the same context. In that sense, the two words are semantically similar. On the other hand, the words “car” and “tyre” cannot be considered as similar. Nonetheless, we can definitely accept that “car” is related to “tyre” in some way (e.g. through meronymy) more than “car” is to “vegetarian”. Due to the absence of feature extraction and selection during the measurement of semantic relation using featureless measures, the distinction 143 Featureless Data Clustering between similarity and relatedness can be rather obscure. As such, having discussed the differences, we will remain to use the terms semantic similarity and semantic relatedness interchangeably. Hierarchical algorithms are relatively common in term clustering, especially for related applications such as ontology learning. Faure and Nedellec (1998) presented a corpus-based conceptual clustering method as part of an ontology learning system called ASIUM. The clustering method was designed for aggregating basic classes based on a distance measure inspired by the Hamming distance. The basic classes are formed prior to clustering in a phase for extracting sub-categorization frames (Faure and Poibeau, 2000). Terms that appear in at least two different occasions with the same verb, and the same preposition or syntactic role, can be regarded as semantically similar such that they can substitute each another in a particular context. Such terms are extracted in the form of selection restriction as part of the sub-categorization frames. These semantically similar terms will form the basic classes. The basic classes form the lowest level of the ontology and are successively aggregated to construct a hierarchy from bottom-up. Each time, only two basic classes are compared. The clustering begins by computing the distance between all pairs of basic classes and aggregate those with distance less than the user-defined threshold. The distance between two classes containing the same words with the same frequencies is 0. On the other hand, a pair of classes without a single common word has a distance 1. In other words, the terms in the basic classes act as features, allowing for inter-class comparison. The measure for distance is defined as where card(C1) and card(C2) is the number of words in (i.e. the size of) class C1 and C2 respectively, and Ncomm is the number of words common to both C1 and C2. ΣFC1 and ΣFC2 is the sum of frequencies of the words in C1 and C2 which also occur in C1 and C2, respectively. f(wordiC1)and f(wordiC2) are respectively the frequency of the i-th word of class C1 and C2. Maedche and Volz (2001) presented a bottomup hierarchical clustering method that functions as part of the ontology learning system Text-to-Onto. This clustering method, designed for clustering terms, rely on an all-knowing Oracle, denoted by H, which is capable of returning possible hypernyms for a given term. In other words, the performance of the clustering algorithm has an upper-bound limited by the ability of the oracle to know all possible hypernyms for a term. The oracle is constructed with the help of WordNet and lexico-syntactic patterns (Cimiano and Stab 2005). During the clustering phase, the algorithm would be fed with a list of terms and the similarity between each pair is computed using a cosine measure. For this purpose, the syntactic surface dependencies of each term, in the form of modifiers of various types of phrases, are extracted and used as the features for that term. The algorithm is an extremely long list of nested ifelse statements. For the sake of brevity, it suffices to know that the algorithm attempts to examine the hypernym–hyponym relations between all pairs of terms as it decides to place a term as a hypernym of the other or vise versa, or places two terms under a new parent. Each time information about the hypernymy relation between two terms is required, the oracle is consulted. The projection H(t) will return a set of tuples (x, y) where x is a hypernym for term t, and y is the number of times the algorithm has found evidences for it. Besides the conventional hierarchical clustering, the less common idea of ant-based clustering was first proposed by Deneubourg et al. (1991) as part of an attempt to explain the different types of emer- 144 Featureless Data Clustering gent technologies inspired by nature. Ants are represented as agents that move around some simulated environment such as a square grid, in random. Objects are randomly placed in this environment and the ants can pick-up the object, move and drop them. These three basic operations are influenced by the distribution of the objects. Objects that are surrounded or isolated by dissimilar ones are more likely to be picked up and are later dropped elsewhere in the surrounding of more similar ones. The probability of picking up and dropping of objects are influenced by the probabilities: where f(i) is an estimation of the distribution density of the objects in the ants’ immediate environment (i.e. local neighborhood) with respect to the object that the ants are considering to pick or drop, and k p and kd are two threshold constants. The choice of f(i) varies depending on the cost and other factors related to the environment and the data items. As f(i) decreases below k p, the probability of picking up the object is very high, and the opposite occurs when f(i) exceeds k p. As for the probability of dropping an object, high f(i) exceeding kd will induce the ants to give up the object while low f(i), which is less than kd, will encourage the ants to hold onto the object. The combination of these three simple operations and the heuristics behind them gave birth to the notion of basic ants for clustering, also known as standard ant clustering algorithm (SACA). Lumer and Faieta (1994) further extended and improved SACA in terms of the numerical aspects of the algorithm, and the convergence time. The authors represented the objects in terms of numerical vectors and the distance between the vectors is computed using Euclidean distance. Hence, given that δ(i, j)∈[0, 1] as the Euclidean distance between object i (i.e. i is the location of the object in the center of the neighborhood) and every other neighboring objects j, the neighborhood function f(i) is defined by the authors as: (1) where s2 is the size of the local neighborhood, and α∈(0, 1] is the constant for scaling the distance among objects. In other words, an ant will have to consider the average similarity of object i with respect to all other objects j in the local neighborhood before performing an operation (i.e. pickup or drop). As the value of f(i) is obtained by averaging the total similarities with the number of neighboring cells, s2, empty cells which do not contribute to the overall similarity must be penalized. In addition, the radius of perception (i.e. the extent to which objects are taken into consideration for f(i)) of each ant at the center of the local neighborhood is given by (s-1)/2 . Even though most conventional clustering algorithms appear to be more established, researchers (Handl et al. 2003) have shown that certain commonly adopted algorithms such as K-means and average-link agglomerative yield mediocre results in comparison with the ant-based algorithms. Handl et al. (2003) have also demonstrated certain desirable properties in ant-based algorithms such as 145 Featureless Data Clustering • • • • Tolerance to different cluster size; Ability to identify the number of clusters; Performance increases with the size of the datasets; and Graceful degradation in the face of overlapping clusters. Despite such advantages, the potentials of ant-based algorithms remain relatively unexplored for possible applications in term clustering. In addition to terms, the second type of non-decomposable data which is of interest to is Uniform Resource Identifiers (URI). The clustering of URIs is studied primarily for the purpose of Web personalisation and navigation. In this area, the navigation histories of users are employed for clustering URIs. As such, the resulting clusters are URIs related through user navigational patterns instead of content similarity or site authority. Mobasher et al. (1999) employ association rule hypergraph partitioning (Han et al., 1997) to cluster URIs based solely on their co-occurrence across different user sessions. As such, no computation of similarity or distance is required. The authors argued that conventional clustering techniques that require user sessions for the computation of feature-based similarity or distance are not feasible since the size of navigation history is often very large. The conventional usage-based clustering of URIs was extended by Manco et al. (2003). In addition to the use of user session as features, the authors included the similarity of Web page content in the clustering of URIs for Web personalisation. Contents are simply reduced into a “collection of all meaningful and unique word stems appearing in pi [the Web page] itself.” (Manco et al., 2003) during a preprocessing phase and the weighting scheme Term Frequency-Inverse Document Frequency (TF-IDF) is then employed to compute weights for the terms. All in all, the use of feature-based similarity or relatedness measures remains predominant in clustering non-decomposable data. In conventional term clustering, contextual cues located around words are used primarily as features to enable feature-based similarity or relatedness measurement. For example, if both words “cow” and “goat” appear frequently with “grass”, then we can deduce that the two words are related in some way through “grass”. However, such approaches are inevitably faced with challenges including sourcing for such large, authoritative source of contextual cues, and the complications in extracting and preparing for such features. Similarly, the clustering of URIs has traditionally been very much reliant on access logs or navigational histories as sources of features. To address these issues, we propose an enhanced, adaptive approach inspired by hierarchical and ant-based algorithms, for clustering non-decomposable data using only featureless relatedness measurements in the next section. ADAPTIVE CLUSTERING USING FEATURELESS MEASURES In this section, we will discuss the details of the adaptive clustering approach, including the algorithms of the three passes involved, and the relatedness and distance measure employed. Featureless Relatedness and Distance Measures The adaptive approach presented in this chapter for clustering terms and URIs makes use of two types of featureless measures, namely, Normalised Google Distance (NGD) and n° of Wikipedia (n°W). For 146 Featureless Data Clustering a thorough discussion on the derivation of NGD, interested readers can refer to Cilibrasi and Vitanyi (2007). At the fundamental level, NGD relies heavily on co-occurrence statistics obtained from search engines such as Google, which represents the indexable portion of the Web. More importantly, the index size of Google and its constant update provide NGD with access to a dynamic textual resource for inducing the required statistical evidence to estimate relatedness. The relatedness between any two data elements x and y, either terms or URIs is established using the equation below: sim(x,y) = 1 - NGD(x,y)q (2) where θ is a constant for scaling the distance value by NGD (Cilibrasi and Vitanyi, 2007). The second measure utilises the online graph-structured textual resource, Wikipedia. Similar to NGD, such notion of calculating relatedness or distance without relying too much on features has only gained popularity in the recent years, evident through the publications’ date. For example, Milne (2007) proposed the Wikipedia Link Vector Model (WLVM) for calculating semantic relatedness between terms using the link structures and titles of articles found within the Wikipedia articles. The relatedness between two articles is computed using the Vector Space Model (VSM), which is used extensively in Information Retrieval (IR). However, the links found within the articles are utilised to construct the vectors instead of the conventional use of terms in IR. Other proposed computations of relatedness using Wikipedia include those by Gabrilovich and Markovitch (2007) and Strube and Paolo-Ponzetto (2006), which they referred to as WikiRelate! and Explicit Semantic Analysis (ESA), respectively. Unlike the approach by Milne (2007), WikiRelate! and ESA rely heavily on the textual content of Wikipedia articles. In this chapter, we look at the measurement of distance using Wikipedia from the perspective of graph theory instead of conventional VSM. The distance measure used in this chapter for clustering is n°W proposed by Wong et al. (2006). In this measure, articles in Wikipedia are regarded as manifestations of the information encoded in terms. Following this, each term a can then be represented using the corresponding article doca in Wikipedia. In Wikipedia, articles are organized using categorical indices which eventually lead to the highest level, namely, “Categories”. The organisation of articles in Wikipedia loosely resembles a directed acyclic graph with a root node, instead of a pure tree structure. Hence, finding the distance between two terms a and b can be reduced to the finding of how closely located the two corresponding articles doca and docb are in the Wikipedia categorical indices. The problem of finding the degree of separation between two articles can be considered as a single-source shortest path problem. For this purpose, the Dijkstra’s Algorithm for finding the shortest-path between two vertices (i.e. articles) is employed. Formally, the distance between term a and b given by n°W is defined as | SP| (a, b) = nW (doca , docb ) = ∑ cek k =1 (3) where n°W(doca, docb) is the degree of separation between the articles doca and docb which corresponds to the term a and b, respectively. The degree of separation is computed as the sum of the cost of all edges along the shortest path between articles doca and docb in the graph of Wikipedia articles. SP is the set of edges along the shortest path and ek is the kth edge or element in set SP. |SP| is the number of edges along the shortest path and cek is the cost associated with the k-th edge. 147 Featureless Data Clustering A 3-Pass Clustering Approach The 3-pass clustering approach presented in this chapter is an improvement over the original 2-pass Tree-Traversing Ant (TTA) algorithm proposed by Wong et al. (2007) for term clustering. The enhancement of the approach through the introduction of an additional pass is necessary to further improve the applicability of TTA. The reason this approach is referred to as adaptive is due to its ability of clustering both terms and URIs without the need for any algorithmic revision or new source of features. Recall that both terms and URIs are non-decomposable. Hence, it is worthwhile to reassert that TTA is able to overcome the difficulty of clustering a wide range of non-decomposable data with the assistance of featureless similarity measures such as NGD and n°W. Algorithm 1 shows the main function of the adaptive clustering approach using TTA. As stated in Algorithm 1, the clustering approach needs to be provided with an initial input set of elements T, and the definition of all required thresholds. First Pass using NGD The first pass of the approach can be regarded as divisive clustering where the initial set of all terms are placed in one root node r0. In the context of TTA, each node corresponds to a cluster where the contents of the node are the elements in the cluster. An ant will randomly pick a term, and proceed on to sense its relatedness with every other terms in that same node. Two new sub-nodes are created to accommodate the two least similar terms a and b. Unlike the sensing capability of the basic ants discussed during the start of the chapter, the sense coverage of TTA includes the two sub-nodes rm1 and rm2 corresponding to Algorithm 1. 148 Featureless Data Clustering the immediate main node rm. Accordingly, instead of estimating f(i) as the averaged similarity defined over the s2 number of terms surrounding term i by basic ants, the neighbourhood function for TTA is defined as the maximum similarity between term a in main node rm and its neighbours ru∈{rm1,rm2}. Formally, we define the density of neighbourhood ru with respect to term a during the first-pass as: f 1TTA (a, ru ) = max sim(a, b) ∀b∈ru (4) where ru can either be rm1 or rm2, and sim(a, b) is as defined in Equation (2). Besides deciding on whether to drop an item or not, like in the case of basic ants, TTAs have to make one additional decision, namely, where to drop. A TTA will decide on where to drop a term based on the neighbourhood densities that it has memorised for all sub-nodes ru of the main node rm. Formally, the decision on whether to drop term a on sub-node r y depends on: 1 drop P 1 if(f 1TTA (a,ry ) = max f 1TTA (a,ru )) ru ∈{rm1 , rm 2 } (a, ry ) = 0 otherwise (5) We employ a threshold S1T to control the division of nodes into further sub-nodes. Whenever a TTA senses that the similarity between the two least similar terms exceeds S1T, no further branching out will be performed. Ideally, S1T should be set as high as possible to produce nodes of extremely fine granularity. Tightly-coupled nodes produced during the first pass can be reconsolidated and improved if necessary through the subsequent passes. Setting high S1T is also important for identifying outliers which would not have been detected otherwise. Algorithm 2 summarises the process of splitting the initial cluster of all elements into subsequent tree-structured nodes. Second Pass using NGD The second pass introduced in this chapter is a new addition to the overall adaptive clustering approach using TTA. This new pass can be regarded as agglomerative clustering where the already partitioned nodes achieved during the first pass is improved through consolidations. After the first pass, all terms are moved to the extremities of the tree, which are the leaf nodes. The collection of all leaf nodes are denoted as the set LN. Essentially, the conditions which determine the consolidation of leaf node ri with another leaf node rj are • • high inter-cluster (i.e. inter-node) similarity between ri and rj; and low intra-cluster (i.e. intra-node) variation of the combined ri and rj. The first condition ensures that the two nodes ri and rj have high relatedness, while the second condition makes sure that the combination of ri and rj results in a compact node. A compact or tightlycoupled node is a node containing elements that are closely situated around the relatedness mean. The inter-cluster similarity between nodes ri and rj is simply the group-average similarity between them, and it is defined as: 149 Featureless Data Clustering Algorithm 2. S (ri , rj ) = ∑∑ sim(a, b) a∈ri b∈r j (6) | ri || rj | where sim(a,b) is as defined in Equation (2). As for the intra-cluster variation of the combined ri and rj denoted as V(ri,rj), the standard deviation of pair-wise similarities of elements in ri and rj from the group-average similarity S(ri,rj) is employed: ∑∑ (sim(a, b) − S (r , r ) ) 2 V (ri , rj ) = i a∈ri b∈r j | ri || rj | j (7) In this new second pass, the TTAs utilise the inter-cluster similarity S(ri,rj) and the intra-cluster variation V(ri,rj) as their measures for neighbourhood density. Note that there is no limit imposed on the size of the nodes to be considered for consolidation. Following the trails laid out during the first pass, a TTA then randomly picks up a leaf node and sense for neighbourhood densities with all other leaf nodes. 150 Featureless Data Clustering The merging is performed if the neighbourhood density exceeds a certain threshold S2T. We define the neighbourhood density of every other leaf nodes ri∈LN with the leaf node rx as: f 2TTA (rx , ri ) = 1 e ( H [ S ]− S ( rx , ri )) V ( rx , ri ) e (8) where H[S] is the maximum inter-cluster similarity of any two leaf nodes from the set of all leaf nodes LN defined as: H [S ] = max ∀ ( ri , r j )∈LN , ri ≠ r j S (ri , rj ) (9) H[S] is a constant for any LN. Intuitively, a high neighbourhood density is obtained if the pair rx and ri has extremely high inter-cluster similarity (i.e. achieved through minimum difference from the maximum inter-cluster similarity), and low intra-cluster variation. At one extreme, when the variation V(rx,r y) = 0 and when the difference H[S]-S(rx,ri)=0, the maximum neighbourhood density of 1 is achieved. After sensing the neighbourhood densities, the decision on whether to drop the elements in leaf node rx on another leaf node r y is determined by: 1 if(f 2TTA (rx ,ry ) = max f 2TTA (rx ,ri ) ∧ f 2TTA (rx ,ry ) > S 2T ) ri ∈LN P drop (rx , ry ) = (10) 0 otherwise where S2T is a user-specified value of acceptable relatedness threshold. The process of consolidating leaf nodes during the second pass is summarised in Algorithm 3. 2 Third Pass using n°W The third pass in this chapter is the original second pass proposed by Wong et al. (2007), which utilises Wikipedia as the source for computing relatedness. Functional wise, this third pass is essentially the same as the new second pass reported earlier, where consolidations of leaf nodes produced by the first pass are carried out. However, unlike the second pass, the computational concern of a TTA during the third pass is limited to relocating any leaf node with a single element rx (i.e. isolated leaf node) or |rx|=1. This third pass serves two purposes as a final, housekeeping phase, namely, to ensure the least number of compact clusters with maximal coverage, and to isolate outliers. Using a different featureless measure based on a more authoritative source (i.e. Wikipedia) compared to Google, this pass aims to provide a more definite means of determining relatedness and hence, ensuring more accurate consolidations of clusters. Every decision to drop the single element a carried by a TTA is made after the average distance between the isolated leaf node rx and all other leaf nodes ri∈LN have been sensed. Formally, the density of neighborhood ri with respect to the isolated leaf node rx during the third pass is defined as: f 3 TTA (rx , ri ) = ∑∑ ( a, b) a∈rx b∈ri | rx || ri | (11) 151 Featureless Data Clustering Algorithm 3. where δ(a,b) is the distance measure between the two elements a and b defined in Equation (3). After all neighbourhood densities have been sensed, the TTA will decide on the most ideal leaf nodes r y to drop the single element from rx using the function: 1 if(f 3TTA (rx ,ry ) = min f 3TTA (rx ,ri ) ∧ f 3TTA (rx ,ry ) ≤ ri ∈LN P drop (rx , ry ) = 0 otherwise 3 T ) (12) where δT is a distance threshold. Intuitively, a TTA will drop the single element a from the isolated leaf node rx on the leaf node r y, if all elements in r y collectively yield the minimum average distance with rx that satisfies the distance threshold δT. In the case of a being an outlier, it will have very high average distances with all other leaf nodes. Without a threshold, the TTA will still relocate the outlier to the leaf node that has the minimum average distance which, by comparison with all other non-outlier cases, is still considered relatively higher. Hence, the purpose of δT is very similar to that of S2T in that both are meant to isolate outliers, and to control the relocation of single elements or more generally, the consolidation of nodes. Ideally, this pass intends to produce the minimal number of nodes to ensure that each one covers as many semantically-related elements as possible while making sure that the relatedness between elements in the same node remains considerably high. Algorithm 4 outlines the steps involved in the third pass of clustering using n°W. Summary of the 3-Pass TTA Note that in this 3-pass TTA, only the first pass of the clustering approach is mandatory. The remaining two passes can be activated or disabled depending on the types of data to be clustered, the desired results, and the acceptable execution time. In fact, clustering using TTA can be further extended and improved beyond what is presented in this chapter. Since the foundation of TTA is by itself an adaptive approach, new 152 Featureless Data Clustering Algorithm 4. passes can be introduced, or the existing ones replaced. In addition, more refined featureless relatedness and distance measures can be proposed to further improve the clustering of non-decomposable data. Clustering Thresholds In addition to the clustering algorithms and semantic measures, another issue that has crucial effect on the clustering results is the choice of thresholds. In order to provide the users with certain control over the clustering process while avoiding the need to manually determine the thresholds, we use numerical descriptors of central tendency and dispersion which are related to the neighbourhood density functions. For the first pass, we need to automatically determine a few values that the users can select to use as the threshold S1T. Recall from the previous section that the partitioning of the clusters is controlled by S1T. Whenever the similarity between the least similar pair of elements exceeds S1T, the division of clusters will stop. In other words, the threshold S1T is related to the pair-wise similarity of elements measured using the function sim(a,b) in Equation (2). As such, there are few possible values that we can choose to be the threshold S1T, namely, the maximum pair-wise similarity H, the minimum pair-wise similarity L, the average pair-wise similarity E, and the standard deviation of the pair-wise similarity SD, of the elements in the input set T. H [ sim] = L[ sim] = max sim(a, b) min sim(a, b) ∀ ( a ,b )∈T , a ≠ b ∀ ( a ,b )∈T , a ≠ b 153 Featureless Data Clustering E[ sim] = ∑ ∑ sim(a, b) a∈T b∈T , a ≠ b | T | (| T | −1) ∑ ∑ (sim(a, b) − E[sim]) 2 SD[ sim] = a∈T b∈T , a ≠ b | T | (| T | −1) Using these numerical descriptors, we can set S1T to either H[sim], L[sim], E[sim], E[sim]+SD[sim], or E[sim]-SD[sim] depending on the desired output. Since H[sim] > E[sim]+SD[sim]> E[sim] > E[sim]-SD[sim] > L[sim], setting an increasingly higher S1T will prolong the partitioning process and hence, create more refined clusters or nodes. As for the second pass, we can follow the same idea of utilising numerical descriptors which are related to f TTA2. Firstly, we formulate S2T in the same way we did with f TTA2: S 2T = 1 e ( H [ S ]− S ) eV where H[S] is as defined in Equation (9), and τS and τV are the two variables related to the inter-cluster similarity and intra-cluster variation functions S(ri,rj) and V(ri,rj) in Equations (6) and (7), respectively. Similar to S1T, we can set the individual variables τS and τV using the various numerical descriptors related to the corresponding functions. For τS, we have the maximum, minimum, average and standard deviation of the inter-cluster similarity, which is H[S], L[S], E[S] and SD[S], respectively. In the case of τV, we have the maximum, minimum, average and standard deviation of the intra-cluster similarity, which is H[V], L[V], E[V] and SD[V], respectively. Since the second pass deals with leaf nodes (i.e. groups of elements) instead of elements alone, we need to define the numerical descriptors in terms of the set of all leaf nodes LN. In addition to H[S] as defined in Equation (9), the remaining numerical descriptors for inter-cluster similarity are determined as: L[ S ] = min ∀ ( ri , r j )∈LN , ri ≠ r j E[ S ] = S (ri , rj ) ∑ ∑ ri ∈LN r j ∈LN , ri ≠ r j S (ri , rj ) (| LN |)(| LN | −1) ∑ ∑ (S (r , r ) − E[S ]) 2 SD[ S ] = ri ∈LN r j ∈LN , ri ≠ r j i j (| LN |)(| LN | −1) and similarly, the numerical descriptors for intra-cluster variations are defined as: H [V ] = 154 max ∀ ( ri , r j )∈LN , ri ≠ r j V (ri , rj ) Featureless Data Clustering L[V ] = E[V ] = min ∀ ( ri , r j )∈LN , ri ≠ r j ∑ ∑ V (ri , rj ) ri ∈LN r j ∈LN , ri ≠ r j V (ri , rj ) (| LN |)(| LN | −1) ∑ ∑ (V (r , r ) − E[V ]) 2 SD[V ] = ri ∈LN r j ∈LN , ri ≠ r j i j (| LN |)(| LN | −1) The two variables τS and τV have a proportional and inversely-proportional effect on S2T, respectively. As τS increases, so does S2T. On the other hand, an increasing τV will produce a decreasing S2T. In short, this method of associating the thresholds to certain combination of numerical descriptors is important to allow for the automatic adjustment of the actual values depending on the input set. The need to manually provide arbitrary values by users is inappropriate and can lead to poor performance. EXPERIMENTS AND DISCUSSIONS To demonstrate the applicability of the enhanced 3-pass clustering approach to non-decomposable data, we conducted five experiments using three sets of terms from chemistry and materials science, and two sets of URIs across various general topics. Term Clustering using the Adaptive 3-Pass Approach The use of the original 2-pass TTA for clustering terms has been thoroughly evaluated (Wong et al., 2007) using terms from various genres, ranging from diseases, wildlife, people and wine to politics, finance and technology. Using Lexical Overlap (LO) (Maedche and Staab 2002) as the performance metric, the original 2-pass TTA achieved an average of 97% over ten experimental datasets. LO is simply the intersection between the discovered clusters and the recommended (i.e. manually created) clusters. Obviously, higher LO is achieved when the discovered clusters share many common data elements as the recommended ones. Through examples, quantitative experiments and discussion, Wong et al. (2007) concluded the following strengths of the 2-pass TTA when used for clustering terms: • • • • • • • Able to further distinguish hidden structures within clusters; Flexible in regards to the discovery of clusters; Capable of identifying and isolating outliers; High tolerance to differing cluster sizes; Able to produce consistent results; Able to identify implicit taxonomic relationships between clusters; and Inherent capability of coping with synonyms, word senses and the fluctuation in terms usage. 155 Featureless Data Clustering For this section, we have conducted three independent experiments to assess the enhanced 3-pass approach in terms of term clustering. We constructed three sets of terms in the following categories: “chemicals compounds”, “explosives and toxins” and “materials”. There are 16 terms in from category “chemicals compounds”, 15 terms in the “explosives and toxins” category, and 19 terms from category “materials”. In general, the 3-pass clustering approach using TTA was able to discover most of the naturally-occurring groups of terms within the respective sets, at the level of granularity that enables human comprehension. The term clustering results are illustrated in Figures 1, 2 and 3 accompanied by some brief discussions. For all three experiments using terms, we fixed the thresholds S1T to E[sim]+SD[sim], τS and τV to E[S]+SD[S] and E[V]-SD[V], respectively, and δT to 10. In fact, these threshold values are generally applicable for term clustering using the 3-pass approach, and will generate desirable results. We carried out the first experiment by clustering the 16 terms belonging to the category “chemical compounds”. In this experiment, both the second and third pass of the clustering approach were involved. Figure 1 shows the results of term clustering from all three passes. After the first pass alone, the four expected clusters have already emerged. In Pass 1 of Figure 1, cluster A, B, C and D represents Figure 1. The result of clustering 16 terms from the category “chemical compounds”. The following thresholds were achieved using the recommended numerical descriptors: S1T = 0.71 and S2T = 0.84. There were four intended clusters of terms from the set. Cluster A, B, C and D, represents “photographic chemicals”, “disinfectants/biocides”, “cosmetic chemicals” and “flavour enhancers”, respectively. A symbol “′” attached to a cluster designator represents a revision of that cluster. A double prime means that the cluster has been revised twice. 156 Featureless Data Clustering “photographic chemicals”, “disinfectants/biocides”, “cosmetic chemicals” and “flavour enhancers”. The prime symbol “′” attached to the cluster designators simply represents the revision or extension of those clusters. For example, cluster B′′ in Pass 3 of Figure 1 is the result of expanding cluster B′ in Pass 2 of Figure 1 by adding another element “sodium dichloroisocyanurate”. While certain chemicals may be used for multiple purposes, the clusters captured by TTA represent their dominant applications. While the resulting clusters after the first pass may show some promising results, they have not yet achieved maximum coverage. There are certain qualified terms that have yet to join the appropriate clusters. For example, “monosodium glutamate” is a common “flavour enhancer”. This is when the second pass, which is the enhancement added to the adaptive clustering approach, contributes to the improvement of clustering results. As shown in Pass 2 of Figure 1, clusters B′ and D′ were appropriately extended to include one additional element each. However, three isolated clusters (i.e. single-element clusters) remain, namely, “sodium dichloroisocyanurate”, “pyrocatechol” and “guanosine monophosphate”. During the final, housekeeping phase, these three isolated clusters were relocated to their intended categories by the third pass as shown in Pass 3 of Figure 1. The second experiment was conducted using 15 terms belonging to the category “explosives and toxins”. As one can observe from Figure 2, this experiment was different in that only the first two passes were required to discover the naturally-occurring clusters. We were anticipating three clusters in the result, and as shown in Pass 2 of Figure 2, our expectation was again fulfilled. Terms from the sub-categories “nerve agents”, “mycotoxins”, and “explosives” were grouped into cluster A, B and C, respectively. Moreover, one is able to note that the second pass alone was adequate to relocate the remaining isolated leaf node in Pass 1 of Figure 2, which is “Cytochalasin”. Since the computation of Figure 2. The result of clustering 15 terms from the category “explosives and toxins”. The following thresholds were achieved using the recommended numerical descriptors: S1T = 0.70 and S2T = 0.87. There were three intended clusters of terms from the set. Cluster A, B and C represents “nerve agents”, “mycotoxins”, and “explosives”, respectively. The symbol “′” attached to a cluster designator represents a revision of that cluster, while the subscript “ij” signifies that cluster Xij is the result of the consolidation of clusters Xi and Xj. The superscript “P” denotes the parent status of the cluster, which is indirectly formed through the sharing of a common node by several clusters belonging to taxonomically-related semantic classes. 157 Featureless Data Clustering distance using n°W can be slightly more intensive due to the traversal of Wikipedia categorical indices, the occasional avoidance of the third pass can be beneficial. Another interesting feature revealed from this experiment is the clustering approach’s capability of revealing hidden structures within clusters as pointed out by Wong et al. (2007). In Pass 2 of Figure 2, both clusters C12 and C3, which are joined by a common ancestor to form cluster CP, contain the names of the various types of explosives. Within this single parent cluster of “explosives”, the approach has further discovered two sub-clusters where cluster C3 contains a specific type of explosives known as “Sprengel explosives” (i.e. “Oxyliquit” and “Panclastite”). The discovery of such virtual groupings of clusters related through the parent-child relationship is usually motivated by the fact that the clusters involved are strongly related in some way but are not suitable to be directly consolidated into a single cluster. For the last experiment involving terms, we conducted clustering using the set of 19 terms from the category “materials”. There were four clusters produced from the clustering, and the cluster memberships properly reflect their intended semantic class. Cluster A consists of the names of the various “thermoplastics”, while cluster B is made up of several “silicate minerals”. Clusters C and D contain the names of the several types of “woods” and “glass”. Similar to the first experiment, the second and third passes were required to finalise the clustering process. Unlike the second experiment where the third pass was of no use, the third pass in this third experiment played an extremely important role. One can observe from Pass 1 of Figure 3 the overly partitioned clusters. In comparison to the previous two Figure 3. The result of clustering 19 terms from the category “materials”. The following thresholds were achieved using the recommended numerical descriptors: S1T = 0.66 and S2T = 0.85. There were four intended clusters of terms from the set. Cluster A, B, C and D represents “thermoplastics”, “silicate minerals”, “woods” and “glass”, respectively. The symbol “′” attached to a cluster designator represents a revision of that cluster. 158 Featureless Data Clustering experiments, there were many isolated clusters produced after the first pass. Similarly, the output after the second pass, as shown in Pass 2 of Figure 3, did not show much improvement. In fact, the intended cluster D did not appear even after the second pass. One of the possible reasons is that the threshold was set much higher than necessary. As we have mentioned at the start of this section, we utilised the same settings of numerical descriptors to define all thresholds for all experiments. Nevertheless, the overall performance of the 3-pass approach was not affected at all since we have the third pass to improve the results before delivering them. This is clearly shown in Part 3 of Figure 3 where all remaining isolated clusters were properly relocated and the last remaining cluster D, which we had been expecting, had successfully emerged. URI Clustering using the Adaptive 3-Pass Approach As we have pointed before, the clustering of URIs has been conventionally carried out using log records or navigational histories for Web personalisation purposes. As a result, huge collection of website-specific access logs are required to validate the performance of clustering approaches. Instead, we have constructed two arbitrary sets of URIs of various genres to demonstrate the wide applicability of the clustering approach presented in this chapter. As we have mentioned before, the clustering of URIs using the proposed 3-pass approach requires the third pass to be disabled since the relatedness between URIs cannot be established using Wikipedia. Two experiments were conducted for this section using two sets of URIs from the categories “publishers, online retailers & fast-food franchises” and “news agencies, travel guides & cooking recipes”. The first set consists of 16 URIs, and the second set has 24 URIs. For the two experiments using URIs, we set the thresholds S1T to E[sim]+SD[sim], and τS and τV to E[S]+SD[S] and E[V], respectively. The pair-wise similarity or relatedness between URIs is inherently lower than that of terms. One of the reasons behind such trend is the relatively lower observable co-occurrence of URIs as compared to terms on the World Wide Web. Even though the same combination Figure 4. The result of clustering 16 URIs from the category “publishers, online retailers & fast-food franchises”. The following thresholds were achieved using the recommended numerical descriptors: S1T = 0.51 and S2T = 0.88. There were three intended clusters of URIs from the set. Cluster A, B and C represents “fast-food franchises”, “online retailers” and “academic publishers”, respectively. The symbol “′” attached to a cluster designator represents a revision of that cluster. 159 Featureless Data Clustering Figure 5. The result of clustering 24 URIs from the category “news agencies, travel services & cooking recipes”. The following thresholds were achieved using the recommended numerical descriptors: S1T = 0.69 and S2T = 0.88. There were three intended clusters of URIs from the set. Cluster A, B and C represents “travel services”, “cooking recipes” and “news agencies”, respectively. The symbol “′” attached to a cluster designator represents a revision of that cluster. The subscript “ij” signifies that cluster Xij is the result of the consolidation of clusters Xi and Xj while the superscript “P” denotes the parent status of the cluster, which is indirectly formed through the sharing of a common node by several clusters belonging to taxonomically-related semantic classes. 160 Featureless Data Clustering of numerical descriptors is used to define the threshold for the first pass, namely, E[sim]+SD[sim], the values for S1T are relatively lower as shown in the captions of Figures 4 and 5. The first experiment using URIs was carried out with a set of 16 URIs from the sub-categories “ fastfood franchises”, “online retailers” and “academic publishers”. Using the measure NGD, backed by the World Wide Web, and the TTA algorithm, these 16 URIs were successfully grouped into their intended clusters using only the first two passes. There were strictly no process of feature extraction and selection involved since, URIs, like terms, cannot be further decomposed into meaningful constituents. After the completion of the clustering process as shown in Pass 2 of Figure 4, cluster A contains all URIs from the input set belonging to the various famous “ fast-food franchises”. As for cluster B and C, we have the URIs of “online retailers” and “academic publishers”. We use 24 URIs for the last experiment. We were expecting three clusters A, B and C to represent URIs of websites belonging to the categories “travel services”, “cooking recipes” and “news agencies”, respectively. Similar to the results of Figure 2, TTA’s ability to uncover sub-clusters within larger clusters remains evident in URI clustering. As shown in Pass 2 of Figure 5, there are three other sub-clusters within the “travel services” category, which is represented by the parent cluster AP. The parent cluster AP is made up of website URIs related to the various types of travel services. For example, the subcluster A1 contains URIs of prominent travel guide websites including “ frommers.com”, “lonelyplanet. com” and “ fodors.com”. On the other hand, sub-cluster A2 consists of the URIs of two worldwide accommodation listing and reservation websites. The URIs of travel advisory websites were grouped into cluster A3. All in all, these three sub-clusters are highly related to one another but at the same time, they belong to distinct groups. As such, instead of consolidating all of these travel-related URIs into a single cluster, they were organised into a hierarchy of distinct sub-clusters which are connected by a common theme, namely, “travel services”. As for cluster B12′ in Pass 2 of Figure 5, it was achieved through the consolidation of the individual clusters B1 and B2, and later, the inclusion of two isolated clusters “chinesefood.about.com” and “thaifood.about.com”. Lastly, the significance of the choice of thresholds is also revealed through this experiment. Unlike the first four experiments, there was an isolated cluster that remained even after the clustering process was completed. This isolated cluster contains the single URI “del.icio.us” as shown in Pass 2 of Figure 5. Depending on one’s perception, this isolated cluster can be perceived as either a clustering outlier, or simply another significant cluster. This demonstrates the 3-pass’s capability in identifying and isolating outliers, subject to the careful selection of thresholds. As such, our proposed method of initialising the thresholds using the various combinations of numerical descriptors is both relevant and important in ensuring the quality of clustering output. In a nutshell, we have shown that the enhanced 3-pass clustering approach proposed in this chapter is flexible due to its ability to cluster both terms and URIs without any need for algorithmic revision or sourcing of relevant features. The adaptability of the 3-pass approach is demonstrated through the automatic adjustment of thresholds depending on the input set. In addition, the selective enablement of the second and third pass based on the output from the preceding phase allows the system to avoid any unnecessary processing, or to further contribute to the improvement of the final clustering output. The iterative style of improving the clustering result which is spread across three different phases, and the use of two different sources for semantic measurement also help to ensure the robustness of the approach. 161 Featureless Data Clustering CONCLUSION AND FUTURE TRENDS Data clustering is an active research area which is highly relevant to many engineering, business and marketing applications. Many established clustering approaches have relied on hierarchical and partitional algorithms with the use of feature-based semantic measures. Due to the challenges involved during the preprocessing phase to prepare the features required for similarity or relatedness computation, many research efforts in this area were focused primarily on reducing dimensionality and improving feature extraction and selection. However, these efforts have limited applicability when we are dealing with data elements that have little or no features to rely on. Such data elements are what we referred to nondecomposable data. Can we resolve a term or a URI into its constituent parts for analysis the same way as an image or document? A key characteristic of non-decomposable data is that the comprehension and appreciation of the semantics or significance of such data requires background knowledge. For example, how can one know that the word “grass” refers to something which is green without acquiring such knowledge through physical senses? The World Wide Web and other forms of continuously-updated structured resources such as Wikipedia and the Open Directory Project offer a promising starting point to develop avenues for harnessing general consensus and inducing semantics. The Normalised Google Distance (NGD) and n° of Wikipedia (n°W) are two examples. In this chapter, we presented an enhanced 3-pass clustering approach using the TreeTraversing Ant (TTA) algorithm, and the two featureless semantic measures NGD and n°W. The first pass, which is mandatory, divides the elements into naturally-occurring clusters of appropriate granularity. The second and third passes improve the final clustering output by performing cluster consolidation, and single element relocation, respectively. The flexibility of the approach is ensured through the substitutability of the second pass with the third and vice versa, and the use of two different sources for computing semantic relatedness and distance. This 3-pass approach does not require any algorithmic revision or new source of features to switch between term and URI clustering. In addition, we have introduced a method of associating the clustering thresholds to various combinations of numerical descriptors to allow for automatic adjustment depending on the input set. This prevents the problem of manually defining arbitrary values to suit the clustering input. Various other strengths of TTA such as the ability of isolating outliers, and the ability of discovering internal groupings within stable clusters were also demonstrated through our five experiments using both terms and URIs. There are a few possible research directions related to this area of non-decomposable data clustering using featureless semantic measures. The first is concerned with the algorithm itself and more study is required to extend the 3-pass clustering approach for use with large datasets. Issues like computational complexity and resources will inevitably arise as more and more data elements are involved during clustering. Secondly, study into the coverage and limitation of the featureless measurements is also necessary. Even though the World Wide Web is the largest repository of text ever, and Wikipedia is constantly growing, are there any terms or URIs that the two associated measures cannot cope with? If so, what are the consequences? ACKNOWLEDGMENT This research was supported by the Australian Endeavour International Postgraduate Research Scholarship, and the Research Grant 2008 by the University of Western Australia. 162 Featureless Data Clustering REFERENCES Cimiano, P. & Staab, S. (2005). Learning concept hierarchies from text with a guided agglomerative clustering algorithm. In Proceedings of the Workshop on Learning and Extending Lexical Ontologies with Machine Learning Methods, Bonn, Germany. Cilibrasi, R. & Vitanyi, P. (2007). The Google similarity distance. IEEE Transactions on Knowledge and Data Engineering, 19(3):370-383. Deneubourg, J., Goss, S., Franks, N., Sendova-Franks, A., Detrain, C. & Chretien, L. (1991). The dynamics of collective sorting: robot-like ants and ant-like robots. In Proceedings of the 1st International Conference on Simulation of Adaptive Behavior: From Animals to Animats, France. Faure, D. & Poibeau, T. (2000). First experiments of using semantic knowledge learned by ASIUM for information extraction task using INTEX. In Proceedings of the 1st Workshop on Ontology Learning, Berlin, Germany. Faure, D., & Nedellec, C. (1998). A corpus-based conceptual clustering method for verb frames and ontology acquisition. In Proceedings of the 1st International Conference on Language Resources and Evaluation (LREC), Granada, Spain. Gabrilovich, E. & Markovitch, S. (2007). Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis. In Proceedings of the 20th International Joint Conference on Artificial Intelligence, Hyderabad, India. Gomez-Perez, A. & Manzano-Macho, D. (2003). A survey of ontology learning methods and techniques. (Report No. 1.5). OntoWeb Consortium. Han, E., Karypis, G., Kumar, V. & Mobasher, B. (1997). Clustering Based On Association Rule Hypergraphs. In Proceedings of the SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery. Handl, J., Knowles, J. & Dorigo, M. (2003). Ant-based clustering: a comparative study of its relative performance with respect to k-means, average link and 1d-som. (Report No. TR/IRIDIA/2003-24). Universite Libre de Bruxelles. Jain, A., Murty, M. & Flynn, P. (1999). Data clustering: A review. ACM Computing Survey, 31(3):264323. Lagus, K., Honkela, T., Kaski, S. & Kohonen, T. (1996). Self-organizing maps of document collections: A New Approach to Interactive Exploration. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. Lumer, E. & Faieta, B. (1994). Diversity and adaptation in populations of clustering ants. In Proceedings of the 3rd International Conference on Simulation of Adaptive Behavior: From Animals to Animats 3. Maedche, A. & Volz, R. (2001). The ontology extraction & maintenance framework: text-to-onto. In Proceedings of the IEEE International Conference on Data Mining, California, USA. 163 Featureless Data Clustering Maedche, A. & Staab, S. (2002). Measuring similarity between ontologies. In Proceedings of the European Conference on Knowledge Acquisition and Management (EKAW), Madrid, Spain. Manco, G., Ortale, R. & Sacca, D. (2003). Similarity-based clustering of Web transactions. In Proceedings of the ACM Symposium on Applied Computing. Milne, D. (2007). Computing Semantic Relatedness using Wikipedia Link Structure. In Proceedings of the New Zealand Computer Science Research Student Conference (NZCSRSC), Hamilton, New Zealand. Mobasher, B., Cooley, R. & Srivastava, J. (1999). Creating adaptive Web sites through usage-based clustering of URLs. In Proceedings of the Workshop on Knowledge and Data Engineering Exchange. Salton, G. & McGill, M. (1983). Introduction to Modern Information Retrieval. New York, NY: McGraw-Hill. Steinbach, M., Karypis, G. & Kumar, V. (2000). A comparison of document clustering techniques. (Report No. 00-034). University of Minnesota. Strube, M. & Paolo-Ponzetto, S. (2006). WikiRelate! Computing Semantic Relatedness Using Wikipedia. In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI), Boston, USA. Tzerpos, V. & Holt, R. (1998). Software botryology, automatic clustering of software systems. In Proceedings of the 9th International Workshop on Database and Expert Systems Applications. Wang, Y. & Kitsuregawa, M. (2002). Evaluating contents-link coupled Web page clustering for Web search results. In Proceedings of the 11th International Conference on Information and Knowledge Management (CIKM), Virginia, USA. Wong, W., Liu, W. & Bennamoun, M. (2007). Tree-traversing ant algorithm for term clustering based on featureless similarities. Data Mining and Knowledge Discovery, 15(3): 349-381. Wong, W., Liu, W. & Bennamoun, M. (2006). Terms clustering using tree-traversing ants and featureless similarities. In Proceedings of the International Symposium on Practical Cognitive Agents and Robots (PCAR), Perth, Australia. KEY TERMS Clustering: The process of discovering naturally-occurring groups of data elements. Semantic Measure: A computational means of determining to what extent two elements are semantically related or similar. Term: A word used in domain-specific context. Uniform Resource Identifier (URI): A compact string of characters used to identify or name a resource. 164
© Copyright 2024 ExpyDoc