Skip Navigation
York U: Redefine the PossibleHOME | Current Students | Faculty & Staff | Research | International
Search »FacultiesLibrariesCampus MapsYork U OrganizationDirectorySite Index
Future Students, Alumni & Visitors
2006 Technical Reports

A Framework for Clustering Categorical Data based on Empirical Distributions

Bill Andreopoulos, Aijun An and Xiaogang Wang

Technical Report CS-2006-01

York University

January 2006

Abstract

Density-based clustering algorithms often have a solid mathematical basis. A challenge involved in applying density-based clustering to categorical data sets is that the `cube' of attribute values has no ordering defined. In this paper we propose the CEED framework for clustering categorical data based on its empirical probability distribution. CEED offers a basis for designing categorical clustering algorithms that balance the tradeoff of accuracy and speed. The advantages of CEED are: (i) it offers a probabilistic basis for clustering categorical data, (ii) it minimizes the user-specified input parameters, (iii) it is insensitive to the order of the input objects, (iv) it can discover clusters of arbitrary shapes and sizes. We present a faster approximation of CEED called the MULIC algorithm, which is designed for categorical data sets with a multi-layered structure. We evaluate CEED and MULIC on various data sets, including protein interaction data. CEED produces more accurate results than other algorithms on small-dimensional data sets. MULIC can find the multi-layered structure of special data sets such as protein interaction data better than other algorithms and has comparable runtimes.

Download paper in PDF format.



The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.