Skip Navigation
York U: Redefine the PossibleHOME | Current Students | Faculty & Staff | Research | International
Search »FacultiesLibrariesCampus MapsYork U OrganizationDirectorySite Index
Future Students, Alumni & Visitors
2005 Technical Reports

Clustering Mixed Numerical and Uncertain Categorical Data with M-BILCOM: Significance Metrics on a Yeast Example

Bill Andreopoulos, Aijun An and Xiaogang Wang

Technical Report CS-2005-03

York University

March 2005


We have designed the M-BILCOM clustering tool for mixed numerical and categorical data sets, where the categorical attribute values (CAs) are not certain to be correct and have associated confidence values (CVs) from 0.0 to 1.0 to represent their certainty of correctness. M-BILCOM performs bi-level clustering of mixed data sets that resembles a Bayesian process. We have applied M-BILCOM to yeast data sets where the CAs were perturbed randomly and CVs were assigned indicating the confidence of correctness of the CAs. On such mixed data sets M-BILCOM outperforms other clustering algorithms, such as AutoClass. We have applied M-BILCOM to real numerical data sets from gene expression studies on yeast, incorporating CAs representing Gene Ontology annotations on the genes and CVs representing Gene Ontology Evidence Codes on the CAs. We have applied novel significance metrics to the CAs in resulting clusters, to extract the most significant CAs based on their frequencies and their CVs in the cluster. For genomic data sets, we have used the most significant CAs in a cluster to predict gene function.

Notice:The work presented in the paper above is covered by pending patents and copyright. Publication of this paper does not grant rights to any intellectual property. All rights reserved.

Download paper in PDF format.

The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.