Interactive data analysis and clustering of genomic data

Ciaramella, Angelo; Cocozza, S.; Iorio, F.; Miele, G.; Napolitano, F.; Pinelli, M.; Raiconi, G.; Tagliaferri, R.

doi:10.1016/j.neunet.2007.12.026

In this work a new clustering approach is used to explore a well- known dataset [Whitfield, M. L., Sherlock, G., Saldanha, A. J., Murray, J. I., Ball, C. A., Alexander, K. E., et al. (2002). Molecular biology of the cell: Vol. 13. Identification of genes periodically expressed in the human cell cycle and their expression in tumors (pp. 1977–2000)] of time dependent gene expression profiles in human cell cycle. The approach followed by us is realized with a multi-step procedure: after preprocessing, parameters are chosen by using data sub sampling and stability measures; for any used model, several different clustering solutions are obtained by random initialization and are selected basing on a similarity measure and a figure of merit; finally the selected solutions are tuned by evaluating a reliability measure. Three different models for clustering, K means, Self-organizing Maps and Probabilistic Principal Surfaces are compared. Comparative analysis is carried out by considering: similarity between best solutions obtained through the three methods, absolute distortion value and validation through the use of Gene Ontology (GO) annotations. The GO annotations are used to give significance to the obtained clusters and to compare the results with those obtained in the work cited above.