K. Arun Prabha, A. Amutha
Abstract: Clustering is the classification of objects into different groups, or more precisely, the partitioning of a data set into subsets. Data clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, data mining, pattern recognition and bioinformatics. Gene expressions are one of the high dimensional data values and its motivating the development of clustering algorithm was used. The Existing system consists of popular algorithms like k-means and CAST. Implementing these algorithms for a large genome-scale gene expression data set is practically critical. A novel method for clustering large gene data set is introduced. In Existing work the TCLUST algorithm used, which introduce, Correlation Coefficient Graph (CCG) is constructed to maintain gene expression data values and Tanimoto Coefficient Graph (TCG) is used to measure the similarity value for the gene expression data. In proposed the enhanced TCLUST algorithm is used, it is called as E-TCLUST. Enhanced Tanimoto clustering method is implemented which feats the co-connectedness for efficiently clustering large, sparse expression data. Dynamic error threshold estimation model implements threshold values which filters data below the given threshold value. In the proposed work tree structure is constructed represent the input samples. Using graphs the variations are identified .Graph Re-arrangement mechanism is performed which effectively reduces the number of iterations. The process time is also reduced. Extensive evaluation of this method reveals an optimized performance which is depicted as a graph. This algorithm is applied to a genome-scale gene expression data set and used gene set enrichment analysis to obtain highly significant biological clusters. It have been implemented both TCLUST and E-TCLUST algorithms and tested their performance using three different data sets. The datasets are real gene expression data from yeast samples generated using micro-arrays technology.
Keywords: Clustering, Gene Expression, Micro-array, Bio-informatics, Data mining