Ankita G. Joshi, R. R. Shelke
Abstract: The main objective of the data mining process is to extract information from a large data set and transform it into an understandable structure for further use. Clustering is a main task of exploratory data analysis and data mining applications. Theoretic Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). The objective of clustering is typically exploratory in nature, we desire clustering algorithms that make as few assumptions about the data as possible. Distributed clustering is to explore the hidden structure of the data collected/stored in geographically distributed nodes. Information theoretic measures take the whole distribution of cluster data into account for better clustering results. For this, we incorporate an information theoretic measure into the cost function of the distributed clustering. We interpret the motivation for choosing the MMI (Maximum Mutual Information) criterion to develop distributed clustering algorithms. The proposed Linear and Kernel DMMI algorithms can achieve almost as good clustering results as the corresponding centralized information theoretic clustering algorithms on both synthetic and real data.
Keywords: Data mining, Theoretic clustering, Information theory, Mutual information