Kavita Rathod, Pramod Patil
Abstract: To Studying imbalanced data is an important and it has common problem. To solve an imbalanced data problem, we have some different algorithms such as CART, C4.4, and HDDT. CART algorithm gives poor performance on imbalanced datasets as compare to other algorithms so that we omit here. We can use sampling techniques for Decision trees to solve imbalanced data problem, but in case of sampling techniques we need to measure parameter selection and to increase the complexity. To overcome this drawback, a new technique is proposed for Decision tree called as hellinger distance decision tree. It finds hellinger distance as the splitting criterion. In addition to this, skew insensitivity of hellinger distance and its advantage over popular changes such as entropy (gain ratio) is calculated. Moreover, the results are binary trees can be easily understand by normal user. To arrive at the particularly practical conclusion that for imbalanced data it is sufficient to use Hellinger trees with bagged HDDT (BG) instead of sampling methods. Learning with imbalanced dataset is an important and common problem. To solve an imbalanced data problem we can use sampling techniques for Decision trees, but in case sampling techniques we need to measure parameter selection and to increase the complexity. Hellinger Distance Decision Tree and C4.4 algorithm for Decision Tree, HDDT this method overcomes the drawback of existing system such as C4.4 method. In the prior system having the problem with imbalanced dataset, to solve this problem we can use sampling technique for decision tree and to build the Decision Tree here to use the splitting criterion as entropy (Gain ratio). In the propose system for Decision Tree which uses to find the Hellinger Distance as a splitting criterion Addition to this, skew insensitivity of hellinger distance. The results Decision Tree, finding gain ratio by using C4.4 and Hellinger distance by HDDT, and comparison between HDDT and C4.4 method using imbalanced datasets.
Keywords: Imbalanced datasets, C44 Algorithm, HDDT, Gain Ratio, Hellinger Distance and Decision Tree