Anju G S, Jyolsna Mary P
Abstract: In bioinformatics, there is a profound scale of DNA and protein sequences available, but far from being fully utilized. Computational models can facilitate the analyses of large-scale data but require a numeric representation as input. Feature engineering aims at representing non-numeric data with numeric features and can help design features to cast the raw symbolic data effectively. Automated feature engineering, i.e., an encoding scheme preprogrammed the establishment of features, saves the redesigning process and allows the researchers to try different representations with minimal effort. Here an encoding scheme for protein sequences, which encodes the representative sequence dataset into a numeric matrix that can be fed into a downstream learning model. The method, Context-Free Encoding Scheme practice was preferred for a dataset with group of protein sequences. Compared with the traditional methods using task-specific designed features, this method improves the predicting accuracy and serve as an automated feature engineering method for protein sequences.
Keywords: Enciphering technique, Data characterization, Feature engineering, Machine learning