Genetic Algorithms as a Feature Selection Tool in Heart Failure Disease

. A great wealth of information is hidden in clinical datasets, which could be analyzed to support decision-making processes or to better diagnose patients. Feature selection is one of the data pre-processing that selects a set of input features by removing unneeded or irrelevant features. Various algorithms have been used in healthcare to solve such problems involving complex medical data. This paper demonstrates how Genetic Algorithms offer a natural way to solve feature selection amongst data sets, where the fittest individual choice of variables is preserved over different generations. In this paper, a Genetic Algorithms is introduced as a feature selection method and shown to be effective in aiding understanding of such data.


Introduction
The performance of pattern modeling and classification is greatly affected if the dataset has a very high dimensionality. At the same time, the computational complexity, both numerically and in terms of space, increases ( [1], [2], [3] and [4]). The rapid development of technology and the corresponding ability to gather data, has led to an explosion of the size of datasets. This does not imply that all of the features/attributes in a dataset are necessary and sufficient, in terms of the information required to determine patterns accurately and provide predictions. Feature selection methods can be used to identify and remove redundant or irrelevant features from a given dataset without loss of accuracy in predictions. At the same time, feature selection can provide an insight into the features in terms of their importance [1,3].
Feature selection can be defined as the process of choosing a minimum subset of features from the original dataset where [3]: • The classification accuracy does not significantly decrease • The resulting class distribution, given only the values for the selected features, is a close as possible to the original class distribution, given all the features. Feature selection algorithms consist of four key steps: subset generation, evaluation subset, stopping criteria and result validation ( [4], [5]). Subset generation is a heuristic search that generates a subset of features for evaluation procedures. Each subset generated is evaluated by certain evaluation criteria to determine the 'goodness' of the generated subset of the features. The generated subset is validated by carrying out different tests and comparisons with the previous best subset. If a new subset is found not to be better, then the previous best subset is replaced by the new subset. This process is repeated until stopping criteria is reached as shown in Fig. (1). Fig. 1 Four steps for feature selection process [3] There are three approaches to feature selection: filter, wrapper or embedded approach [1], [6], [7], and [8]. Filter feature selection methods apply a statistical measure to assign a weight to each feature according to its degree of relevance. Filters independently measure the relevance of feature subsets to classifier outcomes where each feature is evaluated with a measure such as the distance to outcome classes, correlation or Euclidean distance. All the features in the dataset are then ranked according to these measures. The advantages of filter methods are that they are fast, scalable and independent of a learning algorithm. The most distinguishing characteristic of the filters is that the relevance index is calculated solely on a single feature without considering the values of other features [9]. Such implementation implies that the filter assumes orthogonality of features, which is often not true in practice. Therefore, filters omit any conditional dependences (or independence) that might exist, which is known to be one of the weaknesses of filters. Wrapper methods use the predictor as a black box and the predictor performance as the objective function to evaluate the feature subset [1]. The expression wrapper approach covers the category of variable subset selection algorithms that apply a learning algorithm in order to conduct the search for the optimal or a near-optimal subset [10]. The number of the created subset is equal to 2n becomes an NP-heard problem, a suboptimal subset is selected by applying the search algorithm that finds the subset heuristically. The embedded approach is with specific learning algorithms that perform feature selection Yes in the process of training. An important aspect of using feature selection algorithms is that they can improve inductive learning, either in terms of general capabilities, learning speed or reducing the complexity of the induced model and classification accuracy [2]. Often a compromise is reached in achieving these various objectives in a feature selection approach.
This work focuses on applying the Genetic Algorithms (GAs) as a feature selection technique for Heart Failure data sets in order to improve the classification accuracy and reduce the number of features. The GAs was tested as a 'wrapper' features selection method. GAs makes up one of the global methods for optimization, for searching in complex, large and multidimensional datasets ( [1], [9], [11], [12], [13], [14], [15]). First, the GAs was built using different populations, generations, and neighborhoods (k). Secondly, selected features from the best performing GAs were tested again, using different populations and k values. Finally, the GAs investigation was carried out by setting a population of up to 800. In terms of classification accuracy, two different classifiers were used namely; Bayes Nets (BN) and Random Forest (RF).

Genetic Algorithms (GAs) as a Feature Selection Tool
GAs is optimizing and search technique based on natural biological evolution theory (survival for the fittest) ([1], [6], [7]). Over successive generations, the population "evolves" toward an optimal solution. The advantage of GAs over others is that allows the best solution to emerge from the best of the prior solution. The idea of GAs is to combine different solutions generation after generation to extract the best genes from each one. GAs can manage data set with a large number of features and it does not need any extra knowledge about the problem under study. The subsets of features selected by genetic algorithms are generally more efficient than those obtained by classical methods of feature selection since they can produce a better result by using a lower number of features [16].
The individuals in the genetic space are called chromosomes. The chromosome is a collection of genes where the real value or binary encoding can generally represent genes. The number of genes is the total number of features in the data set. If genes are binary values that mean each chromosome in the GAs population has value of 1 or 0. A value of (1) in a chromosome representation means that the corresponding feature is included in the specified subset. A value of (0) indicates that the corresponding feature is not included in the specified dataset. Each solution in a genetic algorithm is represented through chromosomes. The collection of all chromosomes is called 'population' as shown in Fig. (2). As a first step of GAs, an initial population of individuals is generated at random or heuristically. In each generation, the population is evaluated using fitness functions.

Fig. 2 Genetic Algorithms
The next step is the selection process, where in the high fitness chromosomes are used to eliminate low fitness chromosomes. Better feature subsets have a greater chance of being selected to form a new subset through crossover or mutation. In this manner, good subsets are "evolved" over time [17]. The commonly used methods for reproduction or selection are Roulette-wheel selection, Boltzmann selection, Tournament selection, Rank selection, and Steady-state selection. The selected subsets are ready for reproduction using crossover and mutation. The crossover combines different features from a pair of subsets into a new subset as shown in Fig. (2). Cross over tends to create a better string. The mutation changes some of the values (thus adding or deleting features) in a subset randomly as shown in Fig. (2). The new population generated undergoes the further selection, crossover, and mutation until the termination criterion is satisfied or maximum numbers of generation were reached as shown in Fig. (3).

Genetic Algorithms (GAs) Experiments
In this experiment, the Matlab GAs toolbox is used. GAs started by initially creating a random population then it will be evaluated by using a fitness function. The elite kids have then pushed automatically to the next generation and the remaining kids in the current population are allowed to genetically pass through the function of cross over and mutilation to form a new generation [13]. The dataset is a real-life heart failure dataset.
In this dataset, there are 60 features for 1944 patient records. The class is "dead" or "alive". The data sets were imputed by different methods such as Concept Most Common Imputation (CMCI) and Support Victor Machine (SVM). Different classification methods have been applied to these datasets to select which dataset will be trained [18]. The performance of these datasets was measured using accuracy, sensitivity, and specificity. SVM dataset was chosen since its accuracy, sensitivity and specificity were the best. The experiments were designed using Weka (version 3.8. 1-199-2016). The accuracy was the best using Bayes net, random forest, decision tree, REP tree, J48. In this work, BN and RF were selected as classifiers since the accuracy was the highest value as shown in Table 1. The feature's name is displayed in Appendix A. GAs parameters are shown in Table 2. As discussed above, the number of chromosomes used in a particular implementation is of particular interest, in evolutionary computation ( [19], [20], [21]). Various results about the appropriate population size can be found in the literature [22], [23]. Researchers usually argue that a "small" population size could guide the algorithm to poor solutions ( [24], [25], [26]) and that a "large" population size could make the algorithm expend more computation time in finding a solution ( [24], [26], [27]).
For GAs to select a subset feature, a fitness function must be defined to evaluate the fitness of each subset feature. In this work, the fitness function was based on Oluleye's fitness function [14] that is based on error minimization and reducing the number of features. The fitness of each chromosome in the population is evaluated using kNNbased fitness function as defined in FSP1. The kNN algorithm computes Euclidean distance between test data and the training sets then finds the nearest point from the training set to the test set. The individuals are evaluated and their fitness is ranked based on the kNN based classification error. Individuals with minimum fitness have a better chance of surviving into the next generation. GA ensures that the GA reduces the error rate and picks the individual with the best fitness error rate that will reduce the number of features as well.
The model representation for KNN is the entire training dataset. Predictions are made for a new data point by searching through the entire training set for the K most similar instances (the neighbours) and summarizing the output variable for those K instances. For classification problems, this might be the mode (or most common) class value.
Roulette wheel selection was used as the selection method for these experiments as it was discussed in the earlier section. With roulette wheel selection, each individual is assigned as a 'slice' of the wheel in proportion to the fitness value of the individual. Therefore, the fitter an individual is, the larger the slice of the wheel. The wheel is simulated by normalization of fitness values of the population of individuals.

Results and Discussions
In this work, different population size was tested to find the optimal size. The optimal accuracy was achieved using GAs where the population is 100 and k =5 as shown in Table 3. The number of features was dropped from 60 to 27 features. As K is increased, the accuracy changes as well as shown in Table 3. The researcher should try different values for k to reach the optimal solution. BN accuracy was 87.8% that can be interpreted as predicting 12.2% as being a false classified. The number of features was 60, for kNN, the trick is in how to determine the similarity between the data instances. The simplest technique, if the attributes are all of the same scale (all in inches for example), is to use the Euclidean distance. A number it can be calculated directly based on the differences between each input variable. In this case, it is impossible because the features are recorded on different scales.
The idea of distance or closeness can break down in very high dimensions (lots of input variables) which can negatively affect the performance of the algorithm on this problem. This is called the curse of dimensionality.
In order to improve the GAs performance, it's suggested only use those input variables that are most relevant to predicting the output variable [2829]. In the next experiments, the selected features from GAs, where accuracy was the highest (population 100, generation 130, k=5), were tested and the results are shown in Table 4. BN accuracy was 86.77% that can be interpreted as predicting 13.233% as being a false classified. Sensitivity of 91.0% can be interpreted as the algorithm predicting 8.91% dead when they should have been predicted as alive. Specificity shows a performance of 74.02% which can be interpreted as the algorithm predicting 25.98% FP (alive). The performance of GAs has not improved significantly regarding the accuracy; however, the number of selected features was reduced from 27 to 14 features as shown in Table 4.
The number of populations was increased to 400,600, and 800 in order to investigate if there will be any improvement on the GAs performance. Table 5 shows the accuracy for different generations, the optimal accuracy is 86.3% which is less than 87.7% that was achieved using 100 populations. The results showed that it took a long time and almost the same number of selected features. Al Khaldy [29] investigated several feature selection methods including wrapper and filter methods and used a representative set of classification methods for evaluating the features selected. These methods enabled the identification of a core set of features, from the same dataset. As shown in Table 11, there are many common features between his findings and this work.

Conclusions
The experiments in this paper demonstrate the feasibility of using GA as a feature selection tool for large data sets. While the number of features was reduced from 60 to 27 features using GA, the accuracy -being 87.8% -was almost the same. In order to improve the GA performance, the input variables were the most relevant to predicting the output variable (27 features). Whilst the performance of GA has not improved significantly regarding the accuracy, the number of selected features was reduced from 27 to 14 features thus identifying the most important features. GA picked up the three variables that are used by clinicians in diagnosing heart failure [30], namely Urea, Uric acid and Creatinine. In order to validate the performance of GA, different feature selection experiments were carried out using WEKA tool to show this is a viable technique for such problems.