GW311616 br Each feature set is evaluated with SVM using fol
Each feature set is evaluated with SVM using 5-fold cross- followed in all the conducted experiments considering the number
validation. The feature subset of choice is the one with the highest of feature subsets equals 100. Moreover, the parameter values of
classification accuracy and the lowest number of features. Classifi- the proposed algorithm are listed in Table 2.
cation accuracy is calculated as in the following equation:
In order to evaluate the proposed Nested-GA approach, its accu-
racy has been compared to the accuracies of other multiple feature
(1) selection algorithms. In all the conducted experiments, the input
features are the features selected by t-test filtering method. More-
Where TP, TN, FP, and FN represent the true positive, true negative, over, an independent colon cancer dataset from GEO was used to
false positive, and false negative respectively.
further validate the strength of the classification model.
SVM Accuracy on the testing dataset and on the independent dataset.
Num. of genes
Testing data Independent testing data Testing data Independent testing data Testing data Independent testing data
Accuracy of OGA-SVM, IGA-NNW, KNN, RF and Nested-GA on the DNA Methylation dataset.
Num. of CPG sites GA-SVM
GA-NNW KNN RF
At first, Nested-GA has been compared to a non-nested Ge-netic algorithm with SVM as its fitness function (GA-SVM) and to a non-nested Genetic algorithm with deep-learning neural net-work fitness function (GA-NNW) that both run over the colon can-cer gene GW311616 (CC-GE) dataset. Table 3 lists the SVM perfor-mance measurement (Accuracy) for the three experiments on the testing dataset and on the independent dataset, respectively.
After that, Nested-GA has been compared to GA-SVM, GA-NNW, KNN, and RF that all run over colon cancer DNA Methylation (CC-DM) dataset. It is worth to note that Nested-GA uses both the CC-GE and CC-DM datasets. Table 4 shows the accuracies of the five algorithms over the testing dataset. Although GA-SVM had better accuracy compared to Nested-GA when using two or three fea-tures, the accuracy of Nested-GA was noticeably better than GA-SVM when using more features.
Based on the results listed in Tables 3 and 4, it is clear that Nested-GA improves the classification accuracy by using fewer genes (six genes).
Based on the proposed algorithm, the colon cancer biomark-ers can be accurately discovered by selecting the smallest satisfac-tory optimal feature set to represent the Microarray gene markers. Using this criterion, a feature set including six Microarray genes is selected. The genes are “DAB2IP”, “KLRB1”, “NUP155”, “NPC1L”, “CDKN2A” and “SEC61A2”. As a step towards the validation of the resultant biomarkers, Fig. 5 depicts the heatmap generated for the six biomarkers genes (rows) with respect to the experimental sam-ples (columns). It is clear from the heatmap that these six genes are cooperatively indicating high discrimination ability between the normal and cancerous samples. Gene DAB2IP has the highest discrimination ability, whereas gene CDKN2A has the lowest one.
As a second step towards the validation of the resultant biomarkers, they have been substituted by six less important ones (“MGC10701”, “CCDC85B”, “BOC”, “METTL7B”, “KIAA2013” and “LAMB3”) resulting in accuracy of 0.553 for the testing dataset and 0.501 for the independent dataset. This means that replacing the six resultant genes by less important ones leaded to deterioration of the classification accuracy.
3.1. Enrichment Analysis
Furthermore, a Copy Number Variation (CNV) dataset of colon cancer has been used to explore any possible association between its tumor CNV segments and the resultant NestedGA six genes. The CNV dataset was downloaded from (http://firebrowse.org/?cohort= COAD). It consisted of 918 samples: 453 tumor samples and 465 normal samples. The following steps have been implemented as mentioned in (Guttery et al., 2018) in order to reach the genes that are intersecting with CNV segments in tumor samples. First, the Probes meta file from (ftp://ftp.broadinstitute.org/pub/GISTIC2. 0/hg19_support/) was used to differentiate between the normal and tumor CNV segments. After that, a dataset for all the human genes (hsapiens gene ensembl) from the host (grch37.ensembl.org (hg19)) was used as a reference dataset with CNV colon dataset to get the genes overlapped with CNV segments. This step was done by utilizing the two R packages entitled biomaRt and Ge-nomicRanges. At the end, the resultant 554 genes have been inter-sected with the six genes resulted from the Nested-GA approach. The NUP155 gene appeared to be a common gene between the two gene sets. This means that the CNV segments falling inside the NUP155 gene might play an important role in altering its normal expression level.