Summary
Background: Microarray data with reference to gene expression profiles have provided some valuable
results related to a variety of problems, and contributed to advances in clinical
medicine. Microarray data characteristically have a high dimension and small sample
size, which makes it difficult for a general classification method to obtain correct
data for classification. However, not every gene is potentially relevant for distinguishing
the sample class. Thus, in order to analyze gene expression profiles correctly, feature
(gene) selection is crucial for the classification process, and an effective gene
extraction method is necessary for eliminating irrelevant genes and decreasing the
classification error rate.
Objective: The purpose of gene expression analysis is to discriminate between classes of samples,
and to predict the relative importance of each gene for sample classification.
Method: In this paper, correlation-based feature selection (CFS) and Taguchi-binary particle
swarm optimization (TBPSO) were combined into a hybrid method, and the K-nearest neighbor
(K-NN) with leave-one-out cross-validation (LOOCV) method served as a classifier for
ten gene expression profiles.
Results: Experimental results show that this hybrid method effectively simplifies feature
selection by reducing the number of features needed. The classification error rate
obtained by the proposed method had the lowest classification error rate for all of
the ten gene expression data set problems tested. For six of the gene expression profile
data sets a classification error rate of zero could be reached.
Conclusion: The introduced method outperformed five other methods from the literature in terms
of classification error rate. It could thus constitute a valuable tool for gene expression
analysis in future studies.
Keywords
Microarray data - correlation-based feature selection - Taguchi-binary particle swarm
optimization - K-nearest neighbor