Feature Selection with Harmony Search · Abstract Feature selection is a term given to the problem...
Transcript of Feature Selection with Harmony Search · Abstract Feature selection is a term given to the problem...
Feature Selection with Harmony Search
and its Applications
Ren Diao
Supervisors:Prof. Qiang Shen
Dr. Neil S. Mac Parthaláin
Ph.D. Thesis
Department of Computer Science
Institute of Mathematics, Physics and Computer Science
Aberystwyth University
February 6, 2014
Declaration and Statement
DECLARATION
This work has not previously been accepted in substance for any degree and is not
being concurrently submitted in candidature for any degree.
Signed ............................................................ (candidate)
Date ............................................................
STATEMENT 1
This thesis is the result of my own investigations, except where otherwise stated.
Where correction services1 have been used, the extent and nature of the correction
is clearly marked in a footnote(s).
Other sources are acknowledged by footnotes giving explicit references. A bibliogra-
phy is appended.
Signed ............................................................ (candidate)
Date ............................................................
STATEMENT 2
I hereby give consent for my thesis, if accepted, to be available for photocopying and
for inter-library loan, and for the title and summary to be made available to outside
organisations.
Signed ............................................................ (candidate)
Date ............................................................
1This refers to the extent to which the text has been corrected by others.
Abstract
Feature selection is a term given to the problem of selecting important domain
attributes which are most predictive of a given outcome. Unlike other dimensionality
reduction methods, feature selection approaches seek to preserve the semantics of
the original data following reduction. Many strategies have been exploited for this
task in an effort to identify more compact and better quality feature subsets. A
number of group-based feature subset evaluation measures have been developed,
which have the ability to judge the quality of a given feature subset as a whole, rather
than assessing the qualities of individual features. Techniques of stochastic nature
have also emerged, which are inspired by nature phenomena or social behaviour,
allowing good solutions to be discovered without resorting to exhaustive search.
In this thesis, a novel feature subset search algorithm termed “feature selection
with harmony search” is presented. The proposed approach utilises a recently
developed meta-heuristic: harmony search, that is inspired by the improvisation
process of musical players. The proposed approach is general, and can be employed
in conjunction with many feature subset evaluation measures. The simplicity of
harmony search is exploited to reduce the overall complexity of the search process.
The stochastic nature of the resultant technique also allows the search process to
escape from local optima, while identifying multiple, distinctive candidate solutions.
Additional parameter control schemes are introduced to reduce the effort and impact
of static parameter configuration of HS, which are further combined with iterative
refinement, in order to enforce the discovery of more compact feature subsets.
The flexibility of the proposed approach, and its powerful performance in selecting
multiple, good quality feature subsets lead to a number of further theoretical de-
velopments. These include the generation and reduction of feature subset-based
classifier ensembles; feature selection and adaptive classifier ensemble for dynamic
data; hybrid rule induction on the basis of fuzzy-rough set theory; and antecedent
selection for fuzzy rule interpolation. The resultant techniques are experimentally
evaluated using data sets drawn from real-world problem domains, and systemati-
cally compared with leading methodologies in their respective areas, demonstrating
the efficacy and competitive performance of the present work.
Acknowledgements
I would like to express my uttermost gratitude to my supervisors: Prof. Qiang Shen
and Dr. Neil S. Mac Parthaláin, for their motivation, enthusiasm, and guidance,
which have been essential at all stages of my research.
I am grateful to Dr. Richard Jensen for his constant inspiration and for contributing
to the original impetus for this research.
I am also very thankful to Prof. Christopher John Price for his patience and support.
My sincere gratitude goes to my entire family: my parents Chunli Diao and Liming
Zhai, my dear wife Zhuoke Li, and her parents Hongzhu Li and Yulan Sun. The
completion of this Ph.D. would not have been possible without their kind support
and encouragement.
I would like to thank all my fellow researchers in the Advanced Reasoning Group,
both past and present, for the stimulating discussions, insight, and helpful advice. I
am especially grateful to Shangzhu Jin, Pan Su, Nitin Kumar Naik, and Ling Zheng
for their collaborative efforts.
I would like to express my deepest appreciation to the Department of Computer Sci-
ence and the Faculty of Science at Aberystwyth University, to the IEEE Computational
Intelligence Society, to the British Machine Vision Association, and to Plurabelle
Books, Cambridge, for their generous financial support.
My sincere gratitude goes to the anonymous reviewers, journal editors, conference
organisers and attendees involved (either directly or indirectly) with my submitted
works, for their encouragement and valuable input in refining my ideas.
I am extremely grateful to all of the academic, administrative, technical, and support
staff at the Department of Computer Science, Aberystwyth University, for their kind
assistance throughout my entire study.
I would also like to thank all my friends, especially the Chinese student community
in Aberystwyth for their continuous support.
Contents
Contents i
List of Figures v
List of Tables vii
List of Algorithms ix
1 Introduction 1
1.1 Feature Selection (FS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 FS with Harmony Search (HSFS) . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Structure of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Background 12
2.1 FS Evaluation Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.1 Filter-Based FS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.2 Wrapper-Based, Hybrid, and Embedded FS . . . . . . . . . . . 23
2.2 FS Search Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.1 Deterministic Algorithms . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.2 Stochastic and Nature-Inspired Approaches . . . . . . . . . . . 25
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3 Framework for HSFS and its Improvements 47
3.1 Principles of HS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.1.1 Key Notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.1.2 Parameters of HS . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.1.3 Iterative Process of HS . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2 Initial Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2.1 Binary-Valued Representation . . . . . . . . . . . . . . . . . . . . 53
i
ii
3.2.2 Iteration Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.2.3 Tunable Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3 Algorithm for HSFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3.1 Mapping of Key Notions . . . . . . . . . . . . . . . . . . . . . . . 55
3.3.2 Work Flow of HSFS . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.3.3 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.4 Additional Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.4.1 Parameter Control . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.4.2 Iterative Refinement . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.5 Experimentation and Discussion . . . . . . . . . . . . . . . . . . . . . . . 66
3.5.1 Evaluation of HSFS . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.5.2 Evaluation of Additional Improvements . . . . . . . . . . . . . . 75
3.5.3 Iterative Refinement of Fuzzy-Rough Reducts . . . . . . . . . . 80
3.5.4 Discussion of Results . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4 HSFS for Feature Subset Ensemble 84
4.1 Occurrence Coefficient-Based Ensemble . . . . . . . . . . . . . . . . . . 85
4.1.1 Ensemble Construction Methods . . . . . . . . . . . . . . . . . . 87
4.1.2 Decision Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.1.3 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.2 Experimentation and Discussion . . . . . . . . . . . . . . . . . . . . . . . 91
4.2.1 Classification Results . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.2.2 Comparison of Ensemble Generation Methods . . . . . . . . . . 97
4.2.3 Scalability Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5 HSFS for Classifier Ensemble Reduction 103
5.1 Framework for Classifier Ensemble Reduction . . . . . . . . . . . . . . 105
5.1.1 Base Classifier Pool Generation . . . . . . . . . . . . . . . . . . . 105
5.1.2 Classifier Decision Transformation . . . . . . . . . . . . . . . . . 107
5.1.3 FS on Transformed Data set . . . . . . . . . . . . . . . . . . . . . 108
5.1.4 Ensemble Decision Aggregation . . . . . . . . . . . . . . . . . . . 108
5.1.5 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.2 Experimentation and Discussion . . . . . . . . . . . . . . . . . . . . . . . 110
5.2.1 Reduction Performance for Decision Tree-Based Ensembles . . 111
iii
5.2.2 Alternative Ensemble Construction Approaches . . . . . . . . . 113
5.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6 HSFS for Dynamic Data 117
6.1 Dynamic FS Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.1.1 Feature Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.1.2 Feature Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.1.3 Instance Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.1.4 Instance Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.2 Dynamic HSFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.2.1 Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . . 123
6.2.2 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.3 Adaptive Feature Subset Ensemble . . . . . . . . . . . . . . . . . . . . . 128
6.4 Experimentation and Discussion . . . . . . . . . . . . . . . . . . . . . . . 130
6.4.1 Results for Basic Dynamic FS Scenarios . . . . . . . . . . . . . . 131
6.4.2 Results for Combined Dynamic FS Scenarios . . . . . . . . . . . 135
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7 HSFS for Hybrid Rule Induction 140
7.1 Background of Rule Induction . . . . . . . . . . . . . . . . . . . . . . . . 141
7.1.1 Crisp Rough Rule Induction . . . . . . . . . . . . . . . . . . . . . 141
7.1.2 Hybrid Fuzzy-Rough Rule Induction . . . . . . . . . . . . . . . . 144
7.2 HSFS for Hybrid Rule Induction . . . . . . . . . . . . . . . . . . . . . . . 148
7.2.1 Mapping of Key Notions . . . . . . . . . . . . . . . . . . . . . . . 149
7.2.2 HarmonyRules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
7.2.3 Rule Adjustment Mechanisms . . . . . . . . . . . . . . . . . . . . 153
7.3 Experimentation and Discussion . . . . . . . . . . . . . . . . . . . . . . . 154
7.3.1 Classification Results . . . . . . . . . . . . . . . . . . . . . . . . . 154
7.3.2 Comparison of Rule Cardinalities . . . . . . . . . . . . . . . . . . 156
7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
8 HSFS for Fuzzy Rule Interpolation 159
8.1 Background of Fuzzy Rule Interpolation (FRI) . . . . . . . . . . . . . . 160
8.1.1 Transformation-Based FRI . . . . . . . . . . . . . . . . . . . . . . 161
8.1.2 Backward FRI (B-FRI) . . . . . . . . . . . . . . . . . . . . . . . . 165
8.2 Antecedent Significance-Based FRI . . . . . . . . . . . . . . . . . . . . . 166
iv
8.2.1 From FS to Antecedent Selection . . . . . . . . . . . . . . . . . . 167
8.2.2 Weighted Aggregation of Antecedent Significance . . . . . . . 168
8.2.3 Use of Antecedent Significance in B-FRI . . . . . . . . . . . . . 171
8.3 Experimentation and Discussion . . . . . . . . . . . . . . . . . . . . . . . 171
8.3.1 Application Example . . . . . . . . . . . . . . . . . . . . . . . . . 171
8.3.2 Systematic Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 175
8.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
9 Conclusion 178
9.1 Summary of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
9.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
9.2.1 Short Term Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
9.2.2 Long Term Developments . . . . . . . . . . . . . . . . . . . . . . 183
Appendix A Publications Arising from the Thesis 186
A.1 Journal Articles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
A.2 Book Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
A.3 Conference Papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
Appendix B Data Sets Employed in the Thesis 189
Appendix C List of Acronyms 195
Bibliography 197
List of Figures
1.1 Process of knowledge discovery . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Selection of real-world applications of FS . . . . . . . . . . . . . . . . . . . 3
1.3 Distribution of HS applications by discipline areas . . . . . . . . . . . . . . 5
1.4 Relationships between thesis chapters . . . . . . . . . . . . . . . . . . . . . 7
2.1 Components of FS process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Types of FS evaluation measure . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Basic concepts of rough set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Taxonomy of stochastic and nature-inspired approaches . . . . . . . . . . 26
3.1 Key notions of HS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 Iteration steps of HS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3 Key notions of HSFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.4 Work flow of HSFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.5 Improvisation process of HSFS . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.6 Iterative fuzzy-rough reduct refinement for the arrhy data set . . . . . . 80
3.7 Iterative fuzzy-rough reduct refinement for the web data set . . . . . . . . 81
4.1 Flow chart for single subset quality evaluator with stochastic search . . . 87
4.2 Flow chart for single subset quality evaluator with partitioned training data 88
4.3 Flow chart for mixture of subset quality evaluators . . . . . . . . . . . . . 89
4.4 Averaged classification accuracies of the FSE implementations . . . . . . 99
4.5 Averaged OC-FSE classification accuracies and subset sizes . . . . . . . . 101
5.1 Overview of CER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.2 Mixed classifiers using Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.3 Mixed classifiers using Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.1 Procedures of D-HSFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
v
vi
6.2 Generic framework for A-FSE . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.3 Results of dynamic FS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.1 Feature subset cardinality distribution of HarmonyRules and QuickRules . 157
8.1 Procedures of T-FRI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
8.2 Antecedent selection procedures . . . . . . . . . . . . . . . . . . . . . . . . . 168
8.3 Alternative rule selection using weighted distance calculation . . . . . . . 170
List of Tables
2.1 Notions used in pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1 Binary encoded feature subsets . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.2 Concept mapping from HS to FS . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.3 Feature subsets encoded using integer-valued scheme . . . . . . . . . . . . 57
3.4 Parameter settings in different search stages . . . . . . . . . . . . . . . . . . 64
3.5 Data set information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.6 FS results using CFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.7 FS results using PCFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.8 FS results using FRFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.9 C4.5 and NB classification accuracies using the feature subsets found
with the respective search algorithms via CFS . . . . . . . . . . . . . . . . . 73
3.10 C4.5 and NB classification accuracies using the feature subsets found
with the respective search algorithms via PCFS . . . . . . . . . . . . . . . . 74
3.11 C4.5 and NB classification accuracies using the feature subsets found
with the respective search algorithms via FRFS . . . . . . . . . . . . . . . . 75
3.12 Comparison of multiple HS-IR reducts versus single HC reduct . . . . . . 76
3.13 Comparison of parameter control rules using CFS . . . . . . . . . . . . . . 77
3.14 Comparison of parameter control rules using FRFS . . . . . . . . . . . . . 77
3.15 Parameter settings for demonstration of parameter control and iterative
refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.16 Comparison of proposed HS improvements using feature subsets selected
by CFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.1 Ordinary FSE of 5 feature subsets with 8 features . . . . . . . . . . . . . . 86
4.2 An example of OC threshold-based aggregation with 3 possible classes . 90
4.3 Data sets used for OC-FSE experimentation . . . . . . . . . . . . . . . . . . 92
4.4 Classification accuracy result of stochastic search implementation . . . . 95
vii
viii
4.5 Classification accuracy result of data partition-based implementation . . 96
4.6 Classification accuracy result of mixture of algorithms . . . . . . . . . . . 97
4.7 Summary of results of the three FSE implementations . . . . . . . . . . . . 98
5.1 Classifier ensemble decision matrix . . . . . . . . . . . . . . . . . . . . . . . 108
5.2 HS parameter settings and data set information . . . . . . . . . . . . . . . 110
5.3 Comparison on C4.5 classification accuracy . . . . . . . . . . . . . . . . . . 112
6.1 Summary of the data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.2 Feature addition results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.3 Feature removal results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.4 Instance addition results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.5 Instance removal results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.6 A-FSE accuracy comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.1 Example data set for rough set rule induction . . . . . . . . . . . . . . . . . 142
7.2 Example data set for rough set rule induction . . . . . . . . . . . . . . . . . 143
7.3 Example data set for QuickRules . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.4 Mapping of key notions from HS to rule induction . . . . . . . . . . . . . . 149
7.5 Rule base improvisation example . . . . . . . . . . . . . . . . . . . . . . . . 152
7.6 Data set information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
7.7 Parameters settings where * denotes the dynamically adjusted values . . 154
7.8 Classification accuracy of HarmonyRules and QuickRules . . . . . . . . . . 155
7.9 Classification accuracy of other classifiers tested using 10-FCV . . . . . . 156
7.10 HarmonyRules vs. QuickRules in terms of rule cardinalities . . . . . . . . . 157
8.1 Example linguistic rules for terrorist bombing prediction . . . . . . . . . . 172
8.2 Antecedent significance values determined by CFS and FRFS . . . . . . . 172
8.3 Example observation and the closest rules selected by standard and
weighted T-FRI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
8.4 Example observation and the closest rules selected by standard and
weighted B-FRI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
8.5 Evaluation of proposed approaches for standard FRI . . . . . . . . . . . . . 176
8.6 Evaluation of proposed approaches for B-FRI . . . . . . . . . . . . . . . . . 177
B.1 Information of data sets used in the thesis . . . . . . . . . . . . . . . . . . . 189
List of Algorithms
2.1.1 Fuzzy-rough QuickReduct (A,Z) . . . . . . . . . . . . . . . . . . . . . 22
2.2.1 Move Bpitowards Bp j
by a distance v . . . . . . . . . . . . . . . . . . 28
2.2.2 Update current best solution B . . . . . . . . . . . . . . . . . . . . . . 29
2.2.3 Local search (B) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.4 Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2.5 Memetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.6 Clonal Selection Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 34
2.2.7 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.2.8 Tabu Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.2.9 Artificial Bee Colony Optimisation . . . . . . . . . . . . . . . . . . . . 40
2.2.10 Ant Colony Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.2.11 Firefly Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.2.12 Particle Swarm Optimisation . . . . . . . . . . . . . . . . . . . . . . . 45
3.1.1 Improvisation process of original HS . . . . . . . . . . . . . . . . . . 52
3.4.1 Iterative refinement procedure . . . . . . . . . . . . . . . . . . . . . . 65
3.4.2 Musician size adjustment via binary search . . . . . . . . . . . . . . 65
5.1.1 Bagging algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.1.2 Random Subspace algorithm . . . . . . . . . . . . . . . . . . . . . . . 107
6.1.1 Dynamic FRFS for Feature Addition . . . . . . . . . . . . . . . . . . . 120
6.1.2 Dynamic FRFS for Feature Removal . . . . . . . . . . . . . . . . . . . 121
6.1.3 Dynamic FRFS for Instance Addition . . . . . . . . . . . . . . . . . . 122
6.1.4 Dynamic FRFS for Instance Removal . . . . . . . . . . . . . . . . . . 123
6.2.1 Pseudocode of D-HSFS . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.2.2 Sub-routine adapt (Ak+1, Xk+1) . . . . . . . . . . . . . . . . . . . . . . 126
6.3.1 A-FSE implemented using D-HSFS . . . . . . . . . . . . . . . . . . . 129
ix
x
7.1.1 Work flow of QuickRules . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.1.2 Subroutine check (B, RB x , Rz x) . . . . . . . . . . . . . . . . . . . . . 146
7.2.1 HarmonyRules initialisation . . . . . . . . . . . . . . . . . . . . . . . . 150
7.2.2 HarmonyRules iteration . . . . . . . . . . . . . . . . . . . . . . . . . . 152
Chapter 1
Introduction
N OWADAYS, data is being collected at a staggering pace in almost every field
imaginable. Despite the countless advances in computer technology, it is still
very challenging to store, maintain, and process data at the same speed as it is being
gathered. As a result, only a small fraction of information may be analysed to any
advantage, and therefore, there is an increasing demand for automated, efficient,
and scalable means to assist humans in extracting useful knowledge from these
dramatically expanding mountains of data.
Knowledge discovery from data, as illustrated in Fig. 1.1, is a broad subject with
many alternative names and sub-areas, including data mining [95], information
harvesting [183], knowledge extraction [258], and pattern discovery [44]. Fun-
damentally speaking, knowledge discovery concerns methods or techniques that
attempt to make sense of (raw or low-level) data, which may be difficult for humans
to directly interpret [168]. The goal of knowledge discovery is to build compact,
descriptive, or predictive models that are humanly comprehensible.
As the volume of data grows rapidly, the traditional, manual knowledge discovery
process becomes increasingly time-consuming and expensive [75]. This is especially
the case, for example, for problem domains such as medical image analysis [171](seeking to reveal, diagnose, or examine disease, or to study the human anatomy
and physiology), and forensic investigation [125] (analysing ballistics, fingerprints,
toxicology, and body identification). The classic approach to data analysis also relies
heavily on the opinions of domain experts, who must have a detailed and intricate
1
Figure 1.1: Process of knowledge discovery
understanding of the problem at hand. Such opinions are often subjective and/or
inconsistent between different individuals. More importantly, data in its present
form may contain a large number of objects and descriptive tags (features), which
are impractical (if not impossible) in most cases for human beings to analyse.
High dimensional data sets create problems even for automated systems. The
computational complexity and search space are often increased exponentially due to
high problem dimensionality (i.e., a large number of domain features). Moreover,
the naïve assumption of “more features = more knowledge” during data collection
generally leads to a problem known as the curse of dimensionality [16]. This issue
occurs when training data is not being collected at a desirable rate, that is proportional
to the increasing number of features. This is a frustrating issue for many machine
learning methods for knowledge discovery. The abundance of features may also
cause an induction algorithm to identify patterns that are in fact spurious, because
of noise [124].
Dimensionality reduction techniques [154] present a type of approach that at-
tempts to reduce the overall dimensionality of the data. Several of which work by
transforming the underlying meanings of the features, whilst semantics preserving
2
1.1. Feature Selection (FS)
mechanisms maintain the original features. Feature selection approaches, being
the main focus of this thesis, fall into the latter category [53, 164]. These methods
search for and identify a subset of features using a dedicated evaluation measure,
and are particularly beneficial for knowledge discovery tasks, as they preserve the
human interpretability of the original data and the resultant, discovered knowledge.
1.1 Feature Selection (FS)
The main aim of feature selection (FS) is to discover a minimal feature subset from
a problem domain while retaining a suitably high accuracy (or information content)
in representing the original data [53]. When analysing data that has a very large
number of features [270], it is difficult to identify and extract patterns or rules due
to the high inter-dependency amongst individual features, or the complex behaviour
of combined features. Techniques to perform tasks such as text processing, data
classification and systems control [171, 184, 222, 228] can benefit greatly from
FS which directly addresses the problem of high dimensionality, since the noisy,
irrelevant, redundant or misleading features may now be removed [124]. FS is
pervasive, in the sense that it is not restricted as being purely a type of data mining
technique. This characteristic is reflected by its use in a wide range of real-world
applications [164, 168], a few example areas are illustrated in Fig. 1.2.
Figure 1.2: Selection of real-world applications of FS
In the context of FS, an information system generally consists of a fixed number
objects, and each object is described by a set of features. Features can be either
3
1.2. FS with Harmony Search (HSFS)
qualitative (discrete-valued) or quantitative (real-valued). For a given data set with
n features, the task of FS can be seen as a search for the “optimal” subset of features
through the competing 2n candidates. In general, optimality is subjective depending
on the problem at hand. A subset that is selected as optimal using one particular
evaluation function may not be equivalent to that selected by another. Various
techniques [147] have been developed in the literature to judge the quality of the
discovered feature subsets, several of which rank the features based on a certain
importance measure, e.g., information gain [162], chi-square [291], rough set and
fuzzy-rough set-based dependency [116, 123], and symmetrical uncertainty [219].
Recent trends in developing FS methods focus on evaluating a feature subset as
a whole, forming an alternative type of approach to the aforementioned. Popular
methods include the group-based fuzzy-rough FS (FRFS) [126, 172], probabilistic
consistency-based FS (PCFS) [52], and correlation-based FS (CFS) [93]. These
techniques (together with the individual feature-based methods) are often collectively
classified as the filter-based techniques. They are typically used as a preprocessing
step, and are independent of any learning algorithm that may be subsequently
employed. In contrast, wrapper-based [107, 144] and also, hybrid algorithms [297]are used in conjunction with a learning or data mining algorithm, which is employed
in place of an evaluation metric as used in the filter-based approach.
1.2 FS with Harmony Search (HSFS)
Independent of the learning mechanism, a common issue that all FS methods need to
address is how to search for the “optimal” feature subsets. To this end, an exhaustive
method may be used, however it is often impractical for most data sets. Alternatively,
hill-climbing-based approaches are exploited where features are added or removed
one at a time often in an greedy fashion, until there is no further improvement in the
current candidate solution. Although generally fast to converge, these methods may
lead to the discovery of sub-optimal subsets (both in terms of the evaluation score
and the size of the selected feature subset) [62, 164]. To avoid such short-comings,
nature-inspired heuristic strategies such as genetic algorithms [153, 266], genetic
programming [187], simulated annealing [69], and particle swarm optimisation
[261] are utilised with varying degrees of success. This thesis proposes a new FS
search strategy based on a recently developed search algorithm - Harmony Search
(HS) [84, 155].
4
1.2. FS with Harmony Search (HSFS)
HS is a meta-heuristic algorithm that is inspired by the social behaviour of music
players. It mimics the improvisation process of musicians, during which, each
musician plays a note for finding a best harmony all together. HS has been very
successful in a wide variety of engineering optimisation problems [77, 156, 239,
255, 287] and machine learning tasks [50, 174, 175, 179, 212]. Fig. 1.3 provided in
[175] gives a breakdown by discipline areas. It has demonstrated several advantages
over traditional optimisation techniques. HS imposes only limited mathematical
sophistication and is not sensitive to the initial value settings.
Figure 1.3: Distribution of HS applications by discipline areas
The original HS technique has been improved by methods that modify its pitch
adjustment rate and bandwidth with regard to the underlying iterative computational
process [173]. Also, the initial static valued bandwidth [84] may be replaced with a
fret version, making the algorithm more adaptive to the variance in variable range,
and more suitable to real valued problems. Note that fret is a musical term that
refers to a raised element on the neck of a stringed instrument, such as the metal
strips inserted into the fingerboard of a guitar. Work has also been carried out in
the literature to analyse the evolution of the population-variance over successive
generations in HS [49]. The HS algorithm therefore, has a novel stochastic derivative
5
1.2. FS with Harmony Search (HSFS)
(for discrete variable) based on musician’s experience, rather than gradient (for
continuous variable) in differential calculus.
The proposed FS with HS (HSFS) technique in this thesis aims to tackle the
challenges in finding better quality feature subsets. It addresses the weakness of
conventional deterministic algorithms which may return local optimal solutions.
Also, it employs a more expressive, integer-valued feature subset representation,
as opposed to the binary encoding scheme adopted by most other nature-inspired
stochastic methods. The flexibility provided by the integer-valued representation
allows stochastic mechanisms of HS to better explore the solution space, and to
identify more compact feature subsets. The resultant approach inherits the simplicity
of HS, and is capable of identifying multiple (distinctive) solutions of good quality.
A number of additional improvements to HSFS have also been developed in this
work, in order to further extend the capabilities of the proposed method, thus further
improving the performance of HSFS for high dimensional data sets:
• The original HS method employs a static parameter scheme, which requires
a substantial amount of prior effort in order to determine a suitable setting
for a new problem. The parameters themselves are also socially inspired and
therefore cannot be straightforwardly devised on the basis of the properties of
the problem (data set) itself.
• The HSFS framework enables the size of a given candidate feature subset to
be controlled by configuring the total number of musicians. A mechanism that
iteratively refines this parameter is also derived, which allows the intelligent
discovery of compact and good quality solutions.
The flexibility of the improved HSFS algorithm, allied to its powerful performance
in identifying multiple, quality feature subsets has inspired a number of theoretical
applications. These include ensemble learning, FS for dynamic data, fuzzy-rough
rule induction, and fuzzy rule interpolation; and are summarised in Section 1.3
below. Although these methods are mostly evaluated using benchmark problems,
the data sets employed are drawn from real-world, practical applications, including
handwritten character recognition [32], mammography image analysis [170], water
quality prediction [15], and many others. Detailed descriptions of these data sets
are given in Appendix B.
6
1.3. Structure of Thesis
1.3 Structure of Thesis
This section outlines the structure of the remainder of this thesis. Fig. 1.4 illustrates
the relationships between the individual chapters (other than the introduction and
the conclusion). The direct dependencies between the chapters are denoted using
solid arrows, where conceptual linkages, such as that between Chapters 2 and 3, and
that between Chapters 7 and 8, are symbolised using dashed lines. A comprehensive
list of publications arising from the work of the thesis is provided in Appendix A.
Figure 1.4: Relationships between thesis chapters
Chapter 2: Background
This chapter provides a background introduction to FS, which is organised into
two core parts: evaluation measures and search strategies. A selection of popular
approaches developed for FS is discussed, including several group-based, filter eval-
uation metrics [52, 93, 126] that have been developed recently. Such group-based
methods judge the quality of a given feature subset as a whole, rather than assessing
the qualities of features individually. This chapter also provides a comprehensive
review of the most recent methods for FS that originated from nature-inspired meta-
heuristics, where the more classic approaches such as genetic algorithms and ant
colony optimisation are also included for comparison. These techniques allow good
quality solutions to be discovered without resorting to exhaustive search.
7
1.3. Structure of Thesis
A good number of the reviewed methodologies have been significantly modified in
the present work, in order to systematically support generic subset-based evaluators
and higher dimensional problems. Such modifications are carried out because the
original studies are either exclusively tailored to certain subset evaluators (e.g., rough
set-based methods), or limited to specific problem domains. A total of ten different
algorithms are examined, and their mechanisms and work flows are summarised in
a unified manner. The survey of nature-inspired FS search methods presented in the
chapter is under review for journal publication.
Chapter 3: Framework for HSFS and its Improvements
This chapter explains the key contribution of the thesis: the HSFS algorithm, which
is a novel FS approach based on HS. For completeness, an outline of HS and its
key notions are first provided. The HSFS algorithm is a general approach that can
be used in conjunction with many subset evaluation techniques. The simplicity
of HS is exploited in order to reduce the overall complexity of the search process.
The proposed approach is able to escape from local optimal solutions, and identify
multiple solutions due to the stochastic nature of HS. The initial development of
HSFS using binary-valued feature subset representation is also described in this
chapter. Additional parameter control schemes are introduced to reduce the effort
and impact of parameter configuration. These can be further combined with the
iterative refinement strategy, tailored to ensuring the discovery of quality subsets,
and to improving search efficiency.
This chapter also presents initial experimental results that demonstrate the FS
performance of HSFS. The nature-inspired search approaches reviewed in Chapter 2
are systematically tested, and compared to HSFS, using high dimensional, real-valued
benchmark data sets. The selected feature subsets are used to build classification
models, in an effort to further validate their efficacy. The proposed modifications to
the base HSFS method are also individually validated, with results (accompanied by
in-depth studies) reported in dedicated sections.
The base HSFS algorithm and parts thereof have been published initially in [60],with a further and more in-depth version in [62]. The proposed improvements have
been published in [59, 290], whilst an extended paper on the topic of self-adjusting
HSFS is under review for journal publication. Note that the study in this chapter also
includes results of experiments conducted for the survey paper under review.
8
1.3. Structure of Thesis
Chapter 4: HSFS for Feature Subset Ensemble
Classifier ensembles constitute one of the main research directions in machine learn-
ing and data mining. The use of multiple classifiers generally allows better predictive
performance than that achievable with a single model. Feature subset ensembles in
particular, aim to combine the decisions of different base FS components, thereby
producing more robust results for the subsequent learning tasks. This chapter details
a new feature subset ensemble approach that is based on the analysis of feature
occurrences. Three base component construction methods are discussed, generalis-
ing the ensemble concept so that it can be used in conjunction with various subset
evaluation techniques and search algorithms.
A novel occurrence coefficient threshold-based classifier decision aggregation
method is also introduced, which works closely and efficiently with the proposed
ensemble approach. HSFS is employed as the main stochastic search algorithm, in
order to supply the essential base feature subsets for the proposed method to work
with. Results of experimental comparative studies carried out on real-world data sets
are also reported, in order to highlight the benefits of the work. The developments
presented in this chapter have been published in [227]. A paper proposing a refined
technique is currently under review for journal publication.
Chapter 5: HSFS for Classifier Ensemble Reduction
Several approaches exist in the literature that provide means to effectively construct
and aggregate diverse classifier ensembles, including the occurrence-coefficient based
feature subset ensemble as introduced in Chapter 4. However, these ensemble systems
potentially contain redundant members that, if removed, may further increase group
diversity and produce better feature subsets. Smaller ensembles also relax the
memory and storage requirements, reducing the run-time overheads otherwise
required, while improving the overall efficiency.
This chapter extends the existing ideas developed for FS problems in order to
support classifier ensemble reduction, by transforming group predictions into training
samples, and treating classifiers as features. Also, HSFS is used to select a reduced
subset of such artificial features, while attempting to maximise the feature subset
evaluation. The resulting technique is systematically evaluated using high dimen-
sional and large sized benchmark data sets, demonstrating superior classification
9
1.3. Structure of Thesis
performance against both original, unreduced ensembles and randomly formed sub-
sets. The work in this chapter has been published initially in [61], and a generalised
approach is to appear in [57].
Chapter 6: HSFS for Dynamic Data
Most of the approaches developed for FS and classifier ensembles in the literature,
including those proposed in Chapters 4 to 5, focus on the analysis of data from a static
pool of training instances with a fixed set of features. Whilst in practice, knowledge
may be gradually refined, and information regarding the problem domain may be
actively added and/or removed whilst the training is taking place. In this chapter,
a dynamic FS technique is proposed that makes use of the existing subset-based
evaluation methods to further extend the HSFS algorithm. The concept of adaptive
feature subset ensembles is examined, improving upon the idea developed in Chapter
4, with the resulting technique capable of dynamically refining the candidate feature
subsets, and their associated classifier ensembles. The efficacy of the presented work
is verified through systematic simulated experimentation using real-world benchmark
data sets. A preliminary investigation of the basic dynamic FS scenarios has been
published in [58]. A paper concerning the dynamic HSFS algorithm and also the
adaptive ensemble approach are currently under review for journal publication.
Chapter 7: HSFS for Hybrid Fuzzy-Rough Rule Induction
Automated generation of feature pattern-based production (or if-then) rules is es-
sential to the success of many intelligent pattern classifiers, especially when their
inference results are expected to be directly human-comprehensible. Fuzzy and
rough set theory [88, 166] have been applied with much success to this area as
well as to FS [123, 242]. Since both FS and if-then rule learning using rough set
theory involve the processing of equivalence classes for their successful operation,
it is natural to combine these into a single integrated mechanism that generates
concise, meaningful and accurate rules. In particular, this chapter explains how HSFS
may be used together with fuzzy-rough rule induction techniques [42, 262] (as the
latter is one the most popular and best tested existing methods that were built on the
initial notion of rough sets). Here, HS is adopted to simultaneously optimise multiple
objectives, so that the resulting rule base, whilst fully covering the knowledge being
described, remains compact and concise. The efficacy of the proposed algorithm is
10
1.3. Structure of Thesis
experimentally evaluated against leading classifiers, including fuzzy and rough rule
induction techniques. The work in this chapter has been published in [63].
Chapter 8: HSFS for Fuzzy Rule Interpolation
Fuzzy Rule Interpolation [142, 143] is of particular significance for reasoning in
the presence of insufficient knowledge or sparse rule bases. This chapter utilises
HSFS to perform dimensionality reduction in such reasoning systems that involve
high dimensional rules. The techniques derived for FS are applied almost directly to
a converted set of rules with crisp antecedent values (the data set), such that the
significance of individual rule antecedents may be identified, and an informative
subset of antecedents may be discovered. The additional information obtained via
FS is equally beneficial for conventional fuzzy reasoning, and for a newly identified
research area concerning backward fuzzy rule interpolation [128, 131]. A paper
written on the basis of this chapter is currently under review for conference publica-
tion, whilst the outcomes regarding backward fuzzy rule interpolation itself have
been published in [128, 130, 131] (with [131] receiving one of the two best paper
awards at the 2012 IEEE International Conference in Fuzzy Systems). A substantially
extended study of this work has formed a journal publication [129], but is largely
beyond the scope of this thesis.
Chapter 9: Conclusion
This chapter summarises the key contributions made by the thesis, together with
a discussion of topics which form the basis for future research. Both immediately
achievable tasks and long-term projects are considered.
Appendices
Appendix A lists the publications arising from the work presented in this thesis,
containing both published papers, and those currently under review for journal
publication. Appendix B provides information regarding the benchmark data sets em-
ployed in the thesis, which are mostly drawn from real problem scenarios. Appendix
C summaries the acronyms employed throughout this thesis.
11
Chapter 2
Background
A S the amount of available data increases, so too does the need for effective
dimensionality reduction. FS methods aim to find minimal or close-to-minimal
feature subsets, whilst preserving the semantics of the underlying data, thus making
the reduced data transparent to human scrutiny. Generally speaking, FS involves
two computational processes, as shown in Fig. 2.1: 1) a feature subset evaluation
process, and 2) a feature subset search process. A number of studies in the literature
[164] further decompose FS into smaller components, such as feature subset gener-
ation (which may be treated as a part of the search mechanism), and termination
criterion (which may be triggered by the evaluation process itself, or controlled by
the search strategy). Note that exceptional cases [264] exist where the two parts are
indifferentiable or inseparable, a number of early techniques [52, 147] in the area
also treat FS as one integrated process. For ease of explanation and organisation,
in this thesis, the two-part decomposition (evaluate-and-search) is adopted unless
otherwise stated.
The remainder of this chapter is structured as follows. The two core parts of
FS are introduced in Sections 2.1 and 2.2, respectively. Filter-based FS evaluation
techniques are of significant importance to the development of the present work, and
they are covered in Section 2.1.1 in detail. The main focus of this research lies with the
use of stochastic and nature-inspired FS search strategies. A comprehensive survey
of the relevant methods is given in Section 2.2.2. Finally, Section 2.3 summarises
the chapter.
12
2.1. FS Evaluation Measures
Figure 2.1: Components of FS process
2.1 FS Evaluation Measures
An information system in the context of FS is a tuple ⟨X , Y ⟩, where X is a non-empty
set of finite objects, also referred to as the universe of discourse; and Y is a non-empty,
finite set of features. For decision systems, Y = A∪ Z where A= a1, · · · , a|A| is
the set of input features, and |A| denotes the cardinality of A, which may be either
discrete- or real-valued; and Z is the set of decision features. For a given data set
with |A| features, the task of FS can be seen as a search for one or more feature
subsets B ⊆ A, which are “optimal” amongst the competing 2|A| candidates.
The concept of optimality for any given feature subset is twofold: 1) the quality,
in terms of how well it encapsulates the information contained within the original
data (with full set of features); and 2) the size, where more compact solutions are
often preferred due to the advantage of reducing the overall dimensionality. Note
that the term “quality” of a given feature subset and also, the term “information”
contained within data have been concretely defined in the literature, especially in the
area of information theory [225] in statistics. Example definitions include Schwarz
criterion [218], Akaike information criterion [6], mutual information [206], and
Pearson product-moment correlation coefficient [93]. Typically, they provide means
to measure the relative quality of a statistical model [181] for a given set of data,
so as to perform model selection amongst a finite set of models. However, in this
thesis, they are employed in a more general sense. Feature subset quality is indeed
often subjective, depending on the problem at hand and the metric employed to
perform the analysis. As such, a feature subset that is identified as “optimal” using
one particular evaluation function may not be equivalent to that selected by another.
13
2.1. FS Evaluation Measures
Various methods have been developed in the literature in order to judge the
quality of discovered feature subsets, which are hereafter referred to as “feature
subset evaluation measures” or “feature subset evaluators” interchangeably. Such
measures generally focus on producing a numerical score f (B) for a given feature
subset B. Here f : B → R represents a subset evaluation function, which maps
a set of feature subsets onto the set of real numbers (feature subset evaluation
scores). In this thesis, normalised scores f (B) ∈ [0, 1], f (;) = 0, are assumed, where
higher scores indicate better quality feature subsets. Note that for any given data set,
multiple feature subsets may exist that are equally (or almost equally) optimal, when
judged by a subset evaluator, i.e., f (Bp) = f (Bq), or f (Bp)' f (Bq), Bp 6= Bp, where
Bp and Bp denote two arbitrary feature subsets. Based on the style of interaction with
the subsequent learning mechanism that make use of the selected feature subsets,
feature subset evaluation measures may be categorised into four different types:
filter-based, wrapper-based, hybrid, and embedded approaches, as illustrated in Fig.
2.2.
Figure 2.2: Types of FS evaluation measure
2.1.1 Filter-Based FS
Methods that perform FS in isolation of any learning algorithm are termed filter-based
approaches [164], where essentially irrelevant features are filtered out prior to using
a resultant feature subset for learning. Filter-based methods are general purpose,
pre-processing algorithms that are applicable to most problems, as they attempt to
find quality features (that may yield good learning outcomes) regardless of the choice
of the subsequent learning mechanism. The separation of FS methods from any such
mechanism also makes them more efficient, since it is no longer required to train and
test a classifier for the sole purpose of evaluating the quality of a given feature subset.
14
2.1. FS Evaluation Measures
Filter-based approaches can be further divided into two sub-categories, according
to the ways in which they perform feature evaluation: 1) those based on individual
feature-based measures, and 2) those based upon group/subset-based measures.
2.1.1.1 Individual Feature-Based Measures
This type of technique calculates feature relevance individually. The final feature
subset is formed following a predefined rule, such as returning all features above
a given relevance threshold. Individual feature-based measures have higher time
efficiency, since the relevance scores are computed as many times as the total number
of features. Several approaches belonging to this category are also referred to
as feature ranking methods, as they essentially score and rank the importance of
individual features.
Two of the most commonly used techniques are symmetrical uncertainty [219]and Relief [147]. Both of these methods are good examples of this type of algorithm,
and are closely relevant to the methods to be described subsequently in this chapter,
including the correlation-based FS [93], and the consistency-based FS [52].
Symmetrical Uncertainty For a given nominal-valued feature au, a probabilistic
model may be formed by estimating the individual probabilities of its observed values,
on the basis of the training data: ωau i ∈ Ωau, i = 1, · · · , |Ωau
| using entropy [1]:
H(au) = −Σ|Ωau |i=1 p(ωau i) log2 p(ωau i) (2.1)
If the observed values of au are in fact partitioned in relation to another feature
av, and the entropy of au with respect to the partitions induced by av is less than
the entropy of au prior to partitioning, then there is a relationship between the two
features av and au. The entropy of au after observing av is:
H(av|au) = −Σ|Ωav |j=1 p(ωav i)Σ
|Ωau |i=1 p(ωau i|ωav j) log2 p(ωau i|ωav j) (2.2)
Information gain [162] or mutual information [206] may be defined on the basis of
the above equations, which reflects how much additional information about au is
provided by av:
information gain(au, av) = H(au)−H(au|av)
= H(av)−H(av|au)
= H(au) +H(av)−H(av, au) (2.3)
15
2.1. FS Evaluation Measures
Information gain is symmetrical in nature, making it suitable for measuring
inter-feature correlation. However, it is biased in favour of features with higher
information gain. The symmetrical uncertainty measure [219] is introduced to
compensate for such bias. It also produces a normalised output in the range of [0, 1]:
symmetrical uncertainty(au, av) = 2.0× [information gain(au, av)
H(au) +H(av)] (2.4)
Relief Relief [147] and its later development ReliefF [146] are individual feature
weighting algorithms that in principle, may be sensitive to feature interaction. Relief
attempts to approximate the following difference of probabilities, in order to obtain
the weight of a feature au:
weightau=p(different value of au | nearest instance of different class)
− p(different value of au | nearest instance of same class) (2.5)
By removing the context sensitivity imposed by the “nearest instance” condition,
features may be treated as being independent of one another:
Relief(au) =p(different value of au | different class)
− p(different value of au | same class) (2.6)
More formally, the measure may be defined as:
Relief(au, z) =GI ′ ×Σ|Ωau |
i=1 p(ωau i)2
(1−Σ|Ωz |j=1p(ωz j)2)Σ
|Ωz |j=1p(ωz j)2
where GI ′, as given in Eqn. 2.7, is the modified Gini-index [28] which, much like the
information gain of Eqn. 2.3, is also biased toward attributes with more observed
values.
GI ′ =[Σ|Ωz |j=1p(ωz j)(1− p(ωz j))]
−Σ|Ωau |i=1 (
p(ωau i)2
Σ|Ωau |j=1 p(ωau i)2Σ
|Ωz |j=1p(ωz j|ωωaui
)− p(ωz j|ωωaui)2
(2.7)
To use Relief as a symmetrical measure for any given two features au and av, the
above measure is computed twice, where each feature is treated as the class attribute
in turn, and the average of the two measurements is taken as the final output, in
order to ensure connectivity:
Relief′(au, av) =Relief(au, av) +Relief(av, au)
2(2.8)
16
2.1. FS Evaluation Measures
2.1.1.2 Group-Based Measures
A strong assumption behind the use of evaluation measures on an individual feature
basis is that all features are entirely independent of each other. This may not be the
case for many practical problems, where the features are not measured or extracted
independently. Also, the interaction between features may need to be preserved for
the subsequent learning mechanism. For instance, consider two binary features, ap
and aq, which may appear irrelevant in determining the class labels in a learning
classifier when being evaluated individually. However, the combination of these
two features, e.g., ap ⊕ aq, may determine the class label, where ⊕ denotes binary
addition.
Group-based FS methods [52, 93, 126] do not rely on the evaluation of individual
features. Instead, the candidate feature subsets are evaluated as a whole. This
property is particularly beneficial in capturing the inter-feature-dependencies that
are common in real-world data. Group-based FS methods are the main focus of this
thesis, due to their desirable properties and easy integration with stochastic feature
subset search strategies. The principles behind three particular group-based filter
techniques are outline below, and are utilised extensively in the work proposed later.
Correlation-Based FS (CFS) The goal of correlation-based FS (CFS) [93] is that,
when utilising FS to remove irrelevant features, redundant information should be
eliminated as well. A feature is deemed to be redundant if there exists one or more
other features that it is highly correlated with. Note that the term “correlation” was
used in its general sense in [93]. Instead of referring specifically to classical linear
correlation, it was employed to refer to a broad class of statistical relationships
involving dependence, or a degree of predictability of one feature with respect to
another. The original work ensures that, “a high quality feature subset is one that
contains features highly correlated with the class, yet uncorrelated with each other.”
The correlation-based measure is defined as follows:
correlation(B, z) =Σ|B|i=1correlation(z, ai)
Ç
|B|+Σ|B|j=1, j 6=iΣ|B|i=1correlation(ai, a j)
(2.9)
where correlation(B, z) is the correlation between the feature subset B and the deci-
sion variable (class) z, correlation(z, ai) and correlation(ai, a j) are the correlation
between a given feature ai and the class, and the so-called inter-correlation between
any two features ai, a j ∈ B, respectively. correlation(ai, a j) may be calculated via
17
2.1. FS Evaluation Measures
individual feature-based measures, such as the ones introduced previously (symmet-
rical uncertainty [219] and Relief [147]). The minimum description length principle
[31, 97] is exploited to implement this.
CFS is notable for three important characteristics:
• The higher the correlations between the individual features and the decision
variable, the higher the correlation between the feature subset and the class.
• The lower the inter-correlations amongst the selected features themselves, the
higher the correlation between the feature subset and the class.
• As the number of selected features increases, the correlation between the
feature subset and the class increases.
Note that the last point above assumes that each of the newly added features
offers a similar improvement in terms of evaluation score to that of the existing ones
already included in the feature subset, when measured by Eqn. 2.9. CFS has the
advantage of being able to offer a view of closely relevant (substitutable) features,
in addition to the identification of a good feature subset. This may be beneficial for
certain data mining applications where comprehensible results are of paramount
importance. However, it is not always clear whether redundancy should be fully
eliminated. For instance, to encourage direct human comprehension of a given rule,
a specific feature may be replaced by another (equally informative one) with which
it is highly correlated.
Consistency-Based FS (PCFS) One important notion exploited by probabilistic
consistency-based FS (PCFS) is the inconsistency criterion, which essentially speci-
fies to what extent a feature subset (and the reduced data which it infers) can be
accepted [52]. The inconsistency rate of a given data set with respect to a selected
subset of features is checked against a predefined threshold, where lower values of
inconsistency are deemed acceptable, with a default threshold value of 0.0.
Two objects xu and xv are considered inconsistent, if they match in terms of all
their feature values except the class labels: ∀ai ∈ A, aiu == aiv ∧ zu 6= zv. For a group
of such objects that match (without considering their class labels) based on a given
set of features, the inconsistency count is the number of objects minus the largest
18
2.1. FS Evaluation Measures
number of instances with the same class label. For example, given the entire data
set of all objects X , suppose that
X ′ = X ′zu∪ X ′zv
∪ X ′zw, X ′ ⊆ X (2.10)
is a set of objects that match in terms of feature values, but belong to three different
class labels zu, zv, zw ∈ Ωz, where Ωz signifies the set of available class labels. The
inconsistency count for X ′ is:
inconsistency count(X ′) = |X ′| −max(|Xzu|, |Xzv
|, |Xzw|) (2.11)
The overall inconsistency rate of X is computed by summing over all such inconsis-
tency counts (from all the matching sets of objects), divided by the total number of
data instances.
Rough Set FS Rough sets are an extension of conventional sets that allow approx-
imations in decision making [204]. In particular, rough set theory (RST) can be
used to find relationships in data, as a tool to discover data dependencies. The
most notable application of RST is the reduction of features contained in a data set
based on the information in the data set alone [168]. However, RST should not
be confused with or seen as an alternative for fuzzy set theory, nor does fuzzy set
theory compete with RST [204]. They are two individual methods for dealing with
imperfect data. Although this thesis does not utilise any RST-based method directly,
but instead focusing on its fuzzy extensions (to be described in the following section),
the definitions of its key notions are briefly summarised below for completeness.
At the heart of the RST is the concept of indiscernibility [204]. For a given subset
of features P ⊆ A, there exists an associated equivalence relation IND(P):
IND(P) = (x i, x j) ∈ X 2|∀a ∈ P, a(x i) = a(x j) (2.12)
where a(x i) signifies the value of a feature a ∈ P for an object x i ∈ X . If two objects
x i, x j ∈ IND(P), then they are considered indiscernible using features contained in P.
This leads to the definition of equivalence classes of the P-indiscernbibility relation,
which are denoted as [x]P .
RST allows the partition of a vague set W ⊆ X by using two well-defined limits,
with respect to a set of features P ⊆ A, which are known as upper and lower approxi-
mations, as illustrated in Fig. 2.3. Both approximations are discrete sets that allow
19
2.1. FS Evaluation Measures
the partitioning of the domain (sample space) into two distinct sub-domains. The
lower approximation PW describes the objects in the domain that are known with
certainty to belong to the vague set of interest:
PW = x : [x]P ⊆W (2.13)
The upper approximation PW , which subsumes the lower approximation, describes
the objects in the domain that belong to certain equivalence classes whose elements
at least partly belong to the concept of interest:
PW = x : [x]P ∩W 6= ; (2.14)
Figure 2.3: Basic concepts of rough set
The rough set-based approach to FS allows for the reduction in the number of
features in a data set, whilst not requiring any external information for thresholds.
It can find a subset (termed a reduct) of the original features that are the most
informative; all other features can be removed from the data set with minimal
information loss. Given these important advantages over many other FS methods, it
is not surprising that further development based on this theory for FS has been the
focus of much research [145].
20
2.1. FS Evaluation Measures
Fuzzy-Rough FS (FRFS) This is one of the most significant further development of
the aforementioned rough set-based FS technique. RST only works on discrete, crisp-
valued domains. However in practice, the values of features are usually real-valued.
It is not possible in this theory to say whether two different feature values are similar,
and to what extent they are the same. For example, two close values may only differ
as a result of noise, but in RST, they are considered to be as different as two values
of a different order of magnitude. Data set discretisation must therefore take place
before reduction methods based on crisp rough sets can be applied. This is often still
inadequate, however, as the degrees of membership of values to discretised values
are not considered and thus may result in information loss. In order to overcome
this, extensions of RST based on fuzzy-rough sets [67] have been developed.
A fuzzy-rough set is defined by two fuzzy sets, a fuzzy lower and a fuzzy upper
approximation, obtained by extending the corresponding crisp RST notions. In the
crisp case, elements either belong to the lower approximation with absolute certainty
or not at all. In the fuzzy-rough case, elements may have a membership in the range
[0,1], allowing greater flexibility in handling uncertainty. Fuzzy-rough FS (FRFS)
[126] extends the ideas of fuzzy-rough sets to perform FS, where a vague concept
W ⊆ X is approximated by the fuzzy lower and upper approximations:
µRBW (x i) = infx j∈X
I(µRB(x i, x j),µW (x j)) (2.15)
µRBW (x i) = supx j∈X
T (µRB(x i, x j),µW (x j)) (2.16)
where I is a fuzzy implicator, T is a t-norm, and RB is the fuzzy similarity relation
induced by the subset of features B, and x i, x j ∈W are two arbitrary objects in W .
In particular,
µRB(x i, x j) = Ta∈BµRa
(x i, x j) (2.17)
where µRa(x i, x j) is the degree to which objects x i and x j are similar for feature
a ∈ A. Many similarity relations can be constructed for this purpose, for example:
µRa(x i, x j) = 1−
|a(x i)− a(x j)|amax − amin
(2.18)
µRa(x i, x j) = exp(−
(a(x i)− a(x j))2
2σ2a
) (2.19)
µRa(x i, x j) =max(min(
a(x j)− (a(x i)−σa)
a(x i)− (a(x i)−σa),(a(x i) +σa)− a(x j)
(a(x i) +σa)− a(x i)), 0) (2.20)
21
2.1. FS Evaluation Measures
where σa and σ2a represent the standard deviation and the variance of the values
taken by feature a, respectively. The choices for I , T , and the fuzzy similarity relation
have great influence upon the resultant fuzzy partitions, and thus the subsequently
selected feature subsets.
The fuzzy-rough lower approximation-based QuickReduct algorithm [126], which
extends the crisp version [226], is shown in Algorithm 2.1.1. It employs a quality
measure termed the fuzzy-rough dependency function γB(Q) that measures the
dependency between two sets of attributes B and Q, as defined by:
γB(Q) =
∑
x∈X µPOSRB (Q)(x)
|X |(2.21)
In this definition, the fuzzy positive region, which contains all objects of X that can
be classified into classes of X/Q using the information in B, is defined as:
µPOSRB (Q)(x) = sup
x∈X/QµRBX (x) (2.22)
1 A, set of all conditional features.2 Z , set of decision features.3 R= ;, γbest = 0, γprev = 04 repeat5 B = R6 γprev = γbest
7 foreach x ∈ (A\ R) do8 if γR∪x(Z)> γT (Z) then9 B = R∪ x
10 γbest = γT (Z)
11 R= B12 until γbest = γprev
13 return RAlgorithm 2.1.1: Fuzzy-rough QuickReduct (A,Z)
Similar to CFS and PCFS, γB is viewed as a merit of quality for a given feature
subset B ∈ A, with respect to the set of decision features Z: 0≤ γB(Z)≤ 1,γ;(Z) = 0.
A fuzzy-rough reduct R can then be defined as a subset of features that preserves the
dependency degree of the entire data set, i.e., γR(Z) = γA(Z). In this thesis, without
causing confusion, the fuzzy-rough dependency measure of a given feature subset B
is notationally simplified to f (B), following the same conventions adopted for CFS
and PCFS.
22
2.1. FS Evaluation Measures
The evaluation of f (B) enables QuickReduct to choose which features to add to
the current candidate fuzzy-rough reduct. Note that the algorithm is “greedy” and
therefore, always selects the feature resulting the greatest increase in fuzzy-rough
dependency. The algorithm terminates when the addition of any of the remaining
features does not result in an increase in dependency. As with the original crisp
algorithm, for a dimensionality of |A|, the worst case data set will result in O( |A|2+|A|2 )
evaluations of the dependency function, while the cost of the dependency evaluation
itself is related to both the number of original features |A|, and the amount of training
objects |X |.
2.1.2 Wrapper-Based, Hybrid, and Embedded FS
Wrapper-based approaches [107, 144], in contrast to filter-based approaches, are
often used in conjunction with a learning or data mining algorithm (which forms a
major part of the validation process). They have the obvious advantages in identify-
ing solutions most appropriate for a specific application. However, wrapper-based
approaches are generally inferior to the rest because of the computational overheads
in model training and validation that is required for the examination of each feature
subset. Although various methods exist for use different end classification algorithms,
wrapper-based approaches generally follow the same design principle. Indeed, for
B ⊆ A, a generic wrapper-based evaluation measure may be defined as follows:
wrapper(B) = accuracy of classifier built using B and X train
and tested using held out data X test (2.23)
In order to combine the potential benefits of both filter-based and wrapper-based
methods, hybrid algorithms [297] have been proposed. The rationale behind these
techniques is to make use of both an evaluation measure and a learning algorithm,
in evaluating the quality of feature subsets. Such a combined measure is then used
to decide which subsets are most suitable for a given cardinality, and the learning
algorithm is then used to selection the final, overall “best” solution from a pool of
candidate feature subsets of different cardinalities.
In addition to the hybrid mechanisms, there is one that is so-called embedded
approach. In such methods, an implicit or explicit FS sub-algorithm is an integrated
part within a more general learning algorithm [264]. Decision tree learning is a
typical example of this. Of course, this may also be viewed as a specific case of the
hybrid approach.
23
2.2. FS Search Strategies
2.2 FS Search Strategies
Having introduced the evaluation measures that seek to assess the merit or quality of
a given feature subset, the remaining problem for FS is to find the best solution from a
search space of 2|A| competing feature subsets using a specific strategy. An exhaustive
search can conceivably be performed, if the number of variables is not too large.
However, this problem is known to be NP-hard [8, 91] and the search can become
computationally intractable. Existing techniques in the literature generally fall into
two main categories: deterministic methods and stochastic techniques. This section
outlines two deterministic approaches that are commonly employed by conventional
FS algorithms. The main focus however, is the analysis of stochastic, especially
nature-inspired search methods, in which the HSFS algorithm proposed in this thesis
is categorised.
2.2.1 Deterministic Algorithms
Deterministic methods often follow greedy, step-by-step procedures, in order to
form a potential solution in a predetermined fashion. Such an approach is generally
simple to implement, and is empirically efficient for data sets with fewer (e.g., < 100)
features [164]. Two straightforward means to implement this approach are outlined
below.
2.2.1.1 Exhaustive Search
Exhaustive search is an optimal search method, with a complexity of O(2|A|), the
same as the total number of possible solutions. It is both optimal and complete, in
the sense that the best feature subset is guaranteed to be found once (and if) the
search terminates, with all potential solutions having been evaluated during the
process. Exhaustive search is computationally infeasible for most practical problems,
because of its exponential cost.
2.2.1.2 Sequential Search
This is a sometimes referred to as a hill-climbing algorithm, it selects one feature at
each iteration that provides the greatest improvement in terms of evaluation score.
The polynomial complexity is determined by taking into account the number of
subset evaluations per iteration, in order to determine the most informative feature.
24
2.2. FS Search Strategies
Obviously, sequential search is not ideal, since the best solution may exist in a region
where the algorithm does not visit. Furthermore, as per discussion in Section 2.1.1.1,
the inter-dependencies between features make it less beneficial to explore potential
feature subsets on an individual feature basis.
2.2.2 Stochastic and Nature-Inspired Approaches
Following a taxonomy concerning the stochastic and nature-inspired approaches
in their base form [29], the existing FS methods can be classified into a number of
categories, as shown in Fig. 2.4. Considering that a few categories have not yet
attracted sufficient application in the area of FS, e.g., immune systems and physi-
cal/social algorithms, three major categories are established here in order to improve
the organisation of the reviewed methods. The biologically-inspired approaches
include the Genetic Algorithm (GA) [231, 235, 275], Genetic Programming [187],Memetic Algorithm (MA) [274, 296], and the Clonal Selection Algorithm (CSA)
[230] from immune systems; the physical, social and stochastic algorithms include
HSFS [62] (described in Chapter 3), Simulated Annealing (SA) [69, 182], Random
Search [241], Scatter Search [83], and Tabu Search (TS) [100]; and the swarm-
based techniques include Artificial Bee Colony (ABC) [199], Ant Colony Optimisation
(ACO) [38, 122, 134, 138], Bat Algorithm [189], Bee Colony Optimisation, and Par-
ticle Swarm Optimisation (PSO), etc. Most of the above mentioned algorithms are
described in detail in the following section.
Fundamentally speaking, putting the underlying analogies aside, nature-inspired
FS approaches are a collection of techniques of stochastic nature, for the purpose of
discovering and improving good candidate solutions. Several recent studies combined
these algorithms together, adopting a given algorithm’s strength in an effort to
complement the weakness of another. In so doing, a number of hybrid methods have
emerged, including GA-PSO [11], ACO-GA [192], ACO-neural networks [234], PSO-
catfish [41], etc. Moreover, there also exist several approaches that have embedded
local search procedures [133, 194].
2.2.2.1 Common Notions and Mechanisms
Despite having distinctive characteristics and work flows, many stochastic search
techniques share similarities, which are summarised in Table 2.1. A population-
based NIM typically employs a group P of individuals pi, each actively maintains an
25
2.2. FS Search Strategies
Figure 2.4: Taxonomy of nature-inspired approaches
emerging feature subset Bpi. Internally, most algorithms represent a given feature
subset Bpiin a binary manner, where a string bBpi
of length |A| is used. The jth
position of bBpi
is set to 1 (bBpi
j = 1) if its corresponding feature is being selected, i.e.,
a j ∈ Bpi, and bBpi
j = 0 if a j is not selected in the candidate feature subset. The current
best solution (amongst the entire population) is represented by B, and a randomly
generated feature subset is denoted by B. To simply the representation, random
components other than B are denoted by the use of r. For example, c = rc, 0≤ rc ≤ 1
indicates that the value of a certain parameter c is a randomly generated number
drawn from the value range 0 to 1; and ar ∈ A, r ∈ 1, · · · , |A| denotes a feature
randomly picked out of a pool of original features A. These notations will be used
extensively in the pseudocode hereafter, in order to illustrate the work flows of the
reviewed NIMs in an unified representation that eases comparison.
• Random Initialisation
26
2.2. FS Search Strategies
Table 2.1: Notions used in pseudocode
Notion Meaning
pi ∈ P A population P of individuals pi
Bpi ∈ B Set of candidate feature subsets Bpimaintained by pi
B Current best subsetB A subset of randomly selected features
bBpi
j ∈ 0,1 Selection state (0: not selected, 1: selected) of the jth feature in Bpi
f (B) Evaluation score of Bg Current generation/iterationgmax Maximum number of generations/iterationsr A random number or a stochastic componentT A temporary solution
One of the key advantages of nature-inspired approaches is the insensitivity to
the initial states. The population at the start of the search (often referred to as
the initial population) is generally a randomly generated pool. In stochastic
FS, a random subset B can be constructed by randomly setting r bits, where r
itself may be a pre-determined size, or random: r ∈ 1, · · · , |A|.
for i = 1 to |P| do Bpi= B (2.24)
• Solution Adjustment
The candidate solutions are modified constantly during the search process. The
most common adjustment procedure is the random addition or removal of rm
number of features, such as the mutation operator used by many evolutionary
algorithms. rm may be predefined, or dynamically determined according to
certain states of the algorithm. If a binary representation bB is used, this
adjustment may be achieved by randomly flipping rm bits:
for i = 1 to rm do bBr = ¬bB
r , ar ∈ A (2.25)
Several swarm-based algorithms exploit the notion of movement, from a given
candidate solution Bpitowards another possibly better quality feature subset,
say Bp j, aiming to eventually reach the true global best solution. For conven-
tional numerical optimisation problems, this process derives new values for the
function variables according to predefined formulae, and new solution vectors
may be constructed which are interpolated in between the source and target
27
2.2. FS Search Strategies
vectors. However for FS problems, this is less applicable, since binary values
are generally employed, and the variables represent independent features. In
the literature, movement is implemented by first determining the distance
between the two subsets:
d(Bpi, Bp j) = |Bpi
⊕ Bp j| (2.26)
which is equal to the number of bit differences. The amount of movement
v, v ∈ [0, vmax], or the number of bits that Bpishould copy from Bp j
, is then
calculated with regards to the absolute distance, as demonstrated in Algorithm
2.2.1. Note that for algorithms such as PSO and FA, the feature subset being
improved generally moves towards the current best solution B.
1 if v ≤ d(Bpi, Bp j) then
2 for j = 1 to v do
3 bBpi
r = ¬bBpi
r , ar ∈ Bpi ⊕ Bp j
4 else5 Bpi
= Bp j
6 for j = 1 to (v − d(Bpi, Bp j)) do
7 bBpi
r = ¬bBpi
r , ar /∈ Bp j
Algorithm 2.2.1: Move Bpitowards Bp j
by a distance v
• Subset Quality Comparison
FS is essentially a dual-objective optimisation task. A good quality feature
subset should both achieve a high score in terms of evaluation, and maintain
a low cardinality. Having an ordered solution space enables higher quality
solutions to be discovered. Algorithms such as CSA and HS use a scheme where
two given candidate solutions are compared first based on evaluation scores,
and the cardinalities of the subsets are then used as a tie breaker:
Bpi> Bp j
⇔ f (Bpi)> f (Bp j
) ∨
f (Bpi) == f (Bp j
)∧ |Bpi|< |Bp j
| (2.27)
Algorithms including FF, PSO, and SA require a single numerical difference
between candidate solutions, so that the internal parameters can be calcu-
lated. In this case, evaluation score and subset size are integrated together via
28
2.2. FS Search Strategies
weighted aggregation, in order to reflect the influence of both the evaluation
score f (B) and subset size (normalised using |B||A| ) of a given feature subset B.
The weighting parameters α and β may be equal or biased:
Bpi> Bp j
⇔ α f (Bpi) + β
|Bpi ||A|
> α f (Bp j) + β
|Bp j ||A|
(2.28)
Alternative aggregation methods may also be employed of course. Note that
multi-objective evolutionary algorithms [71, 79] have also been exploited to
facilitate simultaneous optimisation of both criteria, but they are outside the
scope of this chapter.
• Current Best Solution Tracking
Due to the stochastic behaviour of the search algorithms concerned, it is often
necessary to keep a record of the best quality feature subset B discovered thus
far, as the algorithm may explore other (possibly sub-optimal) solution regions
later on. The procedure of updating B invokes the previously mentioned
comparison process (Eqn. 2.27). At each iteration, the quality of the current
best solution f (B) is compared with those of all of the emerging subsets f (Bpi)
that are currently maintained by the individuals pi ∈ P:
1 for i = 1 to |P| do2 if Bpi
> B then B = Bpi
Algorithm 2.2.2: Update current best solution B
• Local Search
Algorithm 2.2.3 details one of the local search procedures [3], commonly used
by techniques such as MA [274, 296] and hybrid search methods [133, 194]. It
is a greedy mechanism that evaluates all unselected features, and adds the most
informative candidate (the feature that provides the greatest improvement
in evaluation score) to the current feature subset. The Hill-Climbing (HC)
algorithm works in a similar fashion, which continues to select features until
the score cannot be improved further. This mechanism can also be used in
reverse, in order to eliminate the least important feature from a subset.
29
2.2. FS Search Strategies
1 repeat2 f
′= f (B)
3 t = −14 for i = 1 to |A|, ai /∈ B do5 bB
i = 16 if f (B)> f
′then
7 f′= f (B)
8 t = i
9 bBi = 0
10 if t 6= −1 then bBi = 1
11 until t == −1Algorithm 2.2.3: Local search (B)
2.2.2.2 Genetic Algorithm
Genetic Algorithms (GAs) mimic the process of natural evolution by simulating events
such as inheritance, mutation, selection, and crossover. A considerable amount of
investigation [231, 235] has been carried out in order to explore the feasibility
of applying GAs to FS, much of this has been summarised and compared in the
literature [79]. In GAs, a feature subset is generally represented by a binary string
called a chromosome. A population P of such chromosomes is randomly initialised
and maintained, and those with higher fitness values are propagated into the later
generations.
The reproduction process is typically achieved by the use of two operators:
crossover and mutation. As shown in Algorithm 2.2.4, the standard one-point
crossover operator exchanges and recombines a pair of parent chromosomes, Bp
and Bq. It first locates a certain crossover point rc along the length of the binary
string, and then generates two children with all features beyond rc swapped between
the two parents. The mutation operator produces a modified subset by randomly
adding or removing features from the original subset. By allowing the survival and
reproduction of the fittest chromosomes, the algorithm effectively optimises the
quality of the selected feature subset. The parameters c and m that control the
rate of crossover and mutation require careful consideration, in order to allow the
chromosomes to sufficiently explore the solution space, and to prevent premature
convergence towards a local optimal subset.
A GA-based FS algorithm is simple in concept, it may be implemented to achieve
a great efficiency and obtain quality feature subsets. Being a randomised algorithm,
30
2.2. FS Search Strategies
1 pi ∈ P, i = 1 to |P| group of chromosomes2 Bpi ∈ B, i = 1 to |P| subsets associated with pi
3 c, crossover rate4 m, mutation rate
5 Randomly initialise P6 for g = 1 to gmax do7 for i = 1 to |P| do
8 T i = Bpiwith a probability of f (Bpi
)∑
∀Bpi ∈Bf (Bpi )
9 T i+1 = Bp jwith a probability of f (Bp j
)∑
∀Bp j∈B
f (Bp j )
10 if T i == T i+1 then11 bT i
r = ¬bT i
r , ar ∈ A12 else
// Crossover13 if r < c then14 rc ∈ 1, ..., |A|/215 for k = 1 to rc do
16 T i+1k = Bpi
k , T ik = Bp j
k
// Mutation17 for j = 1 to |A| do18 0≤ rm ≤ 119 if rm < m then bT i
j = ¬bT i
j
20 if rm < m then bT i+1
j = ¬bT i+1
j
21 i = i + 2
22 for i = 1 to P do23 Bpi
= T i
24 Update B
Algorithm 2.2.4: Genetic Algorithm
however, there is no guarantee that a top quality feature subset (if not the global
best solution) can be found in a reasonable or fixed amount of time. Its optimisation
response time and solution quality are not constant. These drawbacks may limit
GA’s potential for more demanding scenarios, such as on-line streaming FS [268].It is also a challenging task to identify a suitable set of required parameter values,
since the problem domain generally has very little in common with the evolutionary
concepts of GAs.
31
2.2. FS Search Strategies
2.2.2.3 Memetic Algorithm
Memetic Algorithm [30, 196] (MA) signifies several recent advances in evolutionary
computation. It is commonly used to refer to any population-based evolutionary
approach with a separate individual learning process (e.g., a local improvement
procedure). It is also being referred to as hybrid GAs, parallel GAs, or genetic local
search in the literature [37]. When applied to the FS domain, the key research
question is how the local search should be implemented. Such an approach typically
follows a similar improvement process as Algorithm 2.2.3, except that the features
being considered for addition are from a randomly selected subset, rather than from
the complete set of original features [274].
An alternative local improvement process [296] suggests that a ranking of fea-
tures should be computed first, and the solutions may then be improved by adding
or removing features based on the ranking information. However, this requires the
subset evaluator employed to be able to handle feature ranking, unless such informa-
tion can be obtained elsewhere. It has also been proposed that local search may be
performed on an elite subset of population [280], as shown in Algorithm 2.2.5, and
the worst solutions may be substituted by the locally improved child solutions. The
local search mechanism alters (by adding or removing) a single feature that provides
the greatest increase in terms of evaluation score.
Although proven to be beneficial in a majority of scenarios, the presence of
greedy mechanisms may have a negative impact on the quality of the feature subset
returned, since the natural stochastic evolution of the chromosomes may be disrupted
by excessive execution of local adjustments. This is because the variable values (i.e.,
features) are discrete rather than continuous. It is also difficult to identify the most
suitable local search mechanism, in addition to the configuration of parameter values.
2.2.2.4 Clonal Selection Algorithm
Clonal Selection Algorithm (CSA) [54] is inspired by the adaptive immune response
behaviour to an antigenic stimulus. It exploits the fact that only the antibodies are
selected to proliferate. The original algorithm involves a maturation process for the
selected biological cells, which improves the affinity to the selective antibodies. A
simplified implementation of CSA-based FS [230] has been proposed. It enables both
the selection of important features, and the optimisation of parameters for the end
classifiers (which are implemented using the support vector machines ). Although
32
2.2. FS Search Strategies
1 pi, i = 1 to |P| group of chromosomes2 Bpi ∈ B, i = 1 to |P| subsets associated with pi
3 s, number of best individuals kept for reproduction4 c, crossover rate5 m, mutation rate
6 for i = 1 to |P| do7 Bpi
= Local search(B)
8 for g = 1 to gmax do9 for i = 1 to s do
10 T i = Bpiwith a probability of f (Bpi
)∑
∀Bpi ∈Bf (Bpi )
11 T i+1 = Bp jwith a probability of f (Bp j
)∑
∀Bp j∈B
f (Bp j )
12 if T i == T i+1 then13 bT i
r = ¬bT i
r , ar ∈ A14 else15 if r < c then16 rc ∈ 1, ..., |A|/217 for k = 1 to rc do
18 T i+1k = Bpi
k , T ik = Bp j
k
19 for j = 1 to |A| do20 0≤ rm ≤ 121 if rm < m then bT i
j = ¬bT i
j
22 if rm < m then bT i+1
j = ¬bT i+1
j
23 i = i + 2
24 Sort B25 for i = 1 to s do26 Bp|P|−i
= Local search(T i)
27 Update B
Algorithm 2.2.5: Memetic Algorithm
the original method integrates these two tasks, it is easily modifiable to support
generic FS, as shown in Algorithm 2.2.6. An adaptive CSA has also been adopted for
network fault FS [284].
The initial population is filled with randomly generated antibodies at the start.
At each iteration, clones are created for each individual. The better evaluation score
an antibody achieves, the more clones are constructed. The maximum number of
33
2.2. FS Search Strategies
1 pi ∈ P, i = 1 to |P|, group of antibodies2 0≤ f (B)≤ 1, normalised subset evaluation score3 c, maximum number of clones per antibody4 m, maximum number of bits per mutation5 s, maximum number of random cells6 T ∈ T, a set of temporary feature subsets
7 Randomly initialise P8 for g = 1 to gmax do9 T= ;
10 for i = 1 to |P| do
11 ci = ce f (Bpi)− f (B)
12 for j = 1 to ci do13 T = Bpi
14 Flip m(1− cic ) random bits of T
15 T= T∪ T
16 for i = 1 to s do17 T= T∪ B18 Sort T19 while |T|> |P| do20 T= T \ T |T|21 B= T22 Update B
Algorithm 2.2.6: Clonal Selection Algorithm
allowed clones ci for each of the antibodies pi, i = 1, · · · , |P| can be configured by a
parameter c, and an exponential function:
ci = c · e f (Bpi)− f (B) (2.29)
is used to calculate the amount of copies required. The clones are then mutated
by flipping bits randomly, and subsequently added to the population. Again, the
better the original antibody is, the less bit alteration will occur. The population is
further joined by a set of antibodies which are randomly selected from the existing
population. The trim process then removes the worst antibodies in order to maintain
the size of the group |P|. The current best solution is updated at each iteration, and
this process repeats until the maximum number of iterations has been reached.
Despite being a much simplified version of CSA, the CSA-based FS technique still
needs to produce clones of the good candidate feature subsets (antibodies), at every
34
2.2. FS Search Strategies
iteration. The exponential cost of generating, mutating, and evaluating such clones
may become a significant overhead, especially for higher dimensional data sets.
2.2.2.5 Simulated Annealing
Simulated Annealing (SA) [55] is a generic probabilistic meta-heuristic for locating
an approximation to the global optimum of a given complex function. It is inspired by
the annealing process in metallurgy, a technique that involves repeatedly heating and
cooling a certain material in a controlled environment, in order to increase the size
of the crystals, and reduce the defects, both of which depend on the thermodynamic
free energy of the material.
The adaptation of SA for FS [69, 182] requires an adjustment to the underlying
computational algorithm, which ensures the SA keeping a record of the current best
solution. Unlike most other population-based NIMs, SA-based FS maintains and
improves only a single feature subset throughout the search process. As shown in
Algorithm 2.2.7, the algorithms checks whether it has reached “thermal equilibrium”
at a given energy state, by maintaining a count e that increments each time SA finds
a better quality feature subset. Once encountered sufficient improvements, SA makes
a transition in the energy state, where both the temperature g, and the perturbation
(mutation) percentage ρ are adjusted, by a so-called cooling rate c. The equilibrium
count e is also reset. Such a transition generally indicates that a smaller number of
features are adjusted during mutation, allowing fine tuning to be achieved towards
termination.
One of the major criticism levelled at SA is that it is not always able to return
the best solution found throughout the search process. This issue is addressed in the
above mentioned application to FS. Only having to improve and maintain a single
candidate solution has an obvious advantage of high efficiency. However, this also
makes SA-based FS more prone to discovering local best feature subsets, without
careful consideration of the settings for the starting temperature and cooling rate.
2.2.2.6 Tabu Search
Tabu Search (TS) [179] is a local search-based meta-heuristic designed to avoid the
pitfalls of typical greedy search procedures. It aims to investigate the solution space
that would otherwise be left unexplored. Once a local optimum is reached, upward
35
2.2. FS Search Strategies
1 g ∈ gmin, gmax, range of temperatures2 ρ, perturbation percentage3 c, cooling rate4 e, equilibrium count5 emax, maximum number of successful perturbations
6 while g > gmin do7 B = B8 rt = dρ|A|e9 Flip rt random bits of B
10 d = f (B)− f (B)11 if d ≤ 0∨ r < e−d/g then12 B = B13 e = e+ 1
14 if e == emax then15 e = 0, g = gc, ρ = ρc
Algorithm 2.2.7: Simulated Annealing
moves (those that worsen the solutions) are allowed [13]. Simultaneously, the last
moves are marked as tabu during the following iterations to avoid cycling.
A TS-based FS method [100], as shown in Algorithm 2.2.8, has been proposed
in order to deal with reduction problems in conjunction with the use of rough set
theory. It employs a binary representation for the feature subsets. It also maintains a
tabu list τ that holds a record of the most recently evaluated solutions, so that the
algorithm can avoid being trapped in a previously explored region, and is restrained
from generating solutions of very low quality. The tabu list is usually initialised with
two feature subsets: an empty subset ;, and a set containing all available features A.
The approach starts by ranking the individual features according to their eval-
uation scores, and invokes a procedure to generate l new trial solutions that are
neighbours to a given candidate solution, with a hamming distance of up to l fea-
tures. The algorithm continues to generate a new trial at each iteration, until no
improvement has been observed for a predefined number of iterations. It then initi-
ates two mechanisms to further “mutate” a given candidate solution: shaking and
diversification. Shaking is essentially a greedy backward local search, each of the
selected features are examined one by one, in order to check whether its removal
produces a higher quality solution, or a subset of the same evaluation score with a
reduced size. The diversification procedure attempts to generate a new candidate
36
2.2. FS Search Strategies
1 C, candidate solutions ordered by quality2 τ, tabu list3 l, number of trials4 T ∈ T, |T|= l, neighbouring solutions5 Q, number of occurrences of features in T6 k ≤ kmax, number of iterations without improvements
7 C= ;, B = ;, B = ;, τ= ; ∪ A8 for i = 1 to |A| do C= C∪ a19 while |B|< |A| do Local search (B)
10 for g = 1 to gmax do11 while k < kmax do12 for j = 1 to l do13 T = B14 Flip j random bits of T15 if T ∈ τ then16 j = j − 117 else18 T= T∪ T
19 B = argmaxT∈T f (T )20 τ= τ∪ B21 if f (B)> f (B) then
// Shaking
22 B = B23 for ∀C ∈ C do24 if B ∩ C /∈ τ∧ f (B ∩ C)≥ f (B) then25 B = B ∩ C
26 else27 k = k+ 1
// Diversification28 for i = 1 to |A| do29 rQ ∈ 1, · · · , |A|30 if rQ >Q i then B = B ∪ ai31
Algorithm 2.2.8: Tabu Search
solution, which contains features chosen with probability inversely proportional to
their number of appearances in the trial solutions. This process continues until the
maximum number of iterations has been reached.
The greedy mechanisms employed by TS are very beneficial to quickly locating
37
2.2. FS Search Strategies
potentially better solutions, but they may lead still to locally optimal feature subsets.
TS also adopts a trial generation procedure similar to the cloning process of CSA,
although the cost is not exponential, it may be a significant overhead for high
dimensional problems.
2.2.2.7 Artificial Bee Colony
The Artificial Bee Colony [136] (ABC) algorithm is inspired by the intelligent be-
haviour of a honey bee swarm when searching for promising food sources. Because
ABC is a relatively new algorithm, and much of the original concept has been modi-
fied and/or omitted in the existing FS adaptations [199, 242], a brief description of
the original method is given here initially. In this algorithm, a colony of artificial bees
is divided into three groups: employed bees, onlookers and scouts. The positions of
food sources represent possible solutions to the optimisation problem. The nectar
amount of a food source corresponds to the quality of its associated solution. The
number of employed bees determines the number of solutions to be simultaneously
explored (and maintained).
Following an initial, randomly generated distribution of food source positions, an
employed bee attempts to locate a neighbouring food source, and evaluates its nectar
amount. If the quality of the nearby source is greater, the employed bee will point to
the newer position, otherwise the previous food source is preserved. An onlooker
watches the employed bees dance at the hive, sharing information regarding the
discovered sources, and independently selects a food source to visit (following the
same neighbourhood investigation procedure). The better food sources are recorded
in place of the previously found locations. Employed bees abandon the unvisited
food sources and become scouts, who perform random search for new solutions.
This process repeats until a predefined set of requirements is met, e.g., the maximum
number of iterations.
The rough set-based ABC FS method [242] first groups the instances by the
decision attributes, and applies a greedy local search to find the reduced feature sets
(for each class). ABC is then used to choose a random number of features out of
each set, and to combine the chosen features into the final feature subset. Due to the
presence of the initial local search, this approach only requires a population size as
big as the number of classes. A more general approach [199] considers features as
food sources. It configures the population of both employed bees and onlookers to be
38
2.2. FS Search Strategies
equal to the number of features. Each employed bee is allocated one feature in the
beginning, and may be merged with others following the decisions of an onlooker,
forming feature subsets in the process. The merge of Bpiand Bp j
only happens if
r( f (Bpi)− f (Bp j
))> 0, r ∈ [0, 1].
In order to make the approach more scalable for data sets with large numbers of
features, an alternative method is described in Algorithm 2.2.9 that fits more closely
to the original ABC algorithm. It uses a predefined population size independent of
the number of features, which is initialised with randomly formed subsets B. Both
the employed bees and onlookers employ the same neighbourhood investigation
procedure, and accept a neighbouring solution if it is better than the subset it
is currently examining. An onlooker qi picks a particular employed bee p j with
probability:f (Bp j
)∑|P|
j=1 f (Bp j)(2.30)
which is in proportion to the evaluation score of its current feature subset f (Bp j),
and marks p j as visited.
At the end of the neighbourhood inspection procedure, any employed bee that is
unvisited generates and evaluates a new random subset B, as its current solution is
very likely to be of low quality. The process repeats until gmax number of iterations is
reached. The current best solution B which has been updated at every iteration, is
returned as the final result. A solution adjustment procedure similar to the move
operation, previously described in Algorithm 2.2.1, is also employed with promising
results. In particular, it allows an onlooker to generate a neighbouring solution by
moving its current subset towards that of the inspecting employed bee.
2.2.2.8 Ant Colony Optimisation
The Ant Colony Optimisation (ACO) algorithm [65] is originally proposed for solving
hard combinatorial optimisation problems. It is based on the behaviour of ants
seeking an optimal path between their colony and a source of food. The approach
uses a group of simple agents called ants that communicate indirectly via pheromone
trails, and probabilistically constructs solutions to the problem under consideration.
Several adaptations of the ACO algorithm have been proposed for the FS problem
domain, a number of which focus on rough set [38, 138] and fuzzy-rough set-based
39
2.2. FS Search Strategies
1 pi ∈ P, i = 1 to |P|, the group of employed bees2 qi ∈Q, i = 1 to |P|, the group of onlooker bees3 T i, a temporary subset for pi
4 for g = 1 to gmax do5 for i = 1 to |P| do6 if pi.visited== f alse then7 Bpi
= B8 else9 if f (neighbour(Bpi
))> f (Bpi) then
10 Bpi= neighbour(Bpi
)
11 for i = 1 to |P| do
12 select p j with a probability of f (Bp j)
∑|P|j=1 f (Bp j )
13 p j.visited= t rue14 if f (neighbour(Bp j
))> f (Bqi) then
15 Bqi= neighbour(Bp j
)16 else17 Bqi
= Bp j
18 Update B
Algorithm 2.2.9: Artificial Bee Colony Optimisation
[122] subset evaluators, while a more general approach also exists in the literature
[134].
In ACO-based FS algorithms, features are represented as nodes in a fully con-
nected bi-directional graph, a candidate feature subset B is therefore a path that
connects the selected features. Two sets of hints are available to the ants: the heuris-
tic information η and the pheromone values τ. η is a pre-constructed matrix of size
|A|2, where A is the set of original features. cell η jk = ηk j stores the evaluation score
of the feature subset a j, ak, and signifies the quality of the path between a j and ak.
τ is another matrix of the same size that stores the pheromone values deposited by
the ants, it is initially populated with a constant value τ0.
As shown in Algorithm 2.2.10, during every iteration, each ant begins from a
random feature, an edge connecting the previous node ac and an unvisited feature
al is determined with probability:
probl =τlc
αηlcβ
∑
al /∈B τlcαηlc
β(2.31)
40
2.2. FS Search Strategies
where α and β are predefined parameters. Following the path construction process,
an active (on-line) update of τ is performed, according to rules such as:
τ jk = mτ jk + n1|B|
(2.32)
where m and n are predefined weights [122, 138].
A passive (off-line) update [122, 138] may also be performed once the whole
path (feature subset) B is established:
τ jk =
ρτ jk + f (B), j ∈ B ∧ k ∈ B
ρτ jk, otherwise(2.33)
where ρ is the evaporation rate. For feature subset evaluators with a pre-determined,
maximum evaluation score, e.g., rough and fuzzy-rough set-based evaluators that
have a score range of 0 to 1, such information may be used to stop an ant from
traversing towards further nodes, once the highest possible fitness value is obtained.
For a generic evaluation technique, an ant may stop if f (B)> f (B), or f (B∪al)<f (B), where l is the feature about to be included.
As ACO requires a pre-constructed heuristic information matrix, an O(|A|2) num-
ber of subset evaluations is necessary in order to calculate the pair-wise feature
dependency, which may become prohibitive for large data sets or high complex-
ity subset evaluators. To combat this, a starting set of essential features may be
calculated in advance, such as the “core” for rough set and fuzzy-rough set-based
techniques [38]. This may significantly reduce this computational overhead, thereby
eliminating the need to consider such features while traversing the graph. In addition,
a normalisation process for τ has been proposed [138] to avoid search stagnation
caused by extreme relative differences of pheromone trails.
2.2.2.9 Firefly Algorithm
The Firefly Algorithm (FA) [277] is a meta-heuristic inspired by the flashing behaviour
of fireflies, which acts as a signal system to attract others. This approach has several
underlying assumptions: 1) all fireflies are uni-sexual, so that an individual will
be attracted to all others; 2) the brightness of a firefly is proportional to its fitness
value but can decrease when observed over distance; and 3) a random exploration is
41
2.2. FS Search Strategies
1 pi ∈ P, i = 1 to |P|, group of ants2 Bpi
, current edges (feature subset) traversed by pi
3 η jk = ηk j, j, k = 1 to |A|, heuristic information4 τ jk = τk j, j, k = 1 to |A|, pheromone values5 τ0, initial pheromone6 ρ, evaporation rate
7 Initialise Parameters8 for j = 1 to |A| − 1 do9 for k = j + 1 to |A| do
10 η jk = f (T )11 τ jk = τ0
12 for g = 1 to gmax do13 for j = 1 to |A| − 1, k = j + 1 to |A| do14 τ jk = ρτ jk
15 normalise τ16 for i = 1 to |P| do17 ac = ar , 1≤ r ≤ |A|18 Bpi
= ac19 while |Bpi |< |A| do20 select al /∈ Bpi
with probability τlcαηlc
β
21 if f (Bpi ∪ al)< f (Bpi) break
22 Bpi= Bpi ∪ al
23 τlc = (1− f (Bpi))/2+ f (Bpi
)τlc
24 ac = al
25 for i = 1 to |P| do26 for j = 1 to |A| − 1, k = j + 1 to |A| do27 τ jk = τ jk + f (Bpi
)
28 Update B
Algorithm 2.2.10: Ant Colony Optimisation
performed if no brighter fireflies can be seen. It has been shown that FA degenerates
into the particle swarm optimisation algorithm with specific parameter settings.
FA has been successfully applied to addressing rough set-based FS problems [12],via the use of a population size equal to the number of features. Each individual’s
feature subset is initialised with one of the original features: Bpi= ai. The
brightness Ii of pi is determined by the rough set dependency score of its associated
feature only. The best mating partner p j for a firefly pi should satisfy: 1) I j > Ii; 2) the
42
2.2. FS Search Strategies
distance f ∗− f (Bpi ∪Bp j) is minimal for all p j ∈ P, j 6= i; and 3) f (Bpi ∪Bp j
)> f (Bp j).
The two subsets then merge together and the process repeats for all fireflies until the
maximum rough set dependency score is achieved. This implementation removes all
stochastic components from the base FA algorithm, and delivers compact rough set
reducts in a manner similar to that of a greedy local search.
An alternative FA-based FS approach can be developed. Briefly, it makes better
use of the stochastic elements proposed by the original, base FA algorithm. As
illustrated in Algorithm 2.2.11, such an approach supports the use of any subset-
based evaluator, and shares similar intentions and modifications as those of the
improved ABC algorithm explained in 2.2.2.7. A population P of predefined fireflies
pi are initialised with random subsets. The brightness of pi when observed by p j is
calculated using:
Ii j = f (Bp j) · e−γd(Bpi
,Bp j)2
(2.34)
where γ is a predefined parameter termed the “absorption coefficient”. This im-
plements the original idea in which the attractiveness of a firefly decreases as the
distance between it and its mating partner increases. In FS terms, this means that
the larger |Bpi ⊕ Bp j | (the total number of mismatched features) is, the less bright
it will be perceived. A subset Bpiis moved towards its best mating partner with a
distance of:
di j = d(Bpi, Bp j) · e−γd(Bpi
,Bp j)2
(2.35)
The resulting subset Bpi′
replaces the previous Bpiif it is a better solution, otherwise
the original subset is maintained. At every iteration, the current best solution B is
updated and returned once the process reaches maximum number of iterations.
2.2.2.10 Particle Swarm Optimisation
Particle Swarm Optimisation (PSO) [7] is a type of method that optimises a problem
by exploiting a population of particles P (also referred to as the swarm), which
move around in the solution space with simulated positions and velocities. The
movement of a given particle is not only influenced by its own current best position,
but also guided towards the currently known group-wise best position in the search
space. The individual current best solution is constantly updated when the individual
particles locate better positions. The previously introduced FA is very closely related
to PSO, having many similar underlying principles, especially with regards to the
43
2.2. FS Search Strategies
1 pi ∈ P, i = 1 to |P|, group of fireflies2 qi ∈Q, i = 1 to |Q|, |Q|= |P|, temporary group of fireflies3 γ, light absorption coefficient4 Ii j, observed brightness of p j by pi
5 Initialise Parameters6 Random Initialisation7 for g = 1 to gmax do8 Q = ;9 for i = 1 to |P| do
10 select p j, j = argmax j Ii j, j 6= i11 di j = d(Bpi
, Bp j)
12 pi′ = move pi towards p j by di je−γdi j
2
13 if f (Bpi′
)> f (Bpi) then Bqi
= Bpi′
14 P =Q15 Update B
Algorithm 2.2.11: Firefly Algorithm
concept of particle movements. However, the fireflies in FA are only attracted and
move towards locally observed best mating partners.
When applied to FS (see Algorithm 2.2.12), the velocity vi which represents the
number of features to be altered, of a given candidate feature subset Bpiis calculated
by:
vi = wg vi + c1rd1d(Bpi, Bpi) + c2rd2d(Bpi
, Bpi) (2.36)
where wg is a gradually decreasing inertia weight, and c1 and c2 are the acceleration
constants giving weights to the current individual best and the group-wise best
solution, respectively. The outcome of the velocity calculation, is further randomised
via the use of random numbers 0≤ rd1, rd2 ≤ 1. It has been suggested in the literature
[261] that the velocity should be regulated by a predefined value vmax, since the
number of features being modified can potentially become very large. Finally, once
the number has been determined, the new candidate subset is calculated following
Algorithm 2.2.1.
There exists significant debate surrounding the velocity calculation [41, 167, 261].This reflects the discrepancy between the intended usage of “movement” proposed
in the base PSO algorithm, and its actual implementation in PSO-based FS. For
continuous-valued optimisation, notions such as velocity and movement are intuitive.
44
2.3. Summary
1 pi ∈ P, i = 1 to |P|, group of particles2 Bpi
, local best subset found by pi
3 c1, c2, acceleration constants towards Bpiand B
4 wg ∈ [wmin, wmax], gradually decreasing inertia weight5 vi ∈ [1, vmax], current and the maximum velocity
6 Initialise Parameters7 Random Initialisation8 for g = 1 to gmax do9 Update B, Bpi
10 for i = 1 to P do11 0≤ rd1, rd2 ≤ 112 vi = wg vi + c1rd1d(Bpi
, Bpi) + c2rd2d(Bpi
, Bpi)
13 Move Bpitowards Bpi
by vi
14 wg = wmin + (1−g
gmax)(wmax −wmin)
Algorithm 2.2.12: Particle Swarm Optimisation
They are used to locate a possible intermediate, interpreted solution between two
solution vectors, i.e., points in a continuous space. Yet in FS, the features are
discrete-valued, and the current binary representation does not allow straightforward,
meaningful interpretation between two feature subsets. Therefore, PSO-based FS
may potentially benefit from the integer-valued representation as used in HS-based
FS.
2.3 Summary
This chapter has introduced a selection of FS techniques for the purpose of evaluating
the quality of feature subsets. The concept of individual feature-based measures
(i.e., feature ranking methods) [147, 162, 219] has been described. They offer
distinctive differences to the group-based approaches [52, 93, 126] that consider
a given feature subset as a whole. Various FS models including filter-based [53],wrapper-based [164], hybrid [119, 194, 297] and embedded methods [264] have
also been introduced, forming alternative means with which to access the goodness
of feature subsets. Three group-based filter evaluators: CFS [93], PCFS [52], and
FRFS [126] have been explained in detail, as they are the main methods adopted to
demonstrate the efficacy of the approaches proposed in this thesis.
45
2.3. Summary
More importantly, this chapter has presented a comparative review of nine dif-
ferent stochastic FS search strategies. Their underlying respective inspirations span
a wide range of areas, including evolution, biology, physics, social behaviour, and
swarm activities, and have been applied to the problem domain of FS. The work
flows of the reviewed algorithms have been illustrated in a uniform manner, and
the common notions and shared mechanisms have also been identified. Existing
methods that are based on the classic heuristics, such as ACO [38, 122, 134, 138],GAs [231, 235], and PSO [41, 167, 261], are summarised. Several more recent
developments, including CSA [230], ABC [199, 242], and FA [12], which are pro-
posed to solve more specific scenarios (or work with fixed types of feature subset
evaluator) are introduced and modified considerably. These modifications enable the
approaches to work with generic, feature subset-based evaluators and thus, facilitate
direct comparison with the proposed HSFS method. A systematic experimental
evaluation of these reviewed methods has been carried out, in order to demonstrate
their efficacy, with results presented later in Section 3.5.
46
Chapter 3
Framework for HSFS and its
Improvements
I N this chapter, a new FS search algorithm named HSFS is presented. This approach
is based on HS, a recently proposed, simple yet powerful optimisation technique
inspired by the improvisation process of music players. HSFS is a general approach
that can be used in conjunction with a wide range of feature subset evaluation
techniques. It is particularly beneficial for group-based measures such as CFS, PCFS
and FRFS, which access the quality of a given feature subset as a whole, rather than
a combination of individual-feature-based scores. Owing to the stochastic nature
of HS, the proposed approach is able to escape from local best solutions, and can
identify multiple quality feature subsets.
Despite being a population-based approach, HSFS works by generating a new
harmony that encodes a candidate feature subset, after considering a selection of
existing quality feature subsets stored in the harmony memory. This forms a contrast
with conventional evolutionary approaches such as GA, which consider only two
(parent) vectors in order to produce a new (child) vector. This characteristic, along
with the simplicity of HS are exploited, in order to improve the robustness and
flexibility of the underlying search mechanism and hence, help to obtain better
quality feature subsets.
The nature of predefined constant parameters limits the exploitation of the origi-
nal HS algorithm. It is difficult to determine a good set-up without an ample amount
47
3.1. Principles of HS
of test runs. Employing the same parameter setting for both initial exploration and
final fine-tuning may also limit the search performance. Furthermore, the original
algorithm is designed to work with single objective optimisation problems, whilst
the problem domain of FS is at least two dimensional (subset size reduction and
evaluation score maximisation).
In order to overcome these drawbacks, a number of modifications to HSFS are
also proposed. By introducing methods to tune parameters dynamically: an initial
set-up is used to encourage exploration, the parameter values then gradually change
during the cause of the algorithm. At the end of the search, a different set-up is
prepared for fine-tuning of the final result. In contrast to the fixed-parameter version,
the effort spent in determining good parameter settings is reduced significantly, while
the overall search performance is simultaneously improved. An iterative refinement
strategy is also exploited, which recursively searches for smaller feature subsets while
preserving the evaluation quality of the discovered candidate solutions.
The remainder of this chapter is structured as follows. Section 3.1 introduces
the key notions of HS and its search procedures. Section 3.2 summarises the initial
development made towards HSFS, which is centred on the use of binary string-based
feature subset representation. Section 3.3 describes the proposed HSFS algorithm
that utilises a flexible, integer-valued encoding scheme, allowing the stochastic in-
ternal mechanisms of HS to be better exploited. Section 3.4 details the additional
improvements developed to further enhance the performance of the proposed ap-
proach. Finally, the results of experimental evaluation are reported in Section 3.5,
followed by a summary given in Section 3.6.
3.1 Principles of HS
The original HS algorithm is designed to solve numerical optimisation problems,
and most of its early applications [157] involve discrete-valued variables. When
applied to such problems, musicians typically represent the decision variables of a
given cost function, and HS acts as a meta-heuristic algorithm that attempts to find a
solution vector that optimises this function. In such a search process, each decision
variable (musician) generates a value (musical note) for finding a global optimum
(best harmony). The aim here is to provide a thorough explanation of the algorithm,
including its key notions and iteration steps, so that the proposed HSFS algorithm
may be better introduced thereafter.
48
3.1. Principles of HS
3.1.1 Key Notions
The key notions of HS, as illustrated in Fig. 3.1, are musicians, notes, harmonies,
fitness, and harmony memory. In most optimisation problems solvable using HS, the
musicians P = pi | i = 1, · · · , |P| represent the variables of the cost function being
optimised, and the values of the variables are referred to as musical notes. A harmony
H, |H|= |P| is a candidate solution vector containing the values for each variable,
where a collection of good quality solutions are stored in the harmony memory
H = H j | j = 1, · · · , |H|. Note that all of the above mentioned collections: P, H, Hare fixed-sized, ordered lists, rather than sets. In particular, H j
i , i = 1, · · · , |P|, j =1, · · · , |H|, denotes the value selected by the ith musician in the jth harmony stored
within the harmony memory.
Figure 3.1: Key notions of HS
For a newly constructed (empty) harmony, all of the internal values are initialised
as −, indicating that no musical notes have been assigned. A harmony memoryH can
be concretely represented as a two dimensional matrix. Without losing generality, the
number of rows (harmonies) |H| is a predefined parameter that limits the maximum
number of harmonies to be stored. Each column of the matrix is dedicated to one
musician, which provides a pool of playable notes for future improvisations. In this
thesis, such a pool is referred to as the note domain ℵi of a musician pi:
ℵi =|H|⋃
j=1
H ji , H j ∈H, i = 1, · · · , |P| (3.1)
49
3.1. Principles of HS
3.1.2 Parameters of HS
The original HS algorithm [155] employs five parameters, including three core
parameters: 1) the size of harmony memory |H|; 2) the harmony memory considering
rate δ; and 3) the maximum number of iterations gmax. There are two optional ones:
1) the pitch adjustment rate ρ; and 2) the adjusting bandwidth that is later developed
into fret width τ [84]. For numerical optimisation, the number of musicians |P| isgenerally implied by the problem itself, and is equal to the number of variables in the
optimisation function. The two factors that influence the actions of a musician: δ
and ρ are described below, and their effects will be explained later in Section 3.1.3.
• Harmony Memory Considering Rate
The harmony memory considering rate δ, 0≤ δ ≤ 1, is the rate of choice of one
value from the historical notes stored in the harmony memory. While (1−δ) is
the rate of randomly selecting one value from the range of all possible values.
If δ is set to a low value, the musicians will focus on exploring other areas of
the solution space, and a high δ will restrict the musicians to historical choices.
• Pitch Adjustment Rate
The pitch adjustment rate ρ, 0≤ ρ ≤ 1 parameter causes a musician to select
a value neighbouring to its current choice For example, for a given value v, its
new value will be calculated based on the formula v + (random(−1,1)× τ).For discrete variables, this simply means to choose the immediate left or right
neighbouring value. For continuous problems, τ is an arbitrary bandwidth that
constrains the maximum amount of distance allowed to shift the current value.
(1−ρ) is the probability of using the chosen value without further alteration.
This pitch adjustment procedure only occurs if the note was chosen from the
harmony memory, and thus it is not affected by δ activation.
3.1.3 Iterative Process of HS
HS can be divided into two core phases: initialisation and iteration, as shown in
Fig. 3.2. A simple discrete numerical problem [84] given in Eqn. 3.2 is used here to
illustrate the process of HS.
Minimise (a− 2)2 + (b− 3)4 + (c − 1)2 + 3 (3.2)
50
3.1. Principles of HS
Figure 3.2: Iteration steps of HS
where a, b, c ∈ 1,2, 3,4, 5.
1. Initialise Problem Domain
In the beginning, the parameters used in the search need to be established.
This includes |H|, δ, gmax, ρ, τ, and |P|.
According to the problem at hand, the group of musicians p1, p2, p3 is ini-
tialised with a size equal to the number of variables (|P| = 3), each corresponds
to the function variables a, b and c. The harmony memory H is filled with
randomly generated solution vectors. In the example problem, 3 randomly
generated solution vectors may be 2, 2,1, 1, 3,4 and 5,3, 3.
2. Improve New Harmony
A new value is chosen randomly by each musician out of their note domain,
and together form a new harmony. During the improvisation process, the
stochastic events controlled by δ and ρ will also occur, causing the value of
the selected notes to change.
In the example, musician p1 may randomly choose 1 out of ℵ1 = 2,1,5, p2
chooses 2 out of ℵ2 = 2, 3, 3 and p3 chooses 3 out of ℵ3 = 5, 3, 3, forming
51
3.1. Principles of HS
a new harmony 1, 2, 3. Given the above example, with δ = 0.9, and ρ = 0.1,
musician p1 will choose from within the his note domain ℵ1 = 2, 1, 5 with a
probability of 0.9. After making a choice, say, 5, the musician will choose the
left or right neighbours with 0.05 probability for each, and the left neighbouring
value 4, may then be chosen in the end. Alternatively, the musician may choose
from the range of all possible values, i.e., 1,2,3,4,5 with a probability of
0.1, and the note 4 may again be chosen but without further pitch adjustment.
To further ease the understanding of HS, Algorithm 3.1.1 presents an outline
of the improvisation procedure in pseudocode.
1 pi ∈ P, i = 1, · · · , |P|, group of musicians2 H j ∈H, j = 1, · · · , |H|, harmony memory3 H j
i , the value of the ith variable in H j
4 Hnew, emerging harmony
5 ℵi =⋃|H|
j=1 H ji , note domain of pi
6 δ, harmony considering rate7 ρ, pitch adjustment rate8 τ, fret-width9 mini, maxi, the value range of the ith variable
10 for i = 1 to |P| do11 random rδ, 0≤ rδ ≤ 112 if rδ < δ then13 random ri, ri ∈ ℵi
14 random rρ, 0≤ rρ ≤ 115 if rρ < ρ then16 random rτ,−1≤ rτ ≤ 117 ri = ri + rττ
18 else19 random ri, mini ≤ r ≤maxi
20 Hnewi = ri
21 return Hnew
Algorithm 3.1.1: Improvisation process of original HS
3. Update Harmony Memory
If the new harmony is better than the worst harmony in the harmony memory
(judged by the objective function), the new harmony is then included in the
resulting harmony memory and the existing worst harmony is removed.
52
3.2. Initial Development
For example, assume the newly improvised harmony 1, 2, 3 has a evaluation
score of 9, making it better than the worst harmony in the harmony memory
1, 3,4 which has a score of 16. Therefore the harmony 1,3, 4 is removed
from harmony memory, and replaced by 1,2,3. If 1,2,3 had a greater
score than 16, then it would be the one being discarded.
4. Iteration
The algorithm continues to iterate until the maximum number of iterations
gmax is reached. In the end, the highest quality solution present in the harmony
memory is returned as the final output.
In the example, if the musicians later improvise a new harmony with values
2, 3, 1, which is very likely as these numbers are already in their respective
note domains, the problem will be solved (with a minimal fitness score of 3).
3.2 Initial Development
This section describes the preliminary investigations [60] carried out, which explores
the feasibility of applying HS to the problem domain of FS. This initial HS-based FS
approach, being a stand-alone and functional application of HS, helped substantially
in obtaining a better understanding of the internal mechanisms of HS, and its
application to the problem domain of FS. It also revealed a number of drawbacks that
inspired further development, from which the current HSFS algorithm is derived.
3.2.1 Binary-Valued Representation
A binary-valued feature subset representation has been adopted in the initial ap-
proach, which is also the most commonly used representation in the literature. Recall
from Section 3.1.1 that the key notions of HS are musicians, notes, harmonies and
harmony memory. The binary-valued approach maps musicians directly onto the
available features to be selected, i.e., |P| = |A|. The note domain ℵi of a given
musician pi, contains only binary values: ℵi = 0,1, which indicates whether
corresponding feature is included (1), or not (0) in the emerging feature subset.
A harmony is represented as a series of bits that encodes the selected fea-
tures. For example, as shown in Table 3.1, for a given data set with 6 features
53
3.2. Initial Development
A= a1, a2, a3, a4, a5, a6, harmony H1 = 0,1,1,0,0,0 translates into feature sub-
set BH1 = a2, a3. The binary encoding of feature subsets is a straightforward
mapping. It allows the procedures of HS: initialisation and iteration, as illustrated in
Fig. 3.2, to be executed in the same fashion as that of standard numerical optimisation
tasks.
Table 3.1: Binary encoded feature subsets
p1 p2 p3 p4 p5 p6 Represented subset B
H1 0 1 1 0 0 0 a2, a3H2 1 0 0 0 0→1 1 a1, a5, a6
3.2.2 Iteration Steps
The initialisation step involves filling the harmony memory with randomly generated
feature subsets, i.e. random-valued string of bits. In order to improvise a new
harmony, each musician randomly selects a value from their respective note domain.
Together, such selected values form a new bit set. This set is then translated back
into a feature subset and evaluated. If the evaluation score is higher than any of the
feature subsets in the harmony memory, it replaces the worst candidate feature subset;
otherwise, the new bit set is discarded. The process repeats until the maximum
number of iterations gmax has been reached.
In this approach, the harmony memory considering rate δ has little practical
impact because the amount of available notes (0 and 1) for each musician are
very limited. The most significant use of it is in terms of flipping the bit value,
which includes a previously unselected feature, or vice versa. Hence, in this initial
development, the parameter δ is simply implemented as the bit flipping rate, and
its effect is demonstrated by the second harmony H2 in Table 3.1. 0→1 signifies a
forced value change due to δ activation, which causes the affected musician p5 to
change its decision to the opposite value.
3.2.3 Tunable Parameters
The three tunable parameters of this initial approach are: 1) the harmony memory
size |H|, 2) bit flipping rate δ, and 3) the maximum number of iterations gmax. The
harmony memory size is a sensitive parameter, in most cases it is set to between half
54
3.3. Algorithm for HSFS
of the total number of features up to the total number of features, leaving less than
half of the features outside the harmony memory. A large harmony memory will give
each musician more musical notes to choose from when improvising a new harmony.
However, it will require a longer initialisation period in order to fill up the harmony
memory and hence, may lead to slower updates and convergence.
The initial development of HSFS using binary-valued feature subset represen-
tation is an intuitive adoption of the HS algorithm in the FS domain. It is simple
to implement and shares a number of commonalities with other nature-inspired
approaches. However, it has a number of obvious shortcomings:
• Having as many musicians as the total features suggests potential scaling
problems for data sets with large numbers of features.
• The binary note domain gives each musician very limited choices when com-
posing new harmonies, and the pitch adjustment opportunities are also wasted
for binary choices.
• The approach requires a substantial amount of iterations in order to reach
convergence; all these together prevent HS in reaching its full potential.
3.3 Algorithm for HSFS
Although simple in concept, the use of binary-valued note domain limits the efficiency
and explorative potential of HS. To better address these problems, an integer-valued
HSFS algorithm has been developed [62], providing more freedom for the choice
of playable notes, and allowing the stochastic mechanisms of HS to be exploited
more thoroughly. In this section, a description of HS based FS is given, explaining
how FS problems can be translated into optimisation problems, further solved by HS.
This section includes illustrative examples of the encoding scheme used to convert
feature subsets into harmony representations. A flow diagram of the search process
is also presented in Fig. 3.4 along with step by step descriptions using FRFS as an
example subset evaluator.
3.3.1 Mapping of Key Notions
For conventional optimisation problems, the number of variables is pre-determined
by the function to be optimised. However for FS, there is no fixed number of elements
55
3.3. Algorithm for HSFS
in any potential candidate feature subset. in fact, the size of the emerging subset
itself should be reduced in parallel to the optimisation of the subset evaluation score.
Therefore, when converting concepts, such as those shown in Table 3.2, a musician
is best described as an independent expert or “feature selector”, where the available
features for the feature selectors translate to musical notes for musicians. Each
musician may vote for one feature to be included in the feature subset when such an
emerging subset is being improvised. The harmony is then the combined vote from
all musicians, indicating which features are being nominated.
Table 3.2: Concept mapping from HS to FS
HS Optimisation FS
Musician Variable Feature SelectorMusical Note Variable Value FeatureHarmony Solution Vector SubsetHarmony Memory Solution Storage Subset StorageHarmony Evaluation Fitness Function Subset EvaluationOptimal Harmony Optimal Solution Optimal Subset
The entire pool of the original features A, forms the range of musical notes
available to each of the musicians. Multiple musicians are allowed to choose the
same feature, and they may opt to choose none at all. The fitness function employed
will become a feature subset evaluation method [52, 93, 126], such as those described
in Section 2.1.1.2, which analyses and merits each of the new subsets found during
the search process. Fig. 3.3 illustrates the important concepts in the same style as
that of Fig. 3.1.
Table 3.3 depicts the following three example harmonies. H1 denotes a subset
of 6 distinctive features: BH1 = a1, a2, a3, a4, a7, a10. H2 shows a duplication of
choices from the first three musicians, and a discarded note (represented by −) from
p6, representing a reduced subset BH2 = a2, a3, a13. H3 signifies the feature subset
BH3 = a2, a6, a4, a13, where a3→a6 indicates that p4 originally voted for a3, but
was forced to change its choice to a6 due to δ activation. For simplicity, the explicit
encoding/decoding process between a given harmony H j and its associated feature
subset BH j is omitted in the following explanation.
For conventional optimisation problems, the range of possible note choices for
each musician is in general different from those for the other musicians. However,
56
3.3. Algorithm for HSFS
Figure 3.3: Key notions of HSFS
Table 3.3: Feature subsets encoded using integer-valued scheme
p1 p2 p3 p4 p5 p6 Represented subset B
H1 a2 a1 a3 a4 a7 a10 a1, a2, a3, a4, a7, a10H2 a2 a2 a2 a3 a13 − a2, a3, a13H3 a2 − a2 a3→a6 a13 a4 a2, a4, a6, a13
when applied to FS, all musicians jointly share one single value range, which is the
set of all features.
3.3.2 Work Flow of HSFS
The iteration steps of HSFS are demonstrated as follows, where the fuzzy-rough
dependency function of FRFS [126] is employed as the subset evaluator. The ac-
companying flow diagrams are given in Figs. 3.4 and 3.5, which are in principle,
straightforward adaptations of the original HS concepts (Figs. 3.2 and 3.1.1). As
previously explained in Section 2.1.1.2, FRFS is concerned with the reduction of
information or decision systems through the use of fuzzy-rough sets. It is used here
to provided a concrete example of the work flow of HSFS. Recall from Section 2.1.1.2
that the original FRFS method employs a greedy hill-climbing (HC) based algorithm
termed fuzzy-rough QuickReduct [116], which extends the original crisp version
[39]. The fuzzy-rough dependency function is utilised in order to identify a minimal
fuzzy-rough reduct (a feature subset achieving full dependency evaluation).
In contrast to stochastic methods such as HSFS, greedy search mechanisms such
as fuzzy-rough QuickReduct add a feature to the current candidate subset at each
57
3.3. Algorithm for HSFS
Figure 3.4: Work flow of HSFS
iteration. Although generally quick to converge, fuzzy-rough QuickReduct only
considers the addition of a given feature based upon the resulting increase in quality,
once added to the candidate feature subset each time. It therefore disregards the
potential contribution of pair-wise or group-wise features. As a result of this greedy
behaviour, the possibility for identifying such groups of features that collectively form
a more informative feature subset are significantly reduced. QuickReduct and other
deterministic HC algorithms are therefore, prone to returning sub-optimal feature
subsets.
1. Initialise Problem Domain
The parameters are assigned according to the problem domain, including: |H|,number of feature selectors |P|, maximum number of iterations gmax, and δ.
The subset storage containing |H| randomly generated subsets is then initialised.
This provides each feature selector with a note domain of |H| features, which
may include identical choices, and nulls.
2. Improvise New Subset
58
3.3. Algorithm for HSFS
Figure 3.5: Improvisation process of HSFS
A new feature is chosen randomly by each feature selector out of their working
feature domain, and together form a new feature subset. In the event of
δ activation, a random feature will be chosen from all available features to
substitute the feature selector’s own choice.
For FS problems that are dealt with in this thesis, the pitch adjustment rate ρ
is not used. This is because the underlying motivation for the use of ρ is that
minor adjustments into neighbouring values may help discovering better solu-
tions, which is generally true for real valued optimisation functions. However,
as the values are now feature indices, each feature and its “neighbours” have
no such general relation, thus pitch adjustment will result in a change into
a possibly unrelated feature nearby. Note that measures such as correlation
[93] and fuzzy-rough dependency [126] may be utilised to facilitate effective
identification of actual neighbouring features and therefore, allow the pitch
59
3.3. Algorithm for HSFS
adjustment mechanism to be exploited.
3. Update Subset Storage
If the newly obtained subset achieves better a fuzzy-rough dependency score
than the worst subset in the subset storage, the new subset is included in the
subset storage and the existing worst subset is removed. The comparison of
subsets takes into consideration of both dependency score and subset size, in
order to discover the minimal fuzzy-rough reduct at termination.
4. Iteration
The improvisation and comparative update procedure continues until a prede-
fined maximum number of iterations gmax is reached. The final output is the
feature subset with the highest quality, out of those stored within the harmony
memory at termination.
HSFS offers a clear advantage in that a group of features are evaluated as a whole.
A newly improvised subset is not necessarily included in the subset storage, simply
because one of the features has a locally strong fuzzy-rough dependency score (or a
very high individual importance regardless of the evaluator employed). This is the
key distinction between any of the HC-based approaches, which also allows a good
synergy between HSFS and group-based feature subset evaluators [52, 93, 126].
3.3.3 Complexity Analysis
Following the study in [84], consider a HS process with the following parameters:
size of the harmony memory (the number of harmonies stored) = |H|, number of
musicians = |P|, number of possible notes (total number of features) of a musician =|A|, number of optimal notes (features) of musician pi in the harmony memory = |ℵi|(|ℵi|< |ℵi|), and a pitch adjustment value of δ. The probability of finding the optimal
harmony, Prob(H) is defined as follows:
Prob(H) =⊆|P|i=1
δ|ℵi||H|+ (1−δ)
1|A|
(3.3)
where ρ is not considered because it is an optional operator [84].
Initially, since the harmony memory is populated with random harmonies, there
may not be any optimal feature (for all musicians) in the harmony memory:
|ℵ1|= |ℵ2|= · · ·= |ℵ|P||= 0 (3.4)
60
3.4. Additional Improvements
and
Prob(H) =
(1−δ)1|A|
(3.5)
which means that the probability Prob(H) is very low. However, as the improvisation
process continues, new feature subsets with improved evaluation scores than those
generated randomly may be identified, and thus added to the harmony memory.
The number of optimal notes of musician pi in the harmony memory: |ℵi| can be
increased on an iteration-by-iteration basis. Consequently, the probability of finding
the optimal harmony, Prob(H) is increased over the course of time.
Using fuzzy-rough QuickReduct (Section 2.1.1.2) as a comparative example, for
a given data set of |A| features, the worst case complexity will result in (|A|2 + |A|)/2evaluations of the dependency function. When implemented using HSFS, the number
of subset evaluations is the same as the maximum number of iterations gmax (which
is no longer purely dependent on the number of features in the original data). This
characteristic makes HS more favourable when solving complex problems with large
amount of features.
As for the complexity of the HS algorithm itself: the initialisation requires O(|A|×|H|) operations to randomly populate the subset storage, and the improvisation
process is of the order O(|A| × gmax) because every feature selector needs to produce
a new feature at every iteration. Here |H| is the subset storage size, |A| is the number
of feature selectors, and gmax is the maximum number of iterations. When comparing
the storage requirement, HSFS clearly requires more storage as it needs to keep
O(|A| × |H|) features in the subset storage, while HC only works on the current
candidate solution, therefore requiring only O(|A|) storage space.
Although the two types of approach are analysed here for the sake of comparison,
in reality, FS is used for dimensionality reduction prior to the involvement of a given
application, that will exploit those features belonging to the resultant feature subset.
Thus, this operation has no negative impact upon the run-time efficiency of any
subsequent process that utilises these selected features.
3.4 Additional Improvements
Traditional HS uses fixed, predefined parameters throughout the entire search process,
making it hard to determine a “good” setting without extensive trial runs. The
61
3.4. Additional Improvements
parameters are also non-independent from each other, therefore finding a good
setting often becomes an optimisation problem by itself. The search results usually
provide no hint as to how parameters should be adjusted in order to obtain an
increase in performance. This section introduces the proposed improvements to
the HSFS algorithm, making it a more flexible approach better suited to solving FS
problems of high dimensionality.
3.4.1 Parameter Control
To eliminate the drawbacks associated with the use of fixed parameter values, a
dynamic parameter adjustment scheme is proposed [59], in order to guide the
modification of parameter values at run-time. By using tailored sets of parameter
values for the initialisation, intermediate and termination stages, the search process
can benefit greatly from this dynamic parameter environment.
At the beginning of a search, as the musicians are just starting to explore the
solution space, the note domains contain only randomly initialised, low quality notes.
Therefore, a large harmony memory is not essential. In fact, having to maintain a
large pool of sub-optimal harmonies may only confuse the musicians, preventing
them from choosing good values during improvisation. Lower δ at this stage may
also encourage the musicians to seek values outside of the current harmony memory.
As the search approaches convergence, the musicians will usually have found
many sub-optimal harmonies. For such cases, given a high δ, they will almost exclu-
sively choose values from the harmony memory when improvising new harmonies.
Thus, a large pool of good results may contribute to a better solution. Of course,
situations can also occur where the algorithm has not converged by the end of the
search, which could be caused by the complexity of the problem itself, or a less-than-
desired-number of iterations. From the above observation, a good dynamic |H| can
be defined as:
|H|g = |H|min +g(|H|max − |H|min)
gmax(3.6)
When improvising new harmonies, HS with a low δ focuses less on the consid-
eration of historical values, but instead more on the entire value range. HS with a
high δ attempts to produce a new harmony out of existing values stored within the
62
3.4. Additional Improvements
harmony memory. A dynamic δ that increases its value as the search progresses can
be formulated such that:
δg = δmin +g(δmax −δmin)
gmax(3.7)
Because one of the main advantages of HS is its simplistic structure, the specifica-
tion of the rules is designed in this manner by taking the computational complexity
into account. The calculus involved is made as simple as possible. Alternatively,
smooth exponentially increasing functions may be considered:
|H|g = |H|min + 2g
gmax−1(|H|max − |H|min) (3.8)
δg = δmin + 2g
gmax−1(δmax −δmin) (3.9)
Although an exponentially decreasing function was proposed to control the fret
width [173], for the general scenario of the FS problem discussed in this chapter, it
is counter-intuitive to suggest that in any search stage |H| and δ parameters should
be adjusted in a more aggressive manner than the others.
All aforementioned individual parameter adjustment strategies can be combined
together for greater performance gain, allowing different sets of parameter settings
for different search stages, as summarised in Table 3.4. Following initialisation,
the algorithm employs a large harmony memory, with a large chance of randomly
selecting new values. Towards the intermediate stage, the algorithm uses a medium
sized harmony memory, with a balanced possibility between choosing values from
the harmony memory and the range of all possible values. Finally, towards the
termination of the process, the algorithm utilises a small harmony memory, with
the values chosen almost purely from stored good solutions. Note that these stages
are listed here for conceptual explanation purposes, there are no clear boundaries
between them in terms of implementation. Parameter settings gradually shift from
one stage to another as the search progresses.
To further justify these intuitive rules, in Section 3.5.2.1, results are gathered
and compared against the original algorithm with no parameter adjustments, as well
as the algorithm using opposite rules, such that |H| and δ decrease from a maximum
to minimum value over the search iterations.
63
3.4. Additional Improvements
Table 3.4: Parameter settings in different search stages
Initialisation Intermediate Termination
δ Small Medium Large|H| Small Medium Large
EffectHigh diversity Steady improvement Fine tuningDeep Exploration in harmonies Fast convergence
3.4.2 Iterative Refinement
The parameter control technique previously introduced in Section 3.4.1 offers ways to
dynamically change HS parameters, in order to avoid some of the difficulties in finding
a good set of parameters. However, there is an additional parameter introduced in the
HSFS approach: the number of feature selectors |P|. For conventional optimisation
problems, |P| is equal to the number of variables in optimisation function, which is
predefined, and it restricts the number of columns in the subset storage. Due to the
concept mapping in HSFS, the number of function variables is transformed into a
virtual concept, and |P| now serves as a hard upper bound for the resulting subset
size, which needs to be defined by the user.
Intuitively |P| should be equal to the actual limit, the total number of features. Yet
such configuration often leads to less satisfactory results. This is because the current
structure of HS only supports single objective optimisation. Additional measures are
required in order to enforce size reduction after HS converges in terms of the subset
evaluation score. An alternative method would be to manually initialise |P| to a
smaller value, in order to force HSFS to find solutions within the restricted boundary.
However, such an approach introduces subjective assumption prior to the search,
and it is often difficult to estimate the amount of redundancy present in any given
data set.
In order to combat this issue, an iterative refinement approach is proposed here
such that the search process becomes more data driven, and further reduce the
requirement for manual parameter configuration. As shown in Algorithm 3.4.1, the
refinement process essentially performs HSFS iteratively, each time with a reduced
feature selector size |P|. If a better or smaller subset is discovered in the previous
iteration, the number of feature selectors is set to be equal to this subset’s size. Since
the best evaluation score achieved so far is recorded in f (B), HSFS is safe to explore
alternative solution regions. This refinement procedure continues until the latest
64
3.4. Additional Improvements
feature subset B no longer provides any improvement in either subset quality or size
(as shown in Eqn. 3.10). In the end, the best feature subset discovered during the
search B is returned.
f (B)≤ f (B) or f (B) == f (B)∧ |B|== |B| (3.10)
1 A, set of all conditional features2 B, set of selected features3 B, current best feature subset
4 B = A5 |P|= |A| B = HSFS(A, |P|)6 while f (B)≥ f (B) do7 if f (B) == f (B)∧ |B|== |B| then8 break9 else
10 B == B11 |P|= |B| B = HSFS(A, |P|)
12 return BAlgorithm 3.4.1: Iterative refinement procedure
Note that for very high dimensional data sets, the musician size may be configured
via binary search instead, as shown in Algorithm 3.4.2. The two parameters |P|min
and |P|max define the field of search, which is narrowed iteratively in a divide and
conquer fashion. The aim is to determine the most suitable value of |P|, which is
used to obtain a good quality and compact feature subset.
1 |P|max = |A|2 |P|min = 13 B = A4 while |P|min < |P|max do5 |P|= b |P|min+|P|max
2 c6 B = HSFS(A, |P|)7 if f (B)> f (B)∨ ( f (B) == f (B)∧ |B|< |B|) then8 |P|max = |B|9 else
10 |P|min = |B|
11 return BAlgorithm 3.4.2: Musician size adjustment via binary search
65
3.5. Experimentation and Discussion
3.5 Experimentation and Discussion
In this section, the results of a series of experimental evaluations are reported, in
order to demonstrate the capabilities of the proposed HSFS approach. The focus
of the study lies with parameter controlled HSFS with iterative refinement, which
embeds two of the more mature modifications previously discussed in Section 3.4.
A systematic comparison against a selection of nature-inspired global optimisation
heuristics (as reviewed in Section 2.2.2) is provided in Sections 3.5.1.1 and 3.5.1.2,
which reveals the competitive performance of HSFS. In addition to the comparative
studies against the aforementioned approaches, further experiments are carried
out in order to demonstrate the characteristics of HS. Section 3.5.2.1 reveals the
differences in results when different parameter control rules are employed. This
empirically demonstrates that the recommended set of rules as presented in Section
3.4.1 is both intuitively sound and practically effective. A comparison against the
original HS algorithm is provided in Section 3.5.2.2, in order to show the effect of
the proposed enhancements.
The main part of the experimentation is carried out using three filter evaluators:
CFS [93], PCFS [52], and FRFS [126], which differ in terms of computational
complexity and characteristics. For instance, CFS is the most lightweight method. It
addresses the problem of FS through a correlation-based approach, and identifies
features that are highly correlated with the class, yet uncorrelated with each other
[93]. PCFS is an FS approach that attempts to identify a group of features that
are inconsistent, and removes irrelevant features in the process [52]. FRFS [126],similar to most rough set-based methods, exploits fuzzy-rough set notions such as
the lower and upper approximations of a given concept, and is able to identify
very compact subsets of features that can fully discern the training objects into
their respective classes. Note that FRFS is relatively high in terms of computational
complexity, and finding the minimal sized solution of full discernibility (a minimal
fuzzy-rough reduct) remains as significant research. Further detail regarding the
employed evaluators may be found in Section 2.1.1.2.
In total 12 real-valued UCI benchmark data sets [78] are used, in order to demon-
strate the capabilities of HSFS, and the nature-inspired FS approaches reviewed
in Section 2.2.2. Several data sets are of high dimensionality and hence, present
reasonable challenges for FS. Lower dimensional problems (e.g., cleve and heart)
are also included to examine whether tested algorithms can identify the best feature
66
3.5. Experimentation and Discussion
subsets. Table 3.5 provides a summary of these data sets. In order to ensure conver-
gence for the more complex data sets, a large number of iterations are uniformly
chosen. The remaining parameters, such as the population size are also configured
to be comparable with each other.
The classification algorithms adopted in the experiments include two commonly
employed techniques: 1) the tree-based C4.5 algorithm [264] which uses entropy
to identify the most informative feature at each level, in order to split the training
samples according to their respective classes; and 2) the probabilistic Bayesian
classifier with naïve independence assumptions (NB) [132]. C4.5 is optimistic by
nature. It is an unstable classifier, i.e., may over- or under-perform for specific
training folds, but also unbiased towards any class or object, whilst NB is pessimistic
(stable but biased). Obtaining the contrasting views of two different classifiers helps
to provide a more comprehensive understanding of the qualities of the selected
feature subsets.
Table 3.5: Data set information
Data set Feature Instance Class C4.5 NB
arrhy 279 452 16 65.97 61.40cleve 14 297 5 51.89 55.36handw 256 1593 10 75.74 86.21heart 14 270 2 77.56 84.00ionos 35 230 2 86.22 83.57libra 91 360 15 68.24 63.635multi 650 2000 10 94.54 95.30ozone 73 2534 2 92.70 67.66secom 591 1567 2 89.56 30.04sonar 60 208 2 73.59 67.85water 39 390 3 81.08 85.40wavef 41 699 2 75.49 79.99
Stratified 10-fold cross-validation (10-FCV) is employed for data validation, where
a given data set is partitioned into 10 subsets. Of these 10 subsets, nine are used to
form one training fold. The FS methods are employed to identify quality subsets,
which are then used to build classification models. A single subset is retained as the
testing data, so that the built classifiers can be compared using the same unseen
data. This process is then repeated 10 times (the number of folds). The advantage
of 10-FCV over random sub-sampling is that all objects are used for both training
and testing, and each object is tested only once per fold.
67
3.5. Experimentation and Discussion
The stratification of the data prior to its division into different folds ensures that
each class label has equal representation in all folds, thereby helping to alleviate
bias/variance problems [18]. In the experiments, unless stated otherwise, 10-FCV is
executed 10 times (10×10-FCV) in order to reveal the impact of the stochastic nature
of the approaches employed. The differences in performance of the various methods
are statistically compared using paired t-test with two-tailed P = 0.01. Note that
since 10×10-FCV is imposed, each of the figures displayed in the following tables is
an averaged result of 100 search outputs (per data set per algorithm). Obviously,
the searches are carried out using the same fold of the data set each time, so that
their results (and the final averaged figures) are directly comparable.
3.5.1 Evaluation of HSFS
Following the evaluation procedures described previously, the evaluation scores and
sizes of the feature subsets identified by HSFS and the 9 other nature-inspired FS
algorithms are detailed in Section 3.5.1.1. These feature subsets are further used
to train classifier learners, in order to further reveal the characteristics of the test
algorithms. Since that the three evaluators employed to judge subset quality are
all filter-based approaches, it should be noted that the predictive accuracies of the
subsequent classifiers are not part of the goals to be optimised by the respective
search methods.
3.5.1.1 FS Results
Tables 3.6 to 3.8 detail the results collected with three different subset evaluators and
all of the reviewed search algorithms including HSFS. As the time complexity of a
subset evaluation using CFS is very low, the maximum search iterations/generations
for this set of experiment is set to a very large value (gmax = 50,000), in order to
allow all algorithms to fully converge. Based on the figures shown in Table 3.6,
GA, HS, and MA deliver very similar results, and work well for lower dimensional
data sets such as cleve, heart, ionos, and water. Algorithms such as SA and TS
demonstrate very good performance for the most complex data sets: arrhy, multi,
and secom. ABC, ACO, and FF are not particularly competitive in identifying feature
subsets with the highest evaluation scores. However, they are relatively good at
producing very compact feature subsets, with acceptable evaluation quality.
For results obtained with PCFS given in Table 3.7, TS fails to find subsets with
the best evaluation scores for most of the data sets, which forms a sharp contrast to
68
3.5. Experimentation and Discussion
Table 3.6: FS results using CFS, showing both the evaluation scores (left) and the sizes(right) of the selected feature subsets. Bold figures indicate the highest evaluationscores, shaded cells signify the overall most compact best solutions
ABC ACO CSA FF GA
arrhy 0.349 10.5 0.369 15.5 0.467 24.6 0.394 27.7 0.263 61.4cleve 0.271 5.4 0.239 6.4 0.271 5.4 0.269 5.0 0.274 6.7handw 0.435 92.0 0.444 31.5 0.527 78.0 0.484 58 0.513 97.9heart 0.337 5.3 0.313 6.3 0.337 5.3 0.334 5.0 0.338 6.4ionos 0.523 9.6 0.511 9.5 0.538 9.4 0.528 9.2 0.539 10.4libra 0.577 23.5 0.570 15.6 0.610 23.0 0.589 21.7 0.611 31.4multi 0.824 235 0.802 26.9 0.923 106 0.836 88.4 0.879 258ozone 0.100 14.0 0.102 10.7 0.112 14.0 0.106 13.0 0.114 21.9secom 0.024 2.9 0.085 14.2 0.101 14.5 0.045 15.7 0.008 97.0sonar 0.330 11.4 0.316 17.7 0.360 16.7 0.339 12.5 0.360 17.7water 0.417 8.2 0.387 10.3 0.426 10.4 0.416 7.5 0.426 10.5wavef 0.366 11.3 0.361 14.5 0.383 13.1 0.364 10.5 0.384 14.9
HS MA PSO SA TS
arrhy 0.466 26.8 0.275 58.4 0.280 16.4 0.466 23.5 0.467 25.1cleve 0.274 6.7 0.274 6.7 0.274 6.7 0.266 4.4 0.274 6.6handw 0.525 93.7 0.467 120 0.473 122 0.526 68.2 0.527 82.3heart 0.338 6.4 0.338 6.4 0.338 6.4 0.333 4.7 0.335 5.3ionos 0.539 10.3 0.539 10.4 0.535 10.9 0.536 8.4 0.537 9.3libra 0.611 28.7 0.607 33.7 0.595 26.6 0.607 19.1 0.611 25.0multi 0.919 141 0.836 318 0.849 357 0.926 76.0 0.926 80.0ozone 0.114 21.4 0.113 23.3 0.110 25.3 0.107 9.5 0.114 19.8secom 0.064 22.5 0.025 95.6 0.018 20.8 0.101 14.1 0.100 14.3sonar 0.360 17.7 0.360 17.6 0.329 11.9 0.358 15.4 0.359 16.6water 0.426 10.5 0.426 10.5 0.417 8.9 0.423 8.4 0.424 9.2wavef 0.384 14.9 0.384 14.9 0.372 12.4 0.382 12.6 0.384 14.9
its strong performance in the previous set of experiment. However, it still identifies
the best solutions for multi and wavef, which are two of the higher dimensional
problems. CSA, GA, and HS demonstrate their capabilities in finding good quality
and compact feature subsets for seven of the 12 data sets. GA is the only algorithm
that identifies the overall best solutions for the ozone and secom data sets. Note
that the search outputs from these algorithms can differ significantly. Taking the
secom data set as an example, the overall best evaluation score is 0.990 (achieved by
CSA, GA, and PSO) with around 200 features, while ACO, HS, and TS yield subsets
with an average size of only 2.5, 24, and 97, respectively. The evaluation scores of
the resultant solutions are also significantly lower in comparison, indicating that
69
3.5. Experimentation and Discussion
only local minimal solutions are possibly returned. Similar observations are also
reflected by the results for the arrhy data set.
Table 3.7: FS results using PCFS, showing both the evaluation scores (left) andthe sizes (right) of the selected feature subsets. Bold figures indicate the highestevaluation scores, shaded cells signify the overall most compact best solutions
ABC ACO CSA FF GA
arrhy 0.987 121 0.802 8.8 0.988 107 0.977 38.7 0.989 45.6cleve 0.781 8.0 0.775 8.8 0.781 7.9 0.715 6.4 0.781 7.9handw 1.000 27.1 1.000 26.9 1.000 22.0 1.000 23.1 1.000 41.2heart 0.961 9.5 0.955 10.2 0.961 9.5 0.915 7.2 0.961 9.5ionos 0.996 9.8 0.993 10.0 0.996 7.0 0.996 8.7 0.996 10.0libra 0.971 35.2 0.935 19.0 0.972 17.2 0.968 24.5 0.972 18.2multi 1.000 14.6 1.000 10.6 1.000 13.1 1.000 19.5 1.000 44.9ozone 0.997 23.0 0.969 14.2 0.999 16.6 0.996 20.1 1.000 21.0secom 0.986 211 0.936 2.5 0.989 213 0.974 150 0.990 198sonar 0.993 24.8 0.946 11.6 0.993 11.7 0.989 15.8 0.993 12.4water 0.994 15.8 0.975 10.8 0.995 9.8 0.990 10.5 0.995 10.0wavef 0.999 12.4 0.999 11.4 0.999 9.7 0.999 10.9 1.000 11.5
HS MA PSO SA TS
arrhy 0.989 29.1 0.989 114 0.989 111 0.989 29.1 0.983 21.6cleve 0.781 7.9 0.781 7.9 0.781 7.9 0.781 8.0 0.738 6.7handw 1.000 70.2 1.000 40.5 1.000 24.0 1.000 83.4 0.999 18.1heart 0.961 9.5 0.961 9.5 0.961 9.5 0.922 8.7 0.947 8.3ionos 0.996 6.8 0.996 10.1 0.996 8.1 0.989 15.6 0.991 6.4libra 0.972 16.4 0.972 33.3 0.972 28.4 0.961 41.6 0.967 16.0multi 1.000 9.1 1.000 43.8 1.000 13.4 1.000 325 1.000 6.1ozone 0.999 18.4 1.000 31.8 1.000 26.0 0.994 34.5 0.999 19.1secom 0.988 23.7 0.988 314 0.990 256 0.971 294 0.979 97.3sonar 0.991 11.8 0.993 23.7 0.993 17.6 0.924 30.3 0.985 11.6water 0.995 10.2 0.995 13.8 0.995 12.2 0.927 18.9 0.990 9.3wavef 1.000 11.1 1.000 14.9 1.000 12.9 0.979 20.1 1.000 10.6
FRFS is a computationally intensive evaluator that requires a considerable amount
of time when the number of training objects is very large. Because of the underlying
properties of fuzzy discernibility matrices [126, 172], it is easy to find feature subsets
with almost full dependency scores (for the commonly adopted fuzzy t-norms and
fuzzy implicators). However, the search for the most compact solutions (fuzzy-rough
reducts) is very challenging. These characteristics of FRFS help to compare the size
reduction capabilities of the reviewed methods. Note that the evaluation scores are
not compared statistically, since subsets with full dependency score can be readily
70
3.5. Experimentation and Discussion
identified. According to the results shown in Table 3.8, HS performs very well in this
set of experiments, mainly due to the fact that it is tailored to solving FRFS problems
in the first place [60], and that it also embeds mechanisms to actively refine the sizes
of the feature subsets during the search. GA, MA, and PSO also deliver competitive
performance. Although TS obtains best results for six of the 12 data sets, it fails to
optimise the FRFS dependency scores for multi and secom, producing sub-optimal
solutions.
Table 3.8: FS results using FRFS, showing both the evaluation scores (left) and thesizes (right) of the selected feature subsets. Shaded cells signify the overall mostcompact best solutions
ABC ACO CSA FF GA
arrhy 1.000 40.4 1.000 23.6 1.000 29.0 1.000 29.2 1.000 63.4cleve 0.929 13.0 0.929 13.0 0.929 13.0 0.854 10.9 0.999 12.9handw 1.000 26.6 1.000 27.5 1.000 22.0 1.000 22.9 1.000 40.4heart 0.959 12.8 0.960 13.0 0.959 12.8 0.909 10.7 1.000 10.1ionos 0.993 15.3 0.994 16.6 0.993 14.0 0.992 15.1 1.000 25.7libra 0.997 19.7 0.997 21.3 0.998 18.7 0.997 19.5 1.000 29.4multi 1.000 20.3 1.000 17.6 1.000 18.9 1.000 27.3 1.000 49.8ozone 0.975 36.5 0.962 35.7 0.973 32.9 0.976 34.5 0.982 48.5secom 1.000 35.2 1.000 20.5 1.000 37.0 1.000 26.6 1.000 67.5sonar 1.000 13.3 1.000 14.5 1.000 12.9 1.000 13.5 1.000 17.3water 0.998 18.0 0.998 19.0 0.998 15.9 0.997 17.8 1.000 20.8wavef 1.000 17.0 1.000 18.0 0.999 16.5 1.000 17.0 1.000 19.2
HS MA PSO SA TS
arrhy 1.000 25.1 1.000 63.1 1.000 34.5 1.000 108 1.000 24.9cleve 0.999 12.9 0.999 12.9 0.999 12.9 0.929 11.2 0.989 11.9handw 0.999 22.5 1.000 40.0 1.000 23.8 1.000 129 0.999 22.0heart 1.000 10.0 1.000 10.1 1.000 10.3 0.989 10.6 0.959 9.1ionos 1.000 25.7 1.000 26.1 1.000 26.5 0.991 14.9 1.000 25.9libra 1.000 20.7 1.000 29.9 1.000 23.2 0.999 26.2 0.999 20.4multi 1.000 15.3 1.000 41.8 1.000 21.0 1.000 323 0.562 6.0ozone 0.924 38.5 0.982 48.8 0.982 51.9 0.919 36.1 0.979 33.7secom 1.000 15.8 1.000 67.1 1.000 31.7 1.000 296 0.803 7.0sonar 1.000 13.0 1.000 16.9 1.000 14.1 1.000 17.8 0.998 11.7water 1.000 19.8 1.000 22.0 1.000 22.1 0.981 19.2 1.000 19.9wavef 1.000 18.4 1.000 19.4 1.000 19.2 0.996 20.5 1.000 17.4
71
3.5. Experimentation and Discussion
3.5.1.2 Classification Performance
Tables 3.9 to 3.11 show the accuracies of the classification models, trained using the
same cross-validation folds as those used to perform FS. The quality of the underlying
subsets have already been discussed in the previous subsection, the accuracies of the
full (unreduced) data sets is given in Table 3.5.
For features selected by CFS, as shown in Table 3.9, the worst solutions (with an
averaged score of 0.024 and size of 2.9) that are found by ABC, actually result in the
best classification performance for both tested classifiers for the secom data set. This
shows that for filter-based evaluators, a solution that achieves the highest evaluation
score does not necessarily guarantee the best classification model, subsequently
learnt using such features, since these subsets are selected independently of the end
classifier learners, However, in general, there is a reasonable correlation between
subset quality (judged by the CFS evaluator) and the classification accuracy. Feature
subsets selected according to CFS also build slightly more accurate models than those
constructed based on PCFS and FRFS, and a greater majority of algorithms are able
to find better performing solutions.
Table 3.10 reports the results collected using the set of feature subsets selected
via PCFS, CSA and TS seem to lead to the best classification performance overall.
Each of the remaining algorithms also finds the best results for one or more data
sets. For the secom data set, all of the reviewed algorithms, apart from ACO, select
features that do not contribute to good NB classifier models. Several models result
in an averaged classification accuracy lower than 40%. A closer investigation reveals
that in fact, local best solutions have been selected by these algorithms for a number
of cross-validation folds, which have a large, negative impact in the final 10-FCV
results.
For classifiers built using feature subsets selected by FRFS as demonstrated in
Table 3.11, algorithms such as ACO, MA, SA, and TS all perform reasonably well.
Both tested classifiers also tend to agree more often (than the two previous sets of
experiments) in terms of predictive accuracy. The performance of a given classifier,
and the evaluation score of its underlying feature subset are also well correlated
for FRFS. Note that since FRFS generally helps to find more compact subsets, the
resultant classification accuracies are also slightly lower when compared to those
obtained by CFS or PCFS.
72
3.5. Experimentation and Discussion
Table 3.9: C4.5 (left) and NB (right) classification accuracies using the featuresubsets found with the respective search algorithms via CFS. Bold figures indicatebest classification accuracy (per classifier), shaded cells signify higher accuracies areachieved for both examined classifiers
ABC ACO CSA FF GA
arrhy 63.0 63.2 66.8 67.0 67.1 68.5 66.8 67.0 66.8 66.8cleve 55.8 56.3 55.3 56.5 55.8 56.3 55.8 56.6 56.4 56.9handw 75.0 83.7 70.2 72.9 75.9 84.7 74.3 82.0 75.5 85.2heart 81.7 83.0 81.2 84.1 81.7 83.0 82.2 83.0 80.6 85.0ionos 88.2 85.6 87.6 86.3 87.8 86.8 88.0 86.1 88.4 86.9libra 65.6 60.7 62.1 57.3 67.0 61.6 65.5 61.0 67.6 62.1multi 94.3 95.7 93.4 95.2 94.9 97.1 92.8 95.7 94.6 96.3ozone 93.4 75.8 93.4 77.2 93.1 74.8 93.3 76.1 93.3 73.9secom 93.4 88.7 92.5 82.6 92.5 84.2 92.7 75.1 90.7 84.1sonar 72.3 66.6 73.2 66.3 73.0 66.6 73.4 65.9 73.1 66.6water 81.9 84.9 82.7 85.8 82.8 85.9 82.3 85.1 83.4 85.9wavef 76.9 79.8 77.4 80.5 77.6 80.7 76.8 79.5 77.5 80.2
HS MA PSO SA TS
arrhy 66.9 68.9 66.9 67.4 63.5 63.3 67.4 69.0 67.2 69.0cleve 56.4 56.9 56.4 56.9 56.4 56.9 55.0 55.7 56.3 57.0handw 75.9 85.3 75.4 85.5 75.3 85.3 76.0 83.8 76.1 84.9heart 80.6 85.0 80.6 85.0 80.7 85.0 82.1 82.3 81.7 82.9ionos 88.3 86.9 88.3 86.9 88.0 86.4 87.4 87.0 87.3 87.0libra 67.3 61.6 68.2 61.8 66.8 61.4 65.9 61.4 66.9 61.3multi 94.9 96.8 94.6 95.8 94.7 95.9 95.1 97.2 94.9 97.2ozone 93.2 73.9 93.4 73.7 93.3 73.5 93.4 78.4 93.1 74.0secom 92.1 71.2 91.2 88.7 92.7 74.2 92.5 84.7 92.4 83.7sonar 73.2 66.6 73.3 66.5 72.8 66.3 72.3 66.6 74.1 66.9water 83.3 85.9 83.3 85.9 82.1 85.2 83.0 85.6 82.9 85.6wavef 77.5 80.2 77.5 80.2 76.9 79.7 77.6 80.9 77.5 80.5
3.5.1.3 Discovery of Multiple Quality Feature Subsets
The stochastic nature of global optimisation techniques such as HS allows the discov-
ery of multiple different solutions for the same set of training samples. Table 3.12
details the differences in the discovered reducts between HS-IR and HC using the
data set arrhy as an example. These subsets are recorded during the experiment,
where the same cross-validation fold is used by both methods. Note that similar
properties have also been observed for other data sets, although the employment of
arrhy allows such properties revealed more clearly. In general, different runs may
73
3.5. Experimentation and Discussion
Table 3.10: C4.5 (left) and NB (right) classification accuracies using the featuresubsets found with the respective search algorithms via PCFS. Bold figures indicatebest classification accuracy (per classifier), shaded cells signify higher accuracies areachieved for both examined classifiers
ABC ACO CSA FF GA
arrhy 66.4 63.3 62.6 61.5 66.3 63.5 66.7 66.2 66.2 65.9cleve 54.5 55.9 54.0 56.0 54.5 55.9 56.4 55.8 54.5 55.9handw 65.2 68.4 65.6 69.1 63.2 66.2 63.8 66.7 68.1 73.3heart 78.5 84.4 78.6 84.2 78.5 84.4 80.2 83.0 78.5 84.4ionos 85.8 79.6 86.0 80.5 85.7 78.8 86.0 79.9 85.4 79.0libra 66.0 61.9 64.5 58.7 64.4 60.2 65.9 61.8 64.1 60.4multi 79.5 82.1 83.8 86.5 78.7 81.0 80.2 82.7 86.0 89.5ozone 93.0 72.5 93.1 75.7 93.3 76.3 93.1 73.7 93.1 73.4secom 90.4 36.2 93.3 90.9 90.7 37.8 91.4 49.6 90.5 37.0sonar 73.5 67.1 73.2 66.5 74.1 66.3 73.9 66.7 72.4 66.6water 81.5 85.7 81.4 84.9 81.1 86.3 81.5 85.9 81.7 85.9wavef 74.6 78.5 75.3 80.0 73.7 78.9 75.0 78.6 75.0 79.1
HS MA PSO SA TS
arrhy 66.2 66.8 66.3 64.0 66.1 63.9 66.2 67.5 66.7 68.2cleve 54.5 55.9 54.5 55.8 54.4 55.9 54.5 55.8 55.2 56.0handw 69.9 75.9 67.9 72.8 63.7 66.6 70.2 76.8 71.1 73.5heart 78.5 84.4 78.5 84.4 78.5 84.4 79.6 83.1 79.5 84.5ionos 85.7 80.1 85.3 80.1 85.5 78.6 85.5 81.0 87.4 80.3libra 65.4 61.5 65.1 61.6 65.3 61.6 66.1 61.5 63.7 59.9multi 81.7 84.3 85.5 89.1 78.5 81.0 93.4 94.9 89.8 91.4ozone 93.2 75.6 93.1 70.6 93 71.8 92.9 69.4 93.2 75.2secom 92.5 70.9 90.4 32.3 90.2 35.0 89.9 31.9 92.1 67.6sonar 73.5 66.6 73.0 66.3 73.7 66.6 72.2 66.9 73.8 67.3water 81.4 86.2 81.7 86.1 81.7 86.3 80.2 83.2 81.9 86.1wavef 74.7 79.5 74.3 78.8 74.8 79.1 71.6 75.8 75.8 80.2
converge to the same solution, possibly due to limited number of best solutions that
can be inferred from data sets of a lower dimensionality.
For 10 runs of HS-IR, 10 different reducts of an average size 7 are selected (again,
all reaching the full dependency measure of 1), while HC results in a single subset
of size 7. The ability to produce multiple quality subsets from the same training
data may greatly benefit multi-view learning techniques such as classifier ensemble
[61, 266], where the subsets may be used to generate partitions of the training data
in order to build diverse classification models.
74
3.5. Experimentation and Discussion
Table 3.11: C4.5 (left) and NB (right) classification accuracies using the featuresubsets found with the respective search algorithms via FRFS. Bold figures indicatebest classification accuracy (per classifier), shaded cells signify higher accuracies areachieved for both examined classifiers
ABC ACO CSA FF GA
arrhy 57.8 58.5 57.1 58.5 58.7 61.8 52.3 57.3 58.0 57.2cleve 50.4 55.1 50.4 55.1 50.4 55.1 49.9 54.1 49.7 55.1handw 65.6 68.2 66.7 69.0 62.0 65.1 63.5 66.1 68.3 71.2heart 77.8 82.6 77.8 83.0 77.8 82.6 78.9 83.0 78.0 82.4ionos 88.5 83.3 89.8 85.0 89.1 83.9 88.3 83.7 88.7 83.9libra 63.6 56.1 60.8 58.5 59.9 54.6 63.1 58.6 62.9 58.8multi 80.6 84.4 83.1 88.3 82.6 86.4 81.8 86.6 87.0 91.1ozone 93.3 69.9 93.0 67.6 92.7 70.3 93.0 69.4 93.2 67.0secom 92.7 71.6 92.9 91.6 92.7 77.4 93.2 69.4 91.6 50.4sonar 75.3 69.2 76.0 73.9 76.2 71.1 71.6 71.5 71.7 67.9water 80.8 83.9 80.0 83.9 78.7 85.0 79.7 83.9 79.7 83.9wavef 71.1 74.8 73.8 78.5 71.0 74.9 70.4 75.8 70.4 75.7
HS MA PSO SA TS
arrhy 60.6 60.6 63.0 57.2 56.6 58.2 62.1 57.0 59.5 61.5cleve 49.7 55.1 49.7 55.1 49.7 55.1 55.8 56.0 51.8 53.8handw 64.3 67.0 67.4 72.0 62.2 64.1 73.9 83.3 65.7 67.9heart 78.9 82.4 78.5 82.4 78.3 81.7 79.3 82.2 78.2 80.7ionos 88.5 83.9 88.5 83.7 88.5 83.9 85.7 82.6 86.5 83.9libra 63.1 53.6 66.0 60.3 64.4 57.2 62.1 58.8 63.3 58.9multi 84.1 88.6 85.8 90.2 83.2 86.7 93.8 95.2 84.8 86.0ozone 93.1 69.7 93.2 68.3 93.1 67.3 92.8 68.5 93.5 69.6secom 93.1 93.1 91.4 43.9 92.7 74.6 90.2 31.3 93.4 93.4sonar 76.7 71.2 70.0 67.4 70.7 65.4 70.8 68.4 76.4 76.4water 80.6 84.4 81.4 85.0 81.3 83.9 79.6 84.2 79.7 84.6wavef 72.9 75.7 71.6 75.6 73.0 76.5 71.8 75.5 72.6 75.9
3.5.2 Evaluation of Additional Improvements
Additional experimentation have been conducted in this section in order to demon-
strate the effectiveness of the proposed improvements which have been described in
Section 3.4. Note that the effect of parameter control rules have also been evaluated
under conventional settings, for improving the solution quality of numerical optimi-
sation problems. They are omitted here due to the lack of relevance to the topic at
hand. Refer to [59] for more detailed results and discussions.
75
3.5. Experimentation and Discussion
Table 3.12: Comparison of multiple HS-IR reducts versus single HC reduct for thearrhy data set, all subsets are of size 7 and evaluation score 1
Feature indices
HSFS
· · ·3 5 9 80 242 246 2753 14 104 113 169 181 271
14 20 64 209 238 242 2515 6 7 113 188 231 2673 6 76 161 246 252 2650 5 6 10 209 257 262
14 191 208 218 241 259 27540 161 186 209 225 228 2556 80 128 161 208 239 2718 14 208 216 225 241 246
· · ·
HC 0 3 6 168 169 217 251
3.5.2.1 Comparison of Parameter Control Rules
The parameter control rules discussed in Section 3.4.1 are examined here, with results
compared against other possible approaches. The previously employed data sets are
once again adopted, with either CFS or FRFS acting as the subset evaluator. For the
less complex data sets, all algorithms lead to identical results and therefore those
particular results are omitted. The following discussions focus on the remainder of the
results. Here, the arrows illustrate how the parameters are adjusted over iterations,
e.g. |H| means that |H| decreases from maximum to minimum value as the search
progresses; δ → means that δ is static throughout; and |H| δ indicates the
cases where both parameters increase over time, and hence the recommended rules.
Table 3.13 shows the subset size and evaluation score obtained using the CFS
evaluator. The first three rows show different |H| adjustment functions with a
static δ. The effect of an increasing |H| can be identified by the overall superior
evaluation scores, the subset sizes are not differentiated by a large amount but the
decreasing value of |H| generally leads to larger feature subsets. Rows 3 to 5 show
the comparison of different δ functions with a static |H|. Here, the main difference in
results is feature subset size, where an increasing δ helps to discover smaller subsets,
and the evaluation score is also generally higher. Comparison of combined rule sets
shows that the final results are better when both parameters are increasing during
the search.
76
3.5. Experimentation and Discussion
Table 3.13: Comparison of parameter control rules using CFS, averaged subset sizerounded to nearest integer and evaluation score by 10× 10-FCV, shaded row indicatesthe suggested rules
ionos olito sonar water2 water
Mode Size Score Size Score Size Score Size Score Size Score
|H| δ→ 17 0.5020 15 0.5667 29 0.0907 16 0.1648 17 0.2545|H| δ→ 15 0.5101 16 0.5676 28 0.1111 15 0.2638 16 0.3690|H| → δ→ 16 0.5107 15 0.5675 27 0.1083 15 0.2441 16 0.3258|H| → δ 12 0.5155 16 0.5676 21 0.2978 11 0.3405 13 0.3940|H| → δ 15 0.5087 16 0.5676 26 0.1300 14 0.1944 16 0.2758|H| δ 12 0.5173 16 0.5677 20 0.2989 12 0.3409 13 0.4079|H| δ 15 0.5103 15 0.5665 27 0.1227 15 0.1819 15 0.2848|H| δ 17 0.5032 15 0.5673 30 0.0884 16 0.1735 17 0.2668|H| δ 16 0.5075 16 0.5676 28 0.1020 16 0.1887 16 0.2841
The same conclusion can be reached by studying the results obtained using FRFS
as the subset evaluator, as shown in Table 3.14. All HS variations have achieved
full fuzzy-rough dependency measure for the discovered subsets, the difference in
performance is therefore reflected purely by the size reduction. HS with increasing |H|and δ finds more compact subsets overall, while HS with a static |H| and increasing
δ achieves a close second place with minor increase in subset size, once again
demonstrating that δ adjustment plays a key role in the size reduction of feature
subsets.
Table 3.14: Comparison of parameter control rules using FRFS, averaged subset sizerounded to nearest integer across 10× 10-FCV, shaded row indicates the suggestedrules
Mode ionos olito sonar water2 water
|H| δ→ 14 9 29 16 16|H| δ→ 12 7 25 14 14|H| → δ→ 13 7 26 14 14|H| → δ 10 6 20 9 10|H| → δ 11 7 22 12 12|H| δ 10 6 18 9 9|H| δ 12 7 22 13 13|H| δ 14 10 28 17 16|H| δ 13 8 27 14 14
77
3.5. Experimentation and Discussion
3.5.2.2 Effect of Parameter Control and Iterative Refinement
The effect of proposed improvements are demonstrated here with a comparison of
the original HS algorithm. The parameter settings employed by HS-based methods in
this experiment are given in Table 3.15. Thanks to the performance increase brought
about by parameter control and iterative refinement, the improved HS algorithm
no longer requires as many iterations as the original to achieve good results. The
maximum number of iterations used by HS-IR is therefore reduced by half of the
original amount. This results in significant savings in time complexity.
Table 3.15: Parameter settings for demonstration of parameter control and iterativerefinement
Algorithm Parameter Value
|H| 20HS Original gmax 2000(HS-O) δ 0.8
|P| 10
|H| 10-20HS with gmax 1500Parameter Control (HS-PC) δ 0.5-1
|P| 10
HS |H| 10-20with Parameter Control gmax 1000and Iterative Refinement (HS-IR) δ 0.5-1
Table 3.16 details the results obtained, showing both the subset size and the
evaluation score. The columns labelled HS-O contain these subset sizes discovered by
the original algorithm, and those labelled HS-PC show the results of using parameter
controlled HS, HS-IR which iteratively refines HS is also included for comparison. For
the purpose of maintaining consistency of evaluation, the selection process employs
the same cross-validation folds as used in the previous subsections. The paired t-test
is again employed to compare the differences between HS-PC and HS-O, and HS-IR
against HS-PC. In all cases except one, the enhancements offer statistically significant
improvements in terms of subset size reduction and evaluation optimisation. For the
data set secom, whilst HS-PC did not increase the evaluation score when compared
to HS-O, it did reduce the average subset size.
It can be seen from these results that the proposed improvements have greater
effect under more complex situations, such as those involving subsets with larger
78
3.5. Experimentation and Discussion
Table 3.16: Comparison of proposed HS improvements using feature subsets selectedby CFS, regarding averaged subset size, evaluation score, and C4.5 classificationaccuracy, by 10× 10-FCV, v, −, ∗ indicate statistically better, same, or worse results
Full HS-O HS-PC HS-IR
Data set |A| Acc.% |B| f (B) Acc.% |B| f (B) Acc.% t |B| f (B) Acc.% t
ionos 35 85.62 14.04 0.533 85.30 11.46 0.539 85.57 v 10.06 0.542 85.30 *water 39 79.74 15.24 0.386 83.13 12.3 0.419 82.82 * 10.1 0.427 82.46 -wavef 41 76.62 17.32 0.362 77.02 15.46 0.382 77.21 v 14.9 0.384 77.23 -sonar 61 72.62 25.34 0.132 72.62 20.54 0.317 73.52 v 17.22 0.359 72.95 *ozone 73 92.62 38.58 0.106 93.35 29.36 0.113 93.41 - 19.9 0.114 93.28 -libra 91 70.28 51.12 0.582 70.56 43.4 0.603 70.83 v 24.26 0.607 69.33 *arrhy 280 65.06 161.82 0.052 67.84 117.74 0.088 67.49 - 27.36 0.441 67.27 -secom 591 88.96 348.78 0.002 90.06 279.64 0.002 90.74 v 15.34 0.087 92.78 visole 618 83.42 383 0.692 83.37 356.98 0.716 83.59 - 205.71 0.723 83.02 *multi 650 94.30 401.16 0.836 94.22 365.98 0.867 94.42 v 124.11 0.91 94.63 v
numbers of features, larger numbers of instances, or those that contain many com-
peting potential solutions. The effect of parameter control is revealed largely in
terms of better evaluation scores, while subset sizes are also reduced in the process.
However, iterative refinement greatly improves the overall solution quality, and
shows exceptional capability for reducing the size of subsets. For the secom data
set, HS-IR succeeded in reducing the solution size by over 95% without sacrificing
the evaluation score, making HS-IR a competitive algorithm in dealing with higher
dimensional FS problems.
From the differences in classification accuracy, it can be seen that of the 10
data sets, HS-PC manages improvements over HS-O for 6 cases, ties for 3 cases,
and under-performs for just one case. This demonstrates the effectiveness of using
parameter control. HS-IR obtains better feature subset evaluation scores (when
judged using CFS) and smaller subset sizes. However, classification accuracy results
indicate that more compact feature subsets (with equal or better evaluation score)
may not necessarily lead to equal or better classifiers. For example, regarding data
set arrhy, although HS-IR raised average evaluation score from 0.002 to 0.087, and
reduced average subset size from 279.64 to 15.34, the classifier accuracy remained
the same. Yet, the experimental results also show that, for this data set, HS-IR
equipped with the CFS evaluator removed a fair amount of redundancy, while not
affecting the end classifier performance.
79
3.5. Experimentation and Discussion
3.5.3 Iterative Refinement of Fuzzy-Rough Reducts
Section 3.5.1.1 has shown that the iterative refinement technique works very well
for finding smaller fuzzy-rough reducts. The following experimental results show
graphically how an initial solution is improved upon over several refinements. Two
data sets with a relatively large numbers of features are used: arrhy (280 features)
and web (2557 features). The search objective is to find a fuzzy-rough reduct (with
fuzzy-rough dependency measure of 1.0) of the smallest possible size. For the web
data set, only ten different runs are performed due to its very high dimensionality.
For each data set the reduct sizes at each iterations are recorded, averaged, and
summarised in Figs. 3.6 and 3.7.
Figure 3.6: Iterative fuzzy-rough reduct refinement for the arrhy data set
The refinement procedure is completed within five iterations in six out of 10 runs
for the arrhy data set, with an averaged final reduct size of 7.17. As for the web data
set, 40% of the runs terminate within 30 refinements with the rest taking more than
33 iterations. This is one of the scenarios where an exponential adjustment to the
musician size |P| may become beneficial, such as that suggested in Algorithm 3.4.2.
Note that for smaller problems with <100 features, compact fuzzy-rough reducts are
usually found within 3 iterations. In reality, if the search is to be performed multiple
times, the efficiency can be further increased by initialising the number of feature
selectors |P| to a smaller value, which may be discovered in the first few executions
of HSFS.
80
3.5. Experimentation and Discussion
Figure 3.7: Iterative fuzzy-rough reduct refinement for the web data set
3.5.4 Discussion of Results
HS based approaches are computationally inexpensive themselves in terms of com-
putational overhead and robustness. This is because the algorithm comprises a very
simple concept, and the implementation is also straightforward. The run-time of the
entire FS process is mainly determined by the following two factors: the maximum
number of iterations gmax, and the efficiency of the subset evaluation method. gmax
can be manually configured according to the complexity of the data set, in the ex-
perimental evaluation, HS converges very quickly with a similar run-time to that of
GA- and PSO-based searches. The experiments also revealed the downside of FRFS,
which does not scale well for larger data sets, empirically. The greater the number
of instances in the data set, the longer time is required for computing fuzzy-rough
dependency measures.
The use of subset storage in HS offers a major advantage over that of other
techniques such as GA, as it maintains a record of the historical candidate feature
subsets by previously executed iterations. All elements of the memory together
contribute to the new subset, while changes in genetic populations tend to result
in the destruction of previous knowledge of the problem. The harmony memory
considering rate δ also helps the search mechanism to escape from the local best
solutions.
81
3.6. Summary
In all of the experiments, apart from relaxing the total number of generations
gmax for those conducted using CFS and PCFS, there has been no attempt to optimise
the parameters for each of the employed data sets. The same parameter settings are
used for easy comparison throughout regardless of the difference in complexities of
the data sets. It can be expected that the results obtained from the proposed work
with optimisation would be even better than those already observed.
The proposed approach offers an improved search heuristic. Unfortunately, in
general, there is no stochastic search mechanism that can guarantee exhaustive
search; otherwise it is not a heuristic in the first place. Therefore, no subsets found
can be theoretically proven to be a global optimal (except those rough or fuzzy-rough
reducts identified via propositional satisfiability [127]). However, in practice, it
is important to investigate the relative strength of a given search heuristic. The
systematic experimental studies presented in this work confirm that empirically, the
quality of those subsets found by the proposed technique generally outperform those
returned by the others.
3.6 Summary
In this chapter, a new FS search strategy named HSFS has been presented. It is based
on a recently developed, music-inspired, simple yet powerful meta-heuristic - HS.
Pseudocode and illustrative diagrams have been given in order to aid the explanation.
Additional improvements to HSFS have also been proposed in an attempt to address
the potential weaknesses of the original HS algorithm, and to adapt the approach for
FS problems. The resulting method offers a number of advantages over conventional
approaches, such as fast convergence, simplicity, insensitivity to initial value settings,
and efficiency in finding quality feature subsets. The suggested parameter control
rules have been designed to work with traditional optimisation problems also [59],and are readily generalised to support a wider range of problems. The iterative
refinement mechanism works more closely with the size of the musician group |P|,which is an additional degree of freedom introduced by HSFS.
Results for the experimental evaluation show that all algorithms reviewed in
Section 2.2 are capable of finding good quality solutions. SA and TS are particularly
powerful in optimising the evaluation scores of the CFS evaluator, and work well
with a few high dimensional problem. Algorithms such as CSA, GA, and HSFS offer
82
3.6. Summary
more balanced results for all tested subset evaluators, in terms of both evaluation
score and subset size. HSFS has demonstrated competitive FS performance for both
sets of experiments carried out using CFS and PCFS, and it particularly excels in
size reduction, producing compact fuzzy-rough reducts for most of the tested data
sets. This is largely due to the proposed parameter control rules and the iterative
refinement procedure. The selected feature subsets are verified via the use of two
classification algorithms: C4.5 and NB. The performance of the resultant models
generally support the quality measurements of the filter-based evaluators, although,
there exist cases where feature subsets with very low evaluation scores lead to the
most accurate classifiers.
In-depth analysis of these experimental findings, and how HSFS and the reviewed
algorithms may be further improved remain as topics of active research. The relevant
future directions are discussed in great detail in Section 9.2.1.1. It is without doubt
that stochastic feature subset search algorithms such as HSFS are particularly strong
in identifying distinctive, good quality feature subsets. Such feature subsets, while
substantially reducing the problem dimensionality, are also beneficial for improving
the performance of any classifiers subsequently employed. It is therefore natural
to utilise a ensemble-based learning mechanisms, in order to better exploit the
advantages offered by these feature subsets.
83
Chapter 4
HSFS for Feature Subset Ensemble
T HE strengths of stochastic FS search methods such as HSFS, apart from be-
ing able to escape from local best solutions, lie with their ability to identify
multiple feature subsets of similar quality. “Feature subset ensemble” (FSE) is an
ensemble-based approach that aims to extract information from a collection of base
FS components, producing an aggregated result from the collection. In so doing,
the performance variance of obtaining a single result from a single method can be
reduced. It is also intuitively appealing that the combination of multiple subsets may
remove (or reduce the impact of) less important features, resulting in an informative,
robust, and efficient solution.
A majority of the existing techniques that follow this idea focus on combining
feature ranking techniques (also termed criteria ensembles [238]), e.g., for the
purpose of text classification [195] and software defect prediction [259]. They work
by merging the ranking scores or exploring the rank ordering of the features returned
by individual FS methods. An implementation of FSE similar to wrapper-based FS
algorithms has also been studied [10]. Additionally, feature redundancy elimination
has been achieved using tree-based classifiers ensembles [252]. Several terms similar
to FSE exist in the literature but represent a variety of different meanings, most
of which refer to classifier ensembles built upon feature subsets (e.g., [197]). One
notable example of this type of approach is the widely used Random Subspaces
technique [102].
In this chapter, a new representation termed “occurrence coefficient-based FSE”
(OC-FSE) is proposed. It works by analysing the feature occurrences within a group
84
4.1. Occurrence Coefficient-Based Ensemble
of base FS algorithms, and subsequently producing a collection of feature occurrence
coefficients (OC). It is a concise notion that merges the views of the individual
components involved. Three possible implementations of the FSE concept are intro-
duced and discussed. These include: 1) building ensembles using stochastic search
techniques such as HS, 2) generating diversity by partitioning the training data, and
3) constructing ensembles by mixing various different FS algorithms.
To make better use of the information embedded within an OC-FSE, a novel
OC threshold-based classifier aggregation method is also presented. It improves
upon the existing ideas that imitate the popular majority vote scheme [249], which
is often adopted by conventional ensemble approaches to classifier learning [227].The proposed methods are flexible, allowing feature subset evaluators to be used in
conjunction with feature ranking; and more importantly, to be scalable for large-sized
ensembles and time-critical applications. The use of stochastic search-based and
data partition-based methods in an OC-FSE implementation is due to the observation
that they themselves are able to induce quality FS components from just a single FS
algorithm, thereby reducing the cost of the initial FSE configuration.
The remainder of this chapter is structured as follows. The proposed OC-FSE
approach and the accompanying aggregation technique are explained in Section
4.1, together with three alternative implementations of the FSE concept detailed
in Section 4.1.1. Illustrative flow charts are provided to aid understanding. This
section also provides a complexity analysis of the proposed approach. Section 4.2
presents the experimentation carried out on real-world problem cases [78], dedicated
to empirically identifying important characteristics of the present techniques. It
includes: an analysis of classification accuracy following the proposed approach,
tested using the three implementations (Section 4.2.1); a cross comparison between
the different implementations (Section 4.2.2); and a demonstration of how this
work may deal with a large number of base FS components (Section 4.2.3). Finally,
Section 4.3 summarises the chapter.
4.1 Occurrence Coefficient-Based Ensemble
This section presents the key notions of OC-FSE, and discusses the possible imple-
mentations that can systematically construct such FSEs, with the aid of illustrative
flow charts. The proposed OC threshold-based aggregation method is specified
85
4.1. Occurrence Coefficient-Based Ensemble
that extracts the information embedded within an OC-FSE. It provides an efficient
alternative to the evaluation of a traditional, complete FSE built using subsets found
by the base FS components (referred to as an ordinary FSE hereafter). A brief com-
plexity analysis is also provided to point out the computational costs of the proposed
methods.
For a given ordinary FSE, assuming its underlying feature subsets are B = Bi | i =1, · · · , |B|, can be represented by a set of binary strings: bBi , · · · , bB|B|, as shown in
Table 4.1. Here |B| denotes the size of the ensemble. Existing methods in the literature
generally build the subsequent classifier system using the individual subsets [102,
197], or attempt to merge them into a single subset [227, 238], which is denoted by B∗
below. OC-FSE is developed by exploiting an alternative approach to the combination
of the feature subsets. In particular, the decisions of the ensemble components
are organised in a |B| × |A| boolean decision matrix D. In this representation, a
horizontal row denotes a feature subset Bp, p = 1, · · · , |B|, and the binary cell value
dpq, q = 1, · · · , |A|, indicates whether aq ∈ Bp. The OC parameter σq of feature aq is
then defined as:
σq =
|B|∑
p=1dpq
|B|(4.1)
It first counts the number of occurrences of the features present in the ensemble,
then normalises the occurrences by the ensemble size |B|.
Table 4.1: Ordinary FSE of five feature subsets: Bi | i = 1, · · · , 5 with eight features:a1, · · · , a8
bBi a1 a2 a3 a4 a5 a6 a7 a8
bB1 0 1 1 0 0 1 0 1bB2 1 0 1 0 1 0 0 0bB3 0 0 1 0 1 0 0 1bB4 0 1 1 0 0 0 1 1bB5 1 1 1 0 0 0 0 1
Obviously, 0 ≤ σq ≤ 1. The resultant OC indicates how frequent a particular
feature aq is selected in an ordinary FSE, e.g., σ3 = 1 indicates that a3 is present
in all subsets with respect to Table 4.1. Irrelevant features naturally have an OC
value of σ = 0. An FSE may now be constructed by a set of such OCs: E =σ1, · · · ,σ|A|. The example FSE given in the table can therefore be denoted as
86
4.1. Occurrence Coefficient-Based Ensemble
0.4, 0.6, 1.0, 0.0, 0.4, 0.2, 0.2, 0.8. The size of such an FSE is defined as the sum of
OCs: |E|=|A|∑
q=1σq, 0≤ |E| ≤ |A|.
4.1.1 Ensemble Construction Methods
This section introduces three methods that generate the required base pool of classi-
fiers, where stochastic search algorithms such as HSFS play an important role in the
production of distinctive feature subsets.
4.1.1.1 Single Subset Quality Evaluation Algorithm with Stochastic Search
Many of the existing nature-inspired heuristics, e.g., GA, PSO, and HS, share certain
properties, most notably the ability to generate multiple, quality solutions [62]. This
characteristic has been demonstrated previous in Section 3.5.1.3, and can be exploited
to efficiently construct FSEs. As illustrated in Fig. 4.1, an employed stochastic
algorithm searches for subsets until the targeted number of subsets |B| is satisfied.
This simple implementation requires only one evaluator and one search technique,
therefore the effort spent in configuring and training the necessary components is
minimal. However, this approach may be less desirable for data sets with fewer
features, where the number of diverse feature subsets is limited, and thus, the
resultant FSE diversity maybe low.
Figure 4.1: Flow chart for single subset quality evaluation algorithm with stochasticsearch
4.1.1.2 Single Subset Quality Evaluation Algorithm with Partitioned Data
An alternative approach for creating a diverse FSE is to use data partitioning, where
the training data is divided into a number of different chunks, and FS is then carried
out on each individual data partition. This is illustrated in Fig. 4.2. Here the ensemble
diversity is ensured because of the differences between the data partitions. Strategies
87
4.1. Occurrence Coefficient-Based Ensemble
similar to stratified cross-validation [185] may be employed, in order to maintain
class balance, and to ensure that minority classes are sufficiently represented in each
data partition. This approach may be less effective for data sets with limited training
objects, since most FS evaluators require a sufficient number of training objects in
order to choose the most meaningful features. For such data sets, this may also
impose a constraint on the ensemble size |B| (number of data partitions).
Figure 4.2: Flow chart for single subset quality evaluation algorithm with partitionedtraining data
4.1.1.3 Mixture of Subset Quality Evaluation Algorithms
The most intuitive FSE construction is perhaps to employ a number of different
subset quality evaluation algorithms. Diversity can be naturally obtained from the
differences in opinions reached by the evaluators themselves. The construction
process may be further randomised by the use of a pseudo random generator, as
illustrated in Fig. 4.3, where subset evaluators 1 to Y are the available feature subset
evaluators. This may become beneficial when the available evaluators are fewer
than the desired number of ensemble components, where certain evaluators are
expected to be used multiple times. Although many practical problems may favour
such a scheme, the overhead of integrating several methods, and the complexity of
the employed algorithms themselves may affect the overall run-time efficiency. Also,
as multiple evaluators are used simultaneously, finding optimal parameter settings
for the ensemble may become computationally challenging.
4.1.2 Decision Aggregation
One of the most commonly used ensemble aggregation approaches is majority vote
[249], where the decision with the highest ensemble agreement is selected as the
final prediction. This method is beneficial for the situations where a single aggregated
feature subset is preferable. Following the proposed OC-based approach, a given
88
4.1. Occurrence Coefficient-Based Ensemble
Figure 4.3: Flow chart for mixture of subset quality evaluation algorithms
OC threshold: α, 0 < α ≤ 1, may be adopted to control the number of features
considered for inclusion in the aggregated outcome B∗, such that: aq ∈ B∗ if σq ≥ α.
The common majority (more than half) voting method can be intuitively assimilated
by setting α = 0.5. The value αmay be adjusted according to the problem at hand. In
particular, if the level of agreement is very high (which may indicate poor ensemble
diversity), a higher α value should be used in order to control the size of the resultant
subset. Alternatively, if a highly diverse FSE is expected to be obtained, there may
exist very few features where σq > 0.5; to combat this, it may be necessary to employ
a lowered α value.
Finding the right configuration of α can be difficult without in-depth knowledge
of the application problem at hand, while a poorly aggregated subset will bring
negative impact upon the end classifier performance. To address this issue, a multi-
layered OC threshold-based aggregation scheme can be adopted. It first produces
multiple feature subsets using different degrees of α, where the number of subsets
is set to [ 1∆α], i.e., the nearest integer to 1
∆α, if the entire possible value range of
α is partitioned into a number of intervals of a length ∆α. It subsequently builds
classifiers that can generate class probability distributions d1, . . . di, . . . dC for the C
possible class labels. The classification outcome of OC-FSE is therefore a weighted
combination of such distributions:
[ 1∆α]
∑
j=1
w jd1 j, . . .
[ 1∆α]
∑
j=1
w jdi j, . . .
[ 1∆α]
∑
j=1
w jdC j (4.2)
The most probable class label may then be subsequently taken as the final output.
Using the previous example FSE as shown in Table 4.1, five feature subsets may be
generated using intervals∆α = 0.2, as given in Table 4.2. Suppose the classifiers built
using these feature subsets lead to the distributions shown (with 3 possible classes)
89
4.1. Occurrence Coefficient-Based Ensemble
for a given test object. The final aggregation outcome is 1.9,2.5,2.8 assuming
equal weighting (w j = α, j = 1, . . . , 5), with class 3 being the most probable class
label. Although this example is fictional, it illustrates the possibility of alternative
class labels being derived if majority vote or simple merging is used instead. For
instance, by majority vote (α= 0.5), class 2 would be returned.
Table 4.2: An example of OC threshold-based aggregation with 3 possible classes
B∗α
Feature subsets Distribution
B∗0.1 a3 ∪ a8 ∪ a2 ∪ a1, a5 ∪ a6, a7 0.7, 0.5,0.3B∗0.3 a3 ∪ a8 ∪ a2 ∪ a1, a5 0.6, 0.5,0.3B∗0.5 a3 ∪ a8 ∪ a2 0.3, 0.6,0.5B∗0.7 a3 ∪ a8 0.2, 0.4,0.8B∗0.9 a3 0.1, 0.5,0.9
4.1.3 Complexity Analysis
As the ensemble procedure depends largely on the training (Ot), search (Os), and
evaluation (Oe) complexity of the employed FS components, the overall complexity
of an FSE is also relative to Ot , Os, and Oe. For a given feature evaluator, using HS as
an example, the complexity of the subset search process Os = Oe · gmax, depending on
Oe and the maximum number of iteration gmax. The total complexity of training and
obtaining the solution for a single feature selector is therefore Ot +Os in this case.
For ensembles constructed using a stochastic search method, the training com-
plexity is Ot , as only a single algorithm is involved which needs to be trained once.
The ensemble search complexity is Os · |B|, where |B| is the ensemble size. The total
complexity is therefore:
Ostochastic = Ot +Os · |B| (4.3)
For data partition-based ensembles, the evaluators need to be re-trained for every
data partition, resulting in a training complexity of Ot · |B| for these components,
whilst having the same search complexity Os · |B| as stochastic ensembles. The total
complexity is:
Odata partition = (Ot +Os) · |B| (4.4)
For ensembles generated from a mixture of algorithms, the training complexity
is based on the number of available evaluators∑Y
i=1 Ot i, where Y is the number of
90
4.2. Experimentation and Discussion
evaluators. The search complexity is:
|B|∑
i=1
Osi, Osi=
¨
Oei· gmax for subset evaluators
O(|A|) for feature rankers(4.5)
and |A| is the number of features. The feature ranking approaches simply pick the
|A| best features at O(|A|) complexity, while subset-based evaluators need to perform
a search on the solution space. The final complexity of the mixture approach is
therefore:
Omixture =Y∑
i=1
Ot i+|B|∑
i=1
Osi(4.6)
Furthermore, O(|A| · |B|) straightforward calculations are required to compute
the ensemble output, and to convert the base FS components into the OC-FSE. The
OC threshold-based decision aggregation imposes a constant cost regardless of the
number of base components involved, as the number of trained classifiers is always
[ 1∆α]. This makes the proposed method potentially favourable for large sized FSE
systems or time critical ensemble applications. Note that Os can be further reduced
by integrating the process of ensemble construction and the search procedure itself.
4.2 Experimentation and Discussion
The algorithms adopted in the experiments cover several rather different underlying
techniques, including the well-known C4.5 algorithm [264] and the naïve Bayes-
based classifier (NB) [132]. The vaguely quantified fuzzy-rough nearest neighbour
(VQNN) [118] is also employed which is a very recent and powerful classification
technique. With such use of the various classifiers, a more comprehensive under-
standing of the resulting FSEs, and OC threshold-based aggregation method can be
reached.
A number of subset evaluators are used in the experiments, including CFS [93],PCFS [52], and FRFS [126]. Feature ranking methods are also employed in the
mixture of algorithms implementation, which will be introduced in detail in its
dedicated section (4.2.1.3). In total 13 real-valued UCI benchmark data sets [78]are used to demonstrate the efficacy of the proposed approaches, several of which
are of reasonably high dimensionality and hence, present sufficient challenges for
FS. A summary of the characteristics of these data sets is given in Table 4.3, and the
91
4.2. Experimentation and Discussion
Table 4.3: Data sets used for OC-FSE experimentation
Data set Feature Instance Class C4.5 NB VQNN
arrhy 279 452 16 65.97 61.40 61.55cleve 14 297 5 51.89 55.36 52.99ecoli 8 336 8 82.53 85.21 84.21glass 10 214 6 66.96 48.09 65.61handw 257 1593 10 75.74 86.21 77.68ionos 35 230 2 86.22 83.57 83.05libra 91 360 15 68.24 63.63 67.25multi 650 2000 10 94.54 95.30 98.03ozone 73 2534 2 92.70 67.66 93.69secom 591 1567 2 89.56 30.04 93.36sonar 61 208 2 73.59 67.85 75.39water 39 390 3 81.08 85.40 81.58wavef 41 699 2 75.49 79.99 79.49
HS parameter settings empirically employed in the experiments are: |H|= 10− 20,
gmax = 1000− 2000, δ = 0.5− 1, while |P| is iteratively refined.
Stratified 10-FCV is employed, where a given data set is partitioned into 10
subsets. Of these 10 subsets, nine are used to form a training fold and a single subset
is retained as the testing data. The construction of the base classifier ensemble, and
the ensemble reduction process are both performed using the same training fold, so
that the reduced subset of classifiers can be compared using the same unseen testing
data. This process is then repeated 10 times (the number of folds). The advantage
of 10-FCV over random sub-sampling is that all objects are used for both training
and testing, and each object is used for testing only once per fold. The stratification
of the data prior to its division into different folds ensures that each class label has
equal representation in all folds, thereby helping to alleviate bias/variance problems
[17]. In the experiment, unless stated otherwise, 10-FCV is executed 10 times in
order to reduce the impact of the stochastic methods employed. The differences in
performance of various methods are statistically evaluated using paired t-test with
two-tailed p = 0.01.
The classification outcomes of the proposed OC threshold-based FSE aggregation
method (with ∆α = 0.1) are reported in Section 4.2.1. The base FS components
B, |B|= 20 are produced by the three different ensemble construction methods as
described in Section 4.1.1. With OC threshold α= 0.5 (which assimilates majority
92
4.2. Experimentation and Discussion
vote), the OC-FSE-discovered feature subsets can be “flattened” into standard (single)
feature subsets. That is, the union of these subsets is regarded as a selected subset
itself. The accuracies of the classifiers trained using such flattened subsets are also
presented. The outputs from the base FS components are collected during the
ensemble construction process, which are then used to build traditional, feature
subset-based classifier ensembles.
Note that these ordinary FSEs (with 20 feature subsets) are certainly larger in size,
when compared to OC-FSEs with 10 flattened feature subsets. The purpose of the
comparison is to determine whether the proposed approach is indeed competitive, in
terms of classification accuracy, and subset size. The averaged accuracies of the single
FS algorithms are included as well, in order to signify the performance baseline.
Comparative studies between the three FSE implementations are further made in
Section 4.2.2, where the performances of the ensembles are averaged across different
classifiers, thereby providing a high level reflection of the characteristics of these
approaches. Finally, Section 4.2.3 investigates into the relationship between the
aggregation accuracy and the size of OC-FSE-selected feature subsets, as well as the
number of initial FS components involved in the ensemble construction.
4.2.1 Classification Results
The classification results are presented in Tables 4.4 to 4.6. The number of base
FS components |B| is 20 throughout. This facilitates comparison between different
approaches, especially between the three FSE implementations. However, note that
this may not be most suitable for several data sets (e.g., those with fewer instances).
The figures highlighted in bold indicate statistically superior results in comparison to
the rest. As explained previously in Section 4.1.1.1, the evaluators that access the
quality of a feature subset as a whole (such as FRFS, CFS, and PCFS) are employed in
the stochastic search and data partition-based implementations. Because the source
of diversity arises from the randomised search, a feature ranking based evaluator
will typically (with minor variations from different cross-validation folds) result in
the same feature subset over different runs.
4.2.1.1 Ensemble Constructed via Stochastic Search
As shown in Table 4.4, the proposed method (OC-FSE) is able to deliver very competi-
tive classification performance and generally produces better results than the subsets
93
4.2. Experimentation and Discussion
selected by a single FS algorithm. For the cleve, ecoli, and glass data sets, it
results in equal or better accuracies across almost all classifiers and FS methods,
while the ordinary FSE performs consistently well for the handw data set. Better
overall results are obtained for the CFS and FRFS based ensembles, though the
proposed method is out performed by the ordinary FSE constructed using PCFS, for
the data sets handw, libra and multi. OC-FSE works well in conjunction with
the NB classifier (76.33% vs. 44.22% for the secom data set with PCFS), and the
ordinary FSE is able to achieve very competitive performance when paired with C4.5.
For low dimensional data sets such as cleve, ecoli, and glass, the FRFS evaluator
consistently select the same features, therefore no diversity is present in the resultant
FSEs. However, OC-FSE manages to improve several classifiers for cleve.
Although ordinary FSE performs better in a number of cases, it maintains much
larger ensembles than OC-FSE (20 vs. 10 in this experiment), and has higher space
and time complexity when compared with OC-FSE. Also, PCFS identifies features that
are most inconsistent with the class. For higher dimensional data sets, the number
of common features selected within a given FSE may be significantly reduced. This
affects the performance of the feature subsets aggregated under higher α thresholds.
Finally, considering the flattened B∗0.5 to be in the form of a standard subset, its
performance is also promising and presents a compromise between subset size
(which determines classifier complexity) and classification accuracy.
4.2.1.2 Ensemble Constructed via Data Partitioning
Table 4.5 details the results collected using the data partition-based implementation.
Several characteristics reveal earlier also hold for this set of experiments: (1) OC-FSE
works exceptionally well with NB while the ordinary FSE performs better with C4.5.
(2) The ordinary FSE still produces better results for the data sets handw and libra.
(3) Less performance variation is observed for low dimensional data sets for the
FRFS evaluator. The proposed method achieves competitive scores for the high
dimensional and large data sets, such as multi, ozone, and secom, with reasonable
sized FSEs. This demonstrates the strength of the data partition-based ensemble
construction technique.
4.2.1.3 Mixture of Algorithms
For this set of experiments, a number of individual feature evaluators are considered,
including several feature ranking approaches including information gain, data relia-
94
4.2. Experimentation and Discussion
Table 4.4: Classification accuracy % result of stochastic search implementation,shaded cells indicate statistically significant improvements for each of the testedclassification algorithms
OC-FSE B∗0.5 (OC α= 0.5) Ordinary FSE Single Subset
C4.5 NB VQNN Size C4.5 NB VQNN Size C4.5 NB VQNN Size C4.5 NB VQNN Size
CFS
arrhy 71.4 69.3 62.9 32.2 67.7 68.3 65.4 27.3 73.3 68.8 63.5 33.8 67.3 67.4 63.8 32.1cleve 55.8 56.6 53.2 6.7 55.3 56.4 52.2 6.7 55.6 56.6 53.2 6.7 55.3 56.4 52.2 6.7ecoli 79.8 82.7 82.2 4.7 79.2 81.3 81.9 4.7 79.8 82.6 82.2 4.9 79.2 81.3 81.9 4.7glass 70.9 47.6 67.0 6.6 70.4 47.68 68.0 6.6 68.8 47.4 67.0 6.6 70.4 47.7 68.0 6.6handw 80.5 83.5 57.4 81.6 74.1 81.0 55.0 67.1 86.5 86.1 67.5 82.0 75.4 83.6 62.6 81.7ionos 86.8 86.9 81.7 10.3 86.6 87.0 81.7 10.4 86.7 86.4 81.2 10.6 86.4 86.8 81.7 10.3libra 71.2 62.2 66.6 30.3 65.7 59.6 62.3 28.9 74.7 63.6 66.1 31.8 66.6 61.7 63.8 30.0multi 96.6 97.6 99.0 162 94.5 96.8 98.4 82.9 96.8 96.7 98.6 171 94.6 96.5 98.2 162ozone 93.6 74.3 93.7 20.9 93.1 73.9 93.7 20.7 94.0 73.8 93.7 23.2 93.1 74.0 93.7 20.5secom 92.9 83.1 93.4 18.6 92.3 72.6 93.4 18.3 93.2 73.7 93.4 25.2 92.4 71.6 93.4 19.4sonar 75.6 66.9 75.0 17.8 75.0 66.9 76.0 18.1 75.5 66.6 76.0 18.2 75.3 66.3 76.1 17.7water 81.8 86.0 86.3 10.4 81.0 86.0 85.9 10.5 82.1 86.0 86.1 10.7 81.1 85.9 85.9 10.4wavef 77.2 80.2 84.5 14.8 77.1 80.1 83.8 14.9 77.0 80.1 83.9 14.8 77.2 80.2 83.6 14.7
PCFS
arrhy 70.1 69.9 61.9 44.7 64.7 65.8 62.7 21.7 73.9 69.6 61.8 49.4 66.1 65.4 62.2 44.0cleve 54.6 55.9 52.5 8.0 53.9 55.3 52.5 8.0 54.1 55.7 52.5 7.9 54.6 55.7 52.5 8.0ecoli 83.0 85.6 84.5 6.0 83.0 85.6 84.4 6.0 81.4 85.2 84.5 6.0 83.0 85.6 84.4 6.0glass 69.4 49.1 68.8 6.7 69.2 48.9 68.6 6.7 68.6 48.2 67.4 6.8 69.3 49.1 68.8 6.7handw 78.5 81.8 65.5 60.2 66.6 70.8 50.0 38.0 91.4 84.3 67.6 24.6 63.8 66.8 51.1 20.9ionos 87.3 81.5 81.0 7.6 85.6 77.4 79.4 6.5 89.5 81.5 81.9 7.5 85.5 77.4 79.1 7.5libra 71.7 59.7 64.3 20.5 60.2 54.0 61.3 13.6 76.5 61.4 66.6 18.7 64.1 60.4 63.7 21.2multi 91.7 94.6 94.6 42.0 85.4 87.4 89.2 19.2 97.4 95.4 97.0 14.0 81.5 84.2 83.7 9.8ozone 93.8 79.7 93.7 18.8 93.3 77.6 93.7 15.4 94.1 74.3 93.7 20.4 93.1 74.6 93.7 19.5secom 93.2 76.3 93.4 68.4 92.3 72.6 93.4 16.8 93.1 44.2 93.4 86.5 91.3 48.4 93.4 89.0sonar 77.1 66.8 76.5 11.5 73.7 67.1 76.3 11.7 78.8 66.4 77.9 12.1 74.2 66.8 77.8 12.9water 81.9 86.4 86.3 9.8 81.6 86.2 86.2 9.9 82.0 86.2 86.6 10.2 81.4 86.0 86.0 10.0wavef 78.8 81.7 80.0 11.1 74.6 80.3 77.5 11.9 82.8 81.9 82.7 11.4 74.4 79.2 77.0 10.9
FRFS
arrhy 66.6 65.5 56.2 27.9 59.7 61.5 55.1 16.1 70.6 67.9 55.1 20 57.3 61.5 57.5 20.0cleve 53.3 56.3 55.9 10.0 52.2 55.6 54.6 10.1 52.6 55.3 54.9 10.0 53.2 55.6 55.6 10.0ecoli 83.7 85.7 85.1 6.9 83.7 85.7 85.1 6.9 83.7 85.7 85.1 6.9 83.7 85.7 85.1 6.9glass 67.8 47.7 65.2 9.0 67.8 47.7 65.2 9.0 67.5 47.7 65.2 9.0 67.8 47.7 65.2 9.0handw 82.4 84.4 67.6 73.2 69.2 73.7 55.3 50.3 91.7 84.5 66.0 24.5 62.8 66.0 54.3 25.0ionos 89.9 89.8 85.0 10.2 88.0 83.0 83.5 8.8 89.8 85.0 85.8 10.0 87.0 83.9 81.8 10.0libra 79.2 61.4 68.1 17.5 59.7 47.8 58.1 9.4 71.9 61.1 60.6 10 63.6 59.4 65.0 10.0multi 98.0 95.4 97.3 89.6 87.6 93.1 93.9 62.5 93.9 95.8 96.9 19.4 83.6 87.1 85.8 19.6ozone 94.0 80.6 93.7 25.0 93.1 77.6 93.7 19.5 94.1 72.3 93.7 25.0 93.1 71.8 93.7 25.0secom 93.4 92.1 93.4 76.9 92.0 90.6 93.4 47.9 93.4 92.9 93.4 24.1 93.0 89.9 93.4 24.4sonar 74.6 74.1 77.5 13.6 69.7 73.1 77.5 10.7 78.9 73.1 79.4 10.0 70.2 70.7 75.0 10.0water 83.9 85.9 83.6 10.0 81.5 85.4 83.1 9.9 85.4 85.9 83.6 10.0 81.5 84.6 83.3 10.0wavef 77.2 81.9 82.0 25.2 72.4 79.3 71.8 30.2 83.8 81.7 78.9 19.0 71.7 76.6 70.4 19.0
95
4.2. Experimentation and Discussion
Table 4.5: Classification accuracy % result of data partition-based implementation,shaded cells indicate statistically significant improvements for each of the testedclassifiers
OC-FSE B∗0.5 (OC α= 0.5) Ordinary FSE Single Subset
C4.5 NB VQNN Size C4.5 NB VQNN Size C4.5 NB VQNN Size C4.5 NB VQNN Size
CFS
arrhy 71.7 69.7 63.1 38.8 68.6 68.3 65.9 25.7 72.9 68.4 63.2 33.7 67.8 67.5 63.6 31.6cleve 55.7 56.8 53.4 6.6 55.5 56.9 53.1 6.7 54.4 56.5 53.3 6.6 55.7 56.7 52.7 6.8ecoli 78.0 83.8 82.8 4.9 79.2 82.3 82.2 4.9 79.8 82.5 82.3 4.9 79.6 82.2 81.9 4.8glass 71.0 48.3 67.2 6.4 70.3 48.0 67.6 6.5 68.6 47.9 68.7 6.4 69.0 48.1 67.5 6.6handw 81.3 86.1 59.3 81.4 73.7 79.0 50.0 60.1 86.6 84.2 67.2 82 75.5 83.7 63.1 79.7ionos 87.7 86.7 82.1 10.6 87.1 86.4 81.5 10.5 87.8 86.6 82.2 10.8 87.2 86.5 81.6 10.3libra 73.2 59.7 63.1 29.7 64.9 57.8 61.4 26.8 74.2 61.9 64.5 31.6 67.0 61.7 63.8 30.4multi 96.5 97.6 98.8 161 94.7 96.9 98.6 81.4 96.5 96.9 98.7 170 94.4 96.3 98.4 172ozone 93.8 74.6 93.7 21.4 93.2 73.9 93.7 21.1 94.2 73.8 93.7 23.1 93.2 73.9 93.7 21.4secom 93.1 82.3 93.4 20.0 92.4 75.0 93.4 16.9 93.3 75.3 93.4 24.9 92.2 70.2 93.4 18.4sonar 76.3 66.8 74.7 17.0 76.0 66.4 75.8 17.7 77.1 66.4 75.4 17.9 75.8 66.9 77.2 17.6water 82.5 86.5 86.8 10.2 81.2 85.9 86.1 10.1 82.5 86.4 86.5 10.5 81.8 86.4 86.5 10.5wavef 77.3 80.2 84.5 14.6 77.1 80.1 83.9 14.9 77.0 80.1 84.0 14.8 77.2 80.2 83.7 14.7
PCFS
arrhy 70.5 70.3 60.5 45.3 64.9 66.6 62.2 21.9 74.3 69.8 61.8 49.2 66.3 65.9 61.9 44.6cleve 56.0 56.6 52.8 7.8 54.3 55.4 53.1 7.7 55.2 55.6 51.5 7.8 54.8 55.4 52.4 7.9ecoli 82.5 85.4 84.6 5.9 82.5 85.0 83.5 6.0 82.2 85.2 84.5 6.0 82.4 85.3 84.0 6.0glass 70.3 50.9 67.8 6.6 69.6 49.3 68.3 6.7 69.4 48.6 68.6 6.5 69.3 48.7 68.3 6.7handw 82.4 84.0 67.2 59.5 65.1 69.2 49.7 36.8 91.1 84.0 65.6 24.3 63.8 66.7 51.0 21.0ionos 89.6 81.7 80.8 8.7 86.7 78.9 79.7 7.1 90.0 81.2 82.2 7.4 86.0 78.0 80.3 7.4libra 70.7 60.4 62.9 22.8 59.2 53.3 59.5 14.4 76.4 62.0 66.6 18.8 64.0 60.3 64.1 20.2multi 97.6 95.0 97.1 43.1 85.6 87.4 89.2 19.4 97.6 95.4 96.3 14.0 81.8 84.2 83.6 9.7ozone 93.8 82.1 93.7 19.3 93.3 79.2 93.7 14.0 94.1 74.2 93.7 20.5 93.3 74.9 93.7 18.9secom 93.3 84.5 93.4 67.3 92.2 80.3 93.4 16.0 93.3 47.2 93.4 86.0 91.7 53.8 93.4 67.6sonar 77.0 67.2 75.7 11.6 73.4 66.7 76.1 11.8 76.7 67.6 77.4 11.9 73.4 67.0 76.3 11.3water 82.1 86.9 86.8 9.9 80.9 86.9 86.4 10.5 82.4 86.8 86.8 10.3 80.4 86.4 85.9 9.9wavef 79.7 81.9 82.9 11.4 74.8 80.6 77.8 12.6 83.0 81.9 81.9 11.5 74.7 79.4 77.1 10.9
FRFS
arrhy 64.2 65.3 55.6 25.5 53.3 58.7 55.1 11.3 70.6 66.6 55.3 20.0 59.6 65.3 57.3 20.0cleve 52.5 55.9 56.0 10.0 50.5 53.5 56.0 10.0 51.5 53.9 55.3 10.0 52.2 54.6 52.2 10.0ecoli 82.5 85.5 85.8 6.9 82.5 85.5 85.8 6.9 82.8 85.5 85.8 6.9 82.5 85.5 85.8 6.9glass 66.8 49.7 63.0 9.0 66.8 49.7 63.0 9.0 66.8 49.7 63.0 9.0 65.4 49.2 62.5 8.9handw 82.7 84.2 65.0 71.5 68.8 73.9 55.9 50.5 90.9 84.1 67.5 24.3 65.2 68.3 55.4 23.6ionos 90.4 84.4 87.4 10.3 88.3 81.3 83.5 8.1 90.0 83.0 86.5 10.0 84.8 83.0 80.4 10.0libra 70.3 61.4 58.3 18.1 54.2 46.7 56.1 9.0 75.6 61.4 65.6 10.0 64.7 58.1 61.7 10.0multi 97.9 95.9 97.5 85.3 84.9 89.9 89.8 46.8 93.2 95.6 96.0 19.3 82.6 87.3 86.7 19.5ozone 93.9 79.9 93.7 25.0 93.1 77.8 93.7 20.1 94.0 72.1 93.7 25.0 93.3 72.8 93.7 25.0secom 93.4 92.6 93.4 73.8 91.6 89.3 93.4 39.3 93.4 92.6 93.4 24.1 92.7 77.3 93.4 24.4sonar 79.0 73.7 76.1 13.0 76.1 69.8 76.6 8.8 80.4 75.6 78.0 10.0 71.3 69.8 77.5 10.0water 86.4 85.6 84.1 10.1 84.6 83.9 83.1 10.1 86.2 85.9 83.3 10.0 83.9 84.4 82.1 10.0wavef 78.0 81.8 78.9 25.4 74.4 80.7 73.5 30.9 83.0 81.6 78.9 19.0 70.3 73.2 64.2 19.0
96
4.2. Experimentation and Discussion
bility [22], chi-square [291], RELIEF [147], and symmetrical uncertainty [219], in
conjunction with various feature subset evaluators (FRFS, CFS, and PCFS). Together,
eight different evaluation methods are employed. A pseudo-random generator as
described in Section 4.1.1.3 is used to produce the required 20 ensemble components.
For FS rankers, the final feature subset size is adjusted according to the size of subsets
obtained by the subset evaluators.
The classification performances of the classifiers that utilise the constructed FSEs
are compared in Table 4.6. The most interesting results are achieved for the data sets
ecoli, ionos, sonar, water, and wavef, where all three tested classifiers have
an improved performance. By employing the proposed work, the overall accuracies
of both C4.5 and NB are improved for 10 out of 13 data sets, as compared to that
of VQNN (6/13 data sets). This indicates that the OC threshold-based aggregation
technique is favourable for such generically built ensembles.
Table 4.6: Classification accuracy % result of mixture of algorithms, shaded cellsindicate statistically significant improvements for each of the classification algorithms
OC-FSE B∗0.5 (OC α= 0.5) Ordinary FSE
Data set C4.5 NB VQNN Size C4.5 NB VQNN Size C4.5 NB VQNN Size
arrhy 69.16 65.73 64.68 108.0 66.92 65.70 64.95 116.2 68.11 65.44 64.75 108.0cleve 55.85 56.69 52.79 7.1 55.71 56.72 52.96 7.0 55.91 56.48 52.79 8.1ecoli 81.42 84.30 82.17 4.7 79.04 80.30 81.16 4.6 79.75 80.51 81.49 5.7glass 69.02 50.14 64.17 5.5 67.12 49.47 63.31 5.4 67.77 49.53 64.40 6.5handw 78.54 84.51 64.04 114.1 76.48 84.05 62.47 125.3 78.97 84.66 63.24 105.1ionos 89.43 84.13 84.91 14.5 87.22 83.48 83.22 14.3 89.04 84.13 84.74 15.4libra 68.28 57.50 61.75 38.7 64.11 56.64 60.94 44.5 67.61 58.17 62.28 39.0multi 95.22 95.49 98.07 286.7 94.64 95.04 97.69 314.1 95.29 95.37 97.85 259.2ozone 93.78 70.96 93.69 32.1 92.67 69.46 93.69 35.5 93.68 69.78 93.69 32.1secom 91.45 44.22 93.36 223.7 90.04 36.94 93.36 255.2 91.19 34.97 93.36 189.9sonar 78.80 79.82 82.94 25.3 75.05 66.51 77.38 26.4 75.91 66.80 77.19 26.1water 82.77 86.56 85.85 16.4 81.95 86.26 84.77 17.3 82.26 86.46 84.87 17.4wavef 76.63 80.34 83.76 17.7 76.38 79.98 82.45 19.4 76.58 79.98 82.72 18.7
4.2.2 Comparison of Ensemble Generation Methods
A graphical view of the classification results achieved is shown in Fig. 4.4, detailing
the average performance and spread of the three FSE implementations, against the
classification models built using the base FS components over each of the 12 data
sets. The PCFS subset evaluator is used to select the features, and the FSEs are tested
using the C4.5 algorithm. The 10-FCV process is repeated 50 times per data set,
producing 500 different results in the expectation that the variations in training and
97
4.2. Experimentation and Discussion
testing are sufficiently captured. Table 4.7 presents an overall statistical summary
which combines the results over all examined data sets. The mean accuracy is also
presented in order to give a full comparison of the performance differences between
the implementations.
Table 4.7: Statistical comparison of the three FSE implementations, using meanaccuracy aggregated over all data sets, shaded cells indicate best results
Mean Median Max Min SD
Stochastic 81.53 81.82 93.45 64.18 5.02Partition 81.23 81.36 93.33 64.47 5.06Mixture 80.21 80.29 93.23 63.80 5.11Base 76.00 76.33 91.12 57.13 5.76
Paired t-test (p = 0.1) indicates that all FSE implementations have lower standard
deviations than the corresponding base FS components. This experimentally demon-
strates that the use of FSE reduces performance variance of the end classification
models. However, no significant differentiation can be determined between the
ensemble approaches themselves. From these figures, it can be concluded that no
implementation works universally better than the others, although the stochastic
search-based method leads in terms of overall classification accuracy, achieving best
scores in eight of the 12 cases. Note that the mixture-of-algorithm-based method
presents outstanding performance for the multi data set (highest accuracy of 95.18%
and lowest spread of 1.53%). No obvious improvement is noticed for the data sets
cleve and ecoli, other than a minor accuracy increase. This is expected as these
are low dimensional data sets. For all the data sets tested, the use of OC-FSE results
in improvement on the classifiers accuracy, while the number of features required
to perform the classification is also much reduced. This reflects that as a novel
filter-based approach, OC-FSE offers a beneficial pre-processing step for the purpose
of classification.
4.2.3 Scalability Tests
The aim of this set of experiments is to verify whether using a different number
of base FS components may affect the accuracy (and FSE size) of the proposed
approach. Obviously, this is a computationally expensive experimentation. Thus,
only the stochastic search-based ensemble generation method is employed along
with the PCFS evaluator. Fig. 4.5 details the variations in performance for several
98
4.2. Experimentation and Discussion
Figure 4.4: Comparison of average classification accuracies (crosses) and spreads ofthe three FSE implementations and base components for each data set
99
4.3. Summary
data sets with reasonable complexity (low dimensional data sets are omitted). The
number of the base FS components ranges from 1 (where the method collapses into
a single standard FS algorithm), up to 128. The classification accuracies of the three
earlier adopted classifiers are displayed. The resultant FSE sizes are also included.
Most of the tested data sets reveal an increase in accuracy when the number
of base components rises. This is intuitively appealing as more ensemble members
generally improve the group diversity. If computational resource allows, it may be
beneficial to employ larger base components, where the proposed approach excels
for its constant evaluation complexity (defined by ∆α) as discussed in Section 4.1.3.
For data sets such as arrhy, sonar, and libra, the size of the FSE shows very
minor variation (less than 1 feature on average), while almost an exponential increase
is observed for the data sets handw (257 features) and multi (650 features). It is
interesting to point out that increasing the number of base components does not
necessarily guarantee better performance, as demonstrated by the libra data set.
The accuracies of all classifiers peak at a value of over 72% for this data set when 8
components are used, and then its performance starts to gradually decline.
4.3 Summary
This chapter has introduced an occurrence coefficient-based FSE (OC-FSE) approach
and detailed three distinctive techniques in an effort to implement this approach.
In OC-FSE, the outcomes from multiple, different FS results are integrated together,
for the purpose of producing a high level view that helps to perform the subsequent
classification tasks. The key advantage of OC-FSE (and FSE in general) is that the
end classifier performance is no longer dependent upon merely one selected subset,
making it a potentially more flexible and robust technique, especially in dealing with
high dimensional and large data sets. For such data sets, multiple feature subsets
with equally highest attained scores may be discovered when judged by one single
feature evaluator, but not all may perform equally well in terms of classification.
Two of the proposed implementations, the stochastic search-based and the data
partition-based, use just a single subset evaluation algorithm; whilst the mixture
of algorithms approach aims to produce the ensemble by combining distinctive FS
evaluation measures.
Comparative experimental studies have demonstrated that OC-FSE significantly
improves over single FS results, when combining subsets discovered using the three
100
4.3. Summary
Figure 4.5: Comparison of averaged OC-FSE classification accuracies and subsetsizes, plotted against different numbers of base FS components
101
4.3. Summary
FSE construction techniques, respectively. The results have shown the strength
of OC-FSE in dealing with almost all data sets tested, having at least comparable
classification accuracies to those of ordinary FSEs built using the same subsets, whilst
reducing the overall ensemble complexity. In particular, the stochastic search-based
approach appears to perform better than the rest, which may have benefited from
the quality search results ensured by HSFS.
The limitations of the current OC-FSE implementations are discussed in 9.2.1.2,
which includes future planned extensions that may further improve the efficacy of the
present work. It is worth pointing out here that ensemble-based classification models,
especially those that rely on the underlying feature subsets to create different views
of data, are also prone to irrelevance and redundancy. This is because the selected
feature subsets, unless carefully controlled, may not generate sufficient diversity.
Experimental results also confirm that employing an arbitrary large sized BCP does
not guarantee good ensemble performance. The subsequent chapter explores the
potential of FS mechanisms in identifying the less informative ensemble members
(the so-called base classifiers), so that the efficiency and classification performance
of the resultant ensemble may be further enhanced.
102
Chapter 5
HSFS for Classifier Ensemble
Reduction
T HIS chapter is a continuation of the investigation carried out by its predecessor,
where HSFS has been utilised to generate distinctive underlying feature subsets,
thus enabling the construction of diverse groups of base classifiers, i.e., FSEs. The
goal of classifier ensemble reduction (CER) [250] (or classifier ensemble pruning)
studied in this chapter, is to reduce the level of redundancy in such pre-constructed
pools of base classifiers, in order to identify a much reduced subset of classifiers that
can still deliver comparable classification results. Alternative approaches to building
classifier ensembles (other than FSE) also involve diversifying the training data [27],or random partitioning of the input space [102], before finally aggregating their
decisions together to produce the ensemble prediction.
CER is an intermediate step between ensemble construction and decision aggre-
gation. Efficiency is one of the obvious gains from CER. Having a reduced number
of classifiers can eliminate a portion of run-time overhead, making the ensemble
processing quicker; having fewer classifiers also means relaxed memory and storage
requirements. Removing redundant ensemble members may also lead to improved
diversity within the group, further increasing the prediction accuracy of the ensemble.
Existing approaches in the literature include techniques that employ clustering [85]to discover groups of models that share similar predictions, and subsequently prune
each cluster separately. Others use reinforcement learning [200] and multi-label
103
learning [178] to achieve redundancy removal. A number of similar approaches
[150, 251] focus on selecting a potentially optimal subset of classifiers, in order to
maximise a certain predefined diversity measure.
In this chapter, a new framework for CER is presented which builds upon the
ideas from existing FS techniques. Inspired by the analogies between CER and FS,
this approach attempts to discover a subset of classifiers by eliminating redundant
group members, while maintaining (or increasing) the level of diversity within the
original ensemble. As a result, the CER problem is being tackled from a different
angle: each ensemble member is now transformed into an artificial feature in a
newly constructed data set, and the “feature” values are generated by collecting
the respective classifier predictions. FS algorithms can then be used to remove
redundant features (now representing classifiers) in the present context, in order
to select a minimal classifier subset while maintaining original ensemble diversity,
and preserving ensemble prediction accuracy. The current CER framework extends
the original idea [61] that works exclusively with the fuzzy-rough subset evaluator
[126], thus allowing many different FS evaluators and subset search methods to be
used. It is also made scalable for reducing very large classifier ensembles.
The fusion of CER and FS techniques is of particular significance for problems
that place high demands on both accuracy and speed, including intelligent robotics
and systems control [161]. For instance, simultaneous mapping and localisation
has been identified as a very important task for building robots [176]. To perform
such tasks, apart from the direct use of raw data or simple features as geometric
representations, different approaches that capture more contextual information have
been utilised recently [149]. It has been recognised that ensemble-based methods
may better utilise these additional cognitive and reasoning mappings in order to
boost the performance. In effect, CER may be adopted to prune down the redundant,
unessential models, so that the complexity of the resultant system is restricted to a
manageable level. Also, FS has already been successfully applied to challenging real-
world problems like Martian terrain image classification [224], and to reducing the
computational cost in vision-based robot positioning [263] and activity recognition
[253]. It is therefore, of natural appeal to be able to integrate classifier ensembles
and CER, in order to further enhance the potential of both types of approach.
The remainder of this chapter is laid out as follows. Section 5.1 introduces the
key concepts of the proposed CER framework that builds upon the HSFS algorithm,
104
5.1. Framework for Classifier Ensemble Reduction
illustrating how CER can be modelled as an FS problem, and details the approach
developed to address this task. Section 5.2 presents the experimentation results
along with discussions. Section 5.3 summarises the chapter.
5.1 Framework for Classifier Ensemble Reduction
For most practical scenarios, the classifier ensemble is generated and trained using
a set of given training data. For new samples, each ensemble member individually
predicts a class label, which are then aggregated to provide the ensemble decision.
It is inevitable that such ensembles contain redundant classifiers that share very
similar if not identical models. This may be caused by the shortage of training data,
or the performance limitations of the model diversification process. Such ensemble
members, while occupying valuable system resources, are likely to draw the same
class prediction for new samples, and therefore provide very limited new information
to the group.
The ensemble reduction process, if carried out in between ensemble generation
and aggregation, may reduce the amount of redundancy in the system. The benefit
of having a group of classifiers is to maintain and improve the ensemble diversity.
The fundamental concept and goals of CER are therefore the same as FS. Having
already introduced the HSFS technique (Chapter 3), the following section focuses
on explaining how a CER problem can be converted into an FS scenario, and details
the framework proposed to efficiently perform the reduction. The overall approach
developed in this work is illustrated in Fig 5.1 which contains four key steps.
5.1.1 Base Classifier Pool Generation
Forming a diverse base classifier pool (BCP) is the first step in producing a good
classifier ensemble. Any preferred methods can be used to build the base classifiers,
such as Bagging [27] or Random Subspace [102]. A BCP can either be created using
a single classification algorithm, or through a mixture of classifiers.
Bagging can generate a number of base learners each from a different bootstrap
data set by calling the same base learning algorithm. A bootstrap data set is obtained
by sub-sampling the original training data set with replacement. The size of the
data set is the same as that of the training data set. Thus, for a (conventional)
bootstrap sample [27], some training instances may appear but some may not,
105
5.1. Framework for Classifier Ensemble Reduction
Figure 5.1: Overview of CER
where the probability that an example appears at least once is about 0.632 [295].After obtaining the base learners, Bagging combines their outputs using aggregation
methods such as majority vote [249], and the most-voted class is predicted. The
pseudocode for Bagging is shown in Algorithm 5.1.1. Since Bagging randomly selects
different subsets of training samples in order to build diverse classifiers, differences
in the training data present extra or missing information for different classifiers,
resulting in different classification models.
The Random Subspace method randomly generates different subsets of domain
attributes and builds various classifiers on top of each of such subsets. Algorithm
5.1.2 outlines the basic procedures, assuming a predefined subspace size of s, and
the features are chosen randomly without replacement. The differences between the
subsets creates different view points of the same problem [40], typically resulting in
different boundaries for classification.
For a single base classification algorithm, these two methods both provide a
good level of diversitie. In addition, a mixed classifier scheme is implemented in
the presented work. By selecting classifiers from different schools of classification
106
5.1. Framework for Classifier Ensemble Reduction
1 t, number of training rounds2 Ci, i = 1, · · · , t, base learners3 X i, i = 1, · · · , t, bootstrap data sets4 x j, j = 1, · · · , |X |, original training objects5 Y = A∪ Z , conditional and decision features6 for i = 1 to t do7 X i = ;8 while |X i|< |X | do9 random r, 1≤< r ≤ t
10 X i = X i ∪ x r11 Train Ci using (X i, Y )
Algorithm 5.1.1: Bagging algorithm
1 t, number of training rounds2 Ci, i = 1, · · · , t, base learners3 Ai, i = 1, · · · , t, random feature subsets4 a j, j = 1, · · · , |A|, original set of features5 Y = A∪ Z , conditional and decision features6 for i = 1 to t do7 A′ = A8 Ai = ;9 while |X i|< |X | do
10 random ar , ar ∈ A′
11 Ai = Ai ∪ ar12 A′ = A′ \ ar13 Train Ci using (X , Ai ∪ Z)
Algorithm 5.1.2: Random Subspace algorithm
algorithms, the diversity is naturally achieved through the various foundations of
the algorithms themselves.
5.1.2 Classifier Decision Transformation
Once the base classifiers have been built, their decisions on the training instances are
also gathered. For base classifiers Ci, i = 1, 2, . . . , |C|, Ci ∈ C, and training instances
x j, j = 1,2, . . . , |X |, where |C| is the total number of base classifiers, and |X | is the
total number of training instances, a decision matrix as shown in Table 5.1 can be
constructed. The value di j represents the ith classifier’s decision on the jth instance.
For supervised FS, a class label is required for each training sample, the same class
attribute is taken from the original data set, and assigned to each of the instances.
107
5.1. Framework for Classifier Ensemble Reduction
Note that both the total number of instances and the relations between instances
and their class labels remain unchanged. Although all attributes and values are
completely replaced by transformed classifier predictions, the original class labels
remain the same. A new data set is therefore constructed, each column represents
an artificially generated feature, each row corresponds to a training instance, the
cell then stores the transformed feature value.
Table 5.1: Classifier ensemble decision matrix
X C1 C2 · · · Ci · · · C|C|
x1 d11 d21 · · · di1 · · · d|C|1x2 d12 d22 · · · di2 · · · d|C|2...
......
......
x j d1 j d2 j · · · di j · · · d|C| j...
......
......
x|X | d1|X | d2|X | · · · di|X | · · · d|C||X |
5.1.3 FS on Transformed Data set
HSFS is then performed on the artificial data set, evaluating the emerging feature
subset using the predefined subset evaluator (such as the fuzzy-rough dependency
measure [126]). HSFS optimises the quality of discovered subsets, while trying to
reduce subset sizes. When HS terminates, its best harmony is translated into a feature
subset and returned as the FS result. The features then indicate their corresponding
classifiers that should be included in the learnt classifier ensemble. For example,
if the best harmony found by HS is C−, C9, C3, C23, C3, C5, C17, C−, the translated
artificial feature subset is then C3, C5, C9, C17, C23. Thus, the 3rd, 5th, 9th, 17th
and 23rd classifiers will be chosen from the BCP to construct the classifier ensemble.
5.1.4 Ensemble Decision Aggregation
Once the classifier ensemble is constructed, new objects are classified by the ensemble
members, and their results are aggregated to form the final ensemble decision output.
Such an aggregation process allows the evidence from different sources, i.e., the
class labels predicted by the individual base classifiers, to be combined. This is in
order to derive a degree of belief (represented as a certain belief function [220]) that
takes into account all the available evidence. In particular, the Average of Probability
108
5.1. Framework for Classifier Ensemble Reduction
[114] method is used in this chapter. Given ensemble members Ci, i = 1,2, . . . , |C|,and decision classes d j, j = 1,2, . . . , |Ωz|, where|C| is the ensemble size and |Ωz| isthe number of decision classes, classifier decisions can be viewed as a matrix of
probability distributions pi j, i = 1,2, . . . , |C|, j = 1,2, . . . , |Ωz|. Here, pi j indicates
prediction from classifier Ci for decision class d j. The final aggregated decision is
the winning classifier that has the highest averaged prediction across all classifiers,
as shown in Eq. 5.1.
|C|∑
i=1
pi1
|C|,|C|∑
i=1
pi2
|C|, . . . ,
|C|∑
i=1
pi|Ωz |
|C| (5.1)
Note that this is effective because redundant classifiers are now removed. As such,
the usual alternative aggregation method: majority vote [249] is no longer favourable
since the “majority” has now been significantly reduced.
5.1.5 Complexity Analysis
Various factors affect the overall complexity of the proposed CER framework, namely
the performance of the base classification algorithm and the subset evaluator. Since
the proposed CER framework is generic and not limited to a specific collection of
methods, in the following analysis, OC(t rain), OC(test), and Oeval are used to represent
the complexity of training and testing the employed base classifier, and that of
the subset evaluator, respectively. The amount of time required to construct the
base ensemble (OBag ging +OC(t rain))× |C| can be rather substantial if the size of the
ensemble |C| is very large. The process of generating the artificial training data
set is straightforward, requiring only OC(test) × |C| × |X |, where |X | is the number of
instances.
Recall from Section 3.3.3, HSFS requires Oeval × gmax to perform the subset
search, as the total number of evaluations is controlled by the maximum number of
iterations gmax. Note that the subset evaluation itself can be time consuming for high
dimensional data (large sized ensembles). As for the complexity of the HS algorithm:
the initialisation requires O(|P| × |H| operations to randomly fill the subset storage,
where |P| is the number of musicians. The improvisation process is of the order
O(|P| × gmax) because every feature selector needs to produce a new feature at every
iteration. Finally, the complexity of predicting the class label for any new sample is
OC(test) × |C|, here |C| is the size of the reduced ensemble.
109
5.2. Experimentation and Discussion
5.2 Experimentation and Discussion
To demonstrate the capability of the proposed CER framework, a number of exper-
iments have been carried out. The implementation works closely with the WEKA
[264] data mining software which provides software realisation of the algorithms
employed, and an efficient platform for comparative evaluation. To implement the
ideas proposed in this chapter, the main ensemble construction method adopted
is the Bagging approach [27], and the base classifier learner used is the decision
tree-based C4.5 algorithm [264]. The Correlation Based FS [93, 94] (CFS), the
Probabilistic Consistency Based FS [52] (PCFS), and the FS technique developed
using fuzzy-rough set theory [126] (FRFS) are employed as the feature subset eval-
uators. The HSFS algorithm then works together with the various evaluators to
identify quality feature (classifier) subsets. In order to demonstrate the scalability
of the framework, the base ensembles are created in three different sizes: 50, 100,
and 200. A collection of real-valued UCI [78] benchmark data sets are used in the
experiments, a number of which are very large in size and high in dimensionality and
hence, present significant challenges for the construction and reduction of ensembles.
The parameters used in the experiments and the information of the data sets are
summarised in Table 5.2.
Table 5.2: HS parameter settings and data set information
|H| |P| δ gmax
10-20 |A| 0.5-1 2000
Data set Features Instances Decisions
arrhy 280 452 16cleve 14 297 5ecoli 8 336 8glass 9 214 6heart 13 270 2ionos 35 230 2libra 91 360 15ozone 73 2534 2secom 591 1567 2sonar 61 208 2water 39 390 3wavef 41 5000 3wine 14 178 3
110
5.2. Experimentation and Discussion
Stratified 10-FCV is employed for data validation, where a given data set is
partitioned into 10 subsets. Of these 10 subsets, nine are used to form a training
fold, and a single subset is retained as the testing data. The construction of the base
classifier ensemble, and the ensemble reduction process are both performed using
the same training fold, so that the reduced subset of classifiers can be compared
using the same unseen testing data. This process is then repeated 10 times (the
number of folds). The advantage of 10-FCV over random sub-sampling is that all
objects are used for both training and testing, and each object is used for testing only
once per fold. The stratification of the data prior to its division into different folds
ensures that each class label has equal representation in all folds, thereby helping
to alleviate bias/variance problems [18]. The experimental outcomes presented
are averaged values of 10 different 10-FCV runs, in order to reduce or alleviate the
impact of random factors within the heuristic algorithms.
5.2.1 Reduction Performance for Decision Tree-Based
Ensembles
In this set of experiments, the BCP is built using a decision tree-based approach, and
C4.5 [264] is selected as the base algorithm. Table 5.3 summarises the obtained 3
sets of results for CFS, PCFS, and FRFS respectively, after applying CER, as compared
against the results of using: (1) the base algorithm itself, (2) the full base classifier
pool, and (3) randomly formed ensembles. Entries in bold indicate that the selected
ensemble performance is either statistically equivalent or improved significantly when
compared with the original ensemble, using paired t-test with two-tailed threshold
p = 0.01.
Two general observations can be drawn across all three set-ups: (1) The prediction
accuracies of the constructed classifier ensembles are universally superior than that
achievable using a single C4.5 classifier. Most of the data sets that revealed the
greatest performance increase are either large in size or high in dimensionality.
This reinforces the benefit of employing classifier ensembles. (2) All FS techniques
tested demonstrate substantial ensemble size reduction, showing clear evidence of
dimensionality reduction.
For the original ensembles of size 50, the CFS evaluator performs very well. In
seven out of 11 tested data sets, CFS achieves comparable or better classification
accuracy when compared with the original ensemble. The FRFS evaluator also
111
5.2. Experimentation and Discussion
Table 5.3: Comparison on C4.5 classification accuracy
CFS PCFS FRFS Random Full Base C4.5
Data set Acc.% Size Acc.% Size Acc.% Size Acc.% Size Acc.% Size Acc.%
Base Ensembles of Size 50
arrhy 74.59 21.6 71.93 5.3 74.81 26.3 73.71 10 74.47 50 66.39cleve 55.54 25.8 56.57 5.7 56.60 13.6 54.16 10 54.90 50 50.21ecoli 84.55 11.6 83.95 6.8 83.96 23.8 83.94 10 84.24 50 81.88glass 74.46 15.0 66.45 4.6 76.71 11.9 72.94 10 70.24 50 70.15ionos 91.30 10.8 90.00 3.2 90.43 3.1 90.00 10 90.87 50 87.39libra 79.44 23.2 74.72 3.5 78.89 15.4 77.78 10 81.67 50 71.39ozone 93.88 26.2 94.12 12.3 93.96 43 93.40 10 94.00 50 92.94secom 93.30 35.9 92.79 6.3 92.92 6.3 93.11 10 93.24 50 89.28sonar 75.31 24.5 71.93 3.3 71.05 3.2 72.45 10 75.88 50 70.05water 87.69 20.9 83.33 4 84.61 6.1 84.87 10 86.67 50 80.00wavef 82.92 42.2 81.50 8.7 82.47 11 81.00 10 82.98 50 75.50
Base Ensembles of Size 100
arrhy 73.91 28.3 73.04 5.2 74.37 22.3 73.26 20 74.47 100 66.39cleve 54.56 30.4 58.26 6.7 54.46 11.8 55.56 20 56.56 100 50.21ecoli 84.85 13.7 85.76 6.4 85.16 24.2 84.84 20 84.25 100 81.88glass 71.60 16.5 70.58 4.5 74.31 11.7 72.53 20 74.42 100 70.15ionos 89.13 14.3 90.43 3.1 84.35 3.2 90.87 20 91.74 100 87.39libra 80.83 33.0 74.17 3.5 77.78 15.3 77.22 20 80.28 100 71.39ozone 94.24 31.8 93.84 13.5 94.16 74.2 94.16 20 94.16 100 92.94secom 93.43 59.4 93.04 6.2 92.51 6.1 93.00 20 93.30 100 89.28sonar 75.36 30.4 72.88 3.8 75.36 3.5 72.93 20 75.36 100 70.05water 87.18 25.7 85.64 4.7 86.15 6.2 87.18 20 86.92 100 80.00wavef 83.20 71 80.88 9 83.33 11 82.90 20 83.42 100 75.50
Base Ensembles of Size 200
arrhy 75.47 39.9 72.80 5.7 73.04 21.3 74.37 40 75.25 200 66.39cleve 57.93 45 52.56 5.8 55.24 11.9 55.54 40 54.90 200 50.21ecoli 83.96 24.5 83.94 6.6 84.29 24.3 84.86 40 84.54 200 81.88glass 72.53 25.9 72.97 4.8 72.49 11.6 72.08 40 73.94 200 70.15ionos 90.87 20 86.09 3.2 90.87 3.6 89.57 40 91.74 200 87.39libra 81.67 41 74.17 4 81.11 15.1 79.17 40 79.44 200 71.39ozone 94.55 45 93.65 28.9 94.49 143 94.40 40 94.24 200 92.94secom 93.36 95.7 92.92 6.2 93.38 6 93.30 40 93.36 200 89.28sonar 78.69 45.8 73.36 4.1 74.31 4.5 74.88 40 75.83 200 70.05water 87.95 38.6 83.33 4.3 85.87 6.8 86.67 40 87.95 200 80.00wavef 83.12 107.2 81.06 9.3 82.40 12 82.76 40 83.48 200 75.50
delivers good accuracies in four data sets while having fairly small reduced ensembles.
The PCFS only produces equally good solutions for the cleve and ozone data sets,
however, it has the most noticeable ensemble size reduction ability. The reduced
ensembles demonstrate increased classification performance for the cleve, glass,
and water data sets.
112
5.2. Experimentation and Discussion
For the medium (100) sized ensembles, both CFS and FRFS produce good results
in five data sets, however, none of these further improves the ensemble classification
accuracy. Although PCFS only achieves best performance for the cleve data set,
it manages to achieve an average accuracy of 58.26% across all 10× 10 reduced
ensembles, with an averaged size of only 6.7. Note that for the ozone and sonar
data sets, the reduced ensembles discovered by CFS and FRFS both show very similar
averaged accuracy, which is almost identical to that of the original full ensembles.
This may indicate that the key members of the ensembles are certainly present in
the reduced subsets, while FRFS eliminates the most redundancy (average reduced
ensemble size is 3.5) for the sonar data set.
For the large sized ensembles, CFS shows a clear lead in terms of the overall quality
of the reduced ensembles, scoring equal classification accuracy for five data sets, and
delivers an improvement in ensemble accuracy for the cleve, libra, and sonar
data sets. This experimentally demonstrates the capability and benefit of employing
the proposed CER framework in dealing with large sized ensembles with large,
complex data sets. FRFS also produces good quality ensembles with much reduced
size, showing its strength in redundancy removal. PCFS is not competitive in this set
of experiments, this may be due to its (perhaps overly) aggressive reduction behaviour,
which possibly resulted in certain quality ensemble members being ignored.
5.2.2 Alternative Ensemble Construction Approaches
The following set of experiments compares supervised FS (FRFS) to unsupervised
[169] (U-FRFS) FS approaches. A total of 10 different base classification algorithms
are selected, containing one to two distinctive classifiers from each representative
classifier groups. The selected methods include fuzzy-based fuzzy nearest neighbours
[140], fuzzy-rough nearest neighbours [117], vaguely quantified fuzzy-rough nearest
neighbours [117], lazy-based k-nearest neighbors [4], tree-based C4.5 [264], reduced
error pruning tree [70], rule-based methods with repeated incremental pruning to
produce error reduction [264] and projective adaptive resonance theory [264], naïve
bayes [132] and multilayer perceptron [98].
Bagging [27] and Random Subspace [102] are subsequently used to create
differentiation between classifiers to fill the total BCP of 50. Tables 5.2 and 5.3
show the experimental results, using these two methods respectively. Due to the
considerable system resource required to construct and maintain the base ensembles,
113
5.2. Experimentation and Discussion
this set of experimentations are carried out using ensembles of size 50 with lower
dimension benchmark data sets.
For mixed classifiers created using Bagging (Fig. 5.2), the FRFS method find
ensembles with much greater size variation. For the ecoli data set in particular, the
averaged ensemble size is 15.98. The results indicate that many distinctive features
(i.e. good diversity classifiers) are present, This particular ensemble also results in
the highest accuracy for ecoli compared against other approaches, with 87.67%
BCP accuracy, and 86.66% ensemble accuracy. A large performance decrease is also
noticed for the sonar data set. Interestingly, the unsupervised FRFS achieves better
overall performance than its supervised counterpart, with smaller selected ensemble
sizes.
Figure 5.2: Mixed classifiers using Bagging
The Random Subspace based mixed classifier scheme (Fig. 5.3) produces better
base pools in 7 out of 9 cases. Both FRFS and U-FRFS find smaller ensembles on
average than the case where Bagging is used. Neither method suffers from extreme
performance decrease following reduction unlike the results obtained when a single
base algorithm is employed. Despite having a BCP that under performs for the ecoli
data set, both methods manage to achieve an increase of 5% in accuracy. The quality
of the mixed classifier group is lower than that of the C4.5 based single algorithm
approach for several data sets. This is largely due to the use of non-optimised base
classifiers. It can be expected that the results achievable after optimisation would be
even better.
114
5.2. Experimentation and Discussion
Figure 5.3: Mixed classifiers using Bagging
5.2.3 Discussion
Although the execution time of the examined approaches have not been precisely
recorded and presented, it was observed during the study that data sets with large
number of instances such as the ozone, secom, and wavef data sets, all require a
substantial amount of time for the reduction process. This observation seems to be
consistent with the findings of the complexity analysis in Section 5.1.5: the reduction
process relies on the efficacy of the evaluators (which may not scale linearly with
the number of training instances), and thus, for huge data sets, it may be beneficial
to choose the lighter weight evaluators (such as CFS). However, since the reduction
process itself can be performed independently and separately from the main ensemble
process, CER is generally treated as a pre-processing step (similar to FS) for the
ensemble classification, or a post-processing refinement procedure for the generated
raw ensembles. The time complexity for such processes is less crucial and has less
impact.
The experimental evaluation also reveals that different evaluators show distinctive
characteristics when producing the reduced ensemble. For example, PCFS consis-
tently delivers very compact ensembles (with less than 10 members for most data
sets). CFS excels in terms of ensemble classification accuracy but with much larger
sized subsets. FRFS is balanced between ensemble accuracy and dimensionality re-
duction, with very occasional large solutions (the ozone data set). The unsupervised
method also produces comparable results to its supervised counterparts.
Note that for a number of experimental data sets, performing CER does not always
yield subsets with equal or better performance. This may be due to the employed filter-
115
5.3. Summary
based FS approaches (which do not cross-examine against the original data in terms
of classification accuracy). How concepts developed by existing wrapper-based and
hybrid FS techniques may be applied to further improve the framework remains active
research. The information lost (even the redundant classifiers) through reduction
may also be the cause for such decrease in performance. Similar behaviour has also
been observed in the FS problem domain. The quality (such as size and variance) of
the training data also plays a very important role in CER, the classifiers that were
deemed redundant by the subset evaluators may in fact carry important internal
models, which are just not sufficiently reflected by the available training samples.
5.3 Summary
This chapter has presented a new approach to CER. It works by applying FS techniques
for minimising redundancy in an artificial data set, generated via transforming a
given classifier ensemble’s decision matrix. The aim is to further reduce the size of
an ensemble, while maintaining and improving classification accuracy and efficiency.
Experimental comparative studies show that several existing FS approaches can
produce good solutions by employing the proposed approach. Reduced ensembles
are found with comparable classification accuracies to those of the original ensembles,
and in most cases also provide good improvement over the performance achievable
by the base algorithm alone. The characteristics of the results also vary depending
on the employed FS evaluator. As a novel application of the FS concept in the area
of classifier ensemble learning, the present work has identified a promising direction
for future theoretical research, and it has laid the necessary foundation for which
further extensions and refinements may be built upon. More in-depth discussions
regarding ideas are given in Section 9.2.1.3.
116
Chapter 6
HSFS for Dynamic Data
M OST of the FS techniques discussed so far focus on selecting from a static pool of
training instances with a fixed number of original features. However, for most
real-world domains, data may be gradually refined, and information regarding the
problem domain may be actively added and/or removed. Dynamic FS [58, 99, 289],also referred to as on-line FS [268], has attracted significant attention recently.
Unlike the conventional, off-line FS that is performed where all the features and
instances are present a-priori, dynamic FS considers situations where the information
regarding a certain problem domain is not fully available. The extraction of features,
or the procedure of collecting new instances may be difficult or time consuming.
New sets of features or instances may only be presented in an incremental fashion,
and the FS technique needs to adapt to the new information quickly and accurately.
Existing studies in the literature typically work with a classifier learner [99, 260,
289], but also involve alternative applications such as prediction [76]. Little work
has been carried out for studying situations where features or instances are removed.
However, such scenarios may be common for applications where data have a limited
validity [25, 34], and outdated information need to be removed to ensure data
consistency or simply to save storage space. As previously explained in Chapters 4
and 5, a classifier ensemble [295] exploits the uncorrelated errors within a group
of classifiers caused by their diverse internal models [217], in order to increase
the classification accuracy over single classifier systems. FSE in particular, is an
effective type of classifier ensemble that generates a group of classifiers with diverse
underlying feature subsets, thereby creating different views of the original data
117
6.1. Dynamic FS Scenarios
[195, 197]. nature-inspired FS search techniques [192, 261] such as HSFS [62] can
help to construct such ensembles by producing multiple, compact, and high quality
feature subsets.
In this chapter, theoretical discussions is present with respect to four basic dynamic
FS scenarios: feature addition, feature removal, instance addition, and instance
removal. It provides an insight of how a nature-inspired meta-heuristic such as HSFS
may be beneficial in more complex situations (arbitrary combination of the possible
events). A dynamic FS technique termed “dynamic HSFS” (D-HSFS) is proposed,
which is capable of actively maintaining the quality of a merging feature subset for a
given changing data set. Its stochastic mechanisms also allow multiple good feature
subsets to be identified simultaneously. The subsequent part of the chapter further
investigates the feasibility of implementing an adaptive FSE framework (A-FSE)
using these actively refined feature subsets.
The remainder of the chapter is organised as follows. Section 6.1 introduces the
concept of dynamic FS with a discussion of four basic dynamic scenarios. Section 6.2
explains the proposed D-HSFS algorithm which handles arbitrary combinations of
the basic dynamic FS events. A generic A-FSE technique is presented in Section 6.3,
aiming to better handle changing data. An implementation of this technique using
D-HSFS is also detailed. Section 6.4 reports the results of experimental investigation,
in order to demonstrate the efficacy of the proposed approach. Finally, Section 6.5
provides a brief summary of the work.
6.1 Dynamic FS Scenarios
The aim of a conventional (static), subset-based FS algorithm, as previously intro-
duced in Section 1.1, is to determine an optimal feature subset B ⊆ A with the best
evaluation score f (B) and minimum size |B|. Such a feature subset may encapsulate
the original concept to the maximum extend, and be able to distinguish the training
instances into their respective classes. Here f : B → R, f (B) ∈ [0,1] is a subset
evaluation function that maps a set of feature subsets onto a set of real numbers.
The nature of dynamic data sets requires any FS algorithm to depart from a
one-off pre-processing process, becoming a recursive procedure. A previous feature
subset Bk obtained on the basis of the data set at an earlier state (Xk, Ak) needs to be
re-evaluated, against any newly added information as well as any removed instances
118
6.1. Dynamic FS Scenarios
and features. This will produce a modified feature subset Bk+1 that has adapted to
the changed data (Xk+1, Ak+1). In this chapter, for simplicity, it is assumed that the
possible class labels are predefined and unaltered throughout the process: Zk = Zk+1.
Four common scenarios that may occur in a given dynamic FS environment are
introduced below, considering events where features or instances may be added or
removed. Insights are also provided regarding how a previously selected subset of
features Bk may be improved, in order to discover a higher quality feature subset
Bk+1. For ease of discussion, let Bk denote the feature subset that is of the highest
achievable evaluation score for the previous state of the data. The fuzzy-rough
dependency measure exploited by FRFS [126] is adopted in this section, in order to
provide concrete examples of the dynamic procedures. The notation of the evaluation
function is annotated as f XkAk(Bk), which signifies that the quality of a given feature
subset Bk, selected on the basis of the data set at its current state (Xk, Ak). For dynamic
FRFS, the aim is (still) to find a fuzzy-rough reduct Rk ⊆ Ak, which is defined as a
subset of features that preserves the dependency degree of the unreduced data, i.e.,
f XkAk(Rk) = f Xk
Ak(Ak).
6.1.1 Feature Addition
This scenario considers the situation where new features are incrementally added
during the FS process, i.e., |Ak+1|> |Ak|, whilst the set of training instances X remains
static. In particular, if the currently available set of features Ak is already capable of
fully distinguishing all objects x ∈ X into their respective classes, any subsequent
feature addition will bring no improvement to the discernibility of the data set.
Therefore, no further selection is necessary for the purpose of improving f (Bk).However, if Ak itself was not informative enough, then it is crucial to examine the
new features Ak+1 \ Ak, in order to improve the discernibility of the subset. Ideally,
every feature a, a ∈ Bk, should also be checked, and amended (in the sense of being
removed or replaced) as the new features may be more informative and hence, may
help to further reduce |Bk|. This may be skipped for time-critical applications with a
risk of resulting in possibly non-global best feature subsets.
For dynamic FRFS, the properties of a fuzzy-rough reduct Rk can be exploited to
significantly simplify such a dynamic process. If the existing set of features Ak can
already fully discern all of the instances in Xk with respect to their associated classes,
119
6.1. Dynamic FS Scenarios
i.e., f XkAk(Ak) = 1, and a reduct has been identified:
for ∀k′ > k, f XkAk(Rk) = f Xk
Ak(Ak) = f Xk′
Ak′(Ak′) = 1, Ak ⊆ Ak′ (6.1)
Then no further modification to Rk is necessary. However, if full fuzzy-rough depen-
dency for the data set is not achieved in the previous step, i.e., f XkAk(Ak)< 1. It is then
crucial to examine the new features. The procedure that handles dynamic feature
addition for FRFS is detailed in Algorithm 6.1.1.
1 if f XkAk(Rk) = 1 then
2 return Rk
3 f ∗← f XkAk(Rk)
4 while f ∗ 6= f Xk+1Ak+1(Ak+1) do
5 foreach x ∈ Ak+1 \ Ak do6 if f Xk
Ak(Rk ∪ x)> f ∗ then
7 Rk+1← Rk ∪ x, f ∗← f XkAk(Rk ∪ x)
8 Rk← Rk+1
9 return Rk+1Algorithm 6.1.1: Dynamic FRFS for Feature Addition: Ak ⊆ Ak+1, Xk = Xk+1
6.1.2 Feature Removal
In contrast to the previous scenario, a particular application may be initialised
with an abundance of features, which are subsequently removed throughout the FS
process. In this case, the overall discernibility of a data set itself may deteriorate,
due to informative features being removed. Particularly, if a feature a belonging
to the current candidate feature subset Bk is deleted: a ∈ Ak \ Ak+1, substitution
of feature(s) may become necessary in order to restore the discernibility of this
subset. If the deletion does not affect Bk, i.e., (Ak \ Ak+1)∩ Bk = ;, it means that the
features being removed are also the previously unselected features (for being less
informative or redundant). In such an event, no further adjustment is necessary,
and the candidate feature subset from the previous state Bk may continue to be
used, since no informative features are lost. The procedure for handling the feature
removal scenario for FRFS is given in Algorithm 6.1.2. Here lines 1 and 2 perform
the necessary check which determines whether the current reduct has been affected
by the removal of features, and the recovery process is initiated only if features are
removed from Rk.
120
6.1. Dynamic FS Scenarios
1 Rk+1← Rk \ (Ak \ Ak+1)2 if Rk+1 = Rk then3 return Rk+1
4 f ∗← f XkAk(Rk)
5 while f ∗ < f Xk+1Ak+1(Ak+1) do
6 foreach x ∈ Ak+1 \ Rk do7 if f Xk
Ak(Rk ∪ x)> f ∗ then
8 Rk+1← Rk ∪ x, f ∗← f XkAk(Rk ∪ x)
9 Rk← Rk+1
10 return Rk+1Algorithm 6.1.2: Dynamic FRFS for Feature Removal: Ak+1 ⊆ Ak, Xk = Xk+1
6.1.3 Instance Addition
Addition of instances (while the features A remain unaltered) is perhaps the most
commonly encountered situation. Monitoring-based applications [228], or proce-
dures that involve streaming data [268] are typical examples of such a case. When
a new batch of instances is added, subset evaluators such as CFS and PCFS may
initiate the necessary correlation or consistency checking only for the unseen objects.
However, techniques similar to FRFS will require a full re-evaluation against the
entire Xk+1, because the addition of objects will inevitably change the partitioning of
the universe Xk/Z . There are exceptional cases where the data set has accumulated a
sufficiently large number of samples with almost full coverage of the underlying con-
cepts to be learned. Any “new” instances are either the same as, or almost equivalent
to the objects already analysed (judged by a certain similarity relation [123]).
Algorithm 6.1.3 details the dynamic FRFS process for the case of instance addi-
tion. In practical applications, the number of new objects may be very small when
compared to the existing pool of instances, |Xk+1 \ Xk| |Xk|, and the new objects
may be very similar (or identical) to those already collected, when judged by a
certain measure such as one of the fuzzy similarity functions given in Eqns. 2.18 to
2.20. In such scenarios, the amount of features required to be further selected (or
replaced) may be minimal, since the existing feature subset can already sufficiently
discern the new objects. Of course, if the new objects are totally unseen objects, then
a large number of modifications is still necessary.
To further improve the efficiency of the algorithm, the newly added objects may
be checked against the current fuzzy-rough lower and upper approximations of
121
6.1. Dynamic FS Scenarios
the existing classes, in order to determine whether they can be subsumed by the
already established partitions. If a given new object (or a group of objects) do not
belong to the existing partitions to a satisfactory degree, it is then an indication that
modifications of the lower and upper approximations are necessary. The effect of
instance addition may be more apparent for FRFS, This is because the addition of
objects may change the fuzzy positive regions µPOSR(x), as the universe (Xk, Ak) may
now be different.
1 if f Xk+1Ak+1(Rk)≥ f Xk
Ak(Rk) then
2 return Rk
3 f ∗← f XkAk(Rk)
4 while f ∗ < f Xk+1Ak(Ak) do
5 foreach x ∈ Ak \ Rk do6 if f Xk+1
Ak+1(Rk ∪ x)> f ∗ then
7 Rk+1← Rk ∪ x, f ∗← f Xk+1Ak+1(Rk ∪ x)
8 Rk← Rk+1
9 return Rk+1Algorithm 6.1.3: Dynamic FRFS for Instance Addition: Ak = Ak+1, Xk ⊆ Xk+1
6.1.4 Instance Removal
The last dynamic FS scenario considered in this chapter is instance removal. A lot
of training objects may be available in the beginning of the FS process, but have to
be removed later either due to the information has become outdated, or simply a
space limitation has been reached. Since the removal of instances does not increase
the amount of inconsistency lies within a given data set, this is the most simplistic
case of dynamic FS. A previously obtained optimal features subset will maintain its
discernibility. Note that exceptional situations exist where an instance has feature
values at the boundaries of the variable range, its removal may affect the result of
those techniques that rely on fuzzy similarity relations [123], causing the overall
evaluation score to change. It may be possible to further improve |Bk|, since any
removed instances may relax the constraints (i.e., the amount of inconsistency or
un-correlated objects present in the data set), and less features may be required
to maintain full discernibility. Algorithm 6.1.4 describes an example backward
elimination-based algorithm, to prune the now redundant features. This pruning
process may also be applied periodically to the other scenarios, since incrementally
122
6.2. Dynamic HSFS
refined feature subsets are susceptible to become sub-optimal solutions. For example,
Algorithms 6.1.1 avoids further evaluation and adjustment (ignoring potentially
more informative features) whilst the current subset Rk qualifies as a reduct.
1 Rk+1← Rk
2 foreach x ∈ Rk do3 if f Xk+1
Ak+1(Rk+1 \ x) = f Xk+1
Ak+1(Rk) then
4 Rk+1← Rk+1 \ x
5 return Rk+1Algorithm 6.1.4: Dynamic FRFS for Instance Removal: Ak = Ak+1, Xk+1 ⊆ Xk
6.2 Dynamic HSFS
For many application problems, it is impractical to assume that only one of the
aforementioned scenarios would occur, but rather a combination of them. Although
an approach could be derived by combining strategies tailored for resolving such
individual cases, direct combinations may produce sub-optimal solutions, especially
in terms of subset size. nature-inspired meta-heuristics, as an alternative, may be
extended to handle dynamic FS problems. Integer-valued HSFS [62] in particular, is
structurally simple and delivers excellent FS search performance, and which is also
able to iteratively reduce the size of an emerging feature subset. This section extends
the existing HSFS method, as detailed in Chapter 3, to develop a modified, dynamic
HSFS algorithm (D-HSFS) that better addresses the challenges in a changing FS
environment.
6.2.1 Algorithm Description
The D-HSFS algorithm uses 3 parameters: |H|, the number of feature selectors |P|,and an harmony memory considering rate δ, which encourages a feature selector pi to
randomly choose from all available features A (instead of within its own note domain
ℵi). The maximum number of iterations that is conventionally employed in HS is
not required in this implementation, as the process is expected to continue operating
throughout the whole dynamic process. The representation of a dynamic feature
subsets (harmony) is the same as that employed by standard HSFS, given in Table
3.3. For simplicity, the explicit encoding/decoding process between a given harmony
H j and its associate feature subset BH j is omitted in the following explanation.
123
6.2. Dynamic HSFS
The overall operation of the proposed D-HSFS algorithm is illustrated in Fig. 6.1
and outlined in Algorithms 6.2.1 and 6.2.2. A generic feature subset evaluator with
a score range of f (B) ∈ 0, 1 is used herein to ensure generality in the explanation.
Figure 6.1: Procedures of D-HSFS
1. Initialise Harmony Memory
Set the initial values for the parameters |H|, |P|, and δ as with the application
of conventional HS. A harmony memory containing |H| randomly generated
subsets is then initialised. This also provides each feature selector a note
domain ℵ of |H| features, which may include identical choices, or nulls (−).
2. Adapt to Change
By default, the internal stochastic mechanisms of HS (especially the δ acti-
vation) are potentially capable of exploring a dynamically changing solution
domain, discovering better solutions over time, without excessive human inter-
vention. To support this, the pool of the originally available features is now
kept up-to-date at all time. Since the pool is the variable domain shared by the
feature selectors, any updates made are automatically propagated to all the se-
lectors. Also, the subset evaluator is updated instantaneously with (Ak+1, Xk+1).
124
6.2. Dynamic HSFS
1 pi ∈ P, i = 1 to |P|, group of musicians2 H j ∈H, j = 1 to |H|, harmony memory
3 ℵi =⋃|H|
j=1 H ji , note domain of pi
4 δ, harmony memory considering rate5 Hnew, emerging harmony6 BH , translated feature subset from H7 f (BH), feature subset evaluator for BH
// Initialise harmony memory8 for j = 1 to |H| do9 Hnew = ;
10 for i = 1 to |P| do11 random ar , ar ∈ A∪ −12 Hnew = Hnew ∪ ar13 H=H∪ Hnew
// Iterate14 while changing do
// Subroutine for adapting to change15 adapt (Ak+1, Xk+1)
// Improvise new harmony16 Hnew = ;17 for i = 1 to |P| do18 random rδ, 0≤ rδ ≤ 119 if rδ < δ then20 random ar , ar ∈ A∪ −21 Hnew = Hnew ∪ ar22 else23 random ar , ar ∈ ℵi
24 Hnew = Hnew ∪ ar
// Update harmony memory25 if f (BHnew)≥min( f (BH) | H ∈H) then26 H=H∪ Hnew27 H=H \ arg minH∈H f (BH)
Algorithm 6.2.1: Pseudocode of D-HSFS
To achieve this, it may be necessary to re-train the internal components of
certain evaluators (e.g., when FRFS is used).
Note that for the event of feature addition, the new features will be explored
in due time via δ activation and introduced into the harmony memory, if they
provide an improvement to the quality of the feature subset. It is intended not
125
6.2. Dynamic HSFS
to force any feature selectors to try the new features immediately, as they may
in fact be irrelevant, or less important than the existing features. However,
such a mechanism may be implemented for more time critical applications,
where a given dynamic feature subset must be refined within a limited amount
of time.
Following the update to the variable domains, the harmony memory is then
re-evaluated using the updated evaluator. This is the most crucial step to ensure
that all the stored fitness values reflects the new changes, and the harmonies are
appropriately ranked for possible future updates. Fortunately, |H| is typically
a small number and hence, this process is generally not expensive. The total
number of feature selectors |P| may also be expanded or shrunk according
to the current size of the feature pool. After this, HS may resume its normal,
iterative operation and continue to improvise new solutions.
The only exception to the above procedure is regarding the scenario of feature
removal. Before initiating the re-evaluation, the deleted features must be
purged from all of the subsets stored in the harmony memory, and from the
note domains of all affected feature selectors. Algorithm 6.2.2 summaries this
adaptation sub-routine.
1 Update the evaluator f (B) with (Ak+1, Xk+1)// Invalidate any outdated features
2 if Ak \ Ak+1 6= ; then3 for ∀H j ∈H do4 H j = H j \ (Ak \ Ak+1)
5 for i = 1 to |P| do6 ℵi = ℵi \ (Ak \ Ak+1)
// Re-adjust musician group size
7 |P|k+1 =max( |P|k·|Ak+1||Ak|
, argmaxH j∈H |H j|)// Re-evaluate all feature subsets
8 for ∀H j ∈H do9 f (BH j)
Algorithm 6.2.2: Sub-routine adapt (Ak+1, Xk+1)
3. Improvise New Subset
Each p j nominates a feature a ∈ ℵ j and all such nominated features form a
new harmony Hnew. The corresponding new feature subset BHnew , decoded by
126
6.2. Dynamic HSFS
following a scheme already illustrated in Table 3.3, then has it evaluation score
computed by f (BHnew).
4. Update Subset Storage
If the newly obtained subset achieves a higher evaluation score than that of
the worst subset in the harmony memory, or it has an equal evaluation but
is of a smaller size, then this new subset replaces the existing worst subset.
Otherwise, it is discarded.
5. Iterate
This adaptation-improvisation-update process continues to run so long as the
data set is still in a dynamic state. The best harmony in the harmony memory
for a given state of the changing data set (Ak, Xk): H = argmaxH∈H f (BH) and
its associated feature subset BH is therefore dynamically, and continuously
refined.
Example 6.2.1.
Suppose that the emerging harmony is Hnew = a1, a4, a3, a3, a2→a7, a−, and the sub-
set it represents has an evaluation score of f (BHnew) = 0.6, with BHnew = a1, a3, a4, a7,and that the existing worst subset Hworst ∈H is a1, a2, a2, a3, a6, a−with f (BHworst ) =0.5, where BHworst = a1, a2, a3, a6. Then, the updated harmony memory is H =H∪ Hnew \ Hworst. If for instance, a new feature a7 is introduced to the harmony
memory via this update for future combinations, then its associated feature selector
p5 also adds this new feature to its note domain: ℵ5 = ℵ5 ∪ a7. In the beginning
of the next iteration, assuming that features a1, a3 are removed, the new harmony
obtained in the last iteration will need to be modified from a1, a4, a3, a3, a2, a7, a− to
a−, a4, a−, a−, a2, a7, a−, and the same invalidation process is applied to all H ∈H.
The evaluation scores of the respective feature subsets are also computed again
before improvising a new solution.
6.2.2 Complexity Analysis
Following a style of analysis adopted by the original HSFS [62], the proposed D-HSFS
method requires O(|P| · |H|) operations to randomly fill the subset storage, where
|P| is the number of feature selectors, and |H| is the size of the harmony memory.
The continuous improvisation process, in between two dynamic states (Xk, Ak) and
(Xk+1, Ak+1), is of the order O(|P| · (gk+1 − gk)) ·Oe, where gk and gk+1 denote the
127
6.3. Adaptive Feature Subset Ensemble
numbers of iterations at the respective states, and Oe signifies the complexity a single
feature subset evaluation for the employed feature subset evaluator.
The adaptation process is of the order:
Ot +O(|P| · |H| · |Ak \ Ak+1|) + |H| ·Oe , if Ak \ Ak+1 6= ;
Ot + |H| ·Oe , otherwise(6.2)
where O(|P| · |H| · |Ak \Ak+1|) reflects the cost of invalidating existing feature subsets
and note domains, in the event of a feature removal. Ot denotes the cost of re-training
the feature subset evaluator using (Xk+1, Ak+1). In typical cases, |H| is a small value
5≤ |H| ≤ 20 [84], and |P| is bounded by the total number of features. Thus, both
the improvisation and adaptation costs are reasonably low.
6.3 Adaptive Feature Subset Ensemble
For a given data set of significant complexity, a family B of quality (while not always
equally optimal) feature subsets may be discovered by the use of a stochastic search
algorithm. Any such feature subset B ∈ Bmay be used to train a subsequent classifier
learner, and a diverse FSE may be constructed, which generally has a better prediction
accuracy than that of a single classifier. In a dynamically changing environment, a
collection of such feature subsets Bk may be adaptively refined in response to the
current state of the data set (Ak, Xk). Similarly, an FSE built upon Bk also needs to be
updated accordingly. The resulting process leads to the establishment of an adaptive
FSE (A-FSE).
A generic framework for such an A-FSE is illustrated in Fig. 6.2, where each
column of components forms a dynamic FS subsysteml , l ∈ 1, . . . , |Bk|, containing
an adaptive classifier C lk that is built using a dynamic feature subset B l
k ∈ Bk. Jointly,
the |Bk| subsystems construct an adaptive ensemble of classifiers Ak = C lk | l =
1, · · · , |B|, in which different components, including subset evaluators, subset search
algorithms, and base classifier learners may be employed independently. Generally
speaking, a system implementing such a framework will naturally possess the diversity
inherent in its various components. However, each type of component may be
implemented using the same algorithm in order to achieve a higher efficiency and a
lower complexity for the overall system. Any changes to the data are propagated, in
an iterative fashion, throughout the subsystems, down to the end ensemble.
128
6.3. Adaptive Feature Subset Ensemble
Figure 6.2: Generic framework for A-FSE
This following presents an implementation of the A-FSE framework using the
proposed D-HSFS algorithm, which supports the use of any feature subset evaluation
method, such as CFS, PCFS, and FRFS. Despite being adaptive, A-FSE is in principle,
similar to a standard FSE [195, 197, 295], and any ensemble aggregation method,
such as majority vote [249], may be employed. The steps of the implementation is
outlined in Algorithm 6.3.1.
1 while changing do2 for l = 1 to |B| do
// Subsystem l3 B l
k+1 = D-HSFSl (Ak+1, Xk+1)4 if B l
k+1 6= B lk ∨ Xk+1 6= Xk then
5 Re-train C lk with (B l
k+1, Xk+1)
// Aggregate ensemble predictions6 majority vote (B, xnew)
Algorithm 6.3.1: A-FSE implemented using D-HSFS
129
6.4. Experimentation and Discussion
It is important to note that an instance of D-HSFS is necessary for each dynamic
FS subsystem, and the feature subset refinement is a continuous process independent
of the subsequent classifiers. Although the learners should be re-trained each time
when the associated subsets are modified, or the training objects are changed, the
efficiency of the ensemble may be further improved by on-demand re-training, i.e.,
only proceed when a new test object is present for classification.
The complexity of an A-FSE implemented using D-HSFS is therefore OD-HSFS · |B|+K · |B| ·OC , where OD-HSFS is the complexity of a D-HSFS component already analysed
in Section 6.2.2, K is the total number of potential state changes in a given dynamic
system, and OC is the cost related to a single classifier employed by the ensemble. Of
course, if multiple, different classification algorithms are involved in the construction
of A-FSE, then the classification-related costs are the sum of those of the individual
base classifiers: K ·Σ|B|l=1OC l .
6.4 Experimentation and Discussion
The present investigation employs five real-valued UCI [78] benchmark data sets,
for the purpose of simulating a dynamically changing FS environment, which is
suitable for the demonstration of the efficacy of the proposed approach. Table 6.1
provides a summary of these data sets, all of which are of high dimensionality and
contain a large number of objects, thereby presenting significant challenges to FS.
Two commonly used subset-based feature evaluators, CFS and PCFS, are used in
the experiment. CFS is the a lightweight method, which addresses the problem
of FS through a correlation-based analysis, and identifies features that are highly
correlated with the class, yet uncorrelated with each other [93]. PCFS is an FS
approach that attempts to identify a group of features that are most inconsistent
[52], thereby removing irrelevant features in the process.
Table 6.1: Summary of the data sets
Data set Feature Instance Class C4.5 PART NB
arrhy 280 452 16 65.97 66.36 61.40handw 257 1593 10 75.74 78.09 86.21multi 650 2000 10 94.54 94.95 95.30ozone 73 2534 2 92.70 92.50 67.66secom 591 1567 2 89.56 91.57 30.04
130
6.4. Experimentation and Discussion
6.4.1 Results for Basic Dynamic FS Scenarios
The four basic dynamic FS events are individually tested here, in order to validate
the efficacies of the proposed A-FSE method, and the D-HSFS algorithm. In the
experimentation, features or objects are added or removed (according to the scenario)
randomly in batches. After a change has been made, the D-HSFS algorithm adapts
to the new data, and improves the previously selected feature subset Bk, in order
to produce a new candidate subset Bk+1. A collection of 20 candidate subsets are
simultaneously improved (each using a separate instance of D-HSFS) with respect
to the same training data, the prediction accuracy of the resultant A-FSE (of size
20) is examined using dedicated test data (10% held out from the original data set).
The classification algorithm employed here is the decision tree-based C4.5 algorithm
[264], which first constructs a full decision tree using all available features, and then
performs heuristic pruning based on the statistical importance of the features.
6.4.1.1 Feature Addition
Table 6.2 details the results of the proposed A-FSE approach for the event of feature
addition, simulated using multi and secom, two of the available data sets each with
a largest total number of features. It is evident that the evaluation scores steadily
improve as new features are being added to the dynamic data sets, which agrees with
the analysis made in Section 6.1.1. The addition of features generally refines the
knowledge of the underlying problem, and the dynamic FS component supporting
the classifiers: i.e., the proposed D-HSFS algorithm, also successfully improves the
qualities of the candidate subsets.
Note that for the multi data set, a feature subset of full evaluation score (for
PCFS) is identified at the beginning (having just 172 features). Since no further
improvement to the score can be made, D-HSFS optimises the candidate solutions
via size reduction, by substituting in possibly more informative features that are
introduced during the feature addition events. The feature subsets found using
PCFS are much smaller than those identified with CFS for multi (with averaged
size being 35.1 v.s. 91.3), while CFS achieves more significant size reduction for
secom (averaged size 30.7 v.s. 49.9). A-FSE built using feature subsets suggested by
PCFS achieves better accuracies for the multi data set (averaged accuracy 97.23%
v.s. 93.91%). It is also marginally more stable for secom (for which the CFS-based
A-FSE delivers the same level of accuracy).
131
6.4. Experimentation and Discussion
Table 6.2: Feature addition results, showing the sizes and evaluation scores ofthe dynamic feature subsets maintained by D-HSFS, and the corresponding A-FSEaccuracies, using feature subset evaluators CFS and PCFS. Shaded cells indicatestatistically better results (using paired t-test)
multi secom
CFS PCFS CFS PCFS
|Ak| Score Size C4.5% Score Size C4.5% |Ak| Score Size C4.5% Score Size C4.5%
172 0.811 76.8 94.00 1.000 40.2 97.00 155 0.004 34.0 93.63 0.948 45.3 92.99210 0.816 81.2 92.50 1.000 38.3 98.00 198 0.006 28.8 93.63 0.948 38.7 93.63250 0.820 85.5 94.50 1.000 36.8 94.50 240 0.007 27.9 92.99 0.951 45.2 93.63289 0.834 92.4 93.50 1.000 34.7 98.50 283 0.009 30.5 94.27 0.961 51.0 93.63342 0.842 94.9 94.00 1.000 33.7 97.00 327 0.011 30.6 94.27 0.964 51.7 94.27394 0.850 97.5 93.50 1.000 32.6 97.00 356 0.013 29.9 93.63 0.968 53.7 93.63453 0.862 98.4 95.00 1.000 32.1 96.50 395 0.014 29.2 94.27 0.970 52.3 92.99504 0.874 99.8 94.00 1.000 31.5 97.50 444 0.015 30.6 93.63 0.974 51.6 93.63588 0.881 101.0 94.50 1.000 30.7 97.50 536 0.016 30.2 94.27 0.975 53.7 93.63637 0.892 103.4 96.50 1.000 30.3 97.50 579 0.016 29.7 93.63 0.979 63.9 93.63
Mean 0.843 91.3 93.91 1.000 35.1 97.23 Mean 0.010 30.7 93.80 0.961 49.9 93.57S.D. 0.032 10.4 1.39 0.000 4.8 1.10 S.D. 0.005 2.4 0.41 0.014 6.8 0.34
6.4.1.2 Feature Removal
Following the discussion made in Section 6.1.2 regarding feature removal, the
effectiveness of the proposed approach for this particular scenario is reported in
Table 6.3. The events are simulated by randomly removing batches of features from
the data sets, which are initialised with full sets of features. The feature subset
evaluation scores decreases expectedly as features are being removed from the
system, expect for PCFS, which again maintains a constant evaluation score of 1.000
throughout the whole experimentation (for multi). This may indicate that more
robust features have been identified by PCFS for this particular data set, leading to
candidate solutions more resilient to the changes.
The overall A-FSE accuracies are reasonably well preserved throughout the series
of feature removals, and have shown improvements for the difficult data set secom.
It is worth noting that, although having very similar features, the terminating states
of the previous set of experiments (with almost all features added) yield better
quality feature subsets, than those obtained here during the initial stages (before
any feature has been removed). A possible explanation is that D-HSFS previously
has a far longer time (search iterations) to perform optimisation, and the gradual
discovery or exploration of new features may also help to form compact and good
132
6.4. Experimentation and Discussion
Table 6.3: Feature removal results, showing the sizes and evaluation scores ofthe dynamic feature subsets maintained by D-HSFS, and the corresponding A-FSEaccuracies, using feature subset evaluators CFS and PCFS. Shaded cells indicatestatistically better results (using paired t-test)
multi secom
CFS PCFS CFS PCFS
|Ak| Score Size C4.5% Score Size C4.5% |Ak| Score Size C4.5% Score Size C4.5%
604 0.854 340.0 96.50 1.000 297.7 98.00 543 0.002 269.6 92.99 0.979 275.4 92.99542 0.853 303.4 97.00 1.000 264.3 98.50 500 0.002 244.2 93.63 0.979 250.4 92.99490 0.852 275.7 97.00 1.000 236.0 97.50 456 0.002 220.8 92.99 0.979 229.0 92.99434 0.851 245.6 97.00 1.000 206.9 98.00 410 0.002 195.8 92.99 0.969 207.5 92.99379 0.849 215.1 97.00 1.000 179.7 98.50 356 0.002 167.9 92.99 0.967 183.1 92.99312 0.836 178.6 98.50 1.000 144.7 98.00 287 0.002 133.2 92.36 0.963 157.3 91.72283 0.829 163.9 97.50 1.000 130.3 98.00 251 0.001 112.9 93.63 0.944 136.4 92.99252 0.827 142.0 97.50 1.000 115.4 98.00 221 0.001 97.6 93.63 0.943 121.1 93.63224 0.824 128.4 97.50 1.000 101.4 97.50 186 0.001 80.9 94.27 0.940 102.7 93.63183 0.816 107.8 96.00 1.000 81.5 97.50 151 0.001 64.6 94.27 0.940 85.9 93.63130 0.784 94.1 93.00 1.000 56.0 99.00 118 0.002 49.6 94.27 0.938 66.9 94.27
Mean 0.835 213.5 96.71 1.000 178.5 98.08 Mean 0.002 161.6 93.37 0.960 176.9 93.21S.D. 0.021 92.6 1.36 0.000 88.5 0.47 S.D. 0.000 84.7 0.69 0.018 78.0 0.63
quality feature subsets. This may inspire further adjustments to the original HSFS
algorithm [62], allowing it to better strategising the exploration of the solution space.
6.4.1.3 Instance Addition
Table 6.4 lists the results of the proposed A-FSE approach for the events involving
instance addition, where the corresponding theoretical analysis can be found in
Section 6.1.3. This set of experiments are simulated using multi and ozone, for
having the two largest total numbers of instances. Similar to the scenario of feature
addition, initially a small collection of instances are available for training, and new
samples are introduced to the system over time. Since the total number of features
remains unchanged in this scenario for both tested data sets, there are a lot less size
variations to the selected feature subsets (when compared to the previous feature-
based dynamic events). The addition of instances gradually expands the underlying
concept embedded in the data sets, and features selected at early stages need to
be altered in order to continue to fully capture the constantly refined information.
Despite that considerable numbers (±100) of new objects are being added per change,
the accuracies of the A-FSEs are well maintained. This shows the effectiveness of the
D-HSFS components optimising the underlying dynamic feature subsets. For both of
133
6.4. Experimentation and Discussion
the tested data sets, better ensemble performance is achieved by PCFS-based A-FSE,
with higher averaged accuracy and lower mean feature subset size.
Table 6.4: Instance addition results, showing the sizes and evaluation scores ofthe dynamic feature subsets maintained by D-HSFS, and the corresponding A-FSEaccuracies, using feature subset evaluators CFS and PCFS. Shaded cells indicatestatistically better results (using paired t-test)
multi ozone
CFS PCFS CFS PCFS
|Xk| Score Size C4.5% Score Size C4.5% |Xk| Score Size C4.5% Score Size C4.5%
360 0.635 371.0 93.50 1.000 331.1 95.00 456 0.185 31.7 90.16 0.993 22.8 91.73446 0.771 371.2 94.00 1.000 320.7 96.50 626 0.182 28.2 92.13 0.990 21.5 92.52554 0.826 377.0 93.50 1.000 314.9 94.00 791 0.169 28.7 92.13 0.999 23.2 91.34654 0.833 373.5 94.00 1.000 312.4 96.00 950 0.146 28.7 91.73 0.999 21.6 92.13790 0.846 367.3 94.50 1.000 308.6 96.50 1107 0.143 27.5 91.73 1.000 23.7 92.13851 0.848 369.3 93.00 1.000 307.6 96.00 1168 0.138 27.3 90.55 0.999 21.6 92.91991 0.850 364.6 94.50 1.000 306.0 96.50 1328 0.130 27.5 91.73 0.998 24.2 92.91
1171 0.857 361.8 96.50 1.000 305.3 95.50 1470 0.126 26.4 91.73 0.999 23.1 91.731273 0.856 361.9 95.50 1.000 303.4 96.00 1542 0.117 25.7 92.52 0.998 23.6 92.911416 0.852 357.8 95.50 1.000 302.6 97.00 1762 0.114 25.7 90.94 1.000 27.8 92.911606 0.857 354.8 94.50 1.000 301.8 98.00 2105 0.113 25.6 91.34 1.000 26.7 92.521737 0.860 354.8 95.50 1.000 301.2 97.00 2194 0.108 25.7 91.73 1.000 24.8 92.91
Mean 0.824 365.4 94.54 1.000 309.6 96.17 Mean 0.139 27.4 91.54 0.998 23.7 92.39S.D. 0.064 7.3 1.03 0.000 8.9 1.03 S.D. 0.027 1.8 0.68 0.003 2.0 0.57
6.4.1.4 Instance Removal
The results of the proposed A-FSE approach for the situation involving instance
removal is presented in Table 6.5, for the multi and ozone data sets. Most of the
observations are similar to the earlier scenarios, such as the slightly better overall
performance achieved by PCFS-based ensembles, especially for multi (96.36 v.s.
93.68). Note that the relaxed constraints due to instance removals (see Section 6.1.4)
do not necessarily correlate to better evaluation scores, especially when judged by
CFS. This is reflected by the completely opposite trends for the two data sets, where
the evaluation scores are decreasing as instances are being removed for multi, and
the scores instead improve over time for ozone. This may be explained by the fact
that the correlation analysis done by CFS is merely with respect to the current state of
the data, and two evaluation scores are not directly comparable when their underlying
training instances are different. It may be beneficial, especially for algorithms such
as D-HSFS that optimise solutions continuously, to devise a (dynamic) feature subset
134
6.4. Experimentation and Discussion
evaluation method that provides an ordered quality metrics, so that solution qualities
at different dynamic states may be directly compared.
Table 6.5: Instance removal results, showing the sizes and evaluation scores ofthe dynamic feature subsets maintained by D-HSFS, and the corresponding A-FSEaccuracies, using feature subset evaluators CFS and PCFS. Shaded cells indicatestatistically better results (using paired t-test)
multi ozone
CFS PCFS CFS PCFS
|Xk| Score Size C4.5% Score Size C4.5% |Xk| Score Size C4.5% Score Size C4.5%
1658 0.855 361.8 96.00 1.000 320.4 98.00 2121 0.110 27.6 91.34 1.000 24.0 93.311483 0.855 360.0 95.00 1.000 315.6 95.50 1956 0.115 26.7 92.13 0.999 23.8 90.941314 0.857 353.7 94.50 1.000 312.5 98.00 1710 0.123 26.4 92.13 0.998 23.7 92.131135 0.854 357.5 95.50 1.000 309.9 97.50 1538 0.132 27.3 93.70 0.999 20.7 92.13979 0.842 367.0 96.00 1.000 307.8 96.50 1350 0.124 28.0 91.34 0.999 21.9 90.55795 0.834 367.8 94.00 1.000 306.3 95.00 1145 0.124 29.5 90.55 0.998 25.5 90.55650 0.832 375.4 92.00 1.000 304.2 95.50 947 0.129 27.6 90.94 0.998 24.3 90.94562 0.825 381.3 92.50 1.000 303.9 95.50 787 0.144 28.2 90.55 0.999 22.5 91.34481 0.783 373.4 91.50 1.000 303.6 97.50 685 0.136 28.1 90.16 0.997 23.6 90.94359 0.768 373.7 87.50 1.000 301.8 93.50 455 0.159 26.3 88.98 1.000 18.7 90.94
Mean 0.832 367.4 93.68 1.000 310.6 96.36 Mean 0.128 27.8 91.23 0.999 23.1 91.37S.D. 0.030 8.4 2.63 0.000 8.7 1.47 S.D. 0.015 1.2 1.23 0.001 2.1 0.83
6.4.2 Results for Combined Dynamic FS Scenarios
In this set of experiments, a set of randomly chosen features or instances may be
added or removed in any combination. After a change has been made, the previously
selected feature subset Bk is first evaluated against the modified data with its quality
recorded. The D-HSFS algorithm then adapts to the new data, and continues to
refine the solution, so that a new candidate subset Bk+1 is produced. A collection of
20 candidate subsets are simultaneously improved (each using a separate instance
of D-HSFS) with respect to the same training data, the resultant A-FSE of size 20 is
examined for the subsequent prediction accuracy analysis.
The base classification algorithms employed for constructing the ensembles in
the experiments include: 1) the previously used, decision tree-based C4.5 algorithm
[264]; 2) PART [33] is a partial decision tree algorithm, which does not need to
perform global pruning like C4.5 in order to produce the appropriate rules; and 3)
the probabilistic Bayesian classifier with strong (naïve) independence assumptions
(NB) [132]. FS is particularly beneficial for C4.5 and PART since the reduced feature
subsets remove much of their initial training overheads.
135
6.4. Experimentation and Discussion
6.4.2.1 Quality of Selected Subsets
Fig. 6.3 illustrates the dynamic FS results, detailing the evaluation scores (dotted
lines) and the sizes (solid lines), both averaged over 20 adaptively refined subsets
throughout a simulated dynamic process. The shaded columns indicate the per-
formance of Bk with regards to (Ak+1, Xk+1), while the non-highlighted columns
show the quality of the adaptively refined subsets Bk+1. Note that the selection
processes are carried out independently for each of the two subset evaluators, while
the underlying dynamic training data sets are the same to facilitate comparison.
It can be observed from the figure that the averaged f (Bk), Bk ∈ Bk, tends to
decrease by a large margin when features are deleted and/or a large number of
objects are added. The two most obvious reductions are at |Ak|= 101 (arrhy) and
|Ak| = 97 (handw), where 43 and 41 features are removed, respectively. This confirms
the theoretical assumptions made in Section 6.1, feature deletion and object addition
are the two major causes of rapid changes to the underlying concept, which are also
more challenging for a given dynamic FS algorithm to recover from. The results also
demonstrates that in response to the changes, the proposed D-HSFS algorithm can
successfully locate alternative informative features, thereby dynamically restoring or
improving the quality of the candidate subsets.
According to this systematic experimentation, a larger size variation can be
observed for the subsets found via CFS. The most noticeable performance difference
between the use of the two different feature evaluators is for the multi data set.
Over 20 additional features are constantly selected by CFS when attempting to adapt
to the changed data, whilst considerably less adjustments (< 10) are made by PCFS.
This indicates that the PCFS evaluator can identify more resilient and robust features.
For this particular data set, CFS also selects larger subsets in general, when compared
to those obtained by PCFS. Given that a significant amount of information is updated,
the sizes of the resultant subsets are maintained at a reasonable level throughout
the whole simulation. This is largely detailed by the beneficial characteristics of the
HS algorithm itself, especially the ability to escape from local optimal solutions.
6.4.2.2 Ensemble Accuracy
Before running the dynamic simulation, 10% of the original objects are again retained
for testing. The classification results of the ensembles trained using the adaptively
refined feature subsets are summarised in Table 6.6. The averaged accuracies of the
136
6.4. Experimentation and Discussion
Figure 6.3: Results of dynamic FS, showing the averaged sizes and evaluationscores over 20 dynamically selected feature subsets for each data set, plotted againstthe number of features and that of objects. Each of the shaded columns and itsimmediate right neighbour (in white) correspond to one dynamic FS event, indicatingthe quality of a previously selected subset and that of a dynamically refined featuresubset, respectively.
137
6.5. Summary
individual ensemble members are also given, indicating the quality of the underlying
dynamic subsets. Additionally, the performance of classifiers trained using the full
set of available features is also provided, signifying a base-line accuracy of these
dynamic data sets.
Table 6.6: A-FSE accuracy comparison, bold figures indicate best result per classifier,shaded cells signify overall best results
A-FSE by CFS Single by CFS A-FSE by PCFS Single by PCFS Base Accuracy
C4.5 PART NB C4.5 PART NB C4.5 PART NB C4.5 PART NB C4.5 PART NB
arrhy 57.85 60.52 57.22 57.47 55.84 50.27 59.54 61.71 60.73 55.71 55.30 50.54 56.66 55.43 52.31handw 67.08 74.98 77.07 62.93 65.74 75.87 78.19 77.94 72.18 58.05 59.49 64.26 62.85 65.66 77.30multi 94.90 96.05 96.23 92.91 93.34 94.81 97.66 97.52 95.21 87.28 86.22 90.44 93.31 93.38 94.00ozone 91.47 91.30 76.87 91.44 90.87 67.35 91.59 91.62 73.98 91.19 91.22 66.44 91.24 91.04 66.24secom 93.55 93.63 71.44 89.97 92.12 61.58 93.43 93.63 54.47 89.89 91.76 60.63 89.45 92.99 59.00
The results show that the classification accuracies of the resultant A-FSEs are
improved for almost all of the data sets. The differences are most noticeable for the
handw data set, where the A-FSE built upon subsets selected by the PCFS evaluator
outperforms the base-line by over 15% (for C4.5). However, the the individual
ensemble members are a lot less accurate, achieving an averaged accuracy of 58.05%
(for C4.5). This confirms the benefit of employing an ensemble-based approach,
which for this instance, has substantially improved (by over 20%) over individually
deployed classifiers. Note that for the secom data set, although both C4.5 and
PART-based ensembles deliver very similar classification accuracies, the NB-based
ensemble performs almost 17% better in accuracy with the CFS evaluator. This
reflects that the underlying feature subsets selected by the two evaluators are rather
different. Generally speaking, the PART classification algorithm works best with the
subsets selected by PCFS, it delivers highest overall accuracy for 3 of 5 data sets. The
C4.5 classifier also performs very well together with PCFS. In contrast, the NB-based
ensembles are improved the most by subsets found using CFS.
6.5 Summary
This chapter has presented a dynamic extension to the HSFS algorithm, in an attempt
to address the challenges posed by dynamically changing data. The stochastic
properties of HSFS enable multiple high quality dynamic feature subsets to be
maintained. Having a collection of such subsets allows an adaptive feature subset-
based classifier ensemble to be constructed. According to the experimental results,
138
6.5. Summary
the proposed D-HSFS algorithm successfully adapts to the dynamically changing
data. Importantly, the A-FSE constructed on the basis of dynamically refined subsets
also demonstrates improved classification performance, when compared to that of
single algorithm-based methods, as well as the base-line accuracies achieved using
the full (unreduced) sets of available features.
Dynamic FS has attracted much attention lately due to its strong links to various
real-world problems [58, 99, 268, 289], and its principle of adapting to a changing
underlying knowledge is also intuitively appealing. Therefore, dynamic FS and the
methods described in this chapter have much potential worthy of further research.
Sections 9.2.2.2 and 9.2.2.4 will discuss the possible future directions in theory and
application, respectively.
139
Chapter 7
HSFS for Hybrid Rule Induction
T HE most common approach to developing expressive and human readable repre-
sentations of knowledge is the use of production (if-then) rules [113]. A typical
technique for addressing the inefficiency of fuzzy rule induction [42, 88, 262] due
to high data dimensionality is to employ a pre-processing mechanism such as FS.
However, this additional step, regardless of the methods employed. adds overheads
to the overall learning process. In addition, much like the techniques presented so
far in this thesis, the FS step is often carried out in isolation of rule induction, i.e.,
filter-based [164]. This separation may prove costly for the subsequent rule induction
phase, since the selected subset of features may not be those that are the most useful
to derive rules from. This has been the motivation behind wrapper-based approaches
to FS, with additional complexity introduced by performing rule induction repeatedly
in the search for the optimal set of features. Clearly, a closer integration of FS and
rule induction is desirable.
QuickRules [119] is a recently proposed hybrid fuzzy-rough rule induction algo-
rithm. It uses a greedy HC strategy as originally employed in QuickReduct [126] (see
Section 2.1.1.2) to search for a feature subset. This helps maintain the discernibility
of the full set of features. Meanwhile, fuzzy rules are generated on the fly in an
attempt to construct a rule base that provides complete coverage of the training data.
At the end of the process, each individual rule within the rule base will contain a more
compact feature subset. However, as discussed throughout this thesis, deterministic
HC techniques such as QuickReduct may lead to the discovery of sub-optimal feature
subsets [164], both in terms of the evaluation score and the subset size. The quality
140
7.1. Background of Rule Induction
of the resultant fuzzy-rough rule base, derived using such a potentially sub-optimal
feature subset, may also be sub-optimal.
This chapter describes a hybrid fuzzy-rough rule induction approach via the use
of HSFS. Similar to that for FS problems, the primary motivation for using stochastic
algorithms in the discovery of high level prediction rules is that a global search may
be performed. The resultant approach may also cope better with feature interaction
than greedy rule induction algorithms. The HS-based method proposed herein,
termed HarmonyRules, begins with an initial set of rules with randomly generated
underlying feature subsets, which are iteratively improved during the search process.
The ultimate aim is to identify an optimised set of rules based on a number of essential
and preferable performance criteria. The essential requirements include complete
coverage of the training data, and full preservation of the underlying schematic of
the original features. Additional evaluations in terms of minimising the size of the
rule base, and cardinality of the selected feature subsets further guide the search
process to converge to a concise, meaningful, and accurate set of rules.
The remaining sections of this chapter are structured as follows. The theoretical
background of rule induction is introduced in Section 7.1. Section 7.2 describes the
proposed HSFS-based hybrid fuzzy-rough rule induction method, where detailed
pseudocode is provided in order to explain the learning procedures. Experimental
studies that demonstrate the potential of the approach are presented in Section
7.3, including an in-depth comparison against QuickRules in terms of feature subset
cardinality. Finally, an appraisal of the present approach is given in Section 7.4.
7.1 Background of Rule Induction
This section briefly describes the crisp rough rule induction method, and covers the
essential theoretical concepts exploited by both rough set based [106], and fuzzy-
rough set based [226] rule induction approaches. The QuickRules algorithm, which
the proposed work is aiming to improve upon, is also explained for completeness.
7.1.1 Crisp Rough Rule Induction
In crisp RST, rules can be generated through the use of so-called minimal complexes
[88]. Let D be a concept, t an attribute-value pair (a, v), and T a set of such attribute-
value pairs. A block of t, denoted by [t], is a set of objects for which attribute a has
141
7.1. Background of Rule Induction
value v. A concept D depends on a set of attribute-value pairs T , if and only if:
; 6= [T] =⋂
t∈T
[t] ⊆ D (7.1)
T is a minimal complex of D if and only if D depends on T , and no true subset T′ ⊂ T
exists such that D depends on T′.
Consider a simple example data set shown in Table 7.1, which is consisted of 14
objects x1, · · · , x14, four conditional features a1, a2, a3, a4, and a decision feature
z. A list of all possible blocks [t1], · · · , [t|T ∗|] that may be derived from the available
attribute-value pairs T ∗ = t1, · · · , t|T (∗)| is given in Table 7.2.
Table 7.1: Example data set for rough set rule induction
Outlook (a1) Temperature (a2) Humidity (a3) Wind (a4) Golf (z)
x1 sunny hot high weak nox2 sunny hot high strong nox3 overcast hot high weak yesx4 rain mild high weak yesx5 rain cool normal weak yesx6 rain cool normal strong nox7 overcast cool normal strong yesx8 sunny mild high weak nox9 sunny cool normal weak yesx10 rain mild normal weak yesx11 sunny mild normal strong yesx12 overcast mild high strong yesx13 overcast hot normal weak yesx14 rain mild high strong no
For a given concept, say Dz=no = x1, x2, x6, x8, x14, there exist several sets of
attribute-value pairs T that Dz=no depends on. For example:
T = t1, t4, t7= (outlook, sunny), (temperature,hot), (humidity, high) (7.2)
which involves three features. The intersection of their associated blocks can be
computed using Eqn. 7.1:
; 6= [T] = [t1]∩ [t4]∩ [t7] = x1, x2 ⊂ Dz=no (7.3)
In this case, T is not a minimal complex of Dz=no since there exists a subset
T ′ = t1, t7= (outlook, sunny), (humidity, high) ⊂ T (7.4)
142
7.1. Background of Rule Induction
Table 7.2: Example data set for rough set rule induction
Block (a, v) Pair Objects
[t1] (outlook, sunny) x1, x2, x8, x9, x11[t2] (outlook, overcast) x3, x7, x12, x13[t3] (outlook, rain) x4, x5, x6, x10, x14[t4] (temperature, hot) x1, x2, x3, x13[t5] (temperature, mild) x4, x8, x10, x11, x12, x14[t6] (temperature, cool) x5, x6, x7, x9[t7] (humidity, high) x1, x2, x3, x4, x8, x12, x14[t8] (humidity, normal) x5, x6, x7, x9, x10, x11, x13[t9] (wind, weak) x1, x3, x4, x5, x8, x9, x10, x13[t10] (wind, strong) x2, x6, x7, x11, x12, x14
with
[T ′] = [t1]∩ [t7] = x1, x2, x8 (7.5)
such that Dz=no depends on T ′, while T ′ itself is a minimal complex of Dz=no as it
cannot be further reduced without violating the properties defined in Eqn. 7.1.
It is often the case that a minimal complex only describes a given concept partially,
and hence more than one minimal complex is required to cover a concept. A local
covering T of a concept D is a collection of minimal complexes, such that the union
of all minimal complexes is exactly D and T is minimal (i.e. containing no spurious
attribute-value pairs). The discovery of such local coverings (referred to hereafter
as the rule base) forms the basis for several approaches to rough set rule induction
[210]. A partitioning of the universe of discourse (that is consistent) by a reduct will
always produce equivalence classes that are subsets of the decision concepts, and
will cover each concept fully. Once a reduct has been found, rules may be extracted
from the underlying equivalence classes. Note that in the literature, reducts for the
purpose of rule induction are termed global coverings.
A popular approach to rule induction in the relevant area is the so-called learning
from examples module, version 2 (LEM2) algorithm [88], which follows a heuristic
strategy for creating an initial rule by choosing sequentially the “best” elementary
conditions according to certain heuristic criteria. Learning examples that match this
rule are then removed from consideration. The process is repeated iteratively when
learning examples remain uncovered. The resulting set of rules covers all learning
examples.
143
7.1. Background of Rule Induction
Additional factors characterising rules may also be taken into account [89],including the strength of matched or partly-matched rules (the total number of
cases correctly classified by an emerging rule during training), the number of non-
matched conditions, the rule specificity (i.e., length of condition parts). All factors
are combined and the strongest decision wins. If no rule is matched, the partially
matched rules are considered and the most probable decision is chosen.
7.1.2 Hybrid Fuzzy-Rough Rule Induction
Following the theoretical background laid out previously in Section 2.1.1.2, a common
rule induction strategy in fuzzy-rough set theory [42, 226] is to induce fuzzy rules
by overlaying decision reducts on the original training decisions, and then reading
off the (qualitative) values of the selected features. In other words, by partitioning
the universe via the features present in a decision reduct, each resulting fuzzy-rough
equivalence class forms a single rule. As the partitioning is produced by a reduct, it
is guaranteed that each fuzzy-rough equivalence class is subsumed by, or equal to, a
decision concept. This means that the attribute values that produced this equivalence
class are good predictors of the decision concept. The use of a reduct also ensures
that each object is covered by the set of rules. A disadvantage of this approach is that
the generated rules are often too specific, as each rule antecedent always includes
every feature appearing in the final reduct.
For the purposes of combining rule induction and feature selection, rules are
constructed from so-called tolerance classes (antecedents) and corresponding de-
cision concepts (consequence). A fuzzy rule rx with respect to an object x ∈ X is
represented as a triple:
rx = (B, RB x , Rz x), x ∈ X , B ⊆ A (7.6)
where B ⊆ A is the set of conditional attributes that appear in the rule’s antecedent,
RB x is the fuzzy tolerance class of the object that generated the rule, and Rz x refers
to a decision class z, i.e. the consequent of the rule.
Recall from Section 2.1.1.2 that RB is a fuzzy indiscernibility relation [236] for a
given feature subset B ⊆ A:
µRB(x i, x j) = Ta∈BµRa
(x i, x j) (7.7)
144
7.1. Background of Rule Induction
where µRB(x i, x j) is a fuzzy similarity relation (see Eqn. 2.18 or 2.19 for example),
and T is a t-norm. For a given object x i ∈ X , its tolerance class RB x i is then defined
as:
RB x i(x j) = µRB(x i, x j),∀x j ∈ X (7.8)
This formulation is used as it provides a fast way of determining rule coverage (the
cardinality of the fuzzy set RB x), and rule specificity (the cardinality of B or the
number of rule antecedents).
7.1.2.1 Outline of QuickRules
As an example for the existing approach that attempts hybrid fuzzy-rough rule
induction: QuickRules [119] is illustrated in Algorithm 7.1.1. One of its subroutine
for checking the coverage of a newly generated rule is detailed in Algorithm 7.1.2.
It makes use of the QuickReduct method previously described in Section 2.1.1.2,
where features are being examined individually. A given feature a ∈ A\ B is added
to the candidate feature subset B, if it provides the greatest increase in fuzzy-rough
dependency evaluation. Fuzzy rules are constructed on the fly whilst the feature
subset is being improved, for those objects not yet covered by existing rules. A
given candidate rule rx is checked via the check(B, RB x , Rz x) subroutine, in order to
determine whether they are subsumed by any existing rules already in the rule base.
The process terminates when a fuzzy-rough reduct is found, and all training objects
covered. Several important mechanisms exploited by QuickRules will be explained in
a greater detail in the following sections.
7.1.2.2 Worked Example
In order to demonstrate the operations of the QuickRules algorithm, an example data
set [237] adopted in the original paper [119] is employed. This data set, as shown
in Table 7.3, consists of seven objects X = x1, · · · , x7, eight conditional features
A= a1, · · · , a8 which are all quantitative, and a decision feature z.
Using hill climbing, QuickRules is initiated when the first object x1 is examined
with respect to the first feature a1, as no rules exist at the beginning. Using Eqn.
2.22, the membership of object x1 in the fuzzy set POSa1 is computed, which in
this case, is the same as that calculated using the full set of features:
µPOSa1(x1) = µPOSA
(x1) (7.9)
145
7.1. Background of Rule Induction
1 T , temporary feature subset
2 B = ;, rules = ;, cov = ;3 repeat4 T = B5 foreach a ∈ A\ B do6 foreach x ∈ X \ covered (cov) do7 if POSB∪a(x) = POSA(x) then8 check (B ∪ a, RB∪ax , Rd x)
9 if f (B ∪ a)> f (T ) then10 T = B ∪ a
11 B = T
12 until f (B) = f (A)13 return B, rules
Algorithm 7.1.1: Work flow of QuickRules
1 RrB x , tolerance class of rule r ∈ rules
2 add= true3 foreach r ∈ rules do4 if RB x ⊆ Rr
B x then5 add= false6 break7 else8 if Rr
B x ⊂ RB x then rules= rules \ r9
10 if add= true then11 rules= rules∪ (B, RB x , Rz x)12 cov = cov ∪ RB x
Algorithm 7.1.2: Subroutine check (B, RB x , Rz x)
This satisfies the condition for new rule generation, as outlined in lines 7 to 8 of
Algorithm 7.1.1. Therefore, check(a1, Ra1x1, Rz x1) is invoked, with
Ra1x1 = [1.0,0.0, 0.0,0.0, 0.73,0.73, 1.0]
Rz x1 = [1,0, 1,0, 1,1, 1] (7.10)
This new rule fully covers x1 (being the object that the rule is constructed for), and
also x7 which has the same feature value as x1 for a1. It also partially covers objects
x5 and x6 to a degree of 0.73. This rule is then added to the (currently empty) rule
146
7.1. Background of Rule Induction
Table 7.3: Example data set for QuickRules
a1 a2 a3 a4 a5 a6 a7 a8 z
x1 1 101 50 15 36 24.2 0.526 26 0x2 8 176 90 34 300 33.7 0.467 58 1x3 7 150 66 42 342 34.7 0.718 42 0x4 7 187 68 39 304 37.7 0.254 41 1x5 0 100 88 60 110 46.8 0.962 31 0x6 0 105 64 41 142 41.5 0.173 22 0x7 1 95 66 13 38 19.6 0.334 25 0
set, and the coverage of the rule base is updated:
cov = [1.0,0.0, 0.0,0.0, 0.73,0.73, 1.0] (7.11)
The algorithm continues to examine the remaining objects for the current feature
subset a1. It identifies another rule for x5, as µPOSa1(x5) = µPOSA
(x5), with:
Ra1x5 = [0.73,0.0, 0.0,0.0, 1,1, 0.73]
Rz x5 = [1,0, 1,0, 1,1, 1] (7.12)
This newly constructed rule is then compared to the existing rule in the rule base
Ra1x1. As neither of the two rules subsumes the other, i.e., Ra1x1 6⊂ Ra1x5 and
Ra1x5 6⊆ Ra1x1, this new rule is added to the rule base, and the coverage cov is
again updated:
cov = [1.0, 0.0,0.0, 0.0,0.73, 0.73,1.0]∪ [0.73,0.0, 0.0,0.0, 1,1, 0.73]
= [1.0, 0.0,0.0, 0.0,1.0, 1.0,1.0] (7.13)
For the feature a1 that is currently being considered, the last two objects x6 and
x7 have already been covered by the existing rules. The algorithm then calculates
the dependency degree of z upon the feature subset a1, producing f (a1) (refer
to Section 2.1.1.2 for more details regarding the dependency calculation using
fuzzy-rough sets).
The remaining single-feature subsets are also checked in a similar manner as
above, during which, it is determined that µPOSa8(x2) = µPOSA
(x2), a new rule
147
7.2. HSFS for Hybrid Rule Induction
(a8, Ra8x2, Rz x2) is then added and the coverage of the rule base updated. The
dependency calculation results of the respective features are summarised as follows:
f (a1) = 0.61 f (a2) = 0.89
f (a3) = 0.28 f (a4) = 0.55
f (a5) = 0.70 f (a6) = 0.56
f (a7) = 0.46 f (a8) = 0.71
In this example, the best feature is a2 which results in the greatest increase in
dependency score, it is therefore added to the feature subset B.
The QuickRules algorithm is intended to iterate until all training objects are cov-
ered fully by the discovered rules. While examining the remaining combinations of
feature subsets, two more rules: (a2, a3, Ra2,a3x3, Rz x3) and (a2, a3, Ra2,a3x4, Rz x4)are identified. The coverage of the rule base, at this stage, becomes:
cov = [1.0,1.0, 1.0,1.0, 1.0,1.0, 1.0] (7.14)
and the simultaneously selected feature subset B = a2, a3 also reaches full depen-
dency evaluation f (B) = 1. The termination condition of QuickRules is satisfied, the
final set of rules results:
(a1, Ra1x1, Rz x1)
(a1, Ra1x5, Rz x5)
(a8, Ra8x2, Rz x2)
(a2, a3, Ra2,a3x3, Rz x3)
(a2, a3, Ra2,a3x4, Rz x4)
7.2 HSFS for Hybrid Rule Induction
In this section, HSFS and its underlying HS algorithm are employed to aid rule base
optimisation. The rule induction process is integrated directly into the FS process,
generating rules on the fly. During its iterative process, the algorithm optimises the
emerging rule base with regard to given criteria (see later). The final result is a set
of fuzzy rules that cover the training objects to the maximum extent, whilst utilising
the minimum number of features.
148
7.2. HSFS for Hybrid Rule Induction
7.2.1 Mapping of Key Notions
For rule induction, as summarised in Table 7.4, each musician represents a training
object x ∈ X . The collection of available notes for each musician is the possible set of
rules with respect to x , taking the form previously introduced in Eqn. 7.6. The rules
are differentiated by the various feature subsets involved. Each musician may vote
for one rule to be included in the emerging rule base when it is being improvised.
Here, a musician may choose to nominate r−, denoting an “empty” or “blank” rule,
if the representing object is already covered by other existing rules. A harmony H is
then the combined rule base from all musicians, taking the form:
(B1, RB1x1, Rd x1), · · · , r−, · · · , (B|X |, RB|X | x|X |, Rd x|X |) (7.15)
for xn ∈ X , Bn ⊆ A.
Table 7.4: Mapping of key notions from HS to rule induction
HS Optimisation Rule induction
Musician Variable ObjectMusical Note Variable Value RuleHarmony Solution Vector Rule BaseHarmony Memory Solution Storage Rule Base StorageHarmony Evaluation Fitness Function Rule Base EvaluationOptimal Harmony Optimal Solution Optimal Rule Base
The harmony memory H stores a predefined number of “good” candidate rule
bases, which are constantly updated with better quality rule bases over the cause
of the search. The fitness function analyses and merits each harmony H (i.e., a
candidate rule base) found during the search process, using criteria including: data
coverage on the training objects, dependency with respect to the full set of features,
the size of the entire rule base, and the cardinality of the feature subsets involved in
the rules:
evaluate(H) =
coverage=∑
x∈X
(⋃
r∈H
RrB x)
dependency=|POSH ||POSA|
=
∑
x∈X ,r∈H POSTr(x)
∑
x∈X POSA(x)
size= 1−|H||X |
cardinality= 1−∑
r∈H |Tr ||H| · |B|
(7.16)
149
7.2. HSFS for Hybrid Rule Induction
where coverage and dependency are prioritised.
The quality of a potential solution is first judged with respect to the criteria of
coverage and dependency. The size and cardinality are examined only if the solution
achieves equal or better coverage and dependency scores to those currently stored
within H. This prioritisation reduces the computational cost for evaluating weak
solutions that typically occur during the randomised search process, which is also
employed by the HSFS [62] for FRFS [126], where the goal of obtaining a full
fuzzy-rough dependency score is prioritised over size reduction.
7.2.2 HarmonyRules
The two stages of HarmonyRules are shown in Algorithms 7.2.1 and 7.2.2. The
integration of these two procedures proceeds in a similar way as QuickRules [119].As it is a hybrid approach, FS is performed alongside rule generation, and is embedded
within the random rule generation via the use of random feature subspace.
7.2.2.1 Initialisation
1 H= H j| j = 1, · · · , |H|2 H j
i ∈ H j, i = 1, · · · , |X |3 B = FS(A, X ) (optional)4 for j = 1 to |H| do5 Hnew, cov = ;6 for i = 1 to |X | do7 if x i ∈ (X \ covered (cov)) then8 T = RandomSubspace(B)9 if POST (x i) = POSA(x i) then
10 cov = cov ∪ RT x i
11 Hnewi = (T, RT x i, Rz x i)
12 else13 Hnew
i = r−
14 evaluate(Hnew)15 H j = Hnew
Algorithm 7.2.1: HarmonyRules initialisation
If any pre-processing is performed a-priori to rule induction (line 3), it is beneficial
to make use of fuzzy-rough set based feature subset evaluators [126], so that the
150
7.2. HSFS for Hybrid Rule Induction
reduced subset B satisfies:
POSB(x) = POSA(x),∀x ∈ X (7.17)
The candidate rule base (harmony) is maintained in H, and subsequently stored in
the harmony memory H. H contains the randomly generated rules for all objects,
and its stochastic states are reflected by the use of random feature subspace (line 8).
The fuzzy set cov in X records the current degree of coverage of each object in the
training data by the current set of rules, while the function covered (cov) returns
the set of objects that are maximally (of a degree of 1.0) covered in cov:
covered (cov) = x ∈ X | cov(x) = POSA(x) (7.18)
This means that an object x is considered to be covered by the set of rules (H), if its
membership to cov is equal to that of the positive region of the full set of feature A.
A rule (note) is constructed for x and subset T only when it has not been covered
maximally yet (line 5).
The coverage is updated if a rule is discovered when x /∈ covered (cov), and it
belongs to POST to the maximum extent (line 9). That is, the tolerance class RT (x)of x (see Eqn. 7.8) is fully included in a decision concept, and the feature values of
T that generated this tolerance class are good indicators of the concept. The new
coverage is determined by taking the union of the rule’s tolerance class with the
current coverage (line 8). When all objects are fully covered, no further rules are
created. The current rule base H is then added to the harmony memory having been
evaluated on the basis of four criteria: coverage, dependency, size and cardinality, as
defined in Eqn. 7.16.
Table 7.5 depicts an illustrative example of the harmony (rule base) generation
process. It starts with a rule r4 being created by musician p4 for object x4, which
covers x4 maximally. Musician p8 then identifies a slightly more general rule r8 that
covers both x2 and x8. Intuitively, no further inspection for x2 is necessary. This is
marked as ; by p2. The process continues until all objects are covered by the rules
induced, and in this case, only 4 rules r1, r3, r4, r8 are required to described the 8
training objects X = x1, · · · , x8. In contrast to QuickRules, where a rule is added to
the rule base only if no rule exists with the same or greater coverage, and an existing
rule that has a strictly smaller coverage than the new rule is deleted. HarmonyRules
relies on the optimisation capability of HS to converge to a more compact rule base
whilst maintaining coverage of all the training objects.
151
7.2. HSFS for Hybrid Rule Induction
Table 7.5: Rule base improvisation example, showing an emerging harmony (left)and its associated coverage status of objects (right)
p1 p2 p3 p4 p5 p6 p7 p8 x1 x2 x3 x4 x5 x6 x7 x8
r4 Ør− r4 r8 Ø Ø Ø
r1 r− r4 r− r− r8 Ø Ø Ø Ø Ø Ør1 r− r3 r4 r− r− r− r8 Ø Ø Ø Ø Ø Ø Ø Ø
7.2.2.2 Iteration
Once the harmony memory has been initialised with |H| number of rule bases, the
iterative improvisation process starts, as shown in Algorithm 7.2.2. At the beginning
of every iteration the rule base Hnew and the fuzzy set RB x representing data coverage
are both empty. Musicians, each representing an object x ∈ X , x /∈ covered (cov),then follow the Pick (x) procedure, each nominating a rule for inclusion in the
emerging rule base Hnew. Similar to the initialisation process, the tolerance classes
are examined, and the measure of coverage cov is updated when necessary (lines
6 and 7). The newly improvised rule base is evaluated, and it replaces the current
worst rule base stored in the harmony memory if a higher score is achieved.
1 while g < gmax do2 Hnew, cov = ;3 for i = 1 to |X | do4 if x i ∈ (X \ covered (cov)) then5 T = Pick (x i)6 if POST (x i) = POSA(x i) then7 cov = cov ∪ RT x i
8 Hnewi = (T, RT x i, Rz x i)
9 else10 Hnew
i = r−
11 if evaluate(Hnew)>minH∈H evaluate(H) then12 Update H with Hnew
13 g = g + 1
Algorithm 7.2.2: HarmonyRules iteration
152
7.2. HSFS for Hybrid Rule Induction
7.2.3 Rule Adjustment Mechanisms
Recall that the key parameters of HS (as described in Section 3.1.1) are harmony
memory considering rate δ, pitch adjustment rate ρ, and fret-width τ. They encour-
age exploration and help with the fine tuning of a given candidate solution. Both δ
and ρ which affect rule choices are incorporated in this approach, allowing poten-
tially good rules and rule combinations to be discovered. Refer to Fig. 3.1.1 which
illustrates the adjustment procedure for the original HS. For δ activation, a randomly
formed feature subset is assigned to the current object x . This is conceptually similar
to the original δ activation, which causes the musician to randomly pick a value
from the entire value range [minx ,maxx] of the a given variable x .
For ρ, the feature subset involved in the rule in question will be modified by
HarmonyRules, by adding/removing k features (a hamming distance of k), k = τ×|A|.Here, τ is the predefined pitch adjusting fret-width τ ∈ [0, 1], which is scaled by the
total number of features |A|. For example, assume a total of |A| = 20 features, τ = 0.1,
k = 2. The modifications made to the feature subset Bn for the rule generated for a
given object xn may be:
(Bn, RBnxn, Rd xn) −→ (B′n, RB′n
xn, Rd xn)
Bn = a2, a3, a4, a18, a20 −→ B′n = a2, a4, a11, a18, a20 (7.19)
Here, a total of k = 2 alterations have been carried out (a3 is removed and a11
added), so that a new feature subset B′n can be obtained. The following adjustment
scenario is equally valid, resulting in an empty feature subset. This in turn specifies
an “empty” rule r− to be assign to object xn:
(Bn, RBnxn, Rd xn) −→ (r−)
a3, a20 −→ ; (7.20)
Note that the current implementation of HarmonyRules is of complexity O(gmax×|X |3). This is because the fuzzy-rough rule evaluation itself has a cost of O(|X |2),which needs to be performed at least once for every rule and every object x ∈ X .
Further optimisations and modifications based on the experimental findings remain
active research, to be further discussed in Section 9.2.1.
153
7.3. Experimentation and Discussion
7.3 Experimentation and Discussion
This section presents experimental evaluation of the proposed approach, for the task
of classification, over 9 real-valued benchmark data sets drawn from [78] with a
selection of classifiers. A summary of the data sets used are given in Table 7.6, where
detailed descriptions of their underlying problem domains may be found in Appendix
B. The number of conditional features ranges from 8 to 2556, and the number of
objects ranges from 120 to 390. The HS parameters empirically employed are shown
in Table 7.7.
Table 7.6: Data set information
Data set Objects Features Classes
cleve 297 13 5glass 270 13 7heart 214 9 2ionos 230 34 2olito 120 25 4water2 390 38 2water 390 38 3web 149 2556 5wine 178 13 3
Table 7.7: Parameters settings where * denotes the dynamically adjusted values
Parameter |P| |H| gmax δ ρ τ
Value |X | 20 2000 0.8* 0.5* 0.1*
7.3.1 Classification Results
Table 7.8 reports the classification performance of HarmonyRules as compared to
the HC-based QuickRules algorithm [119], and those obtained using the following
methods: a) the nearest neighbour classifier based on fuzzy sets (FNN) [140], b)
the recently developed weighted fuzzy subset-hood-based algorithm (WSBA) [213],and c) two leading rough set-based rule induction methods (learning from examples
module, version 2 (LEM2) [88] and the modified LEM algorithm (ModLEM) [210]).
For each method, 10-FCV is performed to validate the generated models with the
results averaged. Statistical paired t-test (per fold) is carried out to justify the
significance of differences between HarmonyRules and QuickRules, with threshold
154
7.3. Experimentation and Discussion
p = 0.01, where v, −, ∗ indicate the result is statistically better, same, or worse,
respectively.
It can be seen that HarmonyRules obtained better results than QuickRules in
terms of accuracy in 4 out of 9 cases, and statistically comparable performance
is achieved for the data sets cleve, heart, and wine. Note that HarmonyRules
typically generates rules with more compact underlying feature subsets, mainly due
to the excellent FS performance of the HSFS algorithm. Therefore, the overall quality
of the discovered rule bases are superior. A more in-depth investigation regarding
rule cardinality is given later in Section 7.3.2. Larger standard deviations are also
observed for HarmonyRules results, likely having been caused by the stochastic nature
of the HS mechanism employed.
The results presented in Table 7.8 demonstrates the power of fuzzy-rough set
theory in handling the vagueness and uncertainty often present in data. Although
HarmonyRules outperforms QuickRules by 9% for data set glass, FNN clearly claims
the best result with 68.57% correct predictions. LEM2 performs relatively poorly,
particularly for the glass, ionos and olito data sets. WSBA fails to classify the
web data set, demonstrating the shortcomings of the approach when handling high
dimensional data sets. Note that, unlike HarmonyRules, LEM2 and ModLEM perform
a degree of rule pruning during induction, which are therefore expected to be able
to should produce more general rules and better resulting accuracies.
Table 7.8: Classification accuracy comparison between HarmonyRules and Quick-Rules, using 10-FCV, where v, −, ∗ indicate statistically better, same, or worse thanQuickRules
Data set HarmonyRules QuickRules FNN WSBA LEM2 ModLEM
cleve 56.53±10.35 - 56.21±7.54 49.75±5.67 51.91±8.50 53.17±5.47 50.15±8.25glass 51.88±14.28 v 42.83±12.34 68.57±9.62 35.51±8.70 14.98±6.33 62.62±7.96heart 80.00±6.11 - 80.67±6.83 66.11±7.89 83.00±6.07 77.04±10.18 76.3±7.8ionos 92.17±8.68 v 91.57±5.62 78.00±7.46 82.96±8.71 58.26±12.17 87.39±5.65olito 75.83±10.88 v 72.33±11.48 63.25±12.48 78.17±11.77 44.17±9.17 64.17±9.9water2 86.89±3.97 v 86.13±4.42 77.97±2.66 84.74±4.84 80.26±2.82 86.41±3.98water 81.28±7.69 * 82.41±4.81 74.64±3.77 81.97±5.57 70.51±3.85 84.36±5.67web 56.33±10.00 * 63.1±11.89 45.55±8.04 1.14±2.70 41.67±11.67 61.05±12.28wine 97.78±4.05 - 97.75±3.92 96.40±4.06 96.17±4.14 73.66±9.25 92.03±8.7
Additional comparative studies have been reported in [119] that include results
obtained using leading non-fuzzy rough methods [132, 264], which are given in
Table 7.9. In addition to the NB and C4.5 algorithms that have already been described
155
7.3. Experimentation and Discussion
in Section 3.5, the sequential minimal optimisation (SMO) method that is widely
used for training support vector machines [208], a neural network-based algorithm
termed projective adaptive resonance theory (PART) [33], and finally a propositional
rule learning algorithm RIPPER (for repeated incremental pruning to produce error
reduction) [43] are also used for comparison. Base on the results, HarmonyRules
is comparable to the leading non-rough set-based techniques, with competitive
classification models obtained for data sets cleve, ionos, and wine. Indeed, the
overall performance of HarmonyRules is slightly worse than that of SMO. However,
note that black-box classifiers such as SMO do not produce humanly-interpretable
rules, as they typically work in higher dimensional transformed feature spaces. Since
one of the main reasons for performing rule induction is to make data transparent to
human users, this is counter-intuitive, becoming an obstacle to knowledge discovery
and also to the understanding of the underlying processes which generated the data.
Table 7.9: Classification accuracy of other classifiers tested using 10-FCV
Data set SMO NB C4.5 RIPPER PART
cleve 58.31±6.15 56.06±6.78 53.39±7.31 54.16±3.64 52.44±7.20glass 57.77±9.10 47.70±9.21 68.08±9.28 67.05±10.69 69.12±8.50heart 83.89±6.24 83.59±5.98 78.15±7.42 79.19±6.38 77.33±7.81ionos 82.96±6.93 83.78±7.62 86.13±6.20 87.09±6.92 87.39±6.61olito 87.92±8.81 78.50±11.31 65.75±12.13 68.83±13.06 67.00±12.86water2 83.67±4.15 70.28±7.56 83.08±5.45 82.64±5.46 83.79±5.17water 86.87±4.36 85.46±4.98 81.59±6.51 82.44±6.63 82.54±5.87web 64.78±10.47 63.41±12.93 57.63±11.31 55.09±12.99 51.50±12.86wine 98.70±2.76 97.46±3.86 93.37±5.85 93.18±6.49 92.24±6.22
7.3.2 Comparison of Rule Cardinalities
Table 7.10 presents a side by side comparison between HarmonyRules and QuickRules
in terms of the cardinalities of rules returned by different induction processes. It can
be seen that, apart from the data set glass, where HarmonyRules uses almost double
the amount of features, but with much better classification accuracy, the averaged
cardinalities of rule bases obtained by HarmonyRules are generally more compact
than those by QuickRules. However, although HarmonyRules selects substantially
smaller subsets for data set web, the resulting classification accuracy is also reduced.
One possible explanation of the above observation is that during HS optimisation,
the random feature subsets are judged purely by their capability to approximate
156
7.3. Experimentation and Discussion
Table 7.10: HarmonyRules vs. QuickRules in terms of rule cardinalities, where v, −,∗ indicate statistically better, same, or worse results
HarmonyRules QuickRules
Data set Accuracy % Cardinality Accuracy % Cardinality
cleve 56.53±10.35 - 8.17 v 56.21±7.54 9.35glass 51.88±14.28 v 10.83 * 42.83±12.34 5.66heart 80.00±6.11 - 5.27 v 80.67±6.83 5.68ionos 92.17±8.68 v 8.99 v 91.57±5.62 9.95olito 75.83±10.88 v 8.19 v 72.33±11.48 9.39water2 86.89±3.97 v 5.99 - 86.13±4.42 5.97water 81.28±7.69 * 5.05 v 82.41±4.81 11.06web 56.33±10.00 * 20.57 v 63.1±11.89 64.04wine 97.78±4.05 - 7.05 * 97.75±3.92 4.65
Figure 7.1: HarmonyRules (left) vs. QuickRules (right) in terms of rules’ featuresubset cardinality distribution (for data set web of 2556 features)
the underlying concept, determined by fuzzy-rough lower approximation POSB(x),coverage, etc. Two rule bases may be identical in terms of evaluation scores, however
the models and their stability and complexity may be distinctively different. The
classification task to be performed may favour one with more specific rules (which
may be obtained by QuickRules), for example, for web data set as shown in Fig. 7.1.
Because one rule base evaluation criterion is the compactness of feature subsets,
HarmonyRules resulted in more rules containing less than 50 features, while a
large amount of rules returned by QuickRules are much higher in terms of attribute
cardinality.
157
7.4. Summary
7.4 Summary
This chapter has described an improved hybrid fuzzy-rough rule induction technique:
HarmonyRules based on fuzzy-rough rule induction and HSFS. HS has demonstrated
many competitive features over conventional approaches: fast convergence, sim-
plicity, insensitivity to initial states, and efficiency in finding good quality solutions.
Experimental comparative studies have shown that the resultant rule base completely
covers the training data, the feature subsets involved also score full fuzzy-rough
dependency evaluation, indicating good approximations of the original data. The
cardinality of the feature subsets involved reflects good subset size reduction. Classi-
fication accuracy is comparable to that achievable by the state of the art approaches.
In almost all aspects, the proposed approach is able to improve upon the greedy
HC-based QuickRules method.
The technique presented in this chapter may be further improved by various
means, and closer examinations of the theories involved may reveal tighter forms
of integration, resulting in more optimised (in terms of efficiency and robustness)
methods. A more detailed discussion is given in Section 9.2.1.3. Furthermore, it
is worth noting that in practical scenarios [148], there often exist substantial gaps
in a given (fuzzy) rule base, i.e., sparse, regardless of how the rules are originally
obtained (learned from data, or supplied by domain experts). This is because the
amount of knowledge or data available may be limited, and therefore, the rules may
not fully support the need of inference-based fuzzy reasoning [177, 282]. For such
type of application, methods developed for fuzzy rule interpolation [142, 143], to
be explore in the proceeding chapter, become valuable.
158
Chapter 8
HSFS for Fuzzy Rule Interpolation
F UZZY Rule Interpolation [142, 143] (FRI) is of particular significance for rea-
soning in the presence of insufficient knowledge or sparse rule bases. When a
given observation has no overlap with antecedent values, no rule can be invoked in
classical (fuzzy) rule-based inference, and therefore no consequence can be derived.
The techniques of FRI not only support inference in such situations, but also help to
reduce the complexity of fuzzy models. Despite these advantages, FRI techniques are
relatively rarely applied in practice [148]. One of the main reasons is that real-world
applications generally involve rules with a large number of antecedents, and the
errors accumulated throughout the interpolation process may affect the accuracy
of the final estimation. More importantly, a rule base may consist of less relevant,
redundant or even misleading variables, which could further deviate the outcome
of an interpolation. Such characteristics of data have been studied extensively
in the area of FS, with techniques developed to rank the importance of features
[116, 123, 147, 219, 291], or to discover a minimal feature subset from a problem
domain while retaining a suitably high accuracy in representing the original data
[52, 93, 126, 164].
This chapter presents a new approach that uses FS techniques to evaluate the
importance of antecedent variables in a fuzzy rule base. Such importance degrees
are referred to as the set of “antecedent significance values” hereafter. This allows
subsets of informative antecedent variables to be identified via the use of feature
subset search methods, e.g., HSFS. It helps to reduce the dimensionally of a rule
base by removing irrelevant antecedent variables. An antecedent significance-based
159
8.1. Background of Fuzzy Rule Interpolation (FRI)
FRI technique based on scale and move transformation-based FRI (T-FRI) is also
proposed, which exploits the information analysed by FS, in order to facilitate more
effective interpolation using weighted aggregation [273]. The benefits of this work
is demonstrated using the scenario of backward FRI [128, 131] (B-FRI), which is a
newly identified research focus regarding FRI.
The remainder of this chapter is structured as follows. Section 8.1 introduces the
general ideas behind FRI, and explains the key notions and interpolation steps of
T-FRI, which is the main method used to carry out the present investigation. This
section also gives an outline of the B-FRI method for completeness. Section 8.2
details the developed approach which applies the existing ideas in FS to FRI, explains
the antecedent significance-based aggregation procedure that is implemented using T-
FRI, and discusses its potential benefits to B-FRI. In Section 8.3, an example scenario
concerning the prediction of terrorist bombing attack is employed to showcase the
procedures of the proposed work. Further, a series of experiments have been carried
out in order to verify the general performance of the present approach. Section 8.4
summaries the chapter.
8.1 Background of Fuzzy Rule Interpolation (FRI)
This section first introduces the general principles of FRI, and provides a brief
introduction of the procedures involved in T-FRI, including the definitions of its
underlying key notions, and an outline of its interpolation steps. Note that the
basic form of T-FRI is herein employed for the ease of presentation, which utilises
two neighbouring rules of a given observation to perform interpolation. Triangular
membership functions are also adopted for simplicity, which are the most commonly
used fuzzy set representation in fuzzy systems. More detailed descriptions and
discussions on the theoretical underpinnings behind T-FRI can be found in the
original work [109, 110].
FRI approaches in the literature can be categorised into two classes with several
exceptions (e.g. type II fuzzy interpolation [36, 158]). The first category of ap-
proaches directly interpolates rules whose antecedents match the given observation.
The consequence of the interpolated rule is thus the logical outcome. Typical methods
in this group [142, 143, 248] are based on the use of α-cuts (α ∈ (0, 1]). The α-cut
of the interpolated consequent fuzzy set is calculated from the α-cuts of the observed
160
8.1. Background of Fuzzy Rule Interpolation (FRI)
antecedent fuzzy sets, and those of all the fuzzy sets involved in the rules used for
interpolation. Having found the consequent α-cuts for all α ∈ (0, 1], the consequent
fuzzy set is then assembled by applying the Resolution Principle [229].
The second category is based on analogical reasoning [24]. Such approaches
first interpolate an artificially created intermediate rule so that the antecedents of
the intermediate rule are similar to the given observation [14]. Then, a conclusion
can be deduced by firing this intermediate rule subject to similarity constraints. The
shape distinguishability between the resulting fuzzy set and the consequence of the
intermediate rule, is set to be analogous to the shape distinguishability between
the observation and the antecedent of the created intermediate rule. In particular,
the scale and move transformation-based approach [109, 110] (T-FRI) offers a
flexible means to handle both interpolation and extrapolation involving multiple
multi-antecedent rules.
Following a similar convention adopted for FS described in Section 1.1, an
FRI system being investigated in this chapter is defined as a tuple (R, Y ), where
R= r1, · · · , r |R| is a non-empty set of finite fuzzy rules (the rule base), and Y is a
non-empty finite set of variables. Y = A∪ z where A= a j | j = 1, · · · , |A| is the
set of antecedent variables, and z is the consequent variable appearing in the rules.
Without losing generality, a given rule r i ∈ R and an observation o∗ is expressed in
the following format:
r i : if a1 is ai1, and · · · , and a j is ai
j, and · · · , and a|A| is ai|A|, then z is z i
o∗ : a1 is a∗1, and · · · , and a j is a∗j , and · · · , and a|A| is a∗|A| (8.1)
where aij represents the value (the fuzzy set) of the antecedent variable a j in rule r i,
and z i denotes the value of the consequent variable z for r i. The asterisk sign (∗)denotes that a value has been directly observed.
8.1.1 Transformation-Based FRI
A key concept used in T-FRI is the representative value rep(a j) of a fuzzy set a j,
it captures important information such as the overall location of a fuzzy set. For
triangular membership functions in the form of a j = (a j1, a j2, a j3), where a j1, a j3
represent the left and right extremities (with membership values 0), and a j2 denotes
161
8.1. Background of Fuzzy Rule Interpolation (FRI)
the normal point (with a membership value of 1), rep(a j) is defined as the centre of
gravity of these three points:
rep(a j) =a j1 + a j2 + a j3
3(8.2)
More generalised forms of representative values for more complex membership
functions have also been defined in [109, 110].
The following is an outline of the T-FRI algorithm. An illustration of its key
procedures is also provided in Fig. 8.1 in order to aid the explanation.
Figure 8.1: Procedures of T-FRI for a given antecedent dimension a j, where thedashed line indicates the representative values of the fuzzy sets
162
8.1. Background of Fuzzy Rule Interpolation (FRI)
1. Identification of the Closest Rules
The distance between any two rules r p, rq ∈ R, is determined by computing
the aggregated distance between all the antecedent variable values:
d(r p, rq) =
√
√
√
√
|A|∑
j=1
d(apj , aq
j )2, where d(ap
j , aqj ) =
|rep(apj )− rep(aq
j )|
maxa j−mina j
(8.3)
where d(apj , aq
j ) is the normalised result of the otherwise absolute distance
measure, so that distances are compatible with each other over different vari-
able domains. The distance between a given rule r p and the observation o∗:
d(r p, o∗) may be calculated in the same manner, and the two closest rules, say
ru and r v, are identified and used for the later interpolation process.
2. Construction of the Intermediate Fuzzy Rule
The intermediate fuzzy rule r ′ is the starting point of the transformation process
in T-FRI. It consists of a series of intermediate antecedent fuzzy sets a′j, and an
intermediate consequent fuzzy set z′:
r ′ : if a1 is a′1, and · · · , and a j is a′j, and · · · , and a|A| is a′|A|, then z is z′
(8.4)
which is a weighted aggregation of the two selected rules ru and r v. For each
of the antecedent dimensions ai, a ratio λa j, 0≤ λa j
≤ 1 is introduced, which
represents the contribution of avj towards the formation of a′j with respect to
auj :
λa j=
d(auj , a∗j )
d(auj , av
j )(8.5)
The intermediate antecedent fuzzy set a′j is then computed using:
a′j = (1−λa j)au
j +λa jav
j (8.6)
The position and shape of the intermediate consequent fuzzy set z′, is then
calculated in the same manner according to those of the consequent fuzzy sets
of the two rules zu and zv, where the ratio λz is obtained by averaging the
ratios of the antecedent variables:
λz =1|A|
|A|∑
j=1
λa j(8.7)
163
8.1. Background of Fuzzy Rule Interpolation (FRI)
3. Computation of the Scale and Move Parameters
The goal of a transformation process T is to scale, move (or skew) an inter-
mediate fuzzy set a′j, so that the transformed shape coincides with that of the
observed value a∗j . In T-FRI, such a process is performed in two stages: 1) the
scale operation from a′j to a′′j (denoting the scaled intermediate fuzzy set), in
an effort to determine the scale ratio sa j; and 2) the move operation from a′′j
to a∗j to obtain a move ratio ma j. Once performed for each of the antecedent
variables, the necessary parameters sz and sz for the consequent variable can
be approximated as follows, in order to compute the final interpolation result
z∗.
For a triangular fuzzy set a′j = (a′j1, a′j2, a′j3), the scale ratio sa j
is calculated
using:
sa j=
a∗j3 − a∗j1a′j3 − a′j1
(8.8)
which essentially expands or contracts the support length of a′j: a′j3− a′j1 so that
it becomes the same as that of a∗j . The scaled intermediate fuzzy set a′′j , which
has the same representative value as a′j, is then acquired using the formula
below:
a′′j1 =(1+2sa j
)a j1′+(1−sa j
)a j2′+(1−sa j
)a j3′
3
a′′j2 =(1−sa j
)a j1′+(1+2sa j
)a j2′+(1−sa j
)a j3′
3
a′′j3 =(1−sa j
)a j1′+(1−sa j
)a j2′+(1+2sa j
)a j3′
3
(8.9)
The move operation shifts the position of a′′j to be the same as that of a∗j , while
maintaining its representative value rep(a′′j ). This is made possible by using a
tailored move ratio ma j:
ma j=
3(a∗j1−a′′j1)
a′′j2−a′′j1, if a∗j1 ≥ a′′j1
ma j=
3(a∗j1−a′′j1)
a′′j3−a′′j2, otherwise
(8.10)
164
8.1. Background of Fuzzy Rule Interpolation (FRI)
The final positions of the triangle’s three points are calculated as follows:
a∗j1 = a′′j1 +ma j
a′′j2−a′′j13
a∗j2 = a′′j2 − 2ma j
a′′j2−a′′j13
a∗j3 = a′′j3 +ma j
a′′j2−a′′j13
, if ma j≥ 0 (8.11a)
a∗j1 = a′′j1 +ma j
a′′j3−a′′j23
a∗j2 = a′′j2 − 2ma j
a′′j3−a′′j23
a∗j3 = a′′j3 +ma j
a′′j3−a′′j23
, otherwise (8.11b)
Note that this operation also guarantees that the resultant shape is convex and
normal.
4. Scale and Move Transformation on Intermediate Consequent Fuzzy Set
After computing the necessary parameters based on all of the observed an-
tecedent variable values, the required parameters for z′ are then determined
by averaging the corresponding parameter values:
sz =1|A|
|A|∑
j=1
sa j(8.12)
mz =1|A|
|A|∑
j=1
ma j(8.13)
A complete scale and move transformation from the initial intermediate con-
sequent fuzzy set z′ to the final interpolative output z∗, may be represented
concisely by: z∗ = T(z′, sz, mz), highlighting the importance of the two key
transformations required.
8.1.2 Backward FRI (B-FRI)
B-FRI [128, 131] is a recently proposed extension to standard (forward) FRI, it
allows crucial missing values that directly relate to the conclusion be inferred, or
interpolated from the known antecedent values and the conclusion. This technique
supplements a conventional FRI process, and is particularly beneficial in the presence
of hierarchically arranged rule bases, since a normal inference or interpolation system
will be unable to proceed if certain key antecedent values (that connect the sub-rule
bases) are missing.
165
8.2. Antecedent Significance-Based FRI
An implementation of the B-FRI concept has been developed, based on the
mechanisms of T-FRI. It works by reversely approximating the scale and move trans-
formation parameters for the variables with missing values. In this chapter, the
scenario with a single missing antecedent value is considered, as efficient ways to
solving the more complex cases: B-FRI with multiple missing values still remain
active research.
Despite that both forward and backward T-FRI share the same underlying analogy-
based idea, backward T-FRI has several subtle differences, such as the procedures
to select the closest rules, and those to compute the transformation parameters.
For instance, assume that the value of the antecedent variable al is missing from
the observation, whilst the conclusion z∗ can be directly observed. The distance
measurement d←−(rp, rq) between any two rules is handled with a bias towards the
consequent variable:
d←−(rp, rq) =
√
√
√
√|A| · d(zp, zq)2 +|A|∑
j=1, j 6=l
d(apj , aq
j )2 (8.14)
This is because the observed value for the consequent variable embeds more infor-
mation, and the weight assigned is equal to the sum of all individual antecedents
|A|. Having identified the closest rules, the remaining steps are the same as forward
T-FRI, except that the parameters for the missing antecedent: λal, sal
, and malare
calculated using a set of similar alternative formulae. For instance, the formula to
calculate λalis:
λal= |A|λz −
1|A|
|A|∑
j=1, j 6=l
λa j(8.15)
Here, the required parameters are obtained by subtracting the sum of the values
of the given antecedent values from that of the consequent (also multiplied by a
weight of |A|). Finally, the backward interpolation result a∗l can be obtained using
T←−(a′l , sal
, mal).
8.2 Antecedent Significance-Based FRI
This section discusses the similarities and differences between the problem domain
of FS and that of FRI, and describes the approach developed that evaluates the
importance of rule antecedents using FS techniques. A weighted aggregation-based
166
8.2. Antecedent Significance-Based FRI
approach is also introduced, which makes use of the antecedent significance values to
better approximate the interpolation results. The potential benefits of the proposed
technique in B-FRI are also explained.
8.2.1 From FS to Antecedent Selection
The key distinction between a standard FS problem (defined in Section 1.1) and FRI
is the presence of the continuously-valued consequent variable z, and that there is no
well defined class labels (hence the need for interpolation). From this point of view,
FRI is modelled more closely to regression than classification, and therefore, only a
selected few of non-class-label-dependent feature evaluators [202] can be readily
adapted for FRI, including CFS [46] and FRFS [126] which support regression tasks
by default. FRFS in particular, relies on fuzzy similarity to differentiate between two
training objects. It employs a strict equivalence relation for class labels or categorical
data, but the underlying concepts (i.e., the upper and lower approximations) may also
be constructed using real-valued “decision” variable. CFS exploits the correlations
between features and may be used for regression-type-of-problems.
Fig. 8.2 illustrates the general procedures of the proposed antecedent selection
approach for FRI. To achieve antecedent selection, a given feature evaluator such as
CFS or FRFS may be employed as is, once the rule base to be processed is converted
into a standard, crisp-valued data set. For this, any defuzzification mechanism may
be adopted, and in this chapter, the representative value of a fuzzy set (Eqn. 8.2) is
used for this purpose. The newly created, crisp-valued data set (antecedent values)
are then employed to train a feature (antecedent) evaluator. This is in order to get a
set of “feature evaluation scores”, or antecedent importance measurements ω′a j, j =
1, · · · , |A|, which are subsequently normalised to obtain the required significance
values:
ωa1, · · · ,ωa|A|=
ω′a1
Σ|A|j=1ω
′a j
, · · · ,ω′a|A|
Σ|A|j=1ω
′a j
(8.16)
These values indicate the relevance of the underlying antecedent variables A, with
respect to the values of the consequent variable z based on the information embedded
in the rule base.
A feature subset search algorithm such as HSFS may be employed to identify a
quality antecedent subset B ⊆ A, which captures the information within the original
rule base R to a reasonable (if not the maximum) extent. R may then be pruned
167
8.2. Antecedent Significance-Based FRI
Figure 8.2: Antecedent selection procedures
to just maintaining the highest quality antecedent variables, thereby producing a
reduced rule base (much like a reduced data set with irrelevant features removed).
Subsequent tasks such as rule selection, fuzzy inference, or FRI may benefit greatly in
terms of accuracy and efficiency, once such redundant and noisy antecedent variables
have been eliminated.
8.2.2 Weighted Aggregation of Antecedent Significance
For a given rule base R, a set of antecedent significance values: ωa1, · · · ,ωa|A|, may
be computed, or supplied by subject experts. A weighted rule ranking strategy may
then be derived for the purpose of identifying the most suitable rules to perform
interpolation. Recall the standard (un-biased) formula (Eqn. 8.3) adopted by T-FRI
for calculating the distance between any given two rules r p, rq ∈ R, is effectively
assumes equal significance of all involved antecedent variables. A general form of
168
8.2. Antecedent Significance-Based FRI
weighted distance d may be defined by:
d(r p, rq) =
√
√
√
√
|A|∑
j=1
ωa jd(ap
j , aqj )
2 (8.17)
which takes into consideration the significance value ωa jof the antecedent variables
a j, j = 1, · · · , |A|.
The use of d may allow a more flexible selection of rules. For instance, consider
the case illustrated in Fig. 8.3, with the assumption that a1 and a3 are antecedents
of high significance and a2 is irrelevant (or noisy). For a given new observation
o∗, the two closest rules determined by standard T-FRI (using un-biased distance
measure) may be r1 and r2. There may also exist another rule r3 (involving dashed
fuzzy sets) with values much closer to a1 and a3, but it has not been selected because
its overall distance to the observation is greater than that of r2, due to the value
a32 being further away. Since in fact a2 is of little importance, a weighted distance
measurement may select r1, r3 to perform interpolation, and the end result z∗(1,3) may
provide a better estimation for this scenario, than the result obtained using r1 and
r2.
As alternative rules may be selected via the use of weighted distance calculation,
the FRI mechanisms should therefore be modified in order to ensure consistency
amongst the results interpolated using different rules. In this chapter, the investiga-
tion is focused on the T-FRI method introduced in Section 8.1.1. However, the use of
the antecedent variable significance appears to be equally applicable to other types
of FRI technique, such as α-cut-based methods [142, 143, 248].
Recall the first step of T-FRI, the construction of the intermediate fuzzy rule
r ′ requires the set of intermediate antecedent fuzzy sets a′j, and the intermediate
consequent fuzzy set z′. A set of shift parameters λa1, · · · ,λa|A| ,λz are required, in
order to maintain the position (representative value) of r ′ on each of its antecedent
dimensions. The value of λz plays an important role in determining the initial position
of the intermediate consequent fuzzy set, which will affect the final interpolative
output. For the present problem, the calculation of λz is modified to reflect the
variations in antecedent variable significance, thereby producing a weighted shift
parameter λz:
λz =1|A|
|A|∑
j=1
ωa jλa j
(8.18)
169
8.2. Antecedent Significance-Based FRI
Figure 8.3: Alternative rule selection using weighted distance calculation
which is then used to obtain the weighted intermediate consequent fuzzy set z′. It is
then necessary to apply the two-stage transformations to the intermediate consequent
fuzzy set z′, and the parameter values for the weighted transformations: weighted
scale ratio sz and move ratio mz are computed using:
sz =1|A|
|A|∑
j=1
ωa jsa j
(8.19)
mz =1|A|
|A|∑
j=1
ωa jma j
(8.20)
These are modified version of Eqns. 8.12 and 8.13, following the same principle as
that applied to the calculation of λz.
Finally, a complete, weighted T-FRI procedure from z′ to z∗ can be readily cre-
ated, by following the transformation z∗ = T (z′, sz, mz). This weighted aggregation
170
8.3. Experimentation and Discussion
procedure makes minimal alterations to the original T-FRI algorithm. Symbolically,
it appears identical to the conventional T-FRI method, and is therefore omitted here.
As such, the procedure maintains its structural simplicity and intuitive appeal, while
extending the capability of T-FRI.
8.2.3 Use of Antecedent Significance in B-FRI
One of the common problems faced by a B-FRI system is the event where more
than one antecedent value is missing from an observation. It is difficult to fully
reconstruct, or even closely approximate multiple missing values, since there may
exist a number of possible combinations of values that lead to the same conclusion.
It is also computationally complex to perform reverse reasoning with a large number
of unknowns. Antecedent selection, being a dimensionality reducing technique,
may be potentially beneficial in such situations. By identifying more important
antecedent variables, or by removing irrelevant antecedents altogether, a priority-
based backward reasoning system may be established and greatly simplifies the
problem. However, much of relevant research concerning this issue is beyond the
scope of this thesis.
8.3 Experimentation and Discussion
This section provides a real-world scenario concerning the prediction of terrorist
activities, it is used to demonstrate the procedures of the proposed antecedent
significance-based approach, for both conventional T-FRI and B-FRI problems. The
accuracy and efficiency of the work is further validated via systematic evaluation
using synthetic random data.
8.3.1 Application Example
Consider a practical scenario that involves the prediction of terrorist bombing risk.
The likelihood of an explosion can be directly affected by the amount of people in
the area, crowded places (high popularity and high travel convenience) are usually
more likely to attract terrorist attentions. Safety precaution such as police patrol
may also be a very important factor, the more alert and prepared a place is, the less
opportunities there are for the terrorists to attack. Other aspects such as temperature
and humidity may be of relevance, but their impact on the potential outcome is
171
8.3. Experimentation and Discussion
limited. Table 8.1 lists a few example linguistic rules that may be derived for such a
scenario.
Table 8.1: Example linguistic rules for terrorist bombing prediction (M. for Moderate,V. for Very)
popularity convenience patrol temperature humidity riska1 a2 a3 a4 a5 z
r1 V. Low V. Low V. High M. High V. Lowr2 V. Low V. High V. Low High Low V. Lowr3 M. High M. Low M. High M.r4 M. M. M. Low Low M. Lowr5 M. High Low M. Low M. High Highr6 High V. Low High V. Low Low V. Lowr7 High High M. M. High M. Lowr8 High High V. Low Low Low V. High
The correlation-based FS (CFS) [93] and the fuzzy-rough set-based FS (FRFS)
techniques are employed in the experiment, and the antecedent significance values
obtained using the two respective methods are presented in Table 8.2. Both feature
evaluators agree on that temperature and humidity are relatively less important than
the other 3 antecedent variables. CFS in particular, assigns a weight of ωa5= 0.0299
to humidity, signifying its relatively lack of relevancy in this rule base. The ranking
of importance for the major antecedent variables is a3 > a1 > a2, when CFS is used.
The resultant ranking determined by FRFS is similar, thought it gives convenience
(a2) a higher significance score.
Table 8.2: Antecedent significance values determined by CFS and FRFS
wa1wa2
wa3wa4
wa5
CFS 0.2765 0.2461 0.3312 0.1163 0.0299FRFS 0.2220 0.3228 0.2904 0.0833 0.0814
8.3.1.1 FRI Example
Suppose that a new observation o∗ is present for interpolation, its linguistic values
and the underlying semantics in terms of triangular fuzzy sets, are given in Table 8.3.
The rules selected using the standard T-FRI process, the antecedent significance-based
weighted distance metric, and the reduced rule base are also provided. The two
172
8.3. Experimentation and Discussion
closest rules selected following the standard T-FRI process are close to the observed
values on all antecedent dimensions. However, if antecedent significance values
are taken into consideration, alternative rules will be selected. For the two rules
selected according to CFS, large differences in values can be observed for the variable
humidity (a5), which is likely caused by its very low significance value, as shown
previously in Table 8.2.
Table 8.3: Example observation (linguistic terms and fuzzy set representations),and the closest rules selected by standard T-FRI, and weighted T-FRI with valuesdetermined using CFS and FRFS
popularity convenience patrol temperature humidity riska1 a2 a3 a4 a5 z
o∗ High High M. High Low M. High ?o∗ (8.0, 8.5,9) (5.8, 7.5, 8) (5.0, 5.5, 6.0) (1.5, 2.0, 3) (5.5, 6.0, 6.5) ?
Standard r1 (8, 8.3,8.4) (8.4, 8.6, 9.1) (5.4, 5.9, 6.2) (3.5, 3.7, 4) (6.3, 6.9, 7.2) (2.9, 3, 3.3)Standard r2 (9.7, 9.8,10.4) (4.9,5.4, 5.5) (1.7, 2.1, 2.2) (1.2, 1.2,1.8) (3.7, 4.3, 4.4) (4.3, 5, 5.4)
CFS r1 (8.5,9.1, 9.8) (8.1, 8.6, 9) (5, 5.4, 6.1) (1.8, 2.3,2.4) (2.4, 3, 3.1) (2.7, 3, 3.2)CFS r2 (6, 6.6,6.8) (5.3, 5.9, 6.1) (7.4, 7.6, 7.8) (1.7, 2.1, 2.8) (9.2, 9.6, 10.2) (1.4, 2, 2.5)
FRFS r1 (8.9,9.3, 9.8) (7.2, 7.6, 8.3) (6.3, 6.3, 6.4) (0.6, 0.7,1.3) (4.1, 4.5, 4.7) (2.6, 3, 3.7)FRFS r2 (6.9, 6.9, 7) (3.9, 4.2, 4.7) (3.6, 3.9, 4.6) (2.8, 3.5,4.3) (7.9, 7.9, 8.1) (2.8, 3, 3.4)
The detailed calculations of the T-FRI transformations are omitted here to save
space as they are easily conceived. The final interpolative result of standard T-
FRI is z∗ = (3.1,3.6,4.4), following a transformation of T(z′ = (3.4,3.7,4), sz =2.163, mz = 0.003). Using weights determined by CFS, the result is z∗ = (2.2, 2.8, 3.2),the weighted transformation is:
T (z′ = (2.4,2.8, 3), s = 1.479, m= −0.101) (8.21)
The results obtained based on FRFS is z∗ = (2.2, 3.2, 4)with a corresponding weighted
transformation as shown below:
T (z′ = (2.7,3, 3.6), sz = 2.084, mz = −0.337) (8.22)
Conceptually speaking, although the area in question may be crowded, due to
the place being popular and convenient to reach, the risk of an attack should be quite
low. This is because the level of alert is moderate high, and the two weather-related
factors (despite being less significant): low temperature and high humidity may
further discourage any potential activities. One of the example rules listed in Table
173
8.3. Experimentation and Discussion
8.1, r7 describes a fairly similar event, where the consequent value is given as M.
Low. Based on these, the results obtained via weighted aggregation: (2.2,2.8,3.2)(Low) via CFS, and (2.2, 3.2, 4) (Low) via FRFS, are more intuitively agreeable than
that produced by standard T-FRI: (3.1, 3.6,4.4) (M. Low).
8.3.1.2 B-FRI Example
For the B-FRI scenario, suppose a given observation o∗ with a missing value for the
antecedent variable patrol (a3), where the consequent variable risk (z) is directly
observed. Table 8.4 lists the observation o∗, and the different rules selected by the re-
spective approaches. Note that both CFS- and FRFS-based weighted distance metrics
select the same two closest rules, both of which differ from those selected by standard
B-FRI. For the standard B-FRI method, the values of the required parameters for the
missing antecedent variable are computed based on those of the know antecedent
and consequent variables. For example, following Eqn. 8.15, λa3may be calculated:
λa3= 5λz −
15
5∑
j=1, j 6=2
λa j
= 5× 0.792− 2.171= 1.789 (8.23)
which then constructs an intermediate fuzzy term a′3 = (0.2,1.2,2.2). Both sa3and
ma3are computed similar to λa3
, and finally, the backward transformation T←− given
in Eqn. 8.24 is derived which provides the final B-FRI output of V. Low.
a∗3 = T←−((0.2,1.2, 2.2), 0.400,−0.172) = (0.7,1.2, 1.5) (8.24)
To avoid unnecessary repetition, the detailed procedures to compute the weighted
B-FRI outputs are omitted. The CFS-based antecedent significance values yield a
weighted B-FRI transformation as shown below:
a∗3 = T←−((1.5,2.5, 3.5), 0.1389, 0.0185) = (2.4, 2.5,2.7) (Low) (8.25)
while the FRFS-based method calculates slightly differently, resulting in the following
backward interpolative outcome:
a∗3 = T←−((1.7, 2.7,3.7), 0.0807,0.0714) = (2.6,2.7, 2.8) (8.26)
which may also be interpreted into a linguistic meaning of Low.
174
8.3. Experimentation and Discussion
Table 8.4: Example observation (both linguistic terms and fuzzy set representations),and the closest rules selected by standard B-FRI, and weighted B-FRI with valuesdetermined using CFS and FRFS
popularity convenience patrol temperature humidity riska1 a2 a3 a4 a5 z
o∗ High High ? Low M. High M. Higho∗ (8.0,8.5, 9) (5.8,7.5, 8) ? (1.5, 2.0, 3) (5.5,6.0, 6.5) (5.1, 5.8, 6.4)
Standard r1 (8.7,9.7, 10.7) (5.4, 6.4, 7.4) (0.9,1.9, 2.9) (0.2,1.2, 2.2) (6.7,7.7, 8.7) (4.6, 5.6, 6.6)Standard r2 (7.5, 8.5, 9.5) (6.9, 7.9,8.9) (0.5,1.5, 2.5) (7.2,8.2, 9.2) (3.9,4.9, 5.9) (4.8, 5.8, 6.8)
CFS r1 (7.7, 8.7,9.7) (5.9, 6.9, 7.9) (4.1,5.1, 6.1) (1.0,2.0, 3.0) (3.8,4.8, 5.8) (2.3, 3.3, 4.3)CFS r2 (6.7, 7.7, 8.7) (7.5, 8.5,9.5) (0.0,0.8, 1.8) (3.0,4.0, 5.0) (5.2,6.2, 7.2) (5.7, 6.7, 7.7)
FRFS r1 (7.7, 8.7,9.7) (5.9, 6.9, 7.9) (4.1,5.1, 6.1) (1.0,2.0, 3.0) (3.8,4.8, 5.8) (2.3, 3.3, 4.3)FRFS r2 (6.7, 7.7, 8.7) (7.5, 8.5,9.5) (0.0,0.8, 1.8) (3.0,4.0, 5.0) (5.2,6.2, 7.2) (5.7, 6.7, 7.7)
Note that all of the values, except for patrol and risk, are the same as the previous
observation used to demonstrate forward T-FRI. This narrows down the reason
why risk has jumped from Low to M. High, which is the level of patrol in the area.
Intuitively, for a highly crowded area, if very little patrol is present (as suggested by
the result of standard B-FRI: V. Low), the resultant value of risk should become V. High.
Therefore, having a Low level of patrol may be a more appealing approximation,
8.3.2 Systematic Evaluation
To evaluate the proposed antecedent selection approach and its effectiveness in
antecedent significance aggregation, a numerical test function with 15 variables
(|A| = 15) is used. Such a systematic test is important to validate the consistency,
accuracy, and robustness of the developed approach. This is because random samples
may be generated from a controlled environment, where the ground truths are also
available to verify the correctness of the interpolation results. These tests share a
similar underlying principle behind that of cross-validation and statistical evaluation
[21, 151].
8.3.2.1 FRI Results
The results shown in Table 8.5 are averaged outcomes of 200 randomised runs. By
employing the weighted aggregation scheme based on the antecedent significance
values, both the mean error and standard deviation are considerably improved. The
results obtained according to FRFS appears to have a slightly higher mean error
175
8.3. Experimentation and Discussion
and a wider spread, however, t-test (p = 0.01) shows that the difference is not
statistically significant. The improvement is more evident when the original rule
base is simplified by removing the redundant antecedent variables.
Table 8.5: Evaluation of proposed approaches for standard FRI
Mean error % S.D. %
Standard T-FRI 7.32 6.15Weighted by CFS 5.33 4.69Weighted by FRFS 5.68 5.16Reduced by CFS 3.38 3.01Reduced by FRFS 3.33 2.63
The antecedent subset selected by CFS is a0, a4, a7, a13, a reduction of 73%
in the number of variables, which achieves an mean error of 3.38%; the subset
selected by FRFS is a1, a4, a7, a9, a13, with a reduction of 67%, helps to obtain an
mean error of 3.33%. Both evaluators yield reasonable reduction results, and the
interpolation error (compared to numerical function’s true output) is also much
lower than standard and weighted T-FRI.
8.3.2.2 B-FRI Results
The same numerical test function adopted in Section 8.3.2 is used again to verify
the performance gain for B-FRI problems. A randomly selected antecedent variable
is set to be missing per test iteration. Obviously, this “missing” variable index is
drawn from the set a0, a4, a7, a13 ∩ a1, a4, a7, a9, a13= a1, a7, a13, which is the
intersection of the two antecedent subsets identified by CFS and FRFS, respectively.
This allows direct comparison between the different techniques.
The proposed weighted aggregation scheme, and the antecedent selected rule
base are then used to reconstruct the original values. In this set of experiments,
the error is calculated with respect to the actual antecedent variable value, that has
been intentionally removed to simulate the B-FRI environment. The mean error and
standard deviation of the 200 simulated tests are given in Table 8.6.
The number of antecedent variables involved is quite large and presents a con-
siderable challenge for precise backward reasoning. The original B-FRI approach
achieves a 18.20% mean error, while the accuracy is slightly improved when weighted
aggregation is used. Based on the simplified rule bases reduced by CFS and FRFS, the
176
8.4. Summary
Table 8.6: Evaluation of proposed approaches for B-FRI
Mean error % S.D. %
Standard B-FRI 18.20 19.40Weighted by CFS 16.94 19.15Weighted by FRFS 17.59 18.76Reduced by CFS 8.45 13.70Reduced by FRFS 6.93 13.70
mean interpolation error is notably improved, with a mean error of 8.45% and 6.95%,
respectively. Furthermore, the quality of the output is also more stable, dropping
from the original 19.40% to 13.70% for both cases, demonstrating the benefits of
the reduced rule base for B-FRI.
8.4 Summary
This chapter has presented a new FRI approach that exploits FS techniques in order
to evaluate the importance of antecedent variables. A weighted aggregation-based
interpolation method is proposed that makes use of the identified antecedent sig-
nificance values. The original rule base may also be simplified by removing the
irrelevant or noisy antecedents using a FS search algorithm such as HSFS, and re-
tains an antecedent subset of a much lower dimensionality. Example scenarios and
systematic tests are employed to demonstrate the potential benefits of the work, for
both conventional and B-FRI problems. The resultant antecedent significance-based
FRI technique is both technically sound and conceptually appealing, as humans often
(automatically) screen out seemingly irrelevant antecedents, and focus on more
important factors in order to perform reasoning. A discussion regarding possible
future improvements of the present work is given in Section 9.2.1.4.
177
Chapter 9
Conclusion
T HIS chapter presents a high level summary of the research as detailed in the
preceding chapters. Having reviewed and compared the work to the relevant
approaches in the literature, the thesis has demonstrated that the developed HSFS
algorithm has utilised HS effectively for the task of FS. The proposed modifications to
HSFS further enhance the efficacy of the algorithm, improving both the compactness
and evaluation quality of the discovered feature subsets. A number of theoretical
areas have also been identified that exploits the stochastic behaviour of HSFS. The
capabilities and potential of the developed applications have been experimentally
validated, and compared with either the original approaches, or relevant techniques
in the literature. The chapter also presents a number of initial thoughts about the
directions for future research.
9.1 Summary of Thesis
A survey of ten different NIMs has been given in Chapter 2, which covers FS ap-
proaches derived from both classic stochastic algorithms and other cutting-edge
techniques. The key common notions and mechanisms of the reviewed algorithms
have been extracted, and an unified style of notation has been adopted, with pseu-
docode included. While conducting the review, several techniques including ABC
and FA have also been modified considerably, in order to utilise a wide range of
feature subset evaluation measures, and to improve their search performance.
178
9.1. Summary of Thesis
HSFS (as described in Chapter 3) is a successful application of HS for the problem
of FS. The initial development of the algorithm has been introduced that facilitates
binary-valued feature subset representation, which is also the common choice for
most nature-inspired approaches. Evolving from its initial forms, the HSFS method
makes use of an integer-valued encoding scheme. It maps the notion of musicians
in HS to arbitrary “FS experts” or “individual feature selectors”, and offers a more
flexible platform for the underlying stochastic approach of HS.
To overcome the drawbacks of static parameters employed in the original HS
method, deterministic parameter control rules have been introduced. Procedures to
iteratively refine the size of the emerging feature subsets have also been presented.
Both of these modifications contribute towards a more flexible, data-oriented ap-
proach, in order to encourage the search process to identify good solutions efficiently.
The proposed HSFS algorithm is a generic technique that can be used in conjunc-
tion with alternative filter-based [164], wrapper-based [107], and hybrid feature
subset evaluation techniques [22, 23]. Owing to the underlying randomised and
yet simple nature, the entire solution space of a given problem may be examined by
running the HSFS algorithm in parallel. This will help to reveal a number of quality
solutions much more quickly than random search or exhaustive search methods. The
ability to identify multiple good solutions is of particular importance for ensemble
learning, as the alternative subsets of features may create distinctive views of the
problem at hand, thereby enabling a diverse feature subset-based classifier ensemble
to be built.
An intuitive usage of the stochastic characteristics of HSFS is the OC-FSE method
(as described in Chapter 4). Different diversification methods have been investi-
gated, in an effort to efficiently construct the base pool of classifiers, including
stochastic search, data-partition, and mixture of FS algorithms. The resultant system
outperforms single classifiers, and is more concise and efficient than ordinary (non-
OC-based) ensembles. The HSFS algorithm (and FS technique in general) has been
utilised further in Chapter 5 for the purpose of pruning redundant base classifiers.
FS is performed on artificially generated data sets, which are transformed from
ensemble training outputs, in order to identify and remove irrelevant or redundant
classifiers. Unsupervised FS has also been utilised in the study as a means to discover
redundancy without resorting to the examination of class labels.
179
9.2. Future Work
To deal with scenarios where data may be dynamically changing, an extension
to the (static) HSFS algorithm has been devised in Chapter 6. D-HSFS provides
the additional functionality which adapts to the changes that occur during train-
ing, including events of feature addition, feature removal, instance addition, and
instance removal. D-HSFS is particularly powerful for resolving situations where
arbitrary combinations of the above mentioned scenarios happen simultaneously. A
modified, adaptive FSE has also been proposed that improves the predictive accuracy
of the concurrently trained classifier learners. The resultant approach is an adaptive
framework that is able to evolve along with the dynamic data set.
The use of FS in rule-based systems has been studied in Chapters 7 and 8. Both
theoretical areas benefit from the use of HSFS. For fuzzy-rough rule induction in
particular, the rule base processed by HS is both compact (low number of rules),
and concise (small cardinality of individual rule antecedents), whilst maintaining a
full coverage of the training objects. It has been shown that both conventional FRI
and B-FRI methods may become more efficient, when the less relevant antecedents
have been correctly identified and assigned with lower significant values. A higher
interpolative accuracy can also be achieved when additional information obtained
from FS is utilised.
The FS performance of HSFS, its improvements and applications, both theoretical
and practical, have been experimentally evaluated, and systematically compared
to the relevant techniques in the literature. The results of experimentation have
demonstrated that HSFS is particularly effective at reducing the size of feature subsets,
while its ability to optimise the evaluation score is on par to the other methods. In
addition, it has been shown that the proposed modifications to HSFS are beneficial
in improving the quality of the selected feature subsets.
9.2 Future Work
Although promising, much can be done to further improve the work presented so
far in this thesis. The following addresses a number of interesting issues whose
successful solution will help strength the current research.
180
9.2. Future Work
9.2.1 Short Term Tasks
This section discusses on extensions and tasks that could be readily implemented if
additional time were available.
9.2.1.1 HSFS
Although a preliminary convergence detection mechanism for HS has been suggested
in [290], allowing the algorithm to detect convergence (based on the frequency of
updates to the best solution within the harmony memory) and to self-terminate.
It would be useful to develop a more advanced stopping criterion, utilising the
overall quality of the entire harmony memory, and additional states of the search
process. This way, the run-time efficiency will become self-adaptive to the problem
at hand, and a further performance improvement can therefore be expected. More
intelligent iterative refinement procedures, alternative to that described in 3.4.2 may
be developed. The purpose is not just to encourage the discovery of more compact
feature subsets, but to find them in much shorter time. The harmony memory
consolidation procedure suggested in [290] is a step in this particular direction,
which can achieve feature subset size reduction with a smaller number of iterations.
The subset evaluators employed in the experimental evaluation indicate that
certain methods are more biased than the rest towards maintaining end classification
accuracy, or towards minimising the resultant feature subset size. Further investiga-
tion is thus necessary such that better group-based approaches (such as FSEs) may
be developed by combining these evaluation measures. As pointed out in Section
3.3.2, feature relevancy measurements such as correlation [93] and fuzzy-rough
dependency [126] may be utilised to identify more relevant neighbours, so that
the stochastic mechanisms controlled by the pitch adjustment rate and fret-width
parameters may be better exploited.
9.2.1.2 FSE
Currently, despite the effort vested in developing adaptive techniques, the number
of base FS components (i.e., the size of the FSE) needs to be predefined. The
construction process should however, be able to automatically “recruit” or “fire” base
FS components according to the complexity of the problem data, in a similar fashion
as that used in the harmony consolidation process of HSFS. An extreme case for this
would be the situation where the data set contains only one optimal feature subset,
181
9.2. Future Work
which may be handled by a single component, thereby eliminating the necessity
of employing a group-based approach (equivalently, shrinking the ensemble size
to one). To achieve this, enhancing the methods developed for CER are of strong
relevance.
A thorough interpretation of the underlying problem domain, given a very large
data set, may become infeasible for many real-world applications and hence, the
amount of labelled training samples is often limited. This makes unsupervised FS al-
gorithms [22, 23, 169], and semi-unsupervised learning techniques [170] potentially
beneficial and desirable. FSEs could help in better identifying correlated (similar)
groups of features [120], rather than individually important features under such
circumstances. Feature grouping techniques are potentially beneficial to computa-
tionally complex FS methods such as FRFS, where FS may be performed on the basis
of predefined groups. This will not only improve the time taken to process large
data, but also potentially generate better subsets with less internal redundancy.
9.2.1.3 Rule Induction
Existing improvements to HSFS may be validated to real their effectiveness for the
problem of rule induction. Investigations into how the parameters of HarmonyRules
can be better tuned [173, 257, 290] are of particular interest. It may also be beneficial
to perform an in-depth analysis of the underlying theoretical characteristics of the
learning mechanism, such as scalability. As the current approach treats training
objects as musicians, an alternative structure may be necessary in order to cope with
huge data sets, where the large number of objects will affect the search performance.
Although the scalability of HS itself has been studied in the literature [49, 257], a
divide and conquer approach, or a hierarchically structured HS components may
further improve the performance. Additional examination will be helpful to utilise
the pool of discovered rule sets/feature subsets. A fuzzy-rough rule-based ensemble
similar to those constructed for FSE [61] may be formed, where the subsets may be
used to generate partitions of the training data in order to build diverse classification
models.
9.2.1.4 FRI
The present antecedent selection approach for FRI can be improved further by
considering unsupervised or semi-supervised FS methods [170, 169, 288], which
182
9.2. Future Work
have emerged recently for analysing the inter-dependencies between features without
the aid of class information. Current work in B-FRI also requires exhaustive search
of suitable parameter values to perform reverse reasoning, a heuristic-based method
employing algorithms such as HS may greatly speed up the process. Although
generic in concept, the current implementation of the antecedent significance-based
aggregation approach is strongly coupled with the T-FRI method. It is worth further
extending the principles behind weighted aggregation to alternative FRI methods,
thereby providing a potentially more flexible framework for efficient interpolation.
Fuzzy aggregation functions [23, 191] are of particular assistance in realising such a
task.
9.2.2 Long Term Developments
This section proposes several future directions that could each form the basis of a
much more significant piece of research.
9.2.2.1 Hierarchical HS for FS and General Purpose Optimisation
Research into hierarchically structured HS for high dimensional and large FS problems
may be beneficial to the further development of the work presented in this thesis.
Such a theoretical extension is potentially applicable to a wide range of applications,
not limited to FS. The idea behind multi-layered HS is originated from the concept of
orchestras, where the current “flat” HS can be seen as a band. An orchestra consists
of multiple sub-sections, including string, wind, brass, and percussion instruments,
and is typically led by a conductor. Depending on the type of music being performed,
additional players, solo performers, or alternative arrangements of sections can take
place.
Hierarchical HS may utilise locally organised search operations which help to
detect similar or related features. Effective feature grouping may lead to substantial
reduction of the complexity of any subsequent search process (imagine it being a
pre-processing step for FS). The restricted feature domain may take advantage of
such groupings to provide stronger informative hints to the feature selectors. Lower
tiered search processes can be focused on difference evaluation criteria, or artificially
injected preferences, where meta-level procedures can oversee the progress of overall
search, so that both macro- and micro-level control are achieved.
183
9.2. Future Work
9.2.2.2 Theoretical Developments of Dynamic FS and A-FSE
A key concern of many real-world applications is the responsiveness of the FS mecha-
nisms, where an in-depth investigation of the relevant application-specific techniques
[5] may reveal even more reactive methods. The internal mechanisms of the HSFS
algorithm may also be further exploited to improve its efficiency. In particular, the
possibility of building an A-FSE out of the candidate solutions stored within the
harmony memory, rather than employing multiple, simultaneous searches is worth
exploring. Multiple FS criteria may also be utilised simultaneously in an effort to
identify better dynamic feature subsets. For this, ideas developed for multi-objective
optimisation [286] may be exploited. The current work has not yet considered the
scenario where additional class labels may be revealed (or different labels may be
assigned to the existing objects) during the dynamic learning process. How the
proposed approach may be further extended to handle dynamic rule learning [188]remains active research, as well as the development of hybrid or embedded models
for a closer integration between dynamic FS and classification.
9.2.2.3 Descriptive CER and its Applications
The formulation of alternative transformation procedures for producing the decision
matrix is of particular interest for the development of descriptive CER. Many state-
of-the-art classifiers are capable of producing a likelihood distribution governing
the chance that a particular instance may belong to a certain class, where the class
with highest probability is usually taken as the final prediction. This probability
distribution may contain more information, and is potentially more suitable to be
utilised as the artificial feature values (rather than the final prediction alone as within
the current). Other statistical information regarding the classifiers such as bias and
variance, may also be used to construct additional artificially generated features, in
order to create a more comprehensive artificial data set for FS-based CER.
The approaches developed following this principle direction may be applicable
to problems involving substantially larger data sets (when compared to those in-
vestigated in Chapter 5), such as Martian rock classification [222, 224], weather
forecasting [216], and intelligent robotics [161, 176, 263]. These areas present
significant challenge to the existing FS and classification algorithms; addressing
them will help to better understand and validate the characteristics of the employed
methods. Investigations into the underlying reasons for why different FS techniques
184
9.2. Future Work
deliver distinctive characteristics in CER will also be beneficial, either to simplify
the complexity of the learnt ensembles, or to improve the overall classifier ensemble
accuracy.
9.2.2.4 A-FSE for Weather Forecasting
One of the most challenging application problems that require the assistance of FS is
weather forecasting [111, 214]. Traditional weather forecasting has been built on
a foundation of deterministic modelling. The forecast typically starts with certain
initial conditions, puts them into a sophisticated computational model, and ends
with a prediction about the forthcoming weather. Ensemble-based forecasting [86]is first introduced in the early 1990s. In this method, results of (up to hundreds of)
different computer runs, each with slight variations in starting conditions or model
assumptions, are combined to derive the final forecast. As with statistical techniques,
ensembles may provide more accurate statements about the uncertainty in daily and
seasonal forecasting.
In particular, weather forecasting deals with data sources that are constantly
changing. The data volume may grow both in terms of attributes and objects, whilst
historical information may also become invalid or irrelevant over time. The A-FSE
approach developed in this thesis can actively form and refine ensembles in a dy-
namic environment, in an effort to maintain the preciseness and effectiveness of the
extracted knowledge. Such a technique may be further generalised to the prediction
of natural disasters, and unusual, severe, or unseasonal weather (commonly referred
to as extreme weather) that lies at the extremes of historical distributions. A consid-
erable amount of effort is foreseen to be needed to establish an adaptive system that
can handle real forecasting problems of extremely high complexity. However, the
work developed in this thesis may offer useful insight into such further development.
185
Appendix A
Publications Arising from the Thesis
A number of publications have been generated from the research carried out within
the PhD project. Below lists the resultant publications that are in close relevance to
the thesis, including both papers already published and articles submitted for review.
A.1 Journal Articles
1. R. Diao, F. Cao, Peng, N. Snooke, and Q. Shen, Feature Selection Inspired
Classifier Ensemble Reduction [57], IEEE Transactions on Cybernetics, 10 pp.,
in press.
2. S. Jin, R. Diao, and Q. Shen, Backward Fuzzy Rule Interpolation [129], IEEE
Transactions on Fuzzy Systems, 14 pp., in press.
3. R. Diao and Q. Shen, Feature Selection with Harmony Search [62], IEEE
Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 42,
no. 6, pp. 1509–1523, 2012.
4. R. Diao and Q. Shen, Adaptive Feature Selection Ensemble for Dynamic Data:
A Harmony Search-Based Approach, 12 pp., submitted.
5. R. Diao and Q. Shen, Occurrence Coefficient Threshold-Based Feature Subset
Ensemble, 12 pp., submitted.
186
187
6. R. Diao and Q. Shen, Nature Inspired Feature Selection Meta-Heuristics, 27
pp., submitted.
7. L. Zheng, R. Diao, and Q. Shen, Self-Adjusting Harmony Search-based Feature
Selection, 11 pp., submitted.
A.2 Book Chapter
8. Q. Shen, R. Diao, and P. Su, Feature Selection Ensemble [227], Turing Centenary,
pp. 289–306, 2012.
A.3 Conference Papers
9. R. Diao, S. Jin, and Q. Shen, Antecedent Selection in Fuzzy Rule Interpolation
using Feature Selection Techniques, submitted.
10. L. Zheng, R. Diao, and Q. Shen, Efficient Feature Selection using a Self-Adjusting
Harmony Search Algorithm [290], Proceedings of the 13th UK Workshop on
Computational Intelligence, 2013.
11. R. Diao, N. Mac Parthaláin, and Q. Shen, Dynamic Feature Selection with
Fuzzy-Rough Sets [58], Proceedings of the 22nd IEEE International Conference
on Fuzzy Systems, 2013.
12. S. Jin, R. Diao, C. Quek, and Q. Shen, Backward Fuzzy Rule Interpolation
with Multiple Missing Values [128], Proceedings of the 22nd IEEE International
Conference on Fuzzy Systems, 2013.
13. R. Diao and Q. Shen, A Harmony Search Based Approach to Hybrid Fuzzy-rough
Rule Induction [63], Proceedings of the 21st IEEE International Conference on
Fuzzy Systems, pp. 1–8, 2012.
14. S. Jin, R. Diao, and Q. Shen, Backward Fuzzy Interpolation and Extrapolation
with Multiple Multi-antecedent Rules [131], Proceedings of the 21st IEEE
International Conference on Fuzzy Systems, pp. 1–8, 2012.
188
15. R. Diao and Q. Shen, Fuzzy-rough classifier ensemble selection [61], Pro-
ceedings of the 20th IEEE International Conference on Fuzzy Systems, pp.
1516–1522, 2011.
16. S. Jin, R. Diao, and Q. Shen, Towards Backward Fuzzy Rule Interpolation
[130], Proceedings of the 11th UK Workshop on Computational Intelligence,
2011.
17. R. Diao and Q. Shen, Two New Approaches to Feature Selection with Harmony
Search [60], Proceedings of the 19th IEEE International Conference on Fuzzy
Systems, pp. 3161–3167, 2010.
18. R. Diao and Q. Shen, Deterministic Parameter Control in Harmony Search [59],Proceedings of the 10th UK Workshop on Computational Intelligence, 2010.
Appendix B
Data Sets Employed in the Thesis
The data sets employed in the thesis are mostly public available benchmark data,
available through the UCI machine learning repository [78] which have been drawn
from real-world problem scenarios. Table B.1 provides a summary of the properties
of these data sets. Their underlying problem domains are described in detail below,
where the URL of the respective data sets are also given in order to facilitate easy
access.
Table B.1: Information of data sets used in the thesis
Data set Feature Instance Class
arrhy 279 452 16cleve 14 297 5ecoli 8 336 8glass 10 214 6handw 256 1593 10heart 14 270 2ionos 35 230 2isole 617 7797 26libra 91 360 15multi 650 2000 10olito 25 120 4ozone 73 2534 2secom 591 1567 2sonar 60 208 2water 39 390 3water2 39 390 2wavef 40 699 2web 2556 149 5wine 13 178 3
189
190
• Arrhythmia (arrhy)
http://archive.ics.uci.edu/ml/datasets/Arrhythmia
This database contains 279 attributes, 206 of which are linear valued and the
rest are nominal [78]. “The aim is to distinguish between the presence and
absence of cardiac arrhythmia and to classify it in one of the 16 groups. Class 01
refers to ’normal’ ECG classes 02 to 15 refers to different classes of arrhythmia
and class 16 refers to the rest of unclassified ones. For the time being, there
exists a computer program that makes such a classification. However there are
differences between the cardiolog’s and the programs classification. Taking the
cardiolog’s as a gold standard we aim to minimise this difference by means of
machine learning tools.” [90]
• Cleveland Heart Disease Data Set (cleve)
http://archive.ics.uci.edu/ml/datasets/Heart+Disease
“This database contains 76 attributes altogether, but all published experiments
refer to using a subset of 14 of them. In particular, the Cleveland database is
the only one that has been used by ML researchers to this date. The decision
attribute refers to the presence of heart disease in the patient. It is integer
valued from 0 (no presence) to 4. Experiments with the Cleveland database
have concentrated on simply attempting to distinguish presence (values 1,2,3,4)
from absence (value 0). The names and social security numbers of the patients
were recently removed from the database, replaced with dummy values.” [2]
• Ecoli (ecoli)
http://archive.ics.uci.edu/ml/datasets/Ecoli
“The localization site of a protein within a cell is primarily determined by its
amino acid sequence. Rule-based expert system for classifying proteins into
their various cellular localization sites, using their amino acid sequences, in
gram-negative bacteria and in eukaryotic cells.” [105]
• Glass Identification (glass)
http://archive.ics.uci.edu/ml/datasets/Glass+Identification
This data set contains 10 attributes which describes the chemical contents of
glass. “The study of classification of types of glass (in determining whether
the glass was a type of “float” glass or not) was motivated by criminological
191
investigation. At the scene of the crime, the glass left can be used as evidence
if it is correctly identified.” [73]
• Semeion Handwritten Digit (handw)
http://archive.ics.uci.edu/ml/datasets/Semeion+Handwritten+Digit
1593 handwritten digits from around 80 persons were scanned, stretched in a
rectangular box 16x16 in a gray scale of 256 values. Then each pixel of each
image was scaled into a boolean (1/0) value using a fixed threshold. Each
person wrote on a paper all the digits from 0 to 9, twice. The commitment was
to write the digit the first time in the normal way (trying to write each digit
accurately) and the second time in a fast way (with no accuracy). [32]
• Statlog (Heart) (heart)
http://archive.ics.uci.edu/ml/datasets/Statlog+%28Heart%29
“This data set is a heart disease database, with 6 real-valued attributes: 1, 4, 5,
8, 10, 12; 1 ordered attribute:11; 3 binary attributes: 2, 6, 9; and 3 nominal
features:7, 3, 13. The class label to be predicted: absence (1) or presence (2)
of heart disease.”
• Ionosphere (ionos)
http://archive.ics.uci.edu/ml/datasets/Ionosphere
“This radar data was collected by a system in Goose Bay, Labrador. This system
consists of a phased array of 16 high-frequency antennas with a total trans-
mitted power on the order of 6.4 kilowatts. The targets were free electrons
in the ionosphere. "Good" radar returns are those showing evidence of some
type of structure in the ionosphere. "Bad" returns are those that do not; their
signals pass through the ionosphere. Received signals were processed using
an autocorrelation function whose arguments are the time of a pulse and the
pulse number.” [232]
• Isolet (isole)
http://archive.ics.uci.edu/ml/datasets/ISOLET
“This data set was generated as follows. 150 subjects spoke the name of each
letter of the alphabet twice. Hence, we have 52 training examples from each
speaker. All attributes are continuous, real-valued attributes scaled into the
192
range -1.0 to 1.0. The data set is a good domain for a noisy, perceptual task. It
is also a very good domain for testing the scaling abilities of algorithms.” [74]
• Libras (libra)
http://archive.ics.uci.edu/ml/datasets/Libras+Movement
“The data set contains 15 classes of 24 instances each, where each class refer-
ences to a hand movement type in LIBRAS (Portuguese name ’LÍngua BRAsileira
de Sinais’, the official Brazilian signal language). In the video pre-processing,
a time normalisation is carried out selecting 45 frames from each video, in
according to an uniform distribution. In each frame, the centroid pixels of the
segmented objects (the hand) are found, which compose the discrete version
of the curve F with 45 points. All curves are normalised in the unitary space.”
[64]
• Multiple Features (multi)
http://archive.ics.uci.edu/ml/datasets/Multiple+Features
“This dataset consists of features of handwritten numerals (‘0’–‘9’) extracted
from a collection of Dutch utility maps. 200 patterns per class (for a total of
2,000 patterns) have been digitized in binary images. These digits are repre-
sented in terms of the following six feature setsSS: 1) 76 Fourier coefficients
of the character shapes; 2) 216 profile correlations; 3) 64 Karhunen-Love
coefficients; 4) 240 pixel averages in 2 x 3 windows; 5) 47 Zernike moments;
6) 6 morphological features. The first 200 patterns are of class ‘0’, followed by
sets of 200 patterns for each of the classes ‘1’ - ‘9’.” [198]
• Olitos (olito)
http://michem.disat.unimib.it/chm/download/datasets.htm#olit
This data set concerns the chemometric analysis of olive oils [9]. The chemical
information such as fatty acids, sterols, and triterpenic alcohols are analysed
from 120 olive oil samples from Tuscany, Italy, collected in 88 different areas
of production. The class variable determines the cultivars of the oil samples.
• Ozone Level Detection (ozone)
http://archive.ics.uci.edu/ml/datasets/Ozone+Level+Detection
193
“Ground ozone level data included in this collection were collected from 1998
to 2004 at the Houston, Galveston and Brazoria area. The data contains impor-
tant attributes that are highly valued by Texas Commission on Environmental
Quality: local ozone peak prediction; upwind ozone background level; pre-
cursor emissions related factor; maximum temperature in degrees F; base
temperature where net ozone production begins; solar radiation total for the
day; wind speed near sunrise; wind speed mid-day.” [283].
• Secom (secom)
http://archive.ics.uci.edu/ml/datasets/SECOM
“A complex modern semi-conductor manufacturing process is normally under
consistent surveillance via the monitoring of signals/variables collected from
sensors and or process measurement points. The measured signals contain
a combination of useful information, irrelevant information as well as noise.
When performing system diagnosis, engineers typically have a much larger
number of signals than are actually required. The Process Engineers may use
certain selected signals to determine key factors contributing to yield excursions
downstream in the process.” [180]
• Sonar (sonar)
http://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+(Sonar,+Mines
+vs.+Rocks)
“The data set contains 111 patterns obtained by bouncing sonar signals off a
metal cylinder at various angles and under various conditions, and 97 patterns
obtained from rocks under similar conditions. The transmitted sonar signal is
a frequency-modulated chirp, rising in frequency. The data set contains signals
obtained from a variety of different aspect angles, spanning 90 degrees for the
cylinder and 180 degrees for the rock. Each pattern is a set of 60 numbers in
the range 0.0 to 1.0. Each number represents the energy within a particular
frequency band, integrated over a certain period of time.” [87]
• Water (water)
http://archive.ics.uci.edu/ml/datasets/Water+Treatment+Plant
“This dataset comes from the daily measures of sensors in a urban waste water
treatment plant. The objective is to classify the operational state of the plant
194
in order to predict faults through the state variables of the plant at each of
the stages of the treatment process. This domain has been stated as an ill-
structured domain.” A variant of this data set: water2 with 2 different classes
(as opposed to the original of 3) has also been utilised in this thesis. [15]
• Waveform Database Generator (wavef)
http://archive.ics.uci.edu/ml/datasets/Waveform+Database+Generator
+(Version+2)
“There are 3 classes of waves, each class is generated from a combination of 2
of 3 base waves. There are 40 attributes, all of which include noise, where the
latter 19 attributes are all noise attributes with mean 0 and variance 1.”. [26]
• Wine (wine)
http://archive.ics.uci.edu/ml/datasets/Wine
“These data are the results of a chemical analysis of wines grown in the same
region in Italy but derived from three different cultivars. The analysis deter-
mined the quantities of 13 constituents found in each of the three types of
wines.” [254]
Appendix C
List of Acronyms
10-FCV 10-fold cross-validation
A-FSE Adaptive feature subset ensemble
ABC Artificial Bee Colony
ACO Ant Colony Optimisation
B-FRI Backward fuzzy rule interpolation
BCP Base classifier pool
CER Classifier ensemble reduction
CFS Correlation-based feature selection
CSA Clonal Selection Algorithm
D-HSFS Dynamic feature selection with Harmony Search
FF Firefly Search
FNN Fuzzy nearest neighbour
FRFS Fuzzy-rough set-based feature selection
FRI Fuzzy rough interpolation
FS Feature selection
FSE Feature subset-based classifier ensemble
GA Genetic Algorithm
195
196
HC Hill-Climbing
HS Harmony Search
HS-O Original Harmony Search
HS-PC Harmony Search with parameter control
HS-IR Harmony Search with parameter control and iterative refinement
HSFS Feature selection with Harmony Search
LEM2 Learning from examples module, version 2
MA Memetic Algorithm
ModLEM Modified algorithm for learning from examples module
NB Naïve Bayes-based classifier
NIM Nature-inspired meta-heuristic
OC Occurrence coefficient
OC-FSE Occurrence coefficient-based feature subset classifier ensemble
PART Projective adaptive resonance theory
PCFS Probabilistic consistency-based feature selection
PSO Particle swarm optimisation
RIPPER Repeated incremental pruning to produce error reduction
RST Rough set theory
SA Simulated Annealing
SMO Sequential minimal optimisation
T-FRI Transformation-based fuzzy rule interpolation
TS Tabu Search
U-FRFS Unsupervised fuzzy-rough set-based feature selection
WSBA Weighted fuzzy subset-hood-based rule induction algorithm
Bibliography
[1] N. Abe and M. Kudo, “Entropy criterion for classifier-independent feature selection,”in Knowledge-Based Intelligent Information and Engineering Systems, ser. Lecture Notesin Computer Science, R. Khosla, R. Howlett, and L. Jain, Eds. Springer BerlinHeidelberg, 2005, vol. 3684, pp. 689–695.
[2] D. Aha and D. Kibler, “Instance-based prediction of heart-disease presence with thecleveland database,” University of California, Tech. Rep., Mar 1988.
[3] D. W. Aha and R. L. Bankert, “A comparative evaluation of sequential feature selectionalgorithms,” in Learning from Data: Artificial Intelligence and Statistics V, ser. LectureNotes in Statistics, D. H. Fisher and H.-J. Lenz, Eds. New York, USA: Springer-Verlag,1996, pp. 199–206.
[4] D. Aha, D. Kibler, and M. Albert, “Instance-based learning algorithms,” MachineLearning, vol. 6, no. 1, pp. 37–66, 1991.
[5] M. Ahmadi, M. Taylor, and P. Stone, “IFSA: Incremental feature-set augmentation forreinforcement learning tasks,” in The 6th International Joint Conference on AutonomousAgents and Multiagent Systems. Springer, 2007.
[6] H. Akaike, “A new look at the statistical model identification,” IEEE Trans. Autom.Control, vol. 19, no. 6, pp. 716–723, 1974.
[7] M. R. AlRashidi and M. El-Hawary, “A survey of particle swarm optimization applica-tions in electric power systems,” IEEE Trans. Evol. Comput., vol. 13, no. 4, pp. 913–918,2009.
[8] E. Amaldi and V. Kann, “On the approximability of minimizing nonzero variables orunsatisfied relations in linear systems,” Theoretical Computer Science, vol. 209, no.1?2, pp. 237–260, 1998.
[9] C. Armanino, R. Leardi, S. Lanteri, and G. Modi, “Chemometric analysis of tuscanolive oils,” Chemometrics and Intelligent Laboratory Systems, vol. 5, no. 4, pp. 343–354,1989.
[10] M. Attik, “Using ensemble feature selection approach in selecting subset with relevantfeatures,” in Advances in Neural Networks, ser. Lecture Notes in Computer Science,J. Wang, Z. Yi, J. Zurada, B.-L. Lu, and H. Yin, Eds. Springer Berlin Heidelberg, 2006,vol. 3971, pp. 1359–1366.
197
198
[11] A. Atyabi, M. Luerssen, S. Fitzgibbon, and D. Powers, “Evolutionary feature selectionand electrode reduction for EEG classification,” in 2012 IEEE Congress on EvolutionaryComputation, Jun. 2012, pp. 1–8.
[12] H. Banati and M. Bajaj, “Fire fly based feature selection approach,” InternationalJournal of Computer Science Issues, vol. 8, no. 2, pp. 473–479, 2011.
[13] Y. Bar-Cohen, Biomimetics: Biologically Inspired Technologies. Taylor & Francis, 2005.
[14] P. Baranyi, L. T. Kóczy, and T. D. Gedeon, “A generalized concept for fuzzy ruleinterpolation,” IEEE Trans. Fuzzy Syst., vol. 12, no. 6, pp. 820–837, 2004.
[15] J. B’ejar, “Linneo: a classification methodology for ill-structured domains,” Facultatd’Informtica de Barcelona, Tech. Rep., 1993.
[16] R. Bellman, Dynamic Programming. Princeton, NJ, USA: Princeton University Press,1957.
[17] Y. Bengio and Y. Grandvalet, “No unbiased estimator of the variance of K-fold cross-validation,” Journal of Machine Learning Research, vol. 5, pp. 1089–1105, Sep. 2004.
[18] ——, “Bias in estimating the variance of K-fold cross-validation,” in Statistical Modelingand Analysis for Complex Data Problems, P. Duchesne and B. RéMillard, Eds. SpringerUS, 2005, pp. 75–95.
[19] R. B. Bhatt and M. Gopal, “On fuzzy-rough sets approach to feature selection,” PatternRecognition Letters, vol. 26, pp. 965–975, 2005.
[20] C. M. Bishop, Neural Networks for Pattern Recognition, 1st ed. Oxford UniversityPress, USA, Jan. 1996.
[21] G. Bontempi, H. Bersini, and M. Birattari, “The local paradigm for modeling andcontrol: from neuro-fuzzy to lazy learning,” Fuzzy Sets and Systems, vol. 121, no. 1,pp. 59–72, 2001.
[22] T. Boongoen, C. Shang, N. Iam-on, and Q. Shen, “Extending data reliability measureto a filter approach for soft subspace clustering,” IEEE Trans. Syst., Man, Cybern. B,vol. 41, no. 6, pp. 1705–1714, 2011.
[23] T. Boongoen and Q. Shen, “Nearest-neighbor guided evaluation of data reliability andits applications,” IEEE Trans. Syst., Man, Cybern. B, vol. 40, no. 6, pp. 1622–1633, Dec.2010.
[24] B. Bouchon-Meunier, R. Mesiar, C. Marsala, and M. Rifqi, “Compositional rule ofinference as an analogical scheme,” Fuzzy Sets and Systems, vol. 138, no. 1, pp. 53–65,2003.
[25] V. Braverman, R. Ostrovsky, and C. Zaniolo, “Optimal sampling from sliding windows,”Journal of Computer and System Sciences, vol. 78, no. 1, pp. 260–272, 2012.
[26] L. Breiman, Classification and regression trees, ser. The Wadsworth and Brooks-Colestatistics-probability series. Chapman & Hall, 1984.
199
[27] ——, “Bagging predictors,” Machine Learning, vol. 24, pp. 123–140, 1996.
[28] ——, “Technical note: Some properties of splitting criteria,” Machine Learning, vol. 24,no. 1, pp. 41–47, 1996.
[29] J. Brownlee, Clever Algorithms: Nature-Inspired Programming Recipes. Lulu.com,2011.
[30] E. Burke and J. Landa Silva, “The design of memetic algorithms for scheduling andtimetabling problems,” in Recent Advances in Memetic Algorithms, Studies in Fuzzinessand Soft Computing. Springer, 2004, pp. 289–312.
[31] K. Burnham and D. Anderson, Model Selection and Multi-Model Inference: A PracticalInformation-Theoretic Approach. Springer, 2002.
[32] M. Buscema, “Metanet: The theory of independent judges,” Substance Use & Misuse,vol. 33, no. 2, pp. 439–461, 1998.
[33] Y. Cao and J. Wu, “Projective art for clustering data sets in high dimensional spaces,”Neural Networks, vol. 15, no. 1, pp. 105–120, Jan. 2002.
[34] J. Chang and W. Lee, “Finding recently frequent itemsets adaptively over onlinetransactional data streams,” Information Systems, vol. 31, no. 8, pp. 849–869, 2006.
[35] N. Chawla and D. Davis, “Bringing big data to personalized healthcare: A patient-centered framework,” Journal of General Internal Medicine, vol. 28, no. 3, pp. 660–665,2013.
[36] S. Chen and Y. Chang, “Fuzzy rule interpolation based on the ratio of fuzziness ofinterval type-2 fuzzy sets,” Expert Systems with Applications, vol. 38, no. 10, pp.12 202–12 213, 2011.
[37] X. Chen, Y.-S. Ong, M.-H. Lim, and K. C. Tan, “A multi-facet survey on memeticcomputation,” IEEE Trans. Evol. Comput., vol. 15, no. 5, pp. 591–607, 2011.
[38] Y. Chen, D. Miao, and R. Wang, “A rough set approach to feature selection based onant colony optimization,” Pattern Recognition Letters, vol. 31, no. 3, pp. 226–233,2010.
[39] A. Chouchoulas and Q. Shen, “Rough set-aided keyword reduction for text categorisa-tion,” Applied Artificial Intelligence, vol. 15, pp. 843–873, 2001.
[40] C. M. Christoudias, R. Urtasun, and T. Darrell, “Multi-view learning in the presence ofview disagreement,” in 24th Conference on Uncertainty in Artificial Intelligence, 2008.
[41] L.-Y. Chuang, S.-W. Tsai, and C.-H. Yang, “Improved binary particle swarm optimizationusing catfish effect for feature selection,” Expert Systems with Applications, vol. 38,no. 10, pp. 12 699–12 707, 2011.
[42] I. Cloete and J. van Zyl, “Fuzzy rule induction in a set covering framework,” IEEETrans. Fuzzy Syst., vol. 14, no. 1, pp. 93–110, 2006.
200
[43] W. W. Cohen, “Fast effective rule induction,” in Twelfth International Conference onMachine Learning. Morgan Kaufmann, 1995, pp. 115–123.
[44] R. Cooley, B. Mobasher, and J. Srivastava, “Web mining: information and pattern dis-covery on the world wide web,” in Proceedings of the 9th IEEE International Conferenceon Tools with Artificial Intelligence, 1997, pp. 558–567.
[45] C. Cornelis, G. H. Martín, R. Jensen, and D. Slezak, “Feature selection with fuzzydecision reducts,” in Rough Sets and Knowledge Technology, ser. Lecture Notes inComputer Science, G. Wang, T. Li, J. Grzymala-Busse, D. Miao, A. Skowron, and Y. Yao,Eds. Springer Berlin Heidelberg, 2008, vol. 5009, pp. 284–291.
[46] Y. Cui, J. Jin, S. Zhang, S. Luo, and Q. Tian, “Correlation-based feature selection andregression,” in Advances in Multimedia Information Processing, ser. Lecture Notes inComputer Science, G. Qiu, K. Lam, H. Kiya, X.-Y. Xue, C.-C. Kuo, and M. Lew, Eds.Springer Berlin Heidelberg, 2010, vol. 6297, pp. 25–35.
[47] P. Cunningham and J. Carney, “Diversity versus quality in classification ensemblesbased on feature selection,” in In 11th European Conference on Machine Learning.Springer, 2000, pp. 109–116.
[48] A. Darwiche, Modeling and Reasoning with Bayesian Networks, 1st ed. New York, NY,USA: Cambridge University Press, 2009.
[49] S. Das, A. Mukhopadhyay, A. Roy, A. Abraham, and B. Panigrahi, “Exploratory powerof the harmony search algorithm: Analysis and improvements for global numericaloptimization,” IEEE Trans. Syst., Man, Cybern. B, vol. 41, no. 1, pp. 89–106, 2011.
[50] K. Das Sharma, A. Chatterjee, and A. Rakshit, “Design of a hybrid stable adaptivefuzzy controller employing Lyapunov theory and harmony search algorithm,” IEEETrans. Control Syst. Technol., vol. 18, no. 6, pp. 1440–1447, 2010.
[51] M. Dash, K. Choi, P. Scheuermann, and H. Liu, “Feature selection for clustering–a filtersolution,” in Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE InternationalConference on, 2002, pp. 115–122.
[52] M. Dash and H. Liu, “Consistency-based search in feature selection,” Artificial Intelli-gence, vol. 151, no. 1-2, pp. 155–176, Dec. 2003.
[53] ——, “Feature selection for classification,” Intelligent Data Analysis, vol. 1, pp. 131–156, 1997.
[54] L. de Castro and F. Von Zuben, “Learning and optimization using the clonal selectionprinciple,” IEEE Trans. Evol. Comput., vol. 6, no. 3, pp. 239 –251, jun 2002.
[55] J. Debuse and V. Rayward-Smith, “Feature subset selection within a simulated an-nealing data mining algorithm,” Journal of Intelligent Information Systems, vol. 9, pp.57–81, 1997.
[56] J. Demšar, “Statistical comparisons of classifiers over multiple data sets,” J. Mach.Learn. Res., vol. 7, pp. 1–30, Dec. 2006.
201
[57] R. Diao, F. Chao, T. Peng, N. Snooke, and Q. Shen, “Feature selection inspired classifierensemble reduction,” IEEE Trans. Cybern., in press.
[58] R. Diao, N. Mac Parthaláin, and Q. Shen, “Dynamic feature selection with fuzzy-roughsets,” in IEEE International Conference on Fuzzy Systems, Jun. 2013, pp. 1–7.
[59] R. Diao and Q. Shen, “Deterministic parameter control in harmony search,” in Pro-ceedings of the 10th UK Workshop on Computational Intelligence, 2010.
[60] ——, “Two new approaches to feature selection with harmony search,” in IEEE Inter-national Conference on Fuzzy Systems, Jul. 2010, pp. 1–7.
[61] ——, “Fuzzy-rough classifier ensemble selection,” in IEEE International Conference onFuzzy Systems, june 2011, pp. 1516 –1522.
[62] ——, “Feature selection with harmony search,” IEEE Trans. Syst., Man, Cybern. B,vol. 42, no. 6, pp. 1509–1523, 2012.
[63] ——, “A harmony search based approach to hybrid fuzzy-rough rule induction,” inIEEE International Conference on Fuzzy Systems, 2012, pp. 1–8.
[64] D. Dias, R. Madeo, T. Rocha, H. Biscaro, and S. Peres, “Hand movement recognitionfor brazilian sign language: A study using distance-based neural networks,” in NeuralNetworks, 2009. IJCNN 2009. International Joint Conference on, 2009, pp. 697–704.
[65] M. Dorigo and T. Stützle, “Ant colony optimization: Overview and recent advances,”in Handbook of Metaheuristics, ser. International Series in Operations Research &Management Science, M. Gendreau and J.-Y. Potvin, Eds. Springer US, 2010, vol.146, pp. 227–263.
[66] M. Drobics, U. Bodenhofer, and E. P. Klement, “FS-FOIL: an inductive learning methodfor extracting interpretable fuzzy descriptions,” International Journal of ApproximateReasoning, vol. 32, no. 2–3, pp. 131–152, 2003.
[67] D. Dubois and H. Prade, Putting rough sets and fuzzy sets together. Intelligent DecisionSupport, Kluwer Academic Publishers, Dordrecht„ 1992.
[68] S. Džeroski and B. Ženko, “Is combining classifiers better than selecting the best one,”Machine Learning, vol. 54, no. 3, pp. 255–273, Mar. 2004.
[69] A. Ekbal, S. Saha, O. Uryupina, and M. Poesio, “Multiobjective simulated annealingbased approach for feature selection in anaphora resolution,” in Proceedings of the8th international conference on Anaphora Processing and Applications, ser. DAARC’11.Berlin, Heidelberg: Springer-Verlag, 2011, pp. 47–58.
[70] T. Elomaa and M. Kääriäinen, “An analysis of reduced error pruning,” Journal ofArtificial Intelligence Research, vol. 15, no. 1, pp. 163–187, Sep. 2001.
[71] C. Emmanouilidis, A. Hunter, and J. MacIntyre, “A multiobjective evolutionary settingfor feature selection and a commonality-based crossover operator,” in Proceedings ofthe 2000 Congress on Evolutionary Computation, vol. 1, 2000, pp. 309–316vol.1.
202
[72] R. Esposito and L. Saitta, “A monte carlo analysis of ensemble classification,” inProceedings of the 23rd International Conference on Machine Learning, 2004, pp. 265–272.
[73] I. W. Evett and E. J. Spiehler, “Rule induction in forensic science,” Central ResearchEstablishment, Home Office Forensic Science Service, Tech. Rep., 1987.
[74] M. A. Fanty and R. Cole, “Spoken letter recognition,” in NIPS, 1990, p. 220.
[75] U. Fayyad, G. Piatetsky-shapiro, and P. Smyth, “From data mining to knowledgediscovery in databases,” AI Magazine, vol. 17, pp. 37–54, 1996.
[76] A. Fern, R. Givan, B. Falsafi, and T. Vijaykumar, “Dynamic feature selection for hard-ware prediction,” Purdue University, Tech. Rep., 2000.
[77] M. Fesanghary, M. Mahdavi, M. Minary-Jolandan, and Y. Alizadeh, “Hybridizingharmony search algorithm with sequential quadratic programming for engineeringoptimization problems,” Computer Methods in Applied Mechanics and Engineering, vol.197, no. 33-40, pp. 3080–3091, 2008.
[78] A. Frank and A. Asuncion, “UCI machine learning repository,” 2010.
[79] A. A. Freitas, “A review of evolutionary algorithms for data mining,” in Soft Computingfor Knowledge Discovery and Data Mining, 2007, pp. 61–93.
[80] X. Fu and Q. Shen, “Fuzzy compositional modeling,” IEEE Trans. Fuzzy Syst., vol. 18,no. 4, pp. 823–840, Aug. 2010.
[81] N. Fu-zhong and L. Ming, “Attribute value reduction in variable precision rough set,”in 6th International Conference on Parallel and Distributed Computing, Applications andTechnologies, 2005, pp. 904–906.
[82] A. Ganguly, J. Gama, O. Omitaomu, M. Gaber, and R. Vatsavai, Knowledge Discoveryfrom Sensor Data, ser. Industrial Innovation Series. Taylor & Francis, 2008.
[83] F. García López, M. García Torres, B. Melián Batista, J. A. Moreno Pérez, and J. M.Moreno-Vega, “Solving feature subset selection problem by a parallel scatter search,”European Journal of Operational Research, vol. 169, no. 2, pp. 477–489, 2006.
[84] Z. W. Geem, Ed., Recent Advances In Harmony Search Algorithm, ser. Studies in Com-putational Intelligence. Springer, 2010, vol. 270.
[85] G. Giacinto and F. Roli, “An approach to the automatic design of multiple classifiersystems,” Pattern Recognition Letters, vol. 22, pp. 25–33, 2001.
[86] T. Gneiting and A. E. Raftery, “Weather forecasting with ensemble methods,” Science,vol. 310, no. 5746, pp. 248–249, 2005.
[87] R. P. Gorman and T. J. Sejnowski, “Analysis of hidden units in a layered networktrained to classify sonar targets,” Neural Networks, vol. 1, p. 75, 1988.
203
[88] J. W. Grzymala-Busse, “Three strategies to rule induction from data with numericalattributes,” in Transactions on Rough Sets II, ser. Lecture Notes in Computer Science,J. Peters, A. Skowron, D. Dubois, J. W. Grzymala-Busse, M. Inuiguchi, and L. Polkowski,Eds. Springer Berlin Heidelberg, 2005, vol. 3135, pp. 54–62.
[89] P. Grzymala-Busse, J. Grzymala-Busse, and Z. Hippe, “Melanoma prediction using datamining system lers,” in 25th Annual International Computer Software and ApplicationsConference, 2001, pp. 615–620.
[90] H. Guvenir, S. Acar, G. Demiroz, and A. Cekin, “A supervised machine learning algo-rithm for arrhythmia analysis,” in Computers in Cardiology 1997, 1997, pp. 433–436.
[91] I. Guyon and A. Elisseeff, “An introduction to variable and feature selection,” Journalof Machine Learning Research, vol. 3, pp. 1157–1182, Mar. 2003.
[92] B. Haktanirlar Ulutas and S. Kulturel-Konak, “A review of clonal selection algorithmand its applications,” Artificial Intelligence Review, vol. 36, no. 2, pp. 117–138, 2011.
[93] M. Hall, “Correlation-based feature subset selection for machine learning,” Ph.D.dissertation, University of Waikato, Hamilton, New Zealand, 1998.
[94] M. A. Hall, “Correlation-based feature selection for discrete and numeric class machinelearning,” in Proceedings of the 17th International Conference on Machine Learning.Morgan Kaufmann, 2000, pp. 359–366.
[95] D. Hand, “Principles of data mining,” Drug Safety, vol. 30, no. 7, pp. 621–622, 2007.
[96] J. Handl and J. Knowles, “Feature subset selection in unsupervised learning via multi-objective optimization,” International Journal of Computational Intelligence Research,vol. 2, no. 3, pp. 217–238, 2006.
[97] M. H. Hansen and B. Yu, “Model selection and the principle of minimum descriptionlength,” Journal of the American Statistical Association, vol. 96, no. 454, pp. 746–774,2001.
[98] S. Haykin, Neural Networks: A Comprehensive Foundation, ser. International edition.Prentice Hall International, 1999.
[99] H. He, H. Daumé III, and J. Eisner, “Cost-sensitive dynamic feature selection,” in ICMLWorkshop on Inferning: Interactions between Inference and Learning, Edinburgh, Jun.2012.
[100] A.-R. Hedar, J. Wang, and M. Fukushima, “Tabu search for attribute reduction inrough set theory,” Soft Computing, vol. 12, no. 9, pp. 909–918, Apr. 2008.
[101] C. Hinde, A. Bani-Hani, T. Jackson, and Y. Cheung, “Evolving polynomials of the inputsfor decision tree building,” Journal of Emerging Technologies in Web Intelligence, vol. 4,no. 2, 2012.
[102] T. Ho, “The random subspace method for constructing decision forests,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 20, no. 8, pp. 832–844, Aug. 1998.
204
[103] S. Hoi, J. Wang, P. Zhao, and R. Jin, “Online feature selection for mining big data,” inProceedings of the 1st International Workshop on BigMine, 2012, pp. 93–100.
[104] Y. Hong, S. Kwong, Y. Chang, and Q. Ren, “Consensus unsupervised feature rankingfrom multiple views,” Pattern Recognition Letters, vol. 29, no. 5, pp. 595–602, 2008.
[105] P. Horton and K. Nakai, “A probabilistic classification system for predicting the cellularlocalization sites of proteins,” in Proceedings of the 4th International Conference onIntelligent Systems for Molecular Biology. AAAI Press, 1996, pp. 109–115.
[106] N.-C. Hsieh, “Rule extraction with rough-fuzzy hybridization method,” in Advancesin Knowledge Discovery and Data Mining, ser. Lecture Notes in Computer Science,T. Washio, E. Suzuki, K. Ting, and A. Inokuchi, Eds. Springer Berlin Heidelberg,2008, vol. 5012, pp. 890–895.
[107] C.-N. Hsu, H.-J. Huang, and D. Schuschel, “The annigma-wrapper approach to fastfeature selection for neural nets,” IEEE Trans. Syst., Man, Cybern. B, vol. 32, no. 2, pp.207–212, 2002.
[108] P. Hsu, R. Lai, and C. Chiu, “The hybrid of association rule algorithms and genetic algo-rithms for tree induction: an example of predicting the student course performance,”Expert Systems with Applications, vol. 25, no. 1, pp. 51–62, 2003.
[109] Z. Huang and Q. Shen, “Fuzzy interpolative reasoning via scale and move transforma-tions,” IEEE Trans. Fuzzy Syst., vol. 14, no. 2, pp. 340–359, 2006.
[110] ——, “Fuzzy interpolation and extrapolation: A practical approach,” IEEE Trans. FuzzySyst., vol. 16, no. 1, pp. 13–28, 2008.
[111] N. Q. Hung, M. S. Babel, S. Weesakul, and N. K. Tripathi, “An artificial neural networkmodel for rainfall forecasting in bangkok, thailand,” Hydrology and Earth SystemSciences, vol. 13, no. 8, pp. 1413–1425, 2009.
[112] S. Hunt, Q. Meng, and C. J. Hinde, “An extension of the consensus-based bundlealgorithm for group dependant tasks with equipment dependencies,” in Neural Infor-mation Processing, ser. Lecture Notes in Computer Science, T. Huang, Z. Zeng, C. Li,and C. Leung, Eds. Springer Berlin Heidelberg, 2012, vol. 7666, pp. 518–527.
[113] H. Ishibuchi, K. Nozaki, N. Yamamoto, and H. Tanaka, “Selecting fuzzy if-then rulesfor classification problems using genetic algorithms,” IEEE Trans. Fuzzy Syst., vol. 3,no. 3, pp. 260–270, 1995.
[114] R. A. Jacobs, “Methods for combining experts’ probability assessments,” Neural Com-putation, vol. 7, no. 5, pp. 867–888, September 1995.
[115] R. Jensen and C. Cornelis, “Fuzzy-rough instance selection,” in IEEE InternationalConference on Fuzzy Systems, 2010, pp. 1–7.
[116] R. Jensen and Q. Shen, “Fuzzy-rough sets assisted attribute selection,” IEEE Trans.Fuzzy Syst., vol. 15, no. 1, pp. 73–89, 2007.
205
[117] R. Jensen and C. Cornelis, “A new approach to fuzzy-rough nearest neighbour clas-sification,” in Rough Sets and Current Trends in Computing, ser. Lecture Notes inComputer Science, C.-C. Chan, J. Grzymala-Busse, and W. Ziarko, Eds. SpringerBerlin Heidelberg, 2008, vol. 5306, pp. 310–319.
[118] ——, “Fuzzy-rough nearest neighbour classification,” in Transactions on Rough SetsXIII, ser. Lecture Notes in Computer Science. Springer Berlin / Heidelberg, 2011,vol. 6499, pp. 56–72.
[119] R. Jensen, C. Cornelis, and Q. Shen, “Hybrid fuzzy-rough rule induction and featureselection,” in IEEE International Conference on Fuzzy Systems, 2009, pp. 1151–1156.
[120] R. Jensen and Q. Shen, “Using fuzzy dependency-guided attribute grouping in featureselection,” in Proceedings of the 9th International Conference on Rough Sets. Springer,2003, pp. 250–254.
[121] ——, “Fuzzy-rough attribute reduction with application to web categorization,” FuzzySets and Systems, vol. 141, no. 3, pp. 469–485, 2004.
[122] ——, “Fuzzy-rough data reduction with ant colony optimization,” Fuzzy Sets andSystems, vol. 149, pp. 5–20, 2005.
[123] ——, Computational Intelligence and Feature Selection: Rough and Fuzzy Approaches.Wiley-IEEE Press, 2008.
[124] ——, “Are more features better? a response to attributes reduction using fuzzy roughsets,” IEEE Trans. Fuzzy Syst., vol. 17, no. 6, pp. 1456–1458, 2009.
[125] ——, “Feature selection for aiding glass forensic evidence analysis,” Intell. Data Anal.,vol. 13, no. 5, pp. 703–723, Oct. 2009.
[126] ——, “New approaches to fuzzy-rough feature selection,” IEEE Trans. Fuzzy Syst.,vol. 17, no. 4, pp. 824–838, Aug. 2009.
[127] R. Jensen, A. Tuson, and Q. Shen, “Finding rough and fuzzy-rough set reducts withSAT,” Information Sciences, vol. 255, no. 0, pp. 100–120, 2014.
[128] S. Jin, R. Diao, C. Quek, and Q. Shen, “Backward fuzzy rule interpolation with multiplemissing values,” in IEEE International Conference on Fuzzy Systems, 2013.
[129] ——, “Backward fuzzy rule interpolation,” IEEE Trans. Fuzzy Syst., 2014, in press.
[130] S. Jin, R. Diao, and Q. Shen, “Towards backward fuzzy rule interpolation,” in Proceed-ings of the 11th UK Workshop on Computational Intelligence, 2011.
[131] ——, “Backward fuzzy interpolation and extrapolation with multiple multi-antecedentrules,” in IEEE International Conference on Fuzzy Systems, 2012, pp. 1–8.
[132] G. John and P. Langley, “Estimating continuous distributions in bayesian classifiers,”in Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence, 1995, pp.338–345.
206
[133] M. M. Kabir, M. Shahjahan, and K. Murase, “A new local search based hybrid geneticalgorithm for feature selection,” Neurocomputing, vol. 74, no. 17, pp. 2914–2928,2011.
[134] M. Kabir, M.S., and K. Murase, “A new hybrid ant colony optimization algorithm forfeature selection,” Expert Systems with Applications, vol. 39, no. 3, pp. 3747–3763,2012.
[135] D. Karaboga and B. Akay, “A survey: algorithms simulating bee swarm intelligence,”Artificial Intelligence Review, vol. 31, no. 1-4, pp. 61–85, 2009.
[136] D. Karaboga and B. Basturk, “A powerful and efficient algorithm for numerical functionoptimization: artificial bee colony (ABC) algorithm,” Journal of Global Optimization,vol. 39, no. 3, pp. 459–471, Nov. 2007.
[137] M. Karzynski, l. Mateos, J. Herrero, and J. Dopazo, “Using a genetic algorithm and aperceptron for feature selection and supervised class learning in dna microarray data,”Artificial Intelligence Review, vol. 20, no. 1-2, pp. 39–51, 2003.
[138] L. Ke, Z. Feng, and Z. Ren, “An efficient ant colony optimization approach to attributereduction in rough set theory,” Pattern Recognition Letters, vol. 29, no. 9, pp. 1351–1357, 2008.
[139] S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy, “Improvements toPlatt’s SMO algorithm for SVM classifier design,” Neural Computation, vol. 13, no. 3,pp. 637–649, Mar. 2001.
[140] J. Keller, M. Gray, and J. Givens, “A fuzzy K-nearest neighbor algorithm,” IEEE Trans.Syst., Man, Cybern., vol. 15, no. 4, pp. 580–585, 1985.
[141] T. Kietzmann, S. Lange, and M. Riedmiller, “Incremental GRLVQ: Learning relevantfeatures for 3D object recognition,” Neurocomputing, vol. 71, no. 13-15, pp. 2868–2879, 2008.
[142] L. Koczy and K. Hirota, “Approximate reasoning by linear rule interpolation andgeneral approximation,” International Journal of Approximate Reasoning, vol. 9, no. 3,pp. 197–225, 1993.
[143] ——, “Interpolative reasoning with insufficient evidence in sparse fuzzy rule bases,”Information Sciences, vol. 71, no. 1-2, pp. 169–201, 1993.
[144] R. Kohavi and G. H. John, “Wrappers for feature subset selection,” Artificial Intelligence,vol. 97, no. 1, pp. 273–324, 1997.
[145] J. Komorowski, Z. Pawlak, L. Polkowski, and A. Skowron, “Rough sets: A tutorial,”1998.
[146] I. Kononenko, “Estimating attributes: Analysis and extensions of relief,” in MachineLearning, ser. Lecture Notes in Computer Science, F. Bergadano and L. Raedt, Eds.Springer Berlin Heidelberg, 1994, vol. 784, pp. 171–182.
207
[147] I. Kononenko, E. Simec, and M. Robnik-Sikonja, “Overcoming the myopia of inductivelearning algorithms with RELIEFF,” Applied Intelligence, vol. 7, pp. 39–55, 1997.
[148] S. Kovács, “Special issue on fuzzy rule interpolation,” Journal of Advanced Computa-tional Intelligence and Intelligent Informatics, p. 253, 2011.
[149] B. Kröse, N. Vlassis, R. Bunschoten, and Y. Motomura, “A probabilistic model forappearance-based robot localization,” in In First European Symposium on AmbienceIntelligence. Springer, 2000, pp. 264–274.
[150] L. Kuncheva, “Switching between selection and fusion in combining classifiers: anexperiment,” IEEE Trans. Syst., Man, Cybern. B, vol. 32, no. 2, pp. 146–156, 2002.
[151] ——, “Fuzzy versus nonfuzzy in combining classifiers designed by boosting,” IEEETrans. Fuzzy Syst., vol. 11, no. 6, pp. 729–741, 2003.
[152] L. I. Kuncheva and C. J. Whitaker, “Measures of diversity in classifier ensembles andtheir relationship with the ensemble accuracy,” Machine Learning, vol. 51, no. 2, pp.181–207, May 2003.
[153] R. Leardi, R. Boggia, and M. Terrile, “Genetic algorithms as a strategy for featureselection,” Journal of Chemometrics, vol. 6, no. 5, pp. 267–281, 1992.
[154] J. Lee and M. Verleysen, Nonlinear Dimensionality Reduction, ser. Information Scienceand Statistics. Springer, 2007.
[155] K. S. Lee and Z. W. Geem, “A new meta-heuristic algorithm for continuous engineeringoptimization: harmony search theory and practice,” Computer Methods in AppliedMechanics and Engineering, vol. 194, no. 36-38, pp. 3902–3933, Sep. 2005.
[156] ——, “A new structural optimization method based on the harmony search algorithm,”Computers & Structures, vol. 82, no. 9–10, pp. 781–798, 2004.
[157] K. S. Lee, Z. W. Geem, S.-h. Lee, and K.-w. Bae, “The harmony search heuristicalgorithm for discrete structural optimization,” Engineering Optimization, vol. 37,no. 7, pp. 663–684, 2005.
[158] L. Lee and S. Chen, “Fuzzy interpolative reasoning using interval type-2 fuzzy sets,”New Frontiers in Applied Artificial Intelligence, vol. 5027, pp. 92–101, 2008.
[159] W. Lee, S. J. Stolfo, and K. W. Mok, “Adaptive intrusion detection: A data miningapproach,” Artificial Intelligence Review, vol. 14, no. 6, pp. 533–567, Dec. 2000.
[160] N. Li, I. Tsang, and Z.-H. Zhou, “Efficient optimization of performance measures byclassifier adaptation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. PP, no. 99, p. 1,2012.
[161] X. Li and L. Parker, “Design and performance improvements for fault detection intightly-coupled multi-robot team tasks,” in Proceedings of IEEE International Conferenceon Robotics and Automation, 2009.
208
[162] M. Lippi, M. Jaeger, P. Frasconi, and A. Passerini, “Relational information gain,”Machine Learning, vol. 83, pp. 219–239, 2011.
[163] H. Liu and H. Motoda, Feature Selection for Knowledge Discovery and Data Mining.Norwell, MA, USA: Kluwer Academic Publishers, 1998.
[164] ——, Computational Methods of Feature Selection (Chapman & Hall/CRC Data Miningand Knowledge Discovery Series). Chapman & Hall/CRC, 2007.
[165] H. Liu and L. Yu, “Toward integrating feature selection algorithms for classificationand clustering,” IEEE Trans. Knowl. Data Eng., vol. 17, no. 4, pp. 491–502, 2005.
[166] Y. Liu, Q. Zhou, E. Rakus-Andersson, and G. Bai, “A fuzzy-rough sets based compact ruleinduction method for classifying hybrid data,” in Rough Sets and Knowledge Technology,ser. Lecture Notes in Computer Science, T. Li, H. Nguyen, G. Wang, J. Grzymala-Busse,R. Janicki, A. Hassanien, and H. Yu, Eds. Springer Berlin Heidelberg, 2012, vol.7414, pp. 63–70.
[167] Y. Liu, G. Wang, H. Chen, H. Dong, X. Zhu, and S. Wang, “An improved particle swarmoptimization for feature selection,” Journal of Bionic Engineering, vol. 8, no. 2, pp.191–200, 2011.
[168] N. Mac Parthaláin, “Guiding rough and fuzzy-rough feature selection using alternativeevaluation functions and search strategies,” Ph.D. dissertation, University of WalesAberystwyth, 2006.
[169] N. Mac Parthaláin and R. Jensen, “Measures for unsupervised fuzzy-rough featureselection,” International Journal of Hybrid Intelligent Systems, vol. 7, no. 4, pp. 249–259, Dec. 2010.
[170] ——, “Fuzzy-rough set based semi-supervised learning,” in IEEE International Confer-ence on Fuzzy Systems, Jun. 2011, pp. 2465–2472.
[171] N. Mac Parthaláin, R. Jensen, Q. Shen, and R. Zwiggelaar, “Fuzzy-rough approachesfor mammographic risk analysis,” Intelligent Data Analysis, vol. 14, no. 2, pp. 225–244,Apr. 2010.
[172] N. Mac Parthaláin, Q. Shen, and R. Jensen, “A distance measure approach to exploringthe rough set boundary region for attribute reduction,” IEEE Trans. Knowl. Data Eng.,vol. 22, no. 3, pp. 305–317, Mar. 2010.
[173] M. Mahdavi, M. Fesanghary, and E. Damangir, “An improved harmony search algorithmfor solving optimization problems,” Applied Mathematics and Computation, vol. 188,no. 2, pp. 1567–1579, 2007.
[174] M. Mahdavi, M. H. Chehreghani, H. Abolhassani, and R. Forsati, “Novel meta-heuristicalgorithms for clustering web documents,” Applied Mathematics and Computation, vol.201, no. 1-2, pp. 441–451, 2008.
[175] D. Manjarres, I. Landa-Torres, S. Gil-Lopez, J. D. Ser, M. Bilbao, S. Salcedo-Sanz, andZ. Geem, “A survey on applications of the harmony search algorithm,” EngineeringApplications of Artificial Intelligence, vol. 26, no. 8, pp. 1818–1831, 2013.
209
[176] A. Marán-Hernández, R. Méndez-Rodríguez, and F. Montes-González, “Significant fea-ture selection in range scan data for geometrical mobile robot mapping,” in Proceedingsof the 5th International Symposium on Robotics and Automation, 2006.
[177] J. Marin-Blazquez and Q. Shen, “From approximative to descriptive fuzzy classifiers,”IEEE Trans. Fuzzy Syst., vol. 10, no. 4, pp. 484–497, 2002.
[178] F. Markatopoulou, G. Tsoumakas, and L. Vlahavas, “Instance-based ensemble pruningvia multi-label classification,” in 22nd IEEE International Conference on Tools withArtificial Intelligence, vol. 1, 2010, pp. 401–408.
[179] M. H. Mashinchi, M. A. Orgun, M. Mashinchi, and W. Pedrycz, “A tabu-harmonysearch-based approach to fuzzy linear regression,” IEEE Trans. Fuzzy Syst., vol. 19,no. 3, pp. 432–448, 2011.
[180] M. McCann, Y. Li, L. P. Maguire, and A. Johnston, “Causality challenge: Benchmarkingrelevant signal components for effective monitoring and process control,” Journal ofMachine Learning Research–Proceedings Track, vol. 6, pp. 277–288, 2010.
[181] P. McCullagh, “What is a statistical model?” The Annals of Statistics, vol. 30, no. 5, pp.1225–1310, 10 2002.
[182] R. Meiri and J. Zahavi, “Using simulated annealing to optimize the feature selectionproblem in marketing applications,” European Journal of Operational Research, vol.171, no. 3, pp. 842–858, 2006.
[183] N. Memon, D. Hicks, and H. Larsen, “Notice of violation of ieee publication princi-ples harvesting terrorists information from web,” in 11th International Conference onInformation Visualization, 2007, pp. 664–671.
[184] H. ming Lee, C. ming Chen, J. ming Chen, and Y. lu Jou, “An efficient fuzzy classifierwith feature selection based on fuzzy entropy,” IEEE Trans. Syst., Man, Cybern. B,vol. 31, pp. 426–432, 2001.
[185] T. Mitchell, Machine Learning, 1st ed. McGraw-Hill Education (ISE Editions), Oct.1997.
[186] L. Molina, L. Belanche, and A. Nebot, “Feature selection algorithms: a survey andexperimental evaluation,” in Proceedings of 2002 IEEE International Conference onData Mining, 2002, pp. 306–313.
[187] D. P. Muni, N. R. Pal, and J. Das, “Genetic programming for simultaneous featureselection and classifier design,” IEEE Trans. Syst., Man, Cybern. B, vol. 36, no. 1, pp.106–117, 2006.
[188] N. Naik, R. Diao, C. Quek, and Q. Shen, “Towards dynamic fuzzy rule interpolation,”in IEEE International Conference on Fuzzy Systems, 2013.
[189] R. Y. M. Nakamura, L. A. M. Pereira, K. A. Costa, D. Rodrigues, J. P. Papa, and X.-S.Yang, “BBA: A binary bat algorithm for feature selection,” in 25th SIBGRAPI Conferenceon Graphics, Patterns and Images, Aug. 2012, pp. 291–297.
210
[190] L. Nanni and A. Lumini, “Ensemblator: An ensemble of classifiers for reliable classi-fication of biological data,” Pattern Recognition Letters, vol. 28, no. 5, pp. 622–630,2007.
[191] Y. Narukawa, Modeling Decisions: Information Fusion and Aggregation Operators, ser.Cognitive Technologies. Springer, 2010.
[192] S. Nemati, M. E. Basiri, N. Ghasem-Aghaee, and M. H. Aghdam, “A novel ACO-GAhybrid algorithm for feature selection in protein function prediction,” Expert Systemswith Applications, vol. 36, no. 10, pp. 12 086–12 094, 2009.
[193] K. Nigam, A. K. Mccallum, S. Thrun, and T. Mitchell, “Text classification from labeledand unlabeled documents using EM,” in Machine Learning, 1999, pp. 103–134.
[194] I.-S. Oh, J.-S. Lee, and B.-R. Moon, “Hybrid genetic algorithms for feature selection,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 11, pp. 1424–1437, 2004.
[195] J. S. Olsson, “Combining feature selectors for text classification,” in Proceedings of the15th ACM international conference on Information and knowledge management, 2006,pp. 798–799.
[196] Y.-S. Ong, N. Krasnogor, and H. Ishibuchi, “Special issue on memetic algorithms,” IEEETrans. Syst., Man, Cybern. B, vol. 37, no. 1, pp. 2 –5, feb. 2007.
[197] D. W. Opitz, “Feature selection for ensembles,” in Proceedings of 16th National Confer-ence on Artificial Intelligence. Press, 1999, pp. 379–384.
[198] P. Paclik, W. Duin, G. M. P. van Kempen, and R. Kohlus, “On feature selection withmeasurement cost and grouped features,” Pattern Recognition Group, Delft Universityof Technology.
[199] S. Palanisamy and K. S, “Artificial bee colony approach for optimizing feature selection,”International Journal of Computer Science Issues, vol. 9, no. 3, pp. 432–438, 2012.
[200] I. Partalas, G. Tsoumakas, and I. Vlahavas, “Pruning an ensemble of classifiers viareinforcement learning,” Neurocomputing, vol. 72, no. 7–9, pp. 1900–1909, 2009.
[201] N. M. Parthallcin and Q. Shen, “Exploring the boundary region of tolerance roughsets for feature selection,” Pattern Recognition, vol. 42, no. 5, pp. 655–667, 2009.
[202] D. Paul, E. Bair, T. Hastie, and R. Tibshirani, “Preconditioning for feature selectionand regression in high-dimensional problems,” The Annals of Statistics, vol. 36, no. 4,pp. pp.1595–1618, 2008.
[203] Z. Pawlak, Rough Sets: Theoretical Aspects of Reasoning about Data. Norwell, MA,USA: Kluwer Academic Publishers, 1992.
[204] Z. Pawlak, J. Grzymala-Busse, R. Slowinski, and W. Ziarko, “Rough sets,” Communica-tions of the ACM, vol. 38, no. 11, pp. 88–95, Nov. 1995.
211
[205] J. M. Peña, J. A. Lozano, P. Larrañaga, and I. n. Inza, “Dimensionality reduction inunsupervised learning of conditional Gaussian networks,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 23, no. 6, pp. 590–603, Jun. 2001.
[206] H. Peng, F. Long, and C. Ding, “Feature selection based on mutual information criteriaof max-dependency, max-relevance, and min-redundancy,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 27, no. 8, pp. 1226–1238, 2005.
[207] S. Piramuthu, “Evaluating feature selection methods for learning in data miningapplications,” European Journal of Operational Research, vol. 156, no. 2, pp. 483–494,2004.
[208] J. C. Platt, “Sequential minimal optimization: A fast algorithm for training supportvector machines,” Advances in Kernel Methods - Support Vector Learning, Tech. Rep.,1998.
[209] H. Prade, G. Richard, and M. Serrurier, “Enriching relational learning with fuzzypredicates,” in Knowledge Discovery in Databases: PKDD 2003, ser. Lecture Notes inComputer Science, N. Lavrac, D. Gamberger, L. Todorovski, and H. Blockeel, Eds.Springer Berlin Heidelberg, 2003, vol. 2838, pp. 399–410.
[210] B. Predki and S. Wilk, “Rough set based data exploration using ROSE system,” in Foun-dations of Intelligent Systems, ser. Lecture Notes in Computer Science, R. ZbigniewW.and A. Skowron, Eds. Springer Berlin Heidelberg, 1999, vol. 1609, pp. 172–180.
[211] Z. Qin and J. Lawry, “LFOIL: Linguistic rule induction in the label semantics frame-work,” Fuzzy Sets and Systems, vol. 159, no. 4, pp. 435–448, Feb. 2008.
[212] C. C. Ramos, A. N. Souza, G. Chiachia, A. X. F. ao, and J. ao P. Papa, “A novel algorithmfor feature selection using harmony search and its application for non-technical lossesdetection,” Computers & Electrical Engineering, vol. 37, no. 6, pp. 886–894, 2011.
[213] K. Rasmani and Q. Shen, “Data-driven fuzzy rule generation and its application forstudent academic performance evaluation,” Applied Intelligence, vol. 25, no. 3, pp.305–319, 2006.
[214] V. Rathnayake, L. Premaratne, and D. Sonnadara, “Development of feature basedartificial neural network model for weather nowcasting,” in National Symposium onDisaster Risk Reduction & Climate Change Adaptation, 2010.
[215] M. J. Reddy and D. K. Mohanta, “A comparative study of artificial neural network(ann) AND fuzzy inference system (fis) approach for digital relaying of transmissionline faults,” International Journal on Artificial Intelligence and Machine Learning, vol. 6,pp. 1–7, 2006.
[216] S. Royston, J. Lawry, and K. Horsburgh, “A linguistic decision tree approach to pre-dicting storm surge,” Fuzzy Sets and Systems, vol. 215, no. 0, pp. 90–111, 2013.
[217] L. Saitta, “Hypothesis diversity in ensemble classification,” in Foundations of IntelligentSystems, ser. Lecture Notes in Computer Science, F. Esposito, Z. Ras, D. Malerba, andG. Semeraro, Eds. Springer Berlin Heidelberg, 2006, vol. 4203, pp. 662–670.
212
[218] G. Schwarz, “Estimating the dimension of a model,” Annals of Statistics, vol. 6, no. 2,pp. 461–464, 03 1978.
[219] S. Senthamarai Kannan and N. Ramaraj, “A novel hybrid feature selection via sym-metrical uncertainty ranking based local memetic search algorithm,” Knowledge-BasedSystems, vol. 23, no. 6, pp. 580–585, Aug. 2010.
[220] G. Shafer, A mathematical theory of evidence. Princeton university press, 1976.
[221] M. Shah, M. Marchand, and J. Corbeil, “Feature selection with conjunctions of decisionstumps and learning from microarray data,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 34, no. 1, pp. 174–186, 2012.
[222] C. Shang and D. Barnes, “Fuzzy-rough feature selection aided support vector machinesfor mars image classification,” Computer Vision and Image Understanding, vol. 117,no. 3, pp. 202–213, 2013.
[223] ——, “Support vector machine-based classification of rock texture images aided byefficient feature selection,” in The International Joint Conference on Neural Networks,june 2012, pp. 1–8.
[224] C. Shang, D. Barnes, and Q. Shen, “Facilitating efficient mars terrain image classifica-tion with fuzzy-rough feature selection,” International Journal of Hybrid IntelligentSystems, vol. 8, no. 1, pp. 3–13, Jan. 2011.
[225] C. Shannon, “A mathematical theory of communication,” Bell System Technical Journal,vol. 27, pp. 379–423, 623–656, July, October 1948.
[226] Q. Shen and A. Chouchoulas, “A rough-fuzzy approach for generating classificationrules,” Pattern Recognition, vol. 35, no. 11, pp. 2425–2438, 2002.
[227] Q. Shen, R. Diao, and P. Su, “Feature selection ensemble,” in Turing Centenary, ser.EPiC Series, A. Voronkov, Ed., vol. 10. EasyChair, 2012, pp. 289–306.
[228] Q. Shen and R. Jensen, “Selecting informative features with fuzzy-rough sets and itsapplication for complex systems monitoring,” Pattern Recognition, vol. 37, no. 7, pp.1351–1363, 2004.
[229] Z. Shen, L. Ding, and M. Mukaidono, “Fuzzy resolution principle,” in Proceedings ofthe 18th International Symposium on Multiple-Valued Logic, 1988, pp. 210–215.
[230] S. Shojaie and M. Moradi, “An evolutionary artificial immune system for featureselection and parameters optimization of support vector machines for ERP assessmentin a P300-based GKT,” in International Biomedical Engineering Conference, Dec. 2008,pp. 1–5.
[231] W. Siedlecki and J. Sklansky, “A note on genetic algorithms for large-scale featureselection,” Pattern Recognition Letters, vol. 10, no. 5, pp. 335–347, 1989.
[232] V. G. Sigillito, S. P. Wing, L. V. Hutton, and K. B. Baker, “Classification of radar returnsfrom the ionosphere using neural networks,” Johns Hopkins APL Tech. Dig, vol. vol.10, pp. 262–266, 1989.
213
[233] S. Singh, J. Kubica, S. Larsen, and D. Sorokina, “Parallel large scale feature selectionfor logistic regression,” in SDM, 2009, pp. 1171–1182.
[234] R. K. Sivagaminathan and S. Ramakrishnan, “A hybrid approach for feature subsetselection using neural networks and ant colony optimization,” Expert Systems withApplications, vol. 33, no. 1, pp. 49–60, 2007.
[235] J. SKLANSKY and M. VRIESENGA, “Genetic selection and neural modeling of piecewise-linear classifiers,” International Journal of Pattern Recognition and Artificial Intelligence,vol. 10, no. 05, pp. 587–612, 1996.
[236] D. Slezak, Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing: 10th In-ternational Conference, RSFDGrC 2005, Regina, Canada, August 31 - September 3,2005, Proceedings, ser. Lecture Notes in Computer Science / Lecture Notes in ArtificialIntelligence. Springer, 2005.
[237] J. W. Smith, J. Everhart, W. Dickson, W. Knowler, and R. Johannes, “Using the adaplearning algorithm to forecast the onset of diabetes mellitus,” in Annual Symposiumon Computer Application in Medical Care, 1988, pp. 261–265.
[238] P. Somol, J. Grim, and P. Pudil, “Criteria ensembles in feature selection,” in MultipleClassifier Systems, ser. Lecture Notes in Computer Science, J. Benediktsson, J. Kittler,and F. Roli, Eds. Springer Berlin Heidelberg, 2009, vol. 5519, pp. 304–313.
[239] R. Srinivasa Rao, S. Narasimham, M. Ramalinga Raju, and A. Srinivasa Rao, “Optimalnetwork reconfiguration of large-scale distribution system using harmony searchalgorithm,” IEEE Trans. Power Syst., vol. 26, no. 3, pp. 1080–1088, 2011.
[240] S. Srinivasan and S. Ramakrishnan, “Evolutionary multi objective optimization forrule mining: a review,” Artificial Intelligence Review, vol. 36, no. 3, pp. 205–248, 2011.
[241] D. J. Stracuzzi and P. E. Utgoff, “Randomized variable elimination,” Journal of MachineLearning Research, vol. 5, pp. 1331–1364, 2004.
[242] N. Suguna and K. G. Thanushkodi, “An independent rough set approach hybrid withartificial bee colony algorithm for dimensionality reduction,” American Journal ofApplied Sciences, vol. 8, no. 3, pp. 261–266, 2011.
[243] A. Sung, A. Merke, and M. Riedmiller, “Reinforcement learning using a grid basedfunction approximator,” in Biomimetic Neural Learning for Intelligent Robots, ser.Lecture Notes in Computer Science, S. Wermter, G. Palm, and M. Elshaw, Eds. SpringerBerlin Heidelberg, 2005, vol. 3575, pp. 235–244.
[244] D. Swets and J. Weng, “Efficient content-based image retrieval using automatic featureselection,” in Proceedings of International Symposium on Computer Vision, 1995, pp.85–90.
[245] R. W. Swiniarski and A. Skowron, “Rough set methods in feature selection and recog-nition,” Pattern Recognition Letters, vol. 24, no. 6, pp. 833–849, 2003.
214
[246] M. A. Tahir, J. Kittler, and A. Bouridane, “Multilabel classification using heterogeneousensemble of multi-label classifiers,” Pattern Recognition Letters, vol. 33, no. 5, pp.513–523, 2012.
[247] A. Tajbakhsh, M. Rahmati, and A. Mirzaei, “Intrusion detection using fuzzy associationrules,” Applied Soft Computing, vol. 9, no. 2, pp. 462–469, Mar. 2009.
[248] D. Tikk, I. Joó, L. Kóczy, P. Várlaki, B. Moser, and T. Gedeon, “Stability of interpolativefuzzy KH controllers,” Fuzzy Sets and Systems, vol. 125, no. 1, pp. 105–119, 2002.
[249] V. Torra and Y. Narukawa, Modeling Decisions: Information Fusion and AggregationOperators. Springer, 2007.
[250] G. Tsoumakas, I. Partalas, and I. Vlahavas, “A taxonomy and short review of ensembleselection,” in Workshop on Supervised and Unsupervised Ensemble Methods and TheirApplications, 2008.
[251] A. Tsymbal, M. Pechenizkiy, and P. Cunningham, “Diversity in search strategies forensemble feature selection,” Information Fusion, vol. 6, no. 1, pp. 83–98, 2005.
[252] E. Tuv, A. Borisov, G. Runger, and K. Torkkola, “Feature selection with ensembles,artificial variables, and redundancy elimination,” Journal of Machine Learning Research,vol. 10, pp. 1341–1366, Dec. 2009.
[253] D. L. Vail and M. M. Veloso, “Feature selection for activity recognition in multi-robotdomains,” in Proceedings of the 23rd AAAI Conference on Artificial Intelligence, 2008,pp. 1415–1420.
[254] B. Vandeginste, “Parvus: An extendable package of programs for data exploration,classification and correlation,” Journal of Chemometrics, vol. 4, no. 2, pp. 191–193,1990.
[255] A. Vasebi, M. Fesanghary, and S. M. T. Bathaee, “Combined heat and power economicdispatch by harmony search algorithm,” International Journal of Electrical Power &Energy Systems, vol. 29, no. 10, pp. 713–719, Dec. 2007.
[256] R. Vilalta and Y. Drissi, “A perspective view and survey of meta-learning,” ArtificialIntelligence Review, vol. 18, no. 2, pp. 77–95, 2002.
[257] C.-M. Wang and Y.-F. Huang, “Self-adaptive harmony search algorithm for optimiza-tion,” Expert Systems with Applications, vol. 37, no. 4, pp. 2826–2837, 2010.
[258] H. Wang, S. Kwong, Y. Jin, W. Wei, and K. Man, “Multi-objective hierarchical geneticalgorithm for interpretable fuzzy rule-based knowledge extraction,” Fuzzy Sets andSystems, vol. 149, no. 1, pp. 149–186, 2005.
[259] H. Wang, T. M. Khoshgoftaar, and A. Napolitano, “A comparative study of ensemblefeature selection techniques for software defect prediction,” in Proceedings of the 20109th International Conference on Machine Learning and Applications, 2010, pp. 135–140.
215
[260] J. Wang, P. Zhao, S. C. Hoi, and rong jin, “Online feature selection and its applications,”IEEE Transactions on Knowledge and Data Engineering, vol. 99, no. PrePrints, p. 1,2013.
[261] X. Wang, J. Yang, X. Teng, W. Xia, and R. Jensen, “Feature selection based on roughsets and particle swarm optimization,” Pattern Recognition Letters, vol. 28, no. 4, pp.459–471, 2007.
[262] X. Wang, E. C. Tsang, S. Zhao, D. Chen, and D. S. Yeung, “Learning fuzzy rules fromfuzzy samples based on rough set technique,” Information Sciences, vol. 177, no. 20,pp. 4493–4514, 2007.
[263] G. Wells and C. Torras, “Assessing image features for vision-based robot positioning,”Journal of Intelligent and Robotic Systems, vol. 30, no. 1, pp. 95–118, 2001.
[264] I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques,2nd ed., ser. Morgan Kaufmann Series in Data Management Systems. MorganKaufmann, Jun. 2005.
[265] J. Wroblewski, “Finding minimal reducts using genetic algorithms,” in Proceedings of2nd International Joint Conference on Information Science, 1995, pp. 186–189.
[266] J. Wróblewski, “Ensembles of classifiers based on approximate reducts,” FundamentaInformaticae, vol. 47, no. 3-4, pp. 351–360, Oct. 2001.
[267] X. Wu, V. Kumar, J. Ross Q., J. Ghosh, Q. Yang, H. Motoda, G. McLachlan, A. Ng, B. Liu,P. Yu, Z.-H. Zhou, M. Steinbach, D. Hand, and D. Steinberg, “Top 10 algorithms indata mining,” Knowledge and Information Systems, vol. 14, pp. 1–37, 2008.
[268] X. Wu, K. Yu, W. Ding, H. Wang, and X. Zhu, “Online feature selection with streamingfeatures,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 5, pp. 1178–1192, 2013.
[269] D. Xie, “Fuzzy association rules discovered on effective reduced database algorithm,”in IEEE International Conference on Fuzzy Systems, 2005, pp. 779–784.
[270] E. P. Xing, M. I. Jordan, and R. M. Karp, “Feature selection for high-dimensionalgenomic microarray data,” in Proceedings of the 18th International Conference onMachine Learning. Morgan Kaufmann, 2001, pp. 601–608.
[271] Z. Xu, “An overview of methods for determining OWA weights: Research articles,”International Journal of Intelligent Systems, vol. 20, no. 8, pp. 843–865, Aug. 2005.
[272] ——, “Dependent OWA operators,” in Proceedings of the Third international conferenceon Modeling Decisions for Artificial Intelligence, ser. MDAI’06. Berlin, Heidelberg:Springer-Verlag, 2006, pp. 172–178.
[273] R. Yager, “On ordered weighted averaging aggregation operators in multicriteriadecisionmaking,” IEEE Trans. Syst., Man, Cybern., vol. 18, no. 1, pp. 183–190, 1988.
[274] C.-S. Yang, L.-Y. Chuang, Y.-J. Chen, and C.-H. Yang, “Feature selection using memeticalgorithms,” in Third International Conference on Convergence and Hybrid InformationTechnology, vol. 1, Nov. 2008, pp. 416–423.
216
[275] J. Yang and V. Honavar, “Feature subset selection using a genetic algorithm,” IEEEIntell. Syst., vol. 13, no. 2, pp. 44–49, Mar. 1998.
[276] L. Yang and Q. Shen, “Adaptive fuzzy interpolation,” IEEE Trans. Fuzzy Syst., vol. 19,no. 6, pp. 1107–1126, 2011.
[277] X.-S. Yang, Nature-Inspired Metaheuristic Algorithms. Luniver Press, 2008.
[278] Y. Yang and J. O. Pedersen, “A comparative study on feature selection in text cate-gorization,” in Proceedings of the 14th International Conference on Machine Learning.San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1997, pp. 412–420.
[279] L. Yu and H. Liu, “Efficient feature selection via analysis of relevance and redundancy,”J. Mach. Learn. Res., vol. 5, pp. 1205–1224, Dec. 2004.
[280] S. C. Yusta, “Different metaheuristic strategies to solve the feature selection problem,”Pattern Recognition Letters, vol. 30, no. 5, pp. 525–534, Apr. 2009.
[281] X.-J. Zeng, J. Y. Goulermas, P. Liatsis, D. Wang, and J. A. Keane, “Hierarchical fuzzysystems for function approximation on discrete input spaces with application,” IEEETrans. Fuzzy Syst., vol. 16, no. 5, pp. 1197–1215, 2008.
[282] X.-J. Zeng and G. Singh Madan, “Approximation accuracy analysis of fuzzy systems asfunction approximators,” Fuzzy Systems, IEEE Transactions on, vol. 4, no. 1, pp. 44–63,1996.
[283] K. Zhang, W. Fan, X. Yuan, I. Davidson, and X. Li, “Forecasting skewed biased stochasticozone days: analyses, solutions and beyond,” Knowl. Inf. Syst, 2008.
[284] L. Zhang, X. Meng, W. Wu, and H. Zhou, “Network fault feature selection based onadaptive immune clonal selection algorithm,” in International Joint Conference onComputational Sciences and Optimization, vol. 2, 2009, pp. 969–973.
[285] M.-L. Zhang and Z.-H. Zhou, “Improve multi-instance neural networks through featureselection,” in Neural Processing Letters, 2004, pp. 1–10.
[286] Q. Zhang, W. Liu, E. Tsang, and B. Virginas, “Expensive multiobjective optimization byMOEA/D with gaussian process model,” Evolutionary Computation, IEEE Transactionson, vol. 14, no. 3, pp. 456–474, 2010.
[287] R. Zhang and L. Hanzo, “Iterative multiuser detection and channel decoding forDS-CDMA using harmony search,” IEEE Trans. Signal Process., vol. 16, no. 10, pp.917–920, 2009.
[288] J. Zhao, K. Lu, and X. He, “Locality sensitive semi-supervised feature selection,”Neurocomputing, vol. 71, no. 10–12, pp. 1842–1849, 2008.
[289] W. Zhao, Y. Wang, and D. Li, “A dynamic feature selection method based on combina-tion of GA with K-means,” in 2nd International Conference on Industrial Mechatronicsand Automation, vol. 2, 2010, pp. 271–274.
217
[290] L. Zheng, R. Diao, and Q. Shen, “Efficient feature selection using a self-adjustingharmony search algorithm,” in Proceedings of the 13th UK Workshop on ComputationalIntelligence, 2013.
[291] Z. Zheng, “Feature selection for text categorization on imbalanced data,” ACM SIGKDDExplorations Newsletter, vol. 6, p. 2004, 2004.
[292] S.-M. Zhou and J. Q. Gan, “Constructing accurate and parsimonious fuzzy modelswith distinguishable fuzzy sets based on an entropy measure,” Fuzzy Sets and Systems,vol. 157, no. 8, pp. 1057–1074, 2006.
[293] S.-M. Zhou and J. Gan, “Constructing L2-SVM-based fuzzy classifiers in high-dimensional space with automatic model selection and fuzzy rule ranking,” IEEETrans. Fuzzy Syst., vol. 15, no. 3, pp. 398–409, 2007.
[294] S.-M. Zhou, J. Garibaldi, R. John, and F. Chiclana, “On constructing parsimonioustype-2 fuzzy logic systems via influential rule selection,” IEEE Trans. Fuzzy Syst.,vol. 17, no. 3, pp. 654–667, 2009.
[295] Z. Zhou, Ensemble Methods: Foundations and Algorithms, ser. Chapman & Hall/CRCData Mining and Knowledge Discovery Serie. Taylor & Francis, 2012.
[296] Z. Zhu and Y.-S. Ong, “Memetic algorithms for feature selection on microarray data,”in Advances in Neural Networks, ser. Lecture Notes in Computer Science, D. Liu, S. Fei,Z.-G. Hou, H. Zhang, and C. Sun, Eds. Springer Berlin Heidelberg, 2007, vol. 4491,pp. 1327–1335.
[297] Z. Zhu, Y.-S. Ong, and M. Dash, “Wrapper-filter feature selection algorithm using amemetic framework,” IEEE Trans. Syst., Man, Cybern. B, vol. 37, no. 1, pp. 70–76,2007.