Feature Selection with Harmony Search · Abstract Feature selection is a term given to the problem...

Feature Selection with Harmony Search

and its Applications

Ren Diao

Supervisors:Prof. Qiang Shen

Dr. Neil S. Mac Parthaláin

Ph.D. Thesis

Department of Computer Science

Institute of Mathematics, Physics and Computer Science

Aberystwyth University

February 6, 2014

Declaration and Statement

DECLARATION

This work has not previously been accepted in substance for any degree and is not

being concurrently submitted in candidature for any degree.

Signed ............................................................ (candidate)

Date ............................................................

STATEMENT 1

This thesis is the result of my own investigations, except where otherwise stated.

Where correction services1 have been used, the extent and nature of the correction

is clearly marked in a footnote(s).

Other sources are acknowledged by footnotes giving explicit references. A bibliogra-

phy is appended.


Date ............................................................

STATEMENT 2

I hereby give consent for my thesis, if accepted, to be available for photocopying and

for inter-library loan, and for the title and summary to be made available to outside

organisations.


Date ............................................................

1This refers to the extent to which the text has been corrected by others.

Abstract

Feature selection is a term given to the problem of selecting important domain

attributes which are most predictive of a given outcome. Unlike other dimensionality

reduction methods, feature selection approaches seek to preserve the semantics of

the original data following reduction. Many strategies have been exploited for this

task in an effort to identify more compact and better quality feature subsets. A

number of group-based feature subset evaluation measures have been developed,

which have the ability to judge the quality of a given feature subset as a whole, rather

than assessing the qualities of individual features. Techniques of stochastic nature

have also emerged, which are inspired by nature phenomena or social behaviour,

allowing good solutions to be discovered without resorting to exhaustive search.

In this thesis, a novel feature subset search algorithm termed “feature selection

with harmony search” is presented. The proposed approach utilises a recently

developed meta-heuristic: harmony search, that is inspired by the improvisation

process of musical players. The proposed approach is general, and can be employed

in conjunction with many feature subset evaluation measures. The simplicity of

harmony search is exploited to reduce the overall complexity of the search process.

The stochastic nature of the resultant technique also allows the search process to

escape from local optima, while identifying multiple, distinctive candidate solutions.

Additional parameter control schemes are introduced to reduce the effort and impact

of static parameter configuration of HS, which are further combined with iterative

refinement, in order to enforce the discovery of more compact feature subsets.

The flexibility of the proposed approach, and its powerful performance in selecting

multiple, good quality feature subsets lead to a number of further theoretical de-

velopments. These include the generation and reduction of feature subset-based

classifier ensembles; feature selection and adaptive classifier ensemble for dynamic

data; hybrid rule induction on the basis of fuzzy-rough set theory; and antecedent

selection for fuzzy rule interpolation. The resultant techniques are experimentally

evaluated using data sets drawn from real-world problem domains, and systemati-

cally compared with leading methodologies in their respective areas, demonstrating

the efficacy and competitive performance of the present work.

Acknowledgements

I would like to express my uttermost gratitude to my supervisors: Prof. Qiang Shen

and Dr. Neil S. Mac Parthaláin, for their motivation, enthusiasm, and guidance,

which have been essential at all stages of my research.

I am grateful to Dr. Richard Jensen for his constant inspiration and for contributing

to the original impetus for this research.

I am also very thankful to Prof. Christopher John Price for his patience and support.

My sincere gratitude goes to my entire family: my parents Chunli Diao and Liming

Zhai, my dear wife Zhuoke Li, and her parents Hongzhu Li and Yulan Sun. The

completion of this Ph.D. would not have been possible without their kind support

and encouragement.

I would like to thank all my fellow researchers in the Advanced Reasoning Group,

both past and present, for the stimulating discussions, insight, and helpful advice. I

am especially grateful to Shangzhu Jin, Pan Su, Nitin Kumar Naik, and Ling Zheng

for their collaborative efforts.

I would like to express my deepest appreciation to the Department of Computer Sci-

ence and the Faculty of Science at Aberystwyth University, to the IEEE Computational

Intelligence Society, to the British Machine Vision Association, and to Plurabelle

Books, Cambridge, for their generous financial support.

My sincere gratitude goes to the anonymous reviewers, journal editors, conference

organisers and attendees involved (either directly or indirectly) with my submitted

works, for their encouragement and valuable input in refining my ideas.

I am extremely grateful to all of the academic, administrative, technical, and support

staff at the Department of Computer Science, Aberystwyth University, for their kind

assistance throughout my entire study.

I would also like to thank all my friends, especially the Chinese student community

in Aberystwyth for their continuous support.

Contents

Contents i

List of Figures v

List of Tables vii

List of Algorithms ix

1 Introduction 1

1.1 Feature Selection (FS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 FS with Harmony Search (HSFS) . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Structure of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Background 12

2.1 FS Evaluation Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.1.1 Filter-Based FS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.1.2 Wrapper-Based, Hybrid, and Embedded FS . . . . . . . . . . . 23

2.2 FS Search Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.2.1 Deterministic Algorithms . . . . . . . . . . . . . . . . . . . . . . . 24

2.2.2 Stochastic and Nature-Inspired Approaches . . . . . . . . . . . 25

2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3 Framework for HSFS and its Improvements 47

3.1 Principles of HS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.1.1 Key Notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.1.2 Parameters of HS . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.1.3 Iterative Process of HS . . . . . . . . . . . . . . . . . . . . . . . . 50

3.2 Initial Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.2.1 Binary-Valued Representation . . . . . . . . . . . . . . . . . . . . 53

i

ii

3.2.2 Iteration Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.2.3 Tunable Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.3 Algorithm for HSFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.3.1 Mapping of Key Notions . . . . . . . . . . . . . . . . . . . . . . . 55

3.3.2 Work Flow of HSFS . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.3.3 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.4 Additional Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.4.1 Parameter Control . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.4.2 Iterative Refinement . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.5 Experimentation and Discussion . . . . . . . . . . . . . . . . . . . . . . . 66

3.5.1 Evaluation of HSFS . . . . . . . . . . . . . . . . . . . . . . . . . . 68

3.5.2 Evaluation of Additional Improvements . . . . . . . . . . . . . . 75

3.5.3 Iterative Refinement of Fuzzy-Rough Reducts . . . . . . . . . . 80

3.5.4 Discussion of Results . . . . . . . . . . . . . . . . . . . . . . . . . 81

3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4 HSFS for Feature Subset Ensemble 84

4.1 Occurrence Coefficient-Based Ensemble . . . . . . . . . . . . . . . . . . 85

4.1.1 Ensemble Construction Methods . . . . . . . . . . . . . . . . . . 87

4.1.2 Decision Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . 88



4.2.1 Classification Results . . . . . . . . . . . . . . . . . . . . . . . . . 93

4.2.2 Comparison of Ensemble Generation Methods . . . . . . . . . . 97

4.2.3 Scalability Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5 HSFS for Classifier Ensemble Reduction 103

5.1 Framework for Classifier Ensemble Reduction . . . . . . . . . . . . . . 105

5.1.1 Base Classifier Pool Generation . . . . . . . . . . . . . . . . . . . 105

5.1.2 Classifier Decision Transformation . . . . . . . . . . . . . . . . . 107

5.1.3 FS on Transformed Data set . . . . . . . . . . . . . . . . . . . . . 108

5.1.4 Ensemble Decision Aggregation . . . . . . . . . . . . . . . . . . . 108



5.2.1 Reduction Performance for Decision Tree-Based Ensembles . . 111

iii

5.2.2 Alternative Ensemble Construction Approaches . . . . . . . . . 113

5.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

6 HSFS for Dynamic Data 117

6.1 Dynamic FS Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

6.1.1 Feature Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

6.1.2 Feature Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

6.1.3 Instance Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

6.1.4 Instance Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

6.2 Dynamic HSFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

6.2.1 Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . . 123


6.3 Adaptive Feature Subset Ensemble . . . . . . . . . . . . . . . . . . . . . 128


6.4.1 Results for Basic Dynamic FS Scenarios . . . . . . . . . . . . . . 131

6.4.2 Results for Combined Dynamic FS Scenarios . . . . . . . . . . . 135

6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

7 HSFS for Hybrid Rule Induction 140

7.1 Background of Rule Induction . . . . . . . . . . . . . . . . . . . . . . . . 141

7.1.1 Crisp Rough Rule Induction . . . . . . . . . . . . . . . . . . . . . 141

7.1.2 Hybrid Fuzzy-Rough Rule Induction . . . . . . . . . . . . . . . . 144

7.2 HSFS for Hybrid Rule Induction . . . . . . . . . . . . . . . . . . . . . . . 148

7.2.1 Mapping of Key Notions . . . . . . . . . . . . . . . . . . . . . . . 149

7.2.2 HarmonyRules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

7.2.3 Rule Adjustment Mechanisms . . . . . . . . . . . . . . . . . . . . 153


7.3.1 Classification Results . . . . . . . . . . . . . . . . . . . . . . . . . 154

7.3.2 Comparison of Rule Cardinalities . . . . . . . . . . . . . . . . . . 156

7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

8 HSFS for Fuzzy Rule Interpolation 159

8.1 Background of Fuzzy Rule Interpolation (FRI) . . . . . . . . . . . . . . 160

8.1.1 Transformation-Based FRI . . . . . . . . . . . . . . . . . . . . . . 161

8.1.2 Backward FRI (B-FRI) . . . . . . . . . . . . . . . . . . . . . . . . 165

8.2 Antecedent Significance-Based FRI . . . . . . . . . . . . . . . . . . . . . 166

iv

8.2.1 From FS to Antecedent Selection . . . . . . . . . . . . . . . . . . 167

8.2.2 Weighted Aggregation of Antecedent Significance . . . . . . . 168

8.2.3 Use of Antecedent Significance in B-FRI . . . . . . . . . . . . . 171


8.3.1 Application Example . . . . . . . . . . . . . . . . . . . . . . . . . 171

8.3.2 Systematic Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 175

8.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

9 Conclusion 178

9.1 Summary of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

9.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

9.2.1 Short Term Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

9.2.2 Long Term Developments . . . . . . . . . . . . . . . . . . . . . . 183

Appendix A Publications Arising from the Thesis 186

A.1 Journal Articles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

A.2 Book Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

A.3 Conference Papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

Appendix B Data Sets Employed in the Thesis 189

Appendix C List of Acronyms 195

Bibliography 197

List of Figures

1.1 Process of knowledge discovery . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Selection of real-world applications of FS . . . . . . . . . . . . . . . . . . . 3

1.3 Distribution of HS applications by discipline areas . . . . . . . . . . . . . . 5

1.4 Relationships between thesis chapters . . . . . . . . . . . . . . . . . . . . . 7

2.1 Components of FS process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 Types of FS evaluation measure . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3 Basic concepts of rough set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4 Taxonomy of stochastic and nature-inspired approaches . . . . . . . . . . 26

3.1 Key notions of HS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.2 Iteration steps of HS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.3 Key notions of HSFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.4 Work flow of HSFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.5 Improvisation process of HSFS . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.6 Iterative fuzzy-rough reduct refinement for the arrhy data set . . . . . . 80

3.7 Iterative fuzzy-rough reduct refinement for the web data set . . . . . . . . 81

4.1 Flow chart for single subset quality evaluator with stochastic search . . . 87

4.2 Flow chart for single subset quality evaluator with partitioned training data 88

4.3 Flow chart for mixture of subset quality evaluators . . . . . . . . . . . . . 89

4.4 Averaged classification accuracies of the FSE implementations . . . . . . 99

4.5 Averaged OC-FSE classification accuracies and subset sizes . . . . . . . . 101

5.1 Overview of CER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.2 Mixed classifiers using Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . 114

5.3 Mixed classifiers using Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . 115

6.1 Procedures of D-HSFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

v

vi

6.2 Generic framework for A-FSE . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

6.3 Results of dynamic FS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

7.1 Feature subset cardinality distribution of HarmonyRules and QuickRules . 157

8.1 Procedures of T-FRI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

8.2 Antecedent selection procedures . . . . . . . . . . . . . . . . . . . . . . . . . 168

8.3 Alternative rule selection using weighted distance calculation . . . . . . . 170

List of Tables

2.1 Notions used in pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.1 Binary encoded feature subsets . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.2 Concept mapping from HS to FS . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.3 Feature subsets encoded using integer-valued scheme . . . . . . . . . . . . 57

3.4 Parameter settings in different search stages . . . . . . . . . . . . . . . . . . 64

3.5 Data set information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

3.6 FS results using CFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

3.7 FS results using PCFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

3.8 FS results using FRFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

3.9 C4.5 and NB classification accuracies using the feature subsets found

with the respective search algorithms via CFS . . . . . . . . . . . . . . . . . 73


with the respective search algorithms via PCFS . . . . . . . . . . . . . . . . 74


with the respective search algorithms via FRFS . . . . . . . . . . . . . . . . 75

3.12 Comparison of multiple HS-IR reducts versus single HC reduct . . . . . . 76

3.13 Comparison of parameter control rules using CFS . . . . . . . . . . . . . . 77

3.14 Comparison of parameter control rules using FRFS . . . . . . . . . . . . . 77

3.15 Parameter settings for demonstration of parameter control and iterative

refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

3.16 Comparison of proposed HS improvements using feature subsets selected

by CFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.1 Ordinary FSE of 5 feature subsets with 8 features . . . . . . . . . . . . . . 86

4.2 An example of OC threshold-based aggregation with 3 possible classes . 90

4.3 Data sets used for OC-FSE experimentation . . . . . . . . . . . . . . . . . . 92

4.4 Classification accuracy result of stochastic search implementation . . . . 95

vii

viii

4.5 Classification accuracy result of data partition-based implementation . . 96

4.6 Classification accuracy result of mixture of algorithms . . . . . . . . . . . 97

4.7 Summary of results of the three FSE implementations . . . . . . . . . . . . 98

5.1 Classifier ensemble decision matrix . . . . . . . . . . . . . . . . . . . . . . . 108

5.2 HS parameter settings and data set information . . . . . . . . . . . . . . . 110

5.3 Comparison on C4.5 classification accuracy . . . . . . . . . . . . . . . . . . 112

6.1 Summary of the data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

6.2 Feature addition results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

6.3 Feature removal results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

6.4 Instance addition results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

6.5 Instance removal results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

6.6 A-FSE accuracy comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

7.1 Example data set for rough set rule induction . . . . . . . . . . . . . . . . . 142

7.2 Example data set for rough set rule induction . . . . . . . . . . . . . . . . . 143

7.3 Example data set for QuickRules . . . . . . . . . . . . . . . . . . . . . . . . . 147

7.4 Mapping of key notions from HS to rule induction . . . . . . . . . . . . . . 149

7.5 Rule base improvisation example . . . . . . . . . . . . . . . . . . . . . . . . 152

7.6 Data set information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

7.7 Parameters settings where * denotes the dynamically adjusted values . . 154

7.8 Classification accuracy of HarmonyRules and QuickRules . . . . . . . . . . 155

7.9 Classification accuracy of other classifiers tested using 10-FCV . . . . . . 156

7.10 HarmonyRules vs. QuickRules in terms of rule cardinalities . . . . . . . . . 157

8.1 Example linguistic rules for terrorist bombing prediction . . . . . . . . . . 172

8.2 Antecedent significance values determined by CFS and FRFS . . . . . . . 172

8.3 Example observation and the closest rules selected by standard and

weighted T-FRI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

8.4 Example observation and the closest rules selected by standard and

weighted B-FRI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

8.5 Evaluation of proposed approaches for standard FRI . . . . . . . . . . . . . 176

8.6 Evaluation of proposed approaches for B-FRI . . . . . . . . . . . . . . . . . 177

B.1 Information of data sets used in the thesis . . . . . . . . . . . . . . . . . . . 189

List of Algorithms

2.1.1 Fuzzy-rough QuickReduct (A,Z) . . . . . . . . . . . . . . . . . . . . . 22

2.2.1 Move Bpitowards Bp j

by a distance v . . . . . . . . . . . . . . . . . . 28

2.2.2 Update current best solution B . . . . . . . . . . . . . . . . . . . . . . 29

2.2.3 Local search (B) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.2.4 Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.2.5 Memetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.2.6 Clonal Selection Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 34

2.2.7 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.2.8 Tabu Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.2.9 Artificial Bee Colony Optimisation . . . . . . . . . . . . . . . . . . . . 40

2.2.10 Ant Colony Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.2.11 Firefly Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2.2.12 Particle Swarm Optimisation . . . . . . . . . . . . . . . . . . . . . . . 45

3.1.1 Improvisation process of original HS . . . . . . . . . . . . . . . . . . 52

3.4.1 Iterative refinement procedure . . . . . . . . . . . . . . . . . . . . . . 65

3.4.2 Musician size adjustment via binary search . . . . . . . . . . . . . . 65

5.1.1 Bagging algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

5.1.2 Random Subspace algorithm . . . . . . . . . . . . . . . . . . . . . . . 107

6.1.1 Dynamic FRFS for Feature Addition . . . . . . . . . . . . . . . . . . . 120

6.1.2 Dynamic FRFS for Feature Removal . . . . . . . . . . . . . . . . . . . 121

6.1.3 Dynamic FRFS for Instance Addition . . . . . . . . . . . . . . . . . . 122

6.1.4 Dynamic FRFS for Instance Removal . . . . . . . . . . . . . . . . . . 123

6.2.1 Pseudocode of D-HSFS . . . . . . . . . . . . . . . . . . . . . . . . . . 125

6.2.2 Sub-routine adapt (Ak+1, Xk+1) . . . . . . . . . . . . . . . . . . . . . . 126

6.3.1 A-FSE implemented using D-HSFS . . . . . . . . . . . . . . . . . . . 129

ix

x

7.1.1 Work flow of QuickRules . . . . . . . . . . . . . . . . . . . . . . . . . . 146

7.1.2 Subroutine check (B, RB x , Rz x) . . . . . . . . . . . . . . . . . . . . . 146

7.2.1 HarmonyRules initialisation . . . . . . . . . . . . . . . . . . . . . . . . 150

7.2.2 HarmonyRules iteration . . . . . . . . . . . . . . . . . . . . . . . . . . 152

Chapter 1

Introduction

N OWADAYS, data is being collected at a staggering pace in almost every field

imaginable. Despite the countless advances in computer technology, it is still

very challenging to store, maintain, and process data at the same speed as it is being

gathered. As a result, only a small fraction of information may be analysed to any

advantage, and therefore, there is an increasing demand for automated, efficient,

and scalable means to assist humans in extracting useful knowledge from these

dramatically expanding mountains of data.

Knowledge discovery from data, as illustrated in Fig. 1.1, is a broad subject with

many alternative names and sub-areas, including data mining [95], information

harvesting [183], knowledge extraction [258], and pattern discovery [44]. Fun-

damentally speaking, knowledge discovery concerns methods or techniques that

attempt to make sense of (raw or low-level) data, which may be difficult for humans

to directly interpret [168]. The goal of knowledge discovery is to build compact,

descriptive, or predictive models that are humanly comprehensible.

As the volume of data grows rapidly, the traditional, manual knowledge discovery

process becomes increasingly time-consuming and expensive [75]. This is especially

the case, for example, for problem domains such as medical image analysis [171](seeking to reveal, diagnose, or examine disease, or to study the human anatomy

and physiology), and forensic investigation [125] (analysing ballistics, fingerprints,

toxicology, and body identification). The classic approach to data analysis also relies

heavily on the opinions of domain experts, who must have a detailed and intricate

1

Figure 1.1: Process of knowledge discovery

understanding of the problem at hand. Such opinions are often subjective and/or

inconsistent between different individuals. More importantly, data in its present

form may contain a large number of objects and descriptive tags (features), which

are impractical (if not impossible) in most cases for human beings to analyse.

High dimensional data sets create problems even for automated systems. The

computational complexity and search space are often increased exponentially due to

high problem dimensionality (i.e., a large number of domain features). Moreover,

the naïve assumption of “more features = more knowledge” during data collection

generally leads to a problem known as the curse of dimensionality [16]. This issue

occurs when training data is not being collected at a desirable rate, that is proportional

to the increasing number of features. This is a frustrating issue for many machine

learning methods for knowledge discovery. The abundance of features may also

cause an induction algorithm to identify patterns that are in fact spurious, because

of noise [124].

Dimensionality reduction techniques [154] present a type of approach that at-

tempts to reduce the overall dimensionality of the data. Several of which work by

transforming the underlying meanings of the features, whilst semantics preserving

2

1.1. Feature Selection (FS)

mechanisms maintain the original features. Feature selection approaches, being

the main focus of this thesis, fall into the latter category [53, 164]. These methods

search for and identify a subset of features using a dedicated evaluation measure,

and are particularly beneficial for knowledge discovery tasks, as they preserve the

human interpretability of the original data and the resultant, discovered knowledge.

1.1 Feature Selection (FS)

The main aim of feature selection (FS) is to discover a minimal feature subset from

a problem domain while retaining a suitably high accuracy (or information content)

in representing the original data [53]. When analysing data that has a very large

number of features [270], it is difficult to identify and extract patterns or rules due

to the high inter-dependency amongst individual features, or the complex behaviour

of combined features. Techniques to perform tasks such as text processing, data

classification and systems control [171, 184, 222, 228] can benefit greatly from

FS which directly addresses the problem of high dimensionality, since the noisy,

irrelevant, redundant or misleading features may now be removed [124]. FS is

pervasive, in the sense that it is not restricted as being purely a type of data mining

technique. This characteristic is reflected by its use in a wide range of real-world

applications [164, 168], a few example areas are illustrated in Fig. 1.2.

Figure 1.2: Selection of real-world applications of FS

In the context of FS, an information system generally consists of a fixed number

objects, and each object is described by a set of features. Features can be either

3

1.2. FS with Harmony Search (HSFS)

qualitative (discrete-valued) or quantitative (real-valued). For a given data set with

n features, the task of FS can be seen as a search for the “optimal” subset of features

through the competing 2n candidates. In general, optimality is subjective depending

on the problem at hand. A subset that is selected as optimal using one particular

evaluation function may not be equivalent to that selected by another. Various

techniques [147] have been developed in the literature to judge the quality of the

discovered feature subsets, several of which rank the features based on a certain

importance measure, e.g., information gain [162], chi-square [291], rough set and

fuzzy-rough set-based dependency [116, 123], and symmetrical uncertainty [219].

Recent trends in developing FS methods focus on evaluating a feature subset as

a whole, forming an alternative type of approach to the aforementioned. Popular

methods include the group-based fuzzy-rough FS (FRFS) [126, 172], probabilistic

consistency-based FS (PCFS) [52], and correlation-based FS (CFS) [93]. These

techniques (together with the individual feature-based methods) are often collectively

classified as the filter-based techniques. They are typically used as a preprocessing

step, and are independent of any learning algorithm that may be subsequently

employed. In contrast, wrapper-based [107, 144] and also, hybrid algorithms [297]are used in conjunction with a learning or data mining algorithm, which is employed

in place of an evaluation metric as used in the filter-based approach.

1.2 FS with Harmony Search (HSFS)

Independent of the learning mechanism, a common issue that all FS methods need to

address is how to search for the “optimal” feature subsets. To this end, an exhaustive

method may be used, however it is often impractical for most data sets. Alternatively,

hill-climbing-based approaches are exploited where features are added or removed

one at a time often in an greedy fashion, until there is no further improvement in the

current candidate solution. Although generally fast to converge, these methods may

lead to the discovery of sub-optimal subsets (both in terms of the evaluation score

and the size of the selected feature subset) [62, 164]. To avoid such short-comings,

nature-inspired heuristic strategies such as genetic algorithms [153, 266], genetic

programming [187], simulated annealing [69], and particle swarm optimisation

[261] are utilised with varying degrees of success. This thesis proposes a new FS

search strategy based on a recently developed search algorithm - Harmony Search

(HS) [84, 155].

4


HS is a meta-heuristic algorithm that is inspired by the social behaviour of music

players. It mimics the improvisation process of musicians, during which, each

musician plays a note for finding a best harmony all together. HS has been very

successful in a wide variety of engineering optimisation problems [77, 156, 239,

255, 287] and machine learning tasks [50, 174, 175, 179, 212]. Fig. 1.3 provided in

[175] gives a breakdown by discipline areas. It has demonstrated several advantages

over traditional optimisation techniques. HS imposes only limited mathematical

sophistication and is not sensitive to the initial value settings.

Figure 1.3: Distribution of HS applications by discipline areas

The original HS technique has been improved by methods that modify its pitch

adjustment rate and bandwidth with regard to the underlying iterative computational

process [173]. Also, the initial static valued bandwidth [84] may be replaced with a

fret version, making the algorithm more adaptive to the variance in variable range,

and more suitable to real valued problems. Note that fret is a musical term that

refers to a raised element on the neck of a stringed instrument, such as the metal

strips inserted into the fingerboard of a guitar. Work has also been carried out in

the literature to analyse the evolution of the population-variance over successive

generations in HS [49]. The HS algorithm therefore, has a novel stochastic derivative

5


(for discrete variable) based on musician’s experience, rather than gradient (for

continuous variable) in differential calculus.

The proposed FS with HS (HSFS) technique in this thesis aims to tackle the

challenges in finding better quality feature subsets. It addresses the weakness of

conventional deterministic algorithms which may return local optimal solutions.

Also, it employs a more expressive, integer-valued feature subset representation,

as opposed to the binary encoding scheme adopted by most other nature-inspired

stochastic methods. The flexibility provided by the integer-valued representation

allows stochastic mechanisms of HS to better explore the solution space, and to

identify more compact feature subsets. The resultant approach inherits the simplicity

of HS, and is capable of identifying multiple (distinctive) solutions of good quality.

A number of additional improvements to HSFS have also been developed in this

work, in order to further extend the capabilities of the proposed method, thus further

improving the performance of HSFS for high dimensional data sets:

• The original HS method employs a static parameter scheme, which requires

a substantial amount of prior effort in order to determine a suitable setting

for a new problem. The parameters themselves are also socially inspired and

therefore cannot be straightforwardly devised on the basis of the properties of

the problem (data set) itself.

• The HSFS framework enables the size of a given candidate feature subset to

be controlled by configuring the total number of musicians. A mechanism that

iteratively refines this parameter is also derived, which allows the intelligent

discovery of compact and good quality solutions.

The flexibility of the improved HSFS algorithm, allied to its powerful performance

in identifying multiple, quality feature subsets has inspired a number of theoretical

applications. These include ensemble learning, FS for dynamic data, fuzzy-rough

rule induction, and fuzzy rule interpolation; and are summarised in Section 1.3

below. Although these methods are mostly evaluated using benchmark problems,

the data sets employed are drawn from real-world, practical applications, including

handwritten character recognition [32], mammography image analysis [170], water

quality prediction [15], and many others. Detailed descriptions of these data sets

are given in Appendix B.

6

1.3. Structure of Thesis

1.3 Structure of Thesis

This section outlines the structure of the remainder of this thesis. Fig. 1.4 illustrates

the relationships between the individual chapters (other than the introduction and

the conclusion). The direct dependencies between the chapters are denoted using

solid arrows, where conceptual linkages, such as that between Chapters 2 and 3, and

that between Chapters 7 and 8, are symbolised using dashed lines. A comprehensive

list of publications arising from the work of the thesis is provided in Appendix A.

Figure 1.4: Relationships between thesis chapters

Chapter 2: Background

This chapter provides a background introduction to FS, which is organised into

two core parts: evaluation measures and search strategies. A selection of popular

approaches developed for FS is discussed, including several group-based, filter eval-

uation metrics [52, 93, 126] that have been developed recently. Such group-based

methods judge the quality of a given feature subset as a whole, rather than assessing

the qualities of features individually. This chapter also provides a comprehensive

review of the most recent methods for FS that originated from nature-inspired meta-

heuristics, where the more classic approaches such as genetic algorithms and ant

colony optimisation are also included for comparison. These techniques allow good

quality solutions to be discovered without resorting to exhaustive search.

7


A good number of the reviewed methodologies have been significantly modified in

the present work, in order to systematically support generic subset-based evaluators

and higher dimensional problems. Such modifications are carried out because the

original studies are either exclusively tailored to certain subset evaluators (e.g., rough

set-based methods), or limited to specific problem domains. A total of ten different

algorithms are examined, and their mechanisms and work flows are summarised in

a unified manner. The survey of nature-inspired FS search methods presented in the

chapter is under review for journal publication.

Chapter 3: Framework for HSFS and its Improvements

This chapter explains the key contribution of the thesis: the HSFS algorithm, which

is a novel FS approach based on HS. For completeness, an outline of HS and its

key notions are first provided. The HSFS algorithm is a general approach that can

be used in conjunction with many subset evaluation techniques. The simplicity

of HS is exploited in order to reduce the overall complexity of the search process.

The proposed approach is able to escape from local optimal solutions, and identify

multiple solutions due to the stochastic nature of HS. The initial development of

HSFS using binary-valued feature subset representation is also described in this

chapter. Additional parameter control schemes are introduced to reduce the effort

and impact of parameter configuration. These can be further combined with the

iterative refinement strategy, tailored to ensuring the discovery of quality subsets,

and to improving search efficiency.

This chapter also presents initial experimental results that demonstrate the FS

performance of HSFS. The nature-inspired search approaches reviewed in Chapter 2

are systematically tested, and compared to HSFS, using high dimensional, real-valued

benchmark data sets. The selected feature subsets are used to build classification

models, in an effort to further validate their efficacy. The proposed modifications to

the base HSFS method are also individually validated, with results (accompanied by

in-depth studies) reported in dedicated sections.

The base HSFS algorithm and parts thereof have been published initially in [60],with a further and more in-depth version in [62]. The proposed improvements have

been published in [59, 290], whilst an extended paper on the topic of self-adjusting

HSFS is under review for journal publication. Note that the study in this chapter also

includes results of experiments conducted for the survey paper under review.

8


Chapter 4: HSFS for Feature Subset Ensemble

Classifier ensembles constitute one of the main research directions in machine learn-

ing and data mining. The use of multiple classifiers generally allows better predictive

performance than that achievable with a single model. Feature subset ensembles in

particular, aim to combine the decisions of different base FS components, thereby

producing more robust results for the subsequent learning tasks. This chapter details

a new feature subset ensemble approach that is based on the analysis of feature

occurrences. Three base component construction methods are discussed, generalis-

ing the ensemble concept so that it can be used in conjunction with various subset

evaluation techniques and search algorithms.

A novel occurrence coefficient threshold-based classifier decision aggregation

method is also introduced, which works closely and efficiently with the proposed

ensemble approach. HSFS is employed as the main stochastic search algorithm, in

order to supply the essential base feature subsets for the proposed method to work

with. Results of experimental comparative studies carried out on real-world data sets

are also reported, in order to highlight the benefits of the work. The developments

presented in this chapter have been published in [227]. A paper proposing a refined

technique is currently under review for journal publication.

Chapter 5: HSFS for Classifier Ensemble Reduction

Several approaches exist in the literature that provide means to effectively construct

and aggregate diverse classifier ensembles, including the occurrence-coefficient based

feature subset ensemble as introduced in Chapter 4. However, these ensemble systems

potentially contain redundant members that, if removed, may further increase group

diversity and produce better feature subsets. Smaller ensembles also relax the

memory and storage requirements, reducing the run-time overheads otherwise

required, while improving the overall efficiency.

This chapter extends the existing ideas developed for FS problems in order to

support classifier ensemble reduction, by transforming group predictions into training

samples, and treating classifiers as features. Also, HSFS is used to select a reduced

subset of such artificial features, while attempting to maximise the feature subset

evaluation. The resulting technique is systematically evaluated using high dimen-

sional and large sized benchmark data sets, demonstrating superior classification

9


performance against both original, unreduced ensembles and randomly formed sub-

sets. The work in this chapter has been published initially in [61], and a generalised

approach is to appear in [57].

Chapter 6: HSFS for Dynamic Data

Most of the approaches developed for FS and classifier ensembles in the literature,

including those proposed in Chapters 4 to 5, focus on the analysis of data from a static

pool of training instances with a fixed set of features. Whilst in practice, knowledge

may be gradually refined, and information regarding the problem domain may be

actively added and/or removed whilst the training is taking place. In this chapter,

a dynamic FS technique is proposed that makes use of the existing subset-based

evaluation methods to further extend the HSFS algorithm. The concept of adaptive

feature subset ensembles is examined, improving upon the idea developed in Chapter

4, with the resulting technique capable of dynamically refining the candidate feature

subsets, and their associated classifier ensembles. The efficacy of the presented work

is verified through systematic simulated experimentation using real-world benchmark

data sets. A preliminary investigation of the basic dynamic FS scenarios has been

published in [58]. A paper concerning the dynamic HSFS algorithm and also the

adaptive ensemble approach are currently under review for journal publication.

Chapter 7: HSFS for Hybrid Fuzzy-Rough Rule Induction

Automated generation of feature pattern-based production (or if-then) rules is es-

sential to the success of many intelligent pattern classifiers, especially when their

inference results are expected to be directly human-comprehensible. Fuzzy and

rough set theory [88, 166] have been applied with much success to this area as

well as to FS [123, 242]. Since both FS and if-then rule learning using rough set

theory involve the processing of equivalence classes for their successful operation,

it is natural to combine these into a single integrated mechanism that generates

concise, meaningful and accurate rules. In particular, this chapter explains how HSFS

may be used together with fuzzy-rough rule induction techniques [42, 262] (as the

latter is one the most popular and best tested existing methods that were built on the

initial notion of rough sets). Here, HS is adopted to simultaneously optimise multiple

objectives, so that the resulting rule base, whilst fully covering the knowledge being

described, remains compact and concise. The efficacy of the proposed algorithm is

10


experimentally evaluated against leading classifiers, including fuzzy and rough rule

induction techniques. The work in this chapter has been published in [63].

Chapter 8: HSFS for Fuzzy Rule Interpolation

Fuzzy Rule Interpolation [142, 143] is of particular significance for reasoning in

the presence of insufficient knowledge or sparse rule bases. This chapter utilises

HSFS to perform dimensionality reduction in such reasoning systems that involve

high dimensional rules. The techniques derived for FS are applied almost directly to

a converted set of rules with crisp antecedent values (the data set), such that the

significance of individual rule antecedents may be identified, and an informative

subset of antecedents may be discovered. The additional information obtained via

FS is equally beneficial for conventional fuzzy reasoning, and for a newly identified

research area concerning backward fuzzy rule interpolation [128, 131]. A paper

written on the basis of this chapter is currently under review for conference publica-

tion, whilst the outcomes regarding backward fuzzy rule interpolation itself have

been published in [128, 130, 131] (with [131] receiving one of the two best paper

awards at the 2012 IEEE International Conference in Fuzzy Systems). A substantially

extended study of this work has formed a journal publication [129], but is largely

beyond the scope of this thesis.

Chapter 9: Conclusion

This chapter summarises the key contributions made by the thesis, together with

a discussion of topics which form the basis for future research. Both immediately

achievable tasks and long-term projects are considered.

Appendices

Appendix A lists the publications arising from the work presented in this thesis,

containing both published papers, and those currently under review for journal

publication. Appendix B provides information regarding the benchmark data sets em-

ployed in the thesis, which are mostly drawn from real problem scenarios. Appendix

C summaries the acronyms employed throughout this thesis.

11

Chapter 2

Background

A S the amount of available data increases, so too does the need for effective

dimensionality reduction. FS methods aim to find minimal or close-to-minimal

feature subsets, whilst preserving the semantics of the underlying data, thus making

the reduced data transparent to human scrutiny. Generally speaking, FS involves

two computational processes, as shown in Fig. 2.1: 1) a feature subset evaluation

process, and 2) a feature subset search process. A number of studies in the literature

[164] further decompose FS into smaller components, such as feature subset gener-

ation (which may be treated as a part of the search mechanism), and termination

criterion (which may be triggered by the evaluation process itself, or controlled by

the search strategy). Note that exceptional cases [264] exist where the two parts are

indifferentiable or inseparable, a number of early techniques [52, 147] in the area

also treat FS as one integrated process. For ease of explanation and organisation,

in this thesis, the two-part decomposition (evaluate-and-search) is adopted unless

otherwise stated.

The remainder of this chapter is structured as follows. The two core parts of

FS are introduced in Sections 2.1 and 2.2, respectively. Filter-based FS evaluation

techniques are of significant importance to the development of the present work, and

they are covered in Section 2.1.1 in detail. The main focus of this research lies with the

use of stochastic and nature-inspired FS search strategies. A comprehensive survey

of the relevant methods is given in Section 2.2.2. Finally, Section 2.3 summarises

the chapter.

12

2.1. FS Evaluation Measures

Figure 2.1: Components of FS process

2.1 FS Evaluation Measures

An information system in the context of FS is a tuple ⟨X , Y ⟩, where X is a non-empty

set of finite objects, also referred to as the universe of discourse; and Y is a non-empty,

finite set of features. For decision systems, Y = A∪ Z where A= a1, · · · , a|A| is

the set of input features, and |A| denotes the cardinality of A, which may be either

discrete- or real-valued; and Z is the set of decision features. For a given data set

with |A| features, the task of FS can be seen as a search for one or more feature

subsets B ⊆ A, which are “optimal” amongst the competing 2|A| candidates.

The concept of optimality for any given feature subset is twofold: 1) the quality,

in terms of how well it encapsulates the information contained within the original

data (with full set of features); and 2) the size, where more compact solutions are

often preferred due to the advantage of reducing the overall dimensionality. Note

that the term “quality” of a given feature subset and also, the term “information”

contained within data have been concretely defined in the literature, especially in the

area of information theory [225] in statistics. Example definitions include Schwarz

criterion [218], Akaike information criterion [6], mutual information [206], and

Pearson product-moment correlation coefficient [93]. Typically, they provide means

to measure the relative quality of a statistical model [181] for a given set of data,

so as to perform model selection amongst a finite set of models. However, in this

thesis, they are employed in a more general sense. Feature subset quality is indeed

often subjective, depending on the problem at hand and the metric employed to

perform the analysis. As such, a feature subset that is identified as “optimal” using

one particular evaluation function may not be equivalent to that selected by another.

13


Various methods have been developed in the literature in order to judge the

quality of discovered feature subsets, which are hereafter referred to as “feature

subset evaluation measures” or “feature subset evaluators” interchangeably. Such

measures generally focus on producing a numerical score f (B) for a given feature

subset B. Here f : B → R represents a subset evaluation function, which maps

a set of feature subsets onto the set of real numbers (feature subset evaluation

scores). In this thesis, normalised scores f (B) ∈ [0, 1], f (;) = 0, are assumed, where

higher scores indicate better quality feature subsets. Note that for any given data set,

multiple feature subsets may exist that are equally (or almost equally) optimal, when

judged by a subset evaluator, i.e., f (Bp) = f (Bq), or f (Bp)' f (Bq), Bp 6= Bp, where

Bp and Bp denote two arbitrary feature subsets. Based on the style of interaction with

the subsequent learning mechanism that make use of the selected feature subsets,

feature subset evaluation measures may be categorised into four different types:

filter-based, wrapper-based, hybrid, and embedded approaches, as illustrated in Fig.

2.2.

Figure 2.2: Types of FS evaluation measure

2.1.1 Filter-Based FS

Methods that perform FS in isolation of any learning algorithm are termed filter-based

approaches [164], where essentially irrelevant features are filtered out prior to using

a resultant feature subset for learning. Filter-based methods are general purpose,

pre-processing algorithms that are applicable to most problems, as they attempt to

find quality features (that may yield good learning outcomes) regardless of the choice

of the subsequent learning mechanism. The separation of FS methods from any such

mechanism also makes them more efficient, since it is no longer required to train and

test a classifier for the sole purpose of evaluating the quality of a given feature subset.

14


Filter-based approaches can be further divided into two sub-categories, according

to the ways in which they perform feature evaluation: 1) those based on individual

feature-based measures, and 2) those based upon group/subset-based measures.

2.1.1.1 Individual Feature-Based Measures

This type of technique calculates feature relevance individually. The final feature

subset is formed following a predefined rule, such as returning all features above

a given relevance threshold. Individual feature-based measures have higher time

efficiency, since the relevance scores are computed as many times as the total number

of features. Several approaches belonging to this category are also referred to

as feature ranking methods, as they essentially score and rank the importance of

individual features.

Two of the most commonly used techniques are symmetrical uncertainty [219]and Relief [147]. Both of these methods are good examples of this type of algorithm,

and are closely relevant to the methods to be described subsequently in this chapter,

including the correlation-based FS [93], and the consistency-based FS [52].

Symmetrical Uncertainty For a given nominal-valued feature au, a probabilistic

model may be formed by estimating the individual probabilities of its observed values,

on the basis of the training data: ωau i ∈ Ωau, i = 1, · · · , |Ωau

| using entropy [1]:

H(au) = −Σ|Ωau |i=1 p(ωau i) log2 p(ωau i) (2.1)

If the observed values of au are in fact partitioned in relation to another feature

av, and the entropy of au with respect to the partitions induced by av is less than

the entropy of au prior to partitioning, then there is a relationship between the two

features av and au. The entropy of au after observing av is:

H(av|au) = −Σ|Ωav |j=1 p(ωav i)Σ

|Ωau |i=1 p(ωau i|ωav j) log2 p(ωau i|ωav j) (2.2)

Information gain [162] or mutual information [206] may be defined on the basis of

the above equations, which reflects how much additional information about au is

provided by av:

information gain(au, av) = H(au)−H(au|av)

= H(av)−H(av|au)

= H(au) +H(av)−H(av, au) (2.3)

15


Information gain is symmetrical in nature, making it suitable for measuring

inter-feature correlation. However, it is biased in favour of features with higher

information gain. The symmetrical uncertainty measure [219] is introduced to

compensate for such bias. It also produces a normalised output in the range of [0, 1]:

symmetrical uncertainty(au, av) = 2.0× [information gain(au, av)

H(au) +H(av)] (2.4)

Relief Relief [147] and its later development ReliefF [146] are individual feature

weighting algorithms that in principle, may be sensitive to feature interaction. Relief

attempts to approximate the following difference of probabilities, in order to obtain

the weight of a feature au:

weightau=p(different value of au | nearest instance of different class)

− p(different value of au | nearest instance of same class) (2.5)

By removing the context sensitivity imposed by the “nearest instance” condition,

features may be treated as being independent of one another:

Relief(au) =p(different value of au | different class)

− p(different value of au | same class) (2.6)

More formally, the measure may be defined as:

Relief(au, z) =GI ′ ×Σ|Ωau |

i=1 p(ωau i)2

(1−Σ|Ωz |j=1p(ωz j)2)Σ

|Ωz |j=1p(ωz j)2

where GI ′, as given in Eqn. 2.7, is the modified Gini-index [28] which, much like the

information gain of Eqn. 2.3, is also biased toward attributes with more observed

values.

GI ′ =[Σ|Ωz |j=1p(ωz j)(1− p(ωz j))]

−Σ|Ωau |i=1 (

p(ωau i)2

Σ|Ωau |j=1 p(ωau i)2Σ

|Ωz |j=1p(ωz j|ωωaui

)− p(ωz j|ωωaui)2

(2.7)

To use Relief as a symmetrical measure for any given two features au and av, the

above measure is computed twice, where each feature is treated as the class attribute

in turn, and the average of the two measurements is taken as the final output, in

order to ensure connectivity:

Relief′(au, av) =Relief(au, av) +Relief(av, au)

2(2.8)

16


2.1.1.2 Group-Based Measures

A strong assumption behind the use of evaluation measures on an individual feature

basis is that all features are entirely independent of each other. This may not be the

case for many practical problems, where the features are not measured or extracted

independently. Also, the interaction between features may need to be preserved for

the subsequent learning mechanism. For instance, consider two binary features, ap

and aq, which may appear irrelevant in determining the class labels in a learning

classifier when being evaluated individually. However, the combination of these

two features, e.g., ap ⊕ aq, may determine the class label, where ⊕ denotes binary

addition.

Group-based FS methods [52, 93, 126] do not rely on the evaluation of individual

features. Instead, the candidate feature subsets are evaluated as a whole. This

property is particularly beneficial in capturing the inter-feature-dependencies that

are common in real-world data. Group-based FS methods are the main focus of this

thesis, due to their desirable properties and easy integration with stochastic feature

subset search strategies. The principles behind three particular group-based filter

techniques are outline below, and are utilised extensively in the work proposed later.

Correlation-Based FS (CFS) The goal of correlation-based FS (CFS) [93] is that,

when utilising FS to remove irrelevant features, redundant information should be

eliminated as well. A feature is deemed to be redundant if there exists one or more

other features that it is highly correlated with. Note that the term “correlation” was

used in its general sense in [93]. Instead of referring specifically to classical linear

correlation, it was employed to refer to a broad class of statistical relationships

involving dependence, or a degree of predictability of one feature with respect to

another. The original work ensures that, “a high quality feature subset is one that

contains features highly correlated with the class, yet uncorrelated with each other.”

The correlation-based measure is defined as follows:

correlation(B, z) =Σ|B|i=1correlation(z, ai)

Ç

|B|+Σ|B|j=1, j 6=iΣ|B|i=1correlation(ai, a j)

(2.9)

where correlation(B, z) is the correlation between the feature subset B and the deci-

sion variable (class) z, correlation(z, ai) and correlation(ai, a j) are the correlation

between a given feature ai and the class, and the so-called inter-correlation between

any two features ai, a j ∈ B, respectively. correlation(ai, a j) may be calculated via

17


individual feature-based measures, such as the ones introduced previously (symmet-

rical uncertainty [219] and Relief [147]). The minimum description length principle

[31, 97] is exploited to implement this.

CFS is notable for three important characteristics:

• The higher the correlations between the individual features and the decision

variable, the higher the correlation between the feature subset and the class.

• The lower the inter-correlations amongst the selected features themselves, the

higher the correlation between the feature subset and the class.

• As the number of selected features increases, the correlation between the

feature subset and the class increases.

Note that the last point above assumes that each of the newly added features

offers a similar improvement in terms of evaluation score to that of the existing ones

already included in the feature subset, when measured by Eqn. 2.9. CFS has the

advantage of being able to offer a view of closely relevant (substitutable) features,

in addition to the identification of a good feature subset. This may be beneficial for

certain data mining applications where comprehensible results are of paramount

importance. However, it is not always clear whether redundancy should be fully

eliminated. For instance, to encourage direct human comprehension of a given rule,

a specific feature may be replaced by another (equally informative one) with which

it is highly correlated.

Consistency-Based FS (PCFS) One important notion exploited by probabilistic

consistency-based FS (PCFS) is the inconsistency criterion, which essentially speci-

fies to what extent a feature subset (and the reduced data which it infers) can be

accepted [52]. The inconsistency rate of a given data set with respect to a selected

subset of features is checked against a predefined threshold, where lower values of

inconsistency are deemed acceptable, with a default threshold value of 0.0.

Two objects xu and xv are considered inconsistent, if they match in terms of all

their feature values except the class labels: ∀ai ∈ A, aiu == aiv ∧ zu 6= zv. For a group

of such objects that match (without considering their class labels) based on a given

set of features, the inconsistency count is the number of objects minus the largest

18


number of instances with the same class label. For example, given the entire data

set of all objects X , suppose that

X ′ = X ′zu∪ X ′zv

∪ X ′zw, X ′ ⊆ X (2.10)

is a set of objects that match in terms of feature values, but belong to three different

class labels zu, zv, zw ∈ Ωz, where Ωz signifies the set of available class labels. The

inconsistency count for X ′ is:

inconsistency count(X ′) = |X ′| −max(|Xzu|, |Xzv

|, |Xzw|) (2.11)

The overall inconsistency rate of X is computed by summing over all such inconsis-

tency counts (from all the matching sets of objects), divided by the total number of

data instances.

Rough Set FS Rough sets are an extension of conventional sets that allow approx-

imations in decision making [204]. In particular, rough set theory (RST) can be

used to find relationships in data, as a tool to discover data dependencies. The

most notable application of RST is the reduction of features contained in a data set

based on the information in the data set alone [168]. However, RST should not

be confused with or seen as an alternative for fuzzy set theory, nor does fuzzy set

theory compete with RST [204]. They are two individual methods for dealing with

imperfect data. Although this thesis does not utilise any RST-based method directly,

but instead focusing on its fuzzy extensions (to be described in the following section),

the definitions of its key notions are briefly summarised below for completeness.

At the heart of the RST is the concept of indiscernibility [204]. For a given subset

of features P ⊆ A, there exists an associated equivalence relation IND(P):

IND(P) = (x i, x j) ∈ X 2|∀a ∈ P, a(x i) = a(x j) (2.12)

where a(x i) signifies the value of a feature a ∈ P for an object x i ∈ X . If two objects

x i, x j ∈ IND(P), then they are considered indiscernible using features contained in P.

This leads to the definition of equivalence classes of the P-indiscernbibility relation,

which are denoted as [x]P .

RST allows the partition of a vague set W ⊆ X by using two well-defined limits,

with respect to a set of features P ⊆ A, which are known as upper and lower approxi-

mations, as illustrated in Fig. 2.3. Both approximations are discrete sets that allow

19


the partitioning of the domain (sample space) into two distinct sub-domains. The

lower approximation PW describes the objects in the domain that are known with

certainty to belong to the vague set of interest:

PW = x : [x]P ⊆W (2.13)

The upper approximation PW , which subsumes the lower approximation, describes

the objects in the domain that belong to certain equivalence classes whose elements

at least partly belong to the concept of interest:

PW = x : [x]P ∩W 6= ; (2.14)

Figure 2.3: Basic concepts of rough set

The rough set-based approach to FS allows for the reduction in the number of

features in a data set, whilst not requiring any external information for thresholds.

It can find a subset (termed a reduct) of the original features that are the most

informative; all other features can be removed from the data set with minimal

information loss. Given these important advantages over many other FS methods, it

is not surprising that further development based on this theory for FS has been the

focus of much research [145].

20


Fuzzy-Rough FS (FRFS) This is one of the most significant further development of

the aforementioned rough set-based FS technique. RST only works on discrete, crisp-

valued domains. However in practice, the values of features are usually real-valued.

It is not possible in this theory to say whether two different feature values are similar,

and to what extent they are the same. For example, two close values may only differ

as a result of noise, but in RST, they are considered to be as different as two values

of a different order of magnitude. Data set discretisation must therefore take place

before reduction methods based on crisp rough sets can be applied. This is often still

inadequate, however, as the degrees of membership of values to discretised values

are not considered and thus may result in information loss. In order to overcome

this, extensions of RST based on fuzzy-rough sets [67] have been developed.

A fuzzy-rough set is defined by two fuzzy sets, a fuzzy lower and a fuzzy upper

approximation, obtained by extending the corresponding crisp RST notions. In the

crisp case, elements either belong to the lower approximation with absolute certainty

or not at all. In the fuzzy-rough case, elements may have a membership in the range

[0,1], allowing greater flexibility in handling uncertainty. Fuzzy-rough FS (FRFS)

[126] extends the ideas of fuzzy-rough sets to perform FS, where a vague concept

W ⊆ X is approximated by the fuzzy lower and upper approximations:

µRBW (x i) = infx j∈X

I(µRB(x i, x j),µW (x j)) (2.15)

µRBW (x i) = supx j∈X

T (µRB(x i, x j),µW (x j)) (2.16)

where I is a fuzzy implicator, T is a t-norm, and RB is the fuzzy similarity relation

induced by the subset of features B, and x i, x j ∈W are two arbitrary objects in W .

In particular,

µRB(x i, x j) = Ta∈BµRa

(x i, x j) (2.17)

where µRa(x i, x j) is the degree to which objects x i and x j are similar for feature

a ∈ A. Many similarity relations can be constructed for this purpose, for example:

µRa(x i, x j) = 1−

|a(x i)− a(x j)|amax − amin

(2.18)

µRa(x i, x j) = exp(−

(a(x i)− a(x j))2

2σ2a

) (2.19)

µRa(x i, x j) =max(min(

a(x j)− (a(x i)−σa)

a(x i)− (a(x i)−σa),(a(x i) +σa)− a(x j)

(a(x i) +σa)− a(x i)), 0) (2.20)

21


where σa and σ2a represent the standard deviation and the variance of the values

taken by feature a, respectively. The choices for I , T , and the fuzzy similarity relation

have great influence upon the resultant fuzzy partitions, and thus the subsequently

selected feature subsets.

The fuzzy-rough lower approximation-based QuickReduct algorithm [126], which

extends the crisp version [226], is shown in Algorithm 2.1.1. It employs a quality

measure termed the fuzzy-rough dependency function γB(Q) that measures the

dependency between two sets of attributes B and Q, as defined by:

γB(Q) =

∑

x∈X µPOSRB (Q)(x)

|X |(2.21)

In this definition, the fuzzy positive region, which contains all objects of X that can

be classified into classes of X/Q using the information in B, is defined as:

µPOSRB (Q)(x) = sup

x∈X/QµRBX (x) (2.22)

1 A, set of all conditional features.2 Z , set of decision features.3 R= ;, γbest = 0, γprev = 04 repeat5 B = R6 γprev = γbest

7 foreach x ∈ (A\ R) do8 if γR∪x(Z)> γT (Z) then9 B = R∪ x

10 γbest = γT (Z)

11 R= B12 until γbest = γprev

13 return RAlgorithm 2.1.1: Fuzzy-rough QuickReduct (A,Z)

Similar to CFS and PCFS, γB is viewed as a merit of quality for a given feature

subset B ∈ A, with respect to the set of decision features Z: 0≤ γB(Z)≤ 1,γ;(Z) = 0.

A fuzzy-rough reduct R can then be defined as a subset of features that preserves the

dependency degree of the entire data set, i.e., γR(Z) = γA(Z). In this thesis, without

causing confusion, the fuzzy-rough dependency measure of a given feature subset B

is notationally simplified to f (B), following the same conventions adopted for CFS

and PCFS.

22


The evaluation of f (B) enables QuickReduct to choose which features to add to

the current candidate fuzzy-rough reduct. Note that the algorithm is “greedy” and

therefore, always selects the feature resulting the greatest increase in fuzzy-rough

dependency. The algorithm terminates when the addition of any of the remaining

features does not result in an increase in dependency. As with the original crisp

algorithm, for a dimensionality of |A|, the worst case data set will result in O( |A|2+|A|2 )

evaluations of the dependency function, while the cost of the dependency evaluation

itself is related to both the number of original features |A|, and the amount of training

objects |X |.

2.1.2 Wrapper-Based, Hybrid, and Embedded FS

Wrapper-based approaches [107, 144], in contrast to filter-based approaches, are

often used in conjunction with a learning or data mining algorithm (which forms a

major part of the validation process). They have the obvious advantages in identify-

ing solutions most appropriate for a specific application. However, wrapper-based

approaches are generally inferior to the rest because of the computational overheads

in model training and validation that is required for the examination of each feature

subset. Although various methods exist for use different end classification algorithms,

wrapper-based approaches generally follow the same design principle. Indeed, for

B ⊆ A, a generic wrapper-based evaluation measure may be defined as follows:

wrapper(B) = accuracy of classifier built using B and X train

and tested using held out data X test (2.23)

In order to combine the potential benefits of both filter-based and wrapper-based

methods, hybrid algorithms [297] have been proposed. The rationale behind these

techniques is to make use of both an evaluation measure and a learning algorithm,

in evaluating the quality of feature subsets. Such a combined measure is then used

to decide which subsets are most suitable for a given cardinality, and the learning

algorithm is then used to selection the final, overall “best” solution from a pool of

candidate feature subsets of different cardinalities.

In addition to the hybrid mechanisms, there is one that is so-called embedded

approach. In such methods, an implicit or explicit FS sub-algorithm is an integrated

part within a more general learning algorithm [264]. Decision tree learning is a

typical example of this. Of course, this may also be viewed as a specific case of the

hybrid approach.

23

2.2. FS Search Strategies

2.2 FS Search Strategies

Having introduced the evaluation measures that seek to assess the merit or quality of

a given feature subset, the remaining problem for FS is to find the best solution from a

search space of 2|A| competing feature subsets using a specific strategy. An exhaustive

search can conceivably be performed, if the number of variables is not too large.

However, this problem is known to be NP-hard [8, 91] and the search can become

computationally intractable. Existing techniques in the literature generally fall into

two main categories: deterministic methods and stochastic techniques. This section

outlines two deterministic approaches that are commonly employed by conventional

FS algorithms. The main focus however, is the analysis of stochastic, especially

nature-inspired search methods, in which the HSFS algorithm proposed in this thesis

is categorised.

2.2.1 Deterministic Algorithms

Deterministic methods often follow greedy, step-by-step procedures, in order to

form a potential solution in a predetermined fashion. Such an approach is generally

simple to implement, and is empirically efficient for data sets with fewer (e.g., < 100)

features [164]. Two straightforward means to implement this approach are outlined

below.

2.2.1.1 Exhaustive Search

Exhaustive search is an optimal search method, with a complexity of O(2|A|), the

same as the total number of possible solutions. It is both optimal and complete, in

the sense that the best feature subset is guaranteed to be found once (and if) the

search terminates, with all potential solutions having been evaluated during the

process. Exhaustive search is computationally infeasible for most practical problems,

because of its exponential cost.

2.2.1.2 Sequential Search

This is a sometimes referred to as a hill-climbing algorithm, it selects one feature at

each iteration that provides the greatest improvement in terms of evaluation score.

The polynomial complexity is determined by taking into account the number of

subset evaluations per iteration, in order to determine the most informative feature.

24


Obviously, sequential search is not ideal, since the best solution may exist in a region

where the algorithm does not visit. Furthermore, as per discussion in Section 2.1.1.1,

the inter-dependencies between features make it less beneficial to explore potential

feature subsets on an individual feature basis.

2.2.2 Stochastic and Nature-Inspired Approaches

Following a taxonomy concerning the stochastic and nature-inspired approaches

in their base form [29], the existing FS methods can be classified into a number of

categories, as shown in Fig. 2.4. Considering that a few categories have not yet

attracted sufficient application in the area of FS, e.g., immune systems and physi-

cal/social algorithms, three major categories are established here in order to improve

the organisation of the reviewed methods. The biologically-inspired approaches

include the Genetic Algorithm (GA) [231, 235, 275], Genetic Programming [187],Memetic Algorithm (MA) [274, 296], and the Clonal Selection Algorithm (CSA)

[230] from immune systems; the physical, social and stochastic algorithms include

HSFS [62] (described in Chapter 3), Simulated Annealing (SA) [69, 182], Random

Search [241], Scatter Search [83], and Tabu Search (TS) [100]; and the swarm-

based techniques include Artificial Bee Colony (ABC) [199], Ant Colony Optimisation

(ACO) [38, 122, 134, 138], Bat Algorithm [189], Bee Colony Optimisation, and Par-

ticle Swarm Optimisation (PSO), etc. Most of the above mentioned algorithms are

described in detail in the following section.

Fundamentally speaking, putting the underlying analogies aside, nature-inspired

FS approaches are a collection of techniques of stochastic nature, for the purpose of

discovering and improving good candidate solutions. Several recent studies combined

these algorithms together, adopting a given algorithm’s strength in an effort to

complement the weakness of another. In so doing, a number of hybrid methods have

emerged, including GA-PSO [11], ACO-GA [192], ACO-neural networks [234], PSO-

catfish [41], etc. Moreover, there also exist several approaches that have embedded

local search procedures [133, 194].

2.2.2.1 Common Notions and Mechanisms

Despite having distinctive characteristics and work flows, many stochastic search

techniques share similarities, which are summarised in Table 2.1. A population-

based NIM typically employs a group P of individuals pi, each actively maintains an

25


Figure 2.4: Taxonomy of nature-inspired approaches

emerging feature subset Bpi. Internally, most algorithms represent a given feature

subset Bpiin a binary manner, where a string bBpi

of length |A| is used. The jth

position of bBpi

is set to 1 (bBpi

j = 1) if its corresponding feature is being selected, i.e.,

a j ∈ Bpi, and bBpi

j = 0 if a j is not selected in the candidate feature subset. The current

best solution (amongst the entire population) is represented by B, and a randomly

generated feature subset is denoted by B. To simply the representation, random

components other than B are denoted by the use of r. For example, c = rc, 0≤ rc ≤ 1

indicates that the value of a certain parameter c is a randomly generated number

drawn from the value range 0 to 1; and ar ∈ A, r ∈ 1, · · · , |A| denotes a feature

randomly picked out of a pool of original features A. These notations will be used

extensively in the pseudocode hereafter, in order to illustrate the work flows of the

reviewed NIMs in an unified representation that eases comparison.

• Random Initialisation

26


Table 2.1: Notions used in pseudocode

Notion Meaning

pi ∈ P A population P of individuals pi

Bpi ∈ B Set of candidate feature subsets Bpimaintained by pi

B Current best subsetB A subset of randomly selected features

bBpi

j ∈ 0,1 Selection state (0: not selected, 1: selected) of the jth feature in Bpi

f (B) Evaluation score of Bg Current generation/iterationgmax Maximum number of generations/iterationsr A random number or a stochastic componentT A temporary solution

One of the key advantages of nature-inspired approaches is the insensitivity to

the initial states. The population at the start of the search (often referred to as

the initial population) is generally a randomly generated pool. In stochastic

FS, a random subset B can be constructed by randomly setting r bits, where r

itself may be a pre-determined size, or random: r ∈ 1, · · · , |A|.

for i = 1 to |P| do Bpi= B (2.24)

• Solution Adjustment

The candidate solutions are modified constantly during the search process. The

most common adjustment procedure is the random addition or removal of rm

number of features, such as the mutation operator used by many evolutionary

algorithms. rm may be predefined, or dynamically determined according to

certain states of the algorithm. If a binary representation bB is used, this

adjustment may be achieved by randomly flipping rm bits:

for i = 1 to rm do bBr = ¬bB

r , ar ∈ A (2.25)

Several swarm-based algorithms exploit the notion of movement, from a given

candidate solution Bpitowards another possibly better quality feature subset,

say Bp j, aiming to eventually reach the true global best solution. For conven-

tional numerical optimisation problems, this process derives new values for the

function variables according to predefined formulae, and new solution vectors

may be constructed which are interpolated in between the source and target

27


vectors. However for FS problems, this is less applicable, since binary values

are generally employed, and the variables represent independent features. In

the literature, movement is implemented by first determining the distance

between the two subsets:

d(Bpi, Bp j) = |Bpi

⊕ Bp j| (2.26)

which is equal to the number of bit differences. The amount of movement

v, v ∈ [0, vmax], or the number of bits that Bpishould copy from Bp j

, is then

calculated with regards to the absolute distance, as demonstrated in Algorithm

2.2.1. Note that for algorithms such as PSO and FA, the feature subset being

improved generally moves towards the current best solution B.

1 if v ≤ d(Bpi, Bp j) then

2 for j = 1 to v do

3 bBpi

r = ¬bBpi

r , ar ∈ Bpi ⊕ Bp j

4 else5 Bpi

= Bp j

6 for j = 1 to (v − d(Bpi, Bp j)) do

7 bBpi

r = ¬bBpi

r , ar /∈ Bp j

Algorithm 2.2.1: Move Bpitowards Bp j

by a distance v

• Subset Quality Comparison

FS is essentially a dual-objective optimisation task. A good quality feature

subset should both achieve a high score in terms of evaluation, and maintain

a low cardinality. Having an ordered solution space enables higher quality

solutions to be discovered. Algorithms such as CSA and HS use a scheme where

two given candidate solutions are compared first based on evaluation scores,

and the cardinalities of the subsets are then used as a tie breaker:

Bpi> Bp j

⇔ f (Bpi)> f (Bp j

) ∨

f (Bpi) == f (Bp j

)∧ |Bpi|< |Bp j

| (2.27)

Algorithms including FF, PSO, and SA require a single numerical difference

between candidate solutions, so that the internal parameters can be calcu-

lated. In this case, evaluation score and subset size are integrated together via

28


weighted aggregation, in order to reflect the influence of both the evaluation

score f (B) and subset size (normalised using |B||A| ) of a given feature subset B.

The weighting parameters α and β may be equal or biased:

Bpi> Bp j

⇔ α f (Bpi) + β

|Bpi ||A|

> α f (Bp j) + β

|Bp j ||A|

(2.28)

Alternative aggregation methods may also be employed of course. Note that

multi-objective evolutionary algorithms [71, 79] have also been exploited to

facilitate simultaneous optimisation of both criteria, but they are outside the

scope of this chapter.

• Current Best Solution Tracking

Due to the stochastic behaviour of the search algorithms concerned, it is often

necessary to keep a record of the best quality feature subset B discovered thus

far, as the algorithm may explore other (possibly sub-optimal) solution regions

later on. The procedure of updating B invokes the previously mentioned

comparison process (Eqn. 2.27). At each iteration, the quality of the current

best solution f (B) is compared with those of all of the emerging subsets f (Bpi)

that are currently maintained by the individuals pi ∈ P:

1 for i = 1 to |P| do2 if Bpi

> B then B = Bpi

Algorithm 2.2.2: Update current best solution B

• Local Search

Algorithm 2.2.3 details one of the local search procedures [3], commonly used

by techniques such as MA [274, 296] and hybrid search methods [133, 194]. It

is a greedy mechanism that evaluates all unselected features, and adds the most

informative candidate (the feature that provides the greatest improvement

in evaluation score) to the current feature subset. The Hill-Climbing (HC)

algorithm works in a similar fashion, which continues to select features until

the score cannot be improved further. This mechanism can also be used in

reverse, in order to eliminate the least important feature from a subset.

29


1 repeat2 f

′= f (B)

3 t = −14 for i = 1 to |A|, ai /∈ B do5 bB

i = 16 if f (B)> f

′then

7 f′= f (B)

8 t = i

9 bBi = 0

10 if t 6= −1 then bBi = 1

11 until t == −1Algorithm 2.2.3: Local search (B)

2.2.2.2 Genetic Algorithm

Genetic Algorithms (GAs) mimic the process of natural evolution by simulating events

such as inheritance, mutation, selection, and crossover. A considerable amount of

investigation [231, 235] has been carried out in order to explore the feasibility

of applying GAs to FS, much of this has been summarised and compared in the

literature [79]. In GAs, a feature subset is generally represented by a binary string

called a chromosome. A population P of such chromosomes is randomly initialised

and maintained, and those with higher fitness values are propagated into the later

generations.

The reproduction process is typically achieved by the use of two operators:

crossover and mutation. As shown in Algorithm 2.2.4, the standard one-point

crossover operator exchanges and recombines a pair of parent chromosomes, Bp

and Bq. It first locates a certain crossover point rc along the length of the binary

string, and then generates two children with all features beyond rc swapped between

the two parents. The mutation operator produces a modified subset by randomly

adding or removing features from the original subset. By allowing the survival and

reproduction of the fittest chromosomes, the algorithm effectively optimises the

quality of the selected feature subset. The parameters c and m that control the

rate of crossover and mutation require careful consideration, in order to allow the

chromosomes to sufficiently explore the solution space, and to prevent premature

convergence towards a local optimal subset.

A GA-based FS algorithm is simple in concept, it may be implemented to achieve

a great efficiency and obtain quality feature subsets. Being a randomised algorithm,

30


1 pi ∈ P, i = 1 to |P| group of chromosomes2 Bpi ∈ B, i = 1 to |P| subsets associated with pi

3 c, crossover rate4 m, mutation rate

5 Randomly initialise P6 for g = 1 to gmax do7 for i = 1 to |P| do

8 T i = Bpiwith a probability of f (Bpi

)∑

∀Bpi ∈Bf (Bpi )

9 T i+1 = Bp jwith a probability of f (Bp j

)∑

∀Bp j∈B

f (Bp j )

10 if T i == T i+1 then11 bT i

r = ¬bT i

r , ar ∈ A12 else

// Crossover13 if r < c then14 rc ∈ 1, ..., |A|/215 for k = 1 to rc do

16 T i+1k = Bpi

k , T ik = Bp j

k

// Mutation17 for j = 1 to |A| do18 0≤ rm ≤ 119 if rm < m then bT i

j = ¬bT i

j

20 if rm < m then bT i+1

j = ¬bT i+1

j

21 i = i + 2

22 for i = 1 to P do23 Bpi

= T i

24 Update B

Algorithm 2.2.4: Genetic Algorithm

however, there is no guarantee that a top quality feature subset (if not the global

best solution) can be found in a reasonable or fixed amount of time. Its optimisation

response time and solution quality are not constant. These drawbacks may limit

GA’s potential for more demanding scenarios, such as on-line streaming FS [268].It is also a challenging task to identify a suitable set of required parameter values,

since the problem domain generally has very little in common with the evolutionary

concepts of GAs.

31


2.2.2.3 Memetic Algorithm

Memetic Algorithm [30, 196] (MA) signifies several recent advances in evolutionary

computation. It is commonly used to refer to any population-based evolutionary

approach with a separate individual learning process (e.g., a local improvement

procedure). It is also being referred to as hybrid GAs, parallel GAs, or genetic local

search in the literature [37]. When applied to the FS domain, the key research

question is how the local search should be implemented. Such an approach typically

follows a similar improvement process as Algorithm 2.2.3, except that the features

being considered for addition are from a randomly selected subset, rather than from

the complete set of original features [274].

An alternative local improvement process [296] suggests that a ranking of fea-

tures should be computed first, and the solutions may then be improved by adding

or removing features based on the ranking information. However, this requires the

subset evaluator employed to be able to handle feature ranking, unless such informa-

tion can be obtained elsewhere. It has also been proposed that local search may be

performed on an elite subset of population [280], as shown in Algorithm 2.2.5, and

the worst solutions may be substituted by the locally improved child solutions. The

local search mechanism alters (by adding or removing) a single feature that provides

the greatest increase in terms of evaluation score.

Although proven to be beneficial in a majority of scenarios, the presence of

greedy mechanisms may have a negative impact on the quality of the feature subset

returned, since the natural stochastic evolution of the chromosomes may be disrupted

by excessive execution of local adjustments. This is because the variable values (i.e.,

features) are discrete rather than continuous. It is also difficult to identify the most

suitable local search mechanism, in addition to the configuration of parameter values.

2.2.2.4 Clonal Selection Algorithm

Clonal Selection Algorithm (CSA) [54] is inspired by the adaptive immune response

behaviour to an antigenic stimulus. It exploits the fact that only the antibodies are

selected to proliferate. The original algorithm involves a maturation process for the

selected biological cells, which improves the affinity to the selective antibodies. A

simplified implementation of CSA-based FS [230] has been proposed. It enables both

the selection of important features, and the optimisation of parameters for the end

classifiers (which are implemented using the support vector machines ). Although

32


1 pi, i = 1 to |P| group of chromosomes2 Bpi ∈ B, i = 1 to |P| subsets associated with pi

3 s, number of best individuals kept for reproduction4 c, crossover rate5 m, mutation rate

6 for i = 1 to |P| do7 Bpi

= Local search(B)

8 for g = 1 to gmax do9 for i = 1 to s do

10 T i = Bpiwith a probability of f (Bpi

)∑

∀Bpi ∈Bf (Bpi )

11 T i+1 = Bp jwith a probability of f (Bp j

)∑

∀Bp j∈B

f (Bp j )

12 if T i == T i+1 then13 bT i

r = ¬bT i

r , ar ∈ A14 else15 if r < c then16 rc ∈ 1, ..., |A|/217 for k = 1 to rc do

18 T i+1k = Bpi

k , T ik = Bp j

k

19 for j = 1 to |A| do20 0≤ rm ≤ 121 if rm < m then bT i

j = ¬bT i

j

22 if rm < m then bT i+1

j = ¬bT i+1

j

23 i = i + 2

24 Sort B25 for i = 1 to s do26 Bp|P|−i

= Local search(T i)

27 Update B

Algorithm 2.2.5: Memetic Algorithm

the original method integrates these two tasks, it is easily modifiable to support

generic FS, as shown in Algorithm 2.2.6. An adaptive CSA has also been adopted for

network fault FS [284].

The initial population is filled with randomly generated antibodies at the start.

At each iteration, clones are created for each individual. The better evaluation score

an antibody achieves, the more clones are constructed. The maximum number of

33


1 pi ∈ P, i = 1 to |P|, group of antibodies2 0≤ f (B)≤ 1, normalised subset evaluation score3 c, maximum number of clones per antibody4 m, maximum number of bits per mutation5 s, maximum number of random cells6 T ∈ T, a set of temporary feature subsets

7 Randomly initialise P8 for g = 1 to gmax do9 T= ;

10 for i = 1 to |P| do

11 ci = ce f (Bpi)− f (B)

12 for j = 1 to ci do13 T = Bpi

14 Flip m(1− cic ) random bits of T

15 T= T∪ T

16 for i = 1 to s do17 T= T∪ B18 Sort T19 while |T|> |P| do20 T= T \ T |T|21 B= T22 Update B

Algorithm 2.2.6: Clonal Selection Algorithm

allowed clones ci for each of the antibodies pi, i = 1, · · · , |P| can be configured by a

parameter c, and an exponential function:

ci = c · e f (Bpi)− f (B) (2.29)

is used to calculate the amount of copies required. The clones are then mutated

by flipping bits randomly, and subsequently added to the population. Again, the

better the original antibody is, the less bit alteration will occur. The population is

further joined by a set of antibodies which are randomly selected from the existing

population. The trim process then removes the worst antibodies in order to maintain

the size of the group |P|. The current best solution is updated at each iteration, and

this process repeats until the maximum number of iterations has been reached.

Despite being a much simplified version of CSA, the CSA-based FS technique still

needs to produce clones of the good candidate feature subsets (antibodies), at every

34


iteration. The exponential cost of generating, mutating, and evaluating such clones

may become a significant overhead, especially for higher dimensional data sets.

2.2.2.5 Simulated Annealing

Simulated Annealing (SA) [55] is a generic probabilistic meta-heuristic for locating

an approximation to the global optimum of a given complex function. It is inspired by

the annealing process in metallurgy, a technique that involves repeatedly heating and

cooling a certain material in a controlled environment, in order to increase the size

of the crystals, and reduce the defects, both of which depend on the thermodynamic

free energy of the material.

The adaptation of SA for FS [69, 182] requires an adjustment to the underlying

computational algorithm, which ensures the SA keeping a record of the current best

solution. Unlike most other population-based NIMs, SA-based FS maintains and

improves only a single feature subset throughout the search process. As shown in

Algorithm 2.2.7, the algorithms checks whether it has reached “thermal equilibrium”

at a given energy state, by maintaining a count e that increments each time SA finds

a better quality feature subset. Once encountered sufficient improvements, SA makes

a transition in the energy state, where both the temperature g, and the perturbation

(mutation) percentage ρ are adjusted, by a so-called cooling rate c. The equilibrium

count e is also reset. Such a transition generally indicates that a smaller number of

features are adjusted during mutation, allowing fine tuning to be achieved towards

termination.

One of the major criticism levelled at SA is that it is not always able to return

the best solution found throughout the search process. This issue is addressed in the

above mentioned application to FS. Only having to improve and maintain a single

candidate solution has an obvious advantage of high efficiency. However, this also

makes SA-based FS more prone to discovering local best feature subsets, without

careful consideration of the settings for the starting temperature and cooling rate.

2.2.2.6 Tabu Search

Tabu Search (TS) [179] is a local search-based meta-heuristic designed to avoid the

pitfalls of typical greedy search procedures. It aims to investigate the solution space

that would otherwise be left unexplored. Once a local optimum is reached, upward

35


1 g ∈ gmin, gmax, range of temperatures2 ρ, perturbation percentage3 c, cooling rate4 e, equilibrium count5 emax, maximum number of successful perturbations

6 while g > gmin do7 B = B8 rt = dρ|A|e9 Flip rt random bits of B

10 d = f (B)− f (B)11 if d ≤ 0∨ r < e−d/g then12 B = B13 e = e+ 1

14 if e == emax then15 e = 0, g = gc, ρ = ρc

Algorithm 2.2.7: Simulated Annealing

moves (those that worsen the solutions) are allowed [13]. Simultaneously, the last

moves are marked as tabu during the following iterations to avoid cycling.

A TS-based FS method [100], as shown in Algorithm 2.2.8, has been proposed

in order to deal with reduction problems in conjunction with the use of rough set

theory. It employs a binary representation for the feature subsets. It also maintains a

tabu list τ that holds a record of the most recently evaluated solutions, so that the

algorithm can avoid being trapped in a previously explored region, and is restrained

from generating solutions of very low quality. The tabu list is usually initialised with

two feature subsets: an empty subset ;, and a set containing all available features A.

The approach starts by ranking the individual features according to their eval-

uation scores, and invokes a procedure to generate l new trial solutions that are

neighbours to a given candidate solution, with a hamming distance of up to l fea-

tures. The algorithm continues to generate a new trial at each iteration, until no

improvement has been observed for a predefined number of iterations. It then initi-

ates two mechanisms to further “mutate” a given candidate solution: shaking and

diversification. Shaking is essentially a greedy backward local search, each of the

selected features are examined one by one, in order to check whether its removal

produces a higher quality solution, or a subset of the same evaluation score with a

reduced size. The diversification procedure attempts to generate a new candidate

36


1 C, candidate solutions ordered by quality2 τ, tabu list3 l, number of trials4 T ∈ T, |T|= l, neighbouring solutions5 Q, number of occurrences of features in T6 k ≤ kmax, number of iterations without improvements

7 C= ;, B = ;, B = ;, τ= ; ∪ A8 for i = 1 to |A| do C= C∪ a19 while |B|< |A| do Local search (B)

10 for g = 1 to gmax do11 while k < kmax do12 for j = 1 to l do13 T = B14 Flip j random bits of T15 if T ∈ τ then16 j = j − 117 else18 T= T∪ T

19 B = argmaxT∈T f (T )20 τ= τ∪ B21 if f (B)> f (B) then

// Shaking

22 B = B23 for ∀C ∈ C do24 if B ∩ C /∈ τ∧ f (B ∩ C)≥ f (B) then25 B = B ∩ C

26 else27 k = k+ 1

// Diversification28 for i = 1 to |A| do29 rQ ∈ 1, · · · , |A|30 if rQ >Q i then B = B ∪ ai31

Algorithm 2.2.8: Tabu Search

solution, which contains features chosen with probability inversely proportional to

their number of appearances in the trial solutions. This process continues until the

maximum number of iterations has been reached.

The greedy mechanisms employed by TS are very beneficial to quickly locating

37


potentially better solutions, but they may lead still to locally optimal feature subsets.

TS also adopts a trial generation procedure similar to the cloning process of CSA,

although the cost is not exponential, it may be a significant overhead for high

dimensional problems.

2.2.2.7 Artificial Bee Colony

The Artificial Bee Colony [136] (ABC) algorithm is inspired by the intelligent be-

haviour of a honey bee swarm when searching for promising food sources. Because

ABC is a relatively new algorithm, and much of the original concept has been modi-

fied and/or omitted in the existing FS adaptations [199, 242], a brief description of

the original method is given here initially. In this algorithm, a colony of artificial bees

is divided into three groups: employed bees, onlookers and scouts. The positions of

food sources represent possible solutions to the optimisation problem. The nectar

amount of a food source corresponds to the quality of its associated solution. The

number of employed bees determines the number of solutions to be simultaneously

explored (and maintained).

Following an initial, randomly generated distribution of food source positions, an

employed bee attempts to locate a neighbouring food source, and evaluates its nectar

amount. If the quality of the nearby source is greater, the employed bee will point to

the newer position, otherwise the previous food source is preserved. An onlooker

watches the employed bees dance at the hive, sharing information regarding the

discovered sources, and independently selects a food source to visit (following the

same neighbourhood investigation procedure). The better food sources are recorded

in place of the previously found locations. Employed bees abandon the unvisited

food sources and become scouts, who perform random search for new solutions.

This process repeats until a predefined set of requirements is met, e.g., the maximum

number of iterations.

The rough set-based ABC FS method [242] first groups the instances by the

decision attributes, and applies a greedy local search to find the reduced feature sets

(for each class). ABC is then used to choose a random number of features out of

each set, and to combine the chosen features into the final feature subset. Due to the

presence of the initial local search, this approach only requires a population size as

big as the number of classes. A more general approach [199] considers features as

food sources. It configures the population of both employed bees and onlookers to be

38


equal to the number of features. Each employed bee is allocated one feature in the

beginning, and may be merged with others following the decisions of an onlooker,

forming feature subsets in the process. The merge of Bpiand Bp j

only happens if

r( f (Bpi)− f (Bp j

))> 0, r ∈ [0, 1].

In order to make the approach more scalable for data sets with large numbers of

features, an alternative method is described in Algorithm 2.2.9 that fits more closely

to the original ABC algorithm. It uses a predefined population size independent of

the number of features, which is initialised with randomly formed subsets B. Both

the employed bees and onlookers employ the same neighbourhood investigation

procedure, and accept a neighbouring solution if it is better than the subset it

is currently examining. An onlooker qi picks a particular employed bee p j with

probability:f (Bp j

)∑|P|

j=1 f (Bp j)(2.30)

which is in proportion to the evaluation score of its current feature subset f (Bp j),

and marks p j as visited.

At the end of the neighbourhood inspection procedure, any employed bee that is

unvisited generates and evaluates a new random subset B, as its current solution is

very likely to be of low quality. The process repeats until gmax number of iterations is

reached. The current best solution B which has been updated at every iteration, is

returned as the final result. A solution adjustment procedure similar to the move

operation, previously described in Algorithm 2.2.1, is also employed with promising

results. In particular, it allows an onlooker to generate a neighbouring solution by

moving its current subset towards that of the inspecting employed bee.

2.2.2.8 Ant Colony Optimisation

The Ant Colony Optimisation (ACO) algorithm [65] is originally proposed for solving

hard combinatorial optimisation problems. It is based on the behaviour of ants

seeking an optimal path between their colony and a source of food. The approach

uses a group of simple agents called ants that communicate indirectly via pheromone

trails, and probabilistically constructs solutions to the problem under consideration.

Several adaptations of the ACO algorithm have been proposed for the FS problem

domain, a number of which focus on rough set [38, 138] and fuzzy-rough set-based

39


1 pi ∈ P, i = 1 to |P|, the group of employed bees2 qi ∈Q, i = 1 to |P|, the group of onlooker bees3 T i, a temporary subset for pi

4 for g = 1 to gmax do5 for i = 1 to |P| do6 if pi.visited== f alse then7 Bpi

= B8 else9 if f (neighbour(Bpi

))> f (Bpi) then

10 Bpi= neighbour(Bpi

)

11 for i = 1 to |P| do

12 select p j with a probability of f (Bp j)

∑|P|j=1 f (Bp j )

13 p j.visited= t rue14 if f (neighbour(Bp j

))> f (Bqi) then

15 Bqi= neighbour(Bp j

)16 else17 Bqi

= Bp j

18 Update B

Algorithm 2.2.9: Artificial Bee Colony Optimisation

[122] subset evaluators, while a more general approach also exists in the literature

[134].

In ACO-based FS algorithms, features are represented as nodes in a fully con-

nected bi-directional graph, a candidate feature subset B is therefore a path that

connects the selected features. Two sets of hints are available to the ants: the heuris-

tic information η and the pheromone values τ. η is a pre-constructed matrix of size

|A|2, where A is the set of original features. cell η jk = ηk j stores the evaluation score

of the feature subset a j, ak, and signifies the quality of the path between a j and ak.

τ is another matrix of the same size that stores the pheromone values deposited by

the ants, it is initially populated with a constant value τ0.

As shown in Algorithm 2.2.10, during every iteration, each ant begins from a

random feature, an edge connecting the previous node ac and an unvisited feature

al is determined with probability:

probl =τlc

αηlcβ

∑

al /∈B τlcαηlc

β(2.31)

40


where α and β are predefined parameters. Following the path construction process,

an active (on-line) update of τ is performed, according to rules such as:

τ jk = mτ jk + n1|B|

(2.32)

where m and n are predefined weights [122, 138].

A passive (off-line) update [122, 138] may also be performed once the whole

path (feature subset) B is established:

τ jk =

ρτ jk + f (B), j ∈ B ∧ k ∈ B

ρτ jk, otherwise(2.33)

where ρ is the evaporation rate. For feature subset evaluators with a pre-determined,

maximum evaluation score, e.g., rough and fuzzy-rough set-based evaluators that

have a score range of 0 to 1, such information may be used to stop an ant from

traversing towards further nodes, once the highest possible fitness value is obtained.

For a generic evaluation technique, an ant may stop if f (B)> f (B), or f (B∪al)<f (B), where l is the feature about to be included.

As ACO requires a pre-constructed heuristic information matrix, an O(|A|2) num-

ber of subset evaluations is necessary in order to calculate the pair-wise feature

dependency, which may become prohibitive for large data sets or high complex-

ity subset evaluators. To combat this, a starting set of essential features may be

calculated in advance, such as the “core” for rough set and fuzzy-rough set-based

techniques [38]. This may significantly reduce this computational overhead, thereby

eliminating the need to consider such features while traversing the graph. In addition,

a normalisation process for τ has been proposed [138] to avoid search stagnation

caused by extreme relative differences of pheromone trails.

2.2.2.9 Firefly Algorithm

The Firefly Algorithm (FA) [277] is a meta-heuristic inspired by the flashing behaviour

of fireflies, which acts as a signal system to attract others. This approach has several

underlying assumptions: 1) all fireflies are uni-sexual, so that an individual will

be attracted to all others; 2) the brightness of a firefly is proportional to its fitness

value but can decrease when observed over distance; and 3) a random exploration is

41


1 pi ∈ P, i = 1 to |P|, group of ants2 Bpi

, current edges (feature subset) traversed by pi

3 η jk = ηk j, j, k = 1 to |A|, heuristic information4 τ jk = τk j, j, k = 1 to |A|, pheromone values5 τ0, initial pheromone6 ρ, evaporation rate

7 Initialise Parameters8 for j = 1 to |A| − 1 do9 for k = j + 1 to |A| do

10 η jk = f (T )11 τ jk = τ0

12 for g = 1 to gmax do13 for j = 1 to |A| − 1, k = j + 1 to |A| do14 τ jk = ρτ jk

15 normalise τ16 for i = 1 to |P| do17 ac = ar , 1≤ r ≤ |A|18 Bpi

= ac19 while |Bpi |< |A| do20 select al /∈ Bpi

with probability τlcαηlc

β

21 if f (Bpi ∪ al)< f (Bpi) break

22 Bpi= Bpi ∪ al

23 τlc = (1− f (Bpi))/2+ f (Bpi

)τlc

24 ac = al

25 for i = 1 to |P| do26 for j = 1 to |A| − 1, k = j + 1 to |A| do27 τ jk = τ jk + f (Bpi

)

28 Update B

Algorithm 2.2.10: Ant Colony Optimisation

performed if no brighter fireflies can be seen. It has been shown that FA degenerates

into the particle swarm optimisation algorithm with specific parameter settings.

FA has been successfully applied to addressing rough set-based FS problems [12],via the use of a population size equal to the number of features. Each individual’s

feature subset is initialised with one of the original features: Bpi= ai. The

brightness Ii of pi is determined by the rough set dependency score of its associated

feature only. The best mating partner p j for a firefly pi should satisfy: 1) I j > Ii; 2) the

42


distance f ∗− f (Bpi ∪Bp j) is minimal for all p j ∈ P, j 6= i; and 3) f (Bpi ∪Bp j

)> f (Bp j).

The two subsets then merge together and the process repeats for all fireflies until the

maximum rough set dependency score is achieved. This implementation removes all

stochastic components from the base FA algorithm, and delivers compact rough set

reducts in a manner similar to that of a greedy local search.

An alternative FA-based FS approach can be developed. Briefly, it makes better

use of the stochastic elements proposed by the original, base FA algorithm. As

illustrated in Algorithm 2.2.11, such an approach supports the use of any subset-

based evaluator, and shares similar intentions and modifications as those of the

improved ABC algorithm explained in 2.2.2.7. A population P of predefined fireflies

pi are initialised with random subsets. The brightness of pi when observed by p j is

calculated using:

Ii j = f (Bp j) · e−γd(Bpi

,Bp j)2

(2.34)

where γ is a predefined parameter termed the “absorption coefficient”. This im-

plements the original idea in which the attractiveness of a firefly decreases as the

distance between it and its mating partner increases. In FS terms, this means that

the larger |Bpi ⊕ Bp j | (the total number of mismatched features) is, the less bright

it will be perceived. A subset Bpiis moved towards its best mating partner with a

distance of:

di j = d(Bpi, Bp j) · e−γd(Bpi

,Bp j)2

(2.35)

The resulting subset Bpi′

replaces the previous Bpiif it is a better solution, otherwise

the original subset is maintained. At every iteration, the current best solution B is

updated and returned once the process reaches maximum number of iterations.

2.2.2.10 Particle Swarm Optimisation

Particle Swarm Optimisation (PSO) [7] is a type of method that optimises a problem

by exploiting a population of particles P (also referred to as the swarm), which

move around in the solution space with simulated positions and velocities. The

movement of a given particle is not only influenced by its own current best position,

but also guided towards the currently known group-wise best position in the search

space. The individual current best solution is constantly updated when the individual

particles locate better positions. The previously introduced FA is very closely related

to PSO, having many similar underlying principles, especially with regards to the

43


1 pi ∈ P, i = 1 to |P|, group of fireflies2 qi ∈Q, i = 1 to |Q|, |Q|= |P|, temporary group of fireflies3 γ, light absorption coefficient4 Ii j, observed brightness of p j by pi

5 Initialise Parameters6 Random Initialisation7 for g = 1 to gmax do8 Q = ;9 for i = 1 to |P| do

10 select p j, j = argmax j Ii j, j 6= i11 di j = d(Bpi

, Bp j)

12 pi′ = move pi towards p j by di je−γdi j

2

13 if f (Bpi′

)> f (Bpi) then Bqi

= Bpi′

14 P =Q15 Update B

Algorithm 2.2.11: Firefly Algorithm

concept of particle movements. However, the fireflies in FA are only attracted and

move towards locally observed best mating partners.

When applied to FS (see Algorithm 2.2.12), the velocity vi which represents the

number of features to be altered, of a given candidate feature subset Bpiis calculated

by:

vi = wg vi + c1rd1d(Bpi, Bpi) + c2rd2d(Bpi

, Bpi) (2.36)

where wg is a gradually decreasing inertia weight, and c1 and c2 are the acceleration

constants giving weights to the current individual best and the group-wise best

solution, respectively. The outcome of the velocity calculation, is further randomised

via the use of random numbers 0≤ rd1, rd2 ≤ 1. It has been suggested in the literature

[261] that the velocity should be regulated by a predefined value vmax, since the

number of features being modified can potentially become very large. Finally, once

the number has been determined, the new candidate subset is calculated following

Algorithm 2.2.1.

There exists significant debate surrounding the velocity calculation [41, 167, 261].This reflects the discrepancy between the intended usage of “movement” proposed

in the base PSO algorithm, and its actual implementation in PSO-based FS. For

continuous-valued optimisation, notions such as velocity and movement are intuitive.

44

2.3. Summary

1 pi ∈ P, i = 1 to |P|, group of particles2 Bpi

, local best subset found by pi

3 c1, c2, acceleration constants towards Bpiand B

4 wg ∈ [wmin, wmax], gradually decreasing inertia weight5 vi ∈ [1, vmax], current and the maximum velocity

6 Initialise Parameters7 Random Initialisation8 for g = 1 to gmax do9 Update B, Bpi

10 for i = 1 to P do11 0≤ rd1, rd2 ≤ 112 vi = wg vi + c1rd1d(Bpi

, Bpi) + c2rd2d(Bpi

, Bpi)

13 Move Bpitowards Bpi

by vi

14 wg = wmin + (1−g

gmax)(wmax −wmin)

Algorithm 2.2.12: Particle Swarm Optimisation

They are used to locate a possible intermediate, interpreted solution between two

solution vectors, i.e., points in a continuous space. Yet in FS, the features are

discrete-valued, and the current binary representation does not allow straightforward,

meaningful interpretation between two feature subsets. Therefore, PSO-based FS

may potentially benefit from the integer-valued representation as used in HS-based

FS.

2.3 Summary

This chapter has introduced a selection of FS techniques for the purpose of evaluating

the quality of feature subsets. The concept of individual feature-based measures

(i.e., feature ranking methods) [147, 162, 219] has been described. They offer

distinctive differences to the group-based approaches [52, 93, 126] that consider

a given feature subset as a whole. Various FS models including filter-based [53],wrapper-based [164], hybrid [119, 194, 297] and embedded methods [264] have

also been introduced, forming alternative means with which to access the goodness

of feature subsets. Three group-based filter evaluators: CFS [93], PCFS [52], and

FRFS [126] have been explained in detail, as they are the main methods adopted to

demonstrate the efficacy of the approaches proposed in this thesis.

45

2.3. Summary

More importantly, this chapter has presented a comparative review of nine dif-

ferent stochastic FS search strategies. Their underlying respective inspirations span

a wide range of areas, including evolution, biology, physics, social behaviour, and

swarm activities, and have been applied to the problem domain of FS. The work

flows of the reviewed algorithms have been illustrated in a uniform manner, and

the common notions and shared mechanisms have also been identified. Existing

methods that are based on the classic heuristics, such as ACO [38, 122, 134, 138],GAs [231, 235], and PSO [41, 167, 261], are summarised. Several more recent

developments, including CSA [230], ABC [199, 242], and FA [12], which are pro-

posed to solve more specific scenarios (or work with fixed types of feature subset

evaluator) are introduced and modified considerably. These modifications enable the

approaches to work with generic, feature subset-based evaluators and thus, facilitate

direct comparison with the proposed HSFS method. A systematic experimental

evaluation of these reviewed methods has been carried out, in order to demonstrate

their efficacy, with results presented later in Section 3.5.

46

Chapter 3

Framework for HSFS and its

Improvements

I N this chapter, a new FS search algorithm named HSFS is presented. This approach

is based on HS, a recently proposed, simple yet powerful optimisation technique

inspired by the improvisation process of music players. HSFS is a general approach

that can be used in conjunction with a wide range of feature subset evaluation

techniques. It is particularly beneficial for group-based measures such as CFS, PCFS

and FRFS, which access the quality of a given feature subset as a whole, rather than

a combination of individual-feature-based scores. Owing to the stochastic nature

of HS, the proposed approach is able to escape from local best solutions, and can

identify multiple quality feature subsets.

Despite being a population-based approach, HSFS works by generating a new

harmony that encodes a candidate feature subset, after considering a selection of

existing quality feature subsets stored in the harmony memory. This forms a contrast

with conventional evolutionary approaches such as GA, which consider only two

(parent) vectors in order to produce a new (child) vector. This characteristic, along

with the simplicity of HS are exploited, in order to improve the robustness and

flexibility of the underlying search mechanism and hence, help to obtain better

quality feature subsets.

The nature of predefined constant parameters limits the exploitation of the origi-

nal HS algorithm. It is difficult to determine a good set-up without an ample amount

47

3.1. Principles of HS

of test runs. Employing the same parameter setting for both initial exploration and

final fine-tuning may also limit the search performance. Furthermore, the original

algorithm is designed to work with single objective optimisation problems, whilst

the problem domain of FS is at least two dimensional (subset size reduction and

evaluation score maximisation).

In order to overcome these drawbacks, a number of modifications to HSFS are

also proposed. By introducing methods to tune parameters dynamically: an initial

set-up is used to encourage exploration, the parameter values then gradually change

during the cause of the algorithm. At the end of the search, a different set-up is

prepared for fine-tuning of the final result. In contrast to the fixed-parameter version,

the effort spent in determining good parameter settings is reduced significantly, while

the overall search performance is simultaneously improved. An iterative refinement

strategy is also exploited, which recursively searches for smaller feature subsets while

preserving the evaluation quality of the discovered candidate solutions.

The remainder of this chapter is structured as follows. Section 3.1 introduces

the key notions of HS and its search procedures. Section 3.2 summarises the initial

development made towards HSFS, which is centred on the use of binary string-based

feature subset representation. Section 3.3 describes the proposed HSFS algorithm

that utilises a flexible, integer-valued encoding scheme, allowing the stochastic in-

ternal mechanisms of HS to be better exploited. Section 3.4 details the additional

improvements developed to further enhance the performance of the proposed ap-

proach. Finally, the results of experimental evaluation are reported in Section 3.5,

followed by a summary given in Section 3.6.

3.1 Principles of HS

The original HS algorithm is designed to solve numerical optimisation problems,

and most of its early applications [157] involve discrete-valued variables. When

applied to such problems, musicians typically represent the decision variables of a

given cost function, and HS acts as a meta-heuristic algorithm that attempts to find a

solution vector that optimises this function. In such a search process, each decision

variable (musician) generates a value (musical note) for finding a global optimum

(best harmony). The aim here is to provide a thorough explanation of the algorithm,

including its key notions and iteration steps, so that the proposed HSFS algorithm

may be better introduced thereafter.

48


3.1.1 Key Notions

The key notions of HS, as illustrated in Fig. 3.1, are musicians, notes, harmonies,

fitness, and harmony memory. In most optimisation problems solvable using HS, the

musicians P = pi | i = 1, · · · , |P| represent the variables of the cost function being

optimised, and the values of the variables are referred to as musical notes. A harmony

H, |H|= |P| is a candidate solution vector containing the values for each variable,

where a collection of good quality solutions are stored in the harmony memory

H = H j | j = 1, · · · , |H|. Note that all of the above mentioned collections: P, H, Hare fixed-sized, ordered lists, rather than sets. In particular, H j

i , i = 1, · · · , |P|, j =1, · · · , |H|, denotes the value selected by the ith musician in the jth harmony stored

within the harmony memory.

Figure 3.1: Key notions of HS

For a newly constructed (empty) harmony, all of the internal values are initialised

as −, indicating that no musical notes have been assigned. A harmony memoryH can

be concretely represented as a two dimensional matrix. Without losing generality, the

number of rows (harmonies) |H| is a predefined parameter that limits the maximum

number of harmonies to be stored. Each column of the matrix is dedicated to one

musician, which provides a pool of playable notes for future improvisations. In this

thesis, such a pool is referred to as the note domain ℵi of a musician pi:

ℵi =|H|⋃

j=1

H ji , H j ∈H, i = 1, · · · , |P| (3.1)

49


3.1.2 Parameters of HS

The original HS algorithm [155] employs five parameters, including three core

parameters: 1) the size of harmony memory |H|; 2) the harmony memory considering

rate δ; and 3) the maximum number of iterations gmax. There are two optional ones:

1) the pitch adjustment rate ρ; and 2) the adjusting bandwidth that is later developed

into fret width τ [84]. For numerical optimisation, the number of musicians |P| isgenerally implied by the problem itself, and is equal to the number of variables in the

optimisation function. The two factors that influence the actions of a musician: δ

and ρ are described below, and their effects will be explained later in Section 3.1.3.

• Harmony Memory Considering Rate

The harmony memory considering rate δ, 0≤ δ ≤ 1, is the rate of choice of one

value from the historical notes stored in the harmony memory. While (1−δ) is

the rate of randomly selecting one value from the range of all possible values.

If δ is set to a low value, the musicians will focus on exploring other areas of

the solution space, and a high δ will restrict the musicians to historical choices.

• Pitch Adjustment Rate

The pitch adjustment rate ρ, 0≤ ρ ≤ 1 parameter causes a musician to select

a value neighbouring to its current choice For example, for a given value v, its

new value will be calculated based on the formula v + (random(−1,1)× τ).For discrete variables, this simply means to choose the immediate left or right

neighbouring value. For continuous problems, τ is an arbitrary bandwidth that

constrains the maximum amount of distance allowed to shift the current value.

(1−ρ) is the probability of using the chosen value without further alteration.

This pitch adjustment procedure only occurs if the note was chosen from the

harmony memory, and thus it is not affected by δ activation.

3.1.3 Iterative Process of HS

HS can be divided into two core phases: initialisation and iteration, as shown in

Fig. 3.2. A simple discrete numerical problem [84] given in Eqn. 3.2 is used here to

illustrate the process of HS.

Minimise (a− 2)2 + (b− 3)4 + (c − 1)2 + 3 (3.2)

50


Figure 3.2: Iteration steps of HS

where a, b, c ∈ 1,2, 3,4, 5.

1. Initialise Problem Domain

In the beginning, the parameters used in the search need to be established.

This includes |H|, δ, gmax, ρ, τ, and |P|.

According to the problem at hand, the group of musicians p1, p2, p3 is ini-

tialised with a size equal to the number of variables (|P| = 3), each corresponds

to the function variables a, b and c. The harmony memory H is filled with

randomly generated solution vectors. In the example problem, 3 randomly

generated solution vectors may be 2, 2,1, 1, 3,4 and 5,3, 3.

2. Improve New Harmony

A new value is chosen randomly by each musician out of their note domain,

and together form a new harmony. During the improvisation process, the

stochastic events controlled by δ and ρ will also occur, causing the value of

the selected notes to change.

In the example, musician p1 may randomly choose 1 out of ℵ1 = 2,1,5, p2

chooses 2 out of ℵ2 = 2, 3, 3 and p3 chooses 3 out of ℵ3 = 5, 3, 3, forming

51


a new harmony 1, 2, 3. Given the above example, with δ = 0.9, and ρ = 0.1,

musician p1 will choose from within the his note domain ℵ1 = 2, 1, 5 with a

probability of 0.9. After making a choice, say, 5, the musician will choose the

left or right neighbours with 0.05 probability for each, and the left neighbouring

value 4, may then be chosen in the end. Alternatively, the musician may choose

from the range of all possible values, i.e., 1,2,3,4,5 with a probability of

0.1, and the note 4 may again be chosen but without further pitch adjustment.

To further ease the understanding of HS, Algorithm 3.1.1 presents an outline

of the improvisation procedure in pseudocode.

1 pi ∈ P, i = 1, · · · , |P|, group of musicians2 H j ∈H, j = 1, · · · , |H|, harmony memory3 H j

i , the value of the ith variable in H j

4 Hnew, emerging harmony

5 ℵi =⋃|H|

j=1 H ji , note domain of pi

6 δ, harmony considering rate7 ρ, pitch adjustment rate8 τ, fret-width9 mini, maxi, the value range of the ith variable

10 for i = 1 to |P| do11 random rδ, 0≤ rδ ≤ 112 if rδ < δ then13 random ri, ri ∈ ℵi

14 random rρ, 0≤ rρ ≤ 115 if rρ < ρ then16 random rτ,−1≤ rτ ≤ 117 ri = ri + rττ

18 else19 random ri, mini ≤ r ≤maxi

20 Hnewi = ri

21 return Hnew

Algorithm 3.1.1: Improvisation process of original HS

3. Update Harmony Memory

If the new harmony is better than the worst harmony in the harmony memory

(judged by the objective function), the new harmony is then included in the

resulting harmony memory and the existing worst harmony is removed.

52

3.2. Initial Development

For example, assume the newly improvised harmony 1, 2, 3 has a evaluation

score of 9, making it better than the worst harmony in the harmony memory

1, 3,4 which has a score of 16. Therefore the harmony 1,3, 4 is removed

from harmony memory, and replaced by 1,2,3. If 1,2,3 had a greater

score than 16, then it would be the one being discarded.

4. Iteration

The algorithm continues to iterate until the maximum number of iterations

gmax is reached. In the end, the highest quality solution present in the harmony

memory is returned as the final output.

In the example, if the musicians later improvise a new harmony with values

2, 3, 1, which is very likely as these numbers are already in their respective

note domains, the problem will be solved (with a minimal fitness score of 3).

3.2 Initial Development

This section describes the preliminary investigations [60] carried out, which explores

the feasibility of applying HS to the problem domain of FS. This initial HS-based FS

approach, being a stand-alone and functional application of HS, helped substantially

in obtaining a better understanding of the internal mechanisms of HS, and its

application to the problem domain of FS. It also revealed a number of drawbacks that

inspired further development, from which the current HSFS algorithm is derived.

3.2.1 Binary-Valued Representation

A binary-valued feature subset representation has been adopted in the initial ap-

proach, which is also the most commonly used representation in the literature. Recall

from Section 3.1.1 that the key notions of HS are musicians, notes, harmonies and

harmony memory. The binary-valued approach maps musicians directly onto the

available features to be selected, i.e., |P| = |A|. The note domain ℵi of a given

musician pi, contains only binary values: ℵi = 0,1, which indicates whether

corresponding feature is included (1), or not (0) in the emerging feature subset.

A harmony is represented as a series of bits that encodes the selected fea-

tures. For example, as shown in Table 3.1, for a given data set with 6 features

53

3.2. Initial Development

A= a1, a2, a3, a4, a5, a6, harmony H1 = 0,1,1,0,0,0 translates into feature sub-

set BH1 = a2, a3. The binary encoding of feature subsets is a straightforward

mapping. It allows the procedures of HS: initialisation and iteration, as illustrated in

Fig. 3.2, to be executed in the same fashion as that of standard numerical optimisation

tasks.

Table 3.1: Binary encoded feature subsets

p1 p2 p3 p4 p5 p6 Represented subset B

H1 0 1 1 0 0 0 a2, a3H2 1 0 0 0 0→1 1 a1, a5, a6

3.2.2 Iteration Steps

The initialisation step involves filling the harmony memory with randomly generated

feature subsets, i.e. random-valued string of bits. In order to improvise a new

harmony, each musician randomly selects a value from their respective note domain.

Together, such selected values form a new bit set. This set is then translated back

into a feature subset and evaluated. If the evaluation score is higher than any of the

feature subsets in the harmony memory, it replaces the worst candidate feature subset;

otherwise, the new bit set is discarded. The process repeats until the maximum

number of iterations gmax has been reached.

In this approach, the harmony memory considering rate δ has little practical

impact because the amount of available notes (0 and 1) for each musician are

very limited. The most significant use of it is in terms of flipping the bit value,

which includes a previously unselected feature, or vice versa. Hence, in this initial

development, the parameter δ is simply implemented as the bit flipping rate, and

its effect is demonstrated by the second harmony H2 in Table 3.1. 0→1 signifies a

forced value change due to δ activation, which causes the affected musician p5 to

change its decision to the opposite value.

3.2.3 Tunable Parameters

The three tunable parameters of this initial approach are: 1) the harmony memory

size |H|, 2) bit flipping rate δ, and 3) the maximum number of iterations gmax. The

harmony memory size is a sensitive parameter, in most cases it is set to between half

54

3.3. Algorithm for HSFS

of the total number of features up to the total number of features, leaving less than

half of the features outside the harmony memory. A large harmony memory will give

each musician more musical notes to choose from when improvising a new harmony.

However, it will require a longer initialisation period in order to fill up the harmony

memory and hence, may lead to slower updates and convergence.

The initial development of HSFS using binary-valued feature subset represen-

tation is an intuitive adoption of the HS algorithm in the FS domain. It is simple

to implement and shares a number of commonalities with other nature-inspired

approaches. However, it has a number of obvious shortcomings:

• Having as many musicians as the total features suggests potential scaling

problems for data sets with large numbers of features.

• The binary note domain gives each musician very limited choices when com-

posing new harmonies, and the pitch adjustment opportunities are also wasted

for binary choices.

• The approach requires a substantial amount of iterations in order to reach

convergence; all these together prevent HS in reaching its full potential.

3.3 Algorithm for HSFS

Although simple in concept, the use of binary-valued note domain limits the efficiency

and explorative potential of HS. To better address these problems, an integer-valued

HSFS algorithm has been developed [62], providing more freedom for the choice

of playable notes, and allowing the stochastic mechanisms of HS to be exploited

more thoroughly. In this section, a description of HS based FS is given, explaining

how FS problems can be translated into optimisation problems, further solved by HS.

This section includes illustrative examples of the encoding scheme used to convert

feature subsets into harmony representations. A flow diagram of the search process

is also presented in Fig. 3.4 along with step by step descriptions using FRFS as an

example subset evaluator.

3.3.1 Mapping of Key Notions

For conventional optimisation problems, the number of variables is pre-determined

by the function to be optimised. However for FS, there is no fixed number of elements

55


in any potential candidate feature subset. in fact, the size of the emerging subset

itself should be reduced in parallel to the optimisation of the subset evaluation score.

Therefore, when converting concepts, such as those shown in Table 3.2, a musician

is best described as an independent expert or “feature selector”, where the available

features for the feature selectors translate to musical notes for musicians. Each

musician may vote for one feature to be included in the feature subset when such an

emerging subset is being improvised. The harmony is then the combined vote from

all musicians, indicating which features are being nominated.

Table 3.2: Concept mapping from HS to FS

HS Optimisation FS

Musician Variable Feature SelectorMusical Note Variable Value FeatureHarmony Solution Vector SubsetHarmony Memory Solution Storage Subset StorageHarmony Evaluation Fitness Function Subset EvaluationOptimal Harmony Optimal Solution Optimal Subset

The entire pool of the original features A, forms the range of musical notes

available to each of the musicians. Multiple musicians are allowed to choose the

same feature, and they may opt to choose none at all. The fitness function employed

will become a feature subset evaluation method [52, 93, 126], such as those described

in Section 2.1.1.2, which analyses and merits each of the new subsets found during

the search process. Fig. 3.3 illustrates the important concepts in the same style as

that of Fig. 3.1.

Table 3.3 depicts the following three example harmonies. H1 denotes a subset

of 6 distinctive features: BH1 = a1, a2, a3, a4, a7, a10. H2 shows a duplication of

choices from the first three musicians, and a discarded note (represented by −) from

p6, representing a reduced subset BH2 = a2, a3, a13. H3 signifies the feature subset

BH3 = a2, a6, a4, a13, where a3→a6 indicates that p4 originally voted for a3, but

was forced to change its choice to a6 due to δ activation. For simplicity, the explicit

encoding/decoding process between a given harmony H j and its associated feature

subset BH j is omitted in the following explanation.

For conventional optimisation problems, the range of possible note choices for

each musician is in general different from those for the other musicians. However,

56


Figure 3.3: Key notions of HSFS

Table 3.3: Feature subsets encoded using integer-valued scheme

p1 p2 p3 p4 p5 p6 Represented subset B

H1 a2 a1 a3 a4 a7 a10 a1, a2, a3, a4, a7, a10H2 a2 a2 a2 a3 a13 − a2, a3, a13H3 a2 − a2 a3→a6 a13 a4 a2, a4, a6, a13

when applied to FS, all musicians jointly share one single value range, which is the

set of all features.

3.3.2 Work Flow of HSFS

The iteration steps of HSFS are demonstrated as follows, where the fuzzy-rough

dependency function of FRFS [126] is employed as the subset evaluator. The ac-

companying flow diagrams are given in Figs. 3.4 and 3.5, which are in principle,

straightforward adaptations of the original HS concepts (Figs. 3.2 and 3.1.1). As

previously explained in Section 2.1.1.2, FRFS is concerned with the reduction of

information or decision systems through the use of fuzzy-rough sets. It is used here

to provided a concrete example of the work flow of HSFS. Recall from Section 2.1.1.2

that the original FRFS method employs a greedy hill-climbing (HC) based algorithm

termed fuzzy-rough QuickReduct [116], which extends the original crisp version

[39]. The fuzzy-rough dependency function is utilised in order to identify a minimal

fuzzy-rough reduct (a feature subset achieving full dependency evaluation).

In contrast to stochastic methods such as HSFS, greedy search mechanisms such

as fuzzy-rough QuickReduct add a feature to the current candidate subset at each

57


Figure 3.4: Work flow of HSFS

iteration. Although generally quick to converge, fuzzy-rough QuickReduct only

considers the addition of a given feature based upon the resulting increase in quality,

once added to the candidate feature subset each time. It therefore disregards the

potential contribution of pair-wise or group-wise features. As a result of this greedy

behaviour, the possibility for identifying such groups of features that collectively form

a more informative feature subset are significantly reduced. QuickReduct and other

deterministic HC algorithms are therefore, prone to returning sub-optimal feature

subsets.

1. Initialise Problem Domain

The parameters are assigned according to the problem domain, including: |H|,number of feature selectors |P|, maximum number of iterations gmax, and δ.

The subset storage containing |H| randomly generated subsets is then initialised.

This provides each feature selector with a note domain of |H| features, which

may include identical choices, and nulls.

2. Improvise New Subset

58


Figure 3.5: Improvisation process of HSFS

A new feature is chosen randomly by each feature selector out of their working

feature domain, and together form a new feature subset. In the event of

δ activation, a random feature will be chosen from all available features to

substitute the feature selector’s own choice.

For FS problems that are dealt with in this thesis, the pitch adjustment rate ρ

is not used. This is because the underlying motivation for the use of ρ is that

minor adjustments into neighbouring values may help discovering better solu-

tions, which is generally true for real valued optimisation functions. However,

as the values are now feature indices, each feature and its “neighbours” have

no such general relation, thus pitch adjustment will result in a change into

a possibly unrelated feature nearby. Note that measures such as correlation

[93] and fuzzy-rough dependency [126] may be utilised to facilitate effective

identification of actual neighbouring features and therefore, allow the pitch

59


adjustment mechanism to be exploited.

3. Update Subset Storage

If the newly obtained subset achieves better a fuzzy-rough dependency score

than the worst subset in the subset storage, the new subset is included in the

subset storage and the existing worst subset is removed. The comparison of

subsets takes into consideration of both dependency score and subset size, in

order to discover the minimal fuzzy-rough reduct at termination.

4. Iteration

The improvisation and comparative update procedure continues until a prede-

fined maximum number of iterations gmax is reached. The final output is the

feature subset with the highest quality, out of those stored within the harmony

memory at termination.

HSFS offers a clear advantage in that a group of features are evaluated as a whole.

A newly improvised subset is not necessarily included in the subset storage, simply

because one of the features has a locally strong fuzzy-rough dependency score (or a

very high individual importance regardless of the evaluator employed). This is the

key distinction between any of the HC-based approaches, which also allows a good

synergy between HSFS and group-based feature subset evaluators [52, 93, 126].

3.3.3 Complexity Analysis

Following the study in [84], consider a HS process with the following parameters:

size of the harmony memory (the number of harmonies stored) = |H|, number of

musicians = |P|, number of possible notes (total number of features) of a musician =|A|, number of optimal notes (features) of musician pi in the harmony memory = |ℵi|(|ℵi|< |ℵi|), and a pitch adjustment value of δ. The probability of finding the optimal

harmony, Prob(H) is defined as follows:

Prob(H) =⊆|P|i=1

δ|ℵi||H|+ (1−δ)

1|A|

(3.3)

where ρ is not considered because it is an optional operator [84].

Initially, since the harmony memory is populated with random harmonies, there

may not be any optimal feature (for all musicians) in the harmony memory:

|ℵ1|= |ℵ2|= · · ·= |ℵ|P||= 0 (3.4)

60

3.4. Additional Improvements

and

Prob(H) =

(1−δ)1|A|

(3.5)

which means that the probability Prob(H) is very low. However, as the improvisation

process continues, new feature subsets with improved evaluation scores than those

generated randomly may be identified, and thus added to the harmony memory.

The number of optimal notes of musician pi in the harmony memory: |ℵi| can be

increased on an iteration-by-iteration basis. Consequently, the probability of finding

the optimal harmony, Prob(H) is increased over the course of time.

Using fuzzy-rough QuickReduct (Section 2.1.1.2) as a comparative example, for

a given data set of |A| features, the worst case complexity will result in (|A|2 + |A|)/2evaluations of the dependency function. When implemented using HSFS, the number

of subset evaluations is the same as the maximum number of iterations gmax (which

is no longer purely dependent on the number of features in the original data). This

characteristic makes HS more favourable when solving complex problems with large

amount of features.

As for the complexity of the HS algorithm itself: the initialisation requires O(|A|×|H|) operations to randomly populate the subset storage, and the improvisation

process is of the order O(|A| × gmax) because every feature selector needs to produce

a new feature at every iteration. Here |H| is the subset storage size, |A| is the number

of feature selectors, and gmax is the maximum number of iterations. When comparing

the storage requirement, HSFS clearly requires more storage as it needs to keep

O(|A| × |H|) features in the subset storage, while HC only works on the current

candidate solution, therefore requiring only O(|A|) storage space.

Although the two types of approach are analysed here for the sake of comparison,

in reality, FS is used for dimensionality reduction prior to the involvement of a given

application, that will exploit those features belonging to the resultant feature subset.

Thus, this operation has no negative impact upon the run-time efficiency of any

subsequent process that utilises these selected features.

3.4 Additional Improvements

Traditional HS uses fixed, predefined parameters throughout the entire search process,

making it hard to determine a “good” setting without extensive trial runs. The

61


parameters are also non-independent from each other, therefore finding a good

setting often becomes an optimisation problem by itself. The search results usually

provide no hint as to how parameters should be adjusted in order to obtain an

increase in performance. This section introduces the proposed improvements to

the HSFS algorithm, making it a more flexible approach better suited to solving FS

problems of high dimensionality.

3.4.1 Parameter Control

To eliminate the drawbacks associated with the use of fixed parameter values, a

dynamic parameter adjustment scheme is proposed [59], in order to guide the

modification of parameter values at run-time. By using tailored sets of parameter

values for the initialisation, intermediate and termination stages, the search process

can benefit greatly from this dynamic parameter environment.

At the beginning of a search, as the musicians are just starting to explore the

solution space, the note domains contain only randomly initialised, low quality notes.

Therefore, a large harmony memory is not essential. In fact, having to maintain a

large pool of sub-optimal harmonies may only confuse the musicians, preventing

them from choosing good values during improvisation. Lower δ at this stage may

also encourage the musicians to seek values outside of the current harmony memory.

As the search approaches convergence, the musicians will usually have found

many sub-optimal harmonies. For such cases, given a high δ, they will almost exclu-

sively choose values from the harmony memory when improvising new harmonies.

Thus, a large pool of good results may contribute to a better solution. Of course,

situations can also occur where the algorithm has not converged by the end of the

search, which could be caused by the complexity of the problem itself, or a less-than-

desired-number of iterations. From the above observation, a good dynamic |H| can

be defined as:

|H|g = |H|min +g(|H|max − |H|min)

gmax(3.6)

When improvising new harmonies, HS with a low δ focuses less on the consid-

eration of historical values, but instead more on the entire value range. HS with a

high δ attempts to produce a new harmony out of existing values stored within the

62


harmony memory. A dynamic δ that increases its value as the search progresses can

be formulated such that:

δg = δmin +g(δmax −δmin)

gmax(3.7)

Because one of the main advantages of HS is its simplistic structure, the specifica-

tion of the rules is designed in this manner by taking the computational complexity

into account. The calculus involved is made as simple as possible. Alternatively,

smooth exponentially increasing functions may be considered:

|H|g = |H|min + 2g

gmax−1(|H|max − |H|min) (3.8)

δg = δmin + 2g

gmax−1(δmax −δmin) (3.9)

Although an exponentially decreasing function was proposed to control the fret

width [173], for the general scenario of the FS problem discussed in this chapter, it

is counter-intuitive to suggest that in any search stage |H| and δ parameters should

be adjusted in a more aggressive manner than the others.

All aforementioned individual parameter adjustment strategies can be combined

together for greater performance gain, allowing different sets of parameter settings

for different search stages, as summarised in Table 3.4. Following initialisation,

the algorithm employs a large harmony memory, with a large chance of randomly

selecting new values. Towards the intermediate stage, the algorithm uses a medium

sized harmony memory, with a balanced possibility between choosing values from

the harmony memory and the range of all possible values. Finally, towards the

termination of the process, the algorithm utilises a small harmony memory, with

the values chosen almost purely from stored good solutions. Note that these stages

are listed here for conceptual explanation purposes, there are no clear boundaries

between them in terms of implementation. Parameter settings gradually shift from

one stage to another as the search progresses.

To further justify these intuitive rules, in Section 3.5.2.1, results are gathered

and compared against the original algorithm with no parameter adjustments, as well

as the algorithm using opposite rules, such that |H| and δ decrease from a maximum

to minimum value over the search iterations.

63


Table 3.4: Parameter settings in different search stages

Initialisation Intermediate Termination

δ Small Medium Large|H| Small Medium Large

EffectHigh diversity Steady improvement Fine tuningDeep Exploration in harmonies Fast convergence

3.4.2 Iterative Refinement

The parameter control technique previously introduced in Section 3.4.1 offers ways to

dynamically change HS parameters, in order to avoid some of the difficulties in finding

a good set of parameters. However, there is an additional parameter introduced in the

HSFS approach: the number of feature selectors |P|. For conventional optimisation

problems, |P| is equal to the number of variables in optimisation function, which is

predefined, and it restricts the number of columns in the subset storage. Due to the

concept mapping in HSFS, the number of function variables is transformed into a

virtual concept, and |P| now serves as a hard upper bound for the resulting subset

size, which needs to be defined by the user.

Intuitively |P| should be equal to the actual limit, the total number of features. Yet

such configuration often leads to less satisfactory results. This is because the current

structure of HS only supports single objective optimisation. Additional measures are

required in order to enforce size reduction after HS converges in terms of the subset

evaluation score. An alternative method would be to manually initialise |P| to a

smaller value, in order to force HSFS to find solutions within the restricted boundary.

However, such an approach introduces subjective assumption prior to the search,

and it is often difficult to estimate the amount of redundancy present in any given

data set.

In order to combat this issue, an iterative refinement approach is proposed here

such that the search process becomes more data driven, and further reduce the

requirement for manual parameter configuration. As shown in Algorithm 3.4.1, the

refinement process essentially performs HSFS iteratively, each time with a reduced

feature selector size |P|. If a better or smaller subset is discovered in the previous

iteration, the number of feature selectors is set to be equal to this subset’s size. Since

the best evaluation score achieved so far is recorded in f (B), HSFS is safe to explore

alternative solution regions. This refinement procedure continues until the latest

64


feature subset B no longer provides any improvement in either subset quality or size

(as shown in Eqn. 3.10). In the end, the best feature subset discovered during the

search B is returned.

f (B)≤ f (B) or f (B) == f (B)∧ |B|== |B| (3.10)

1 A, set of all conditional features2 B, set of selected features3 B, current best feature subset

4 B = A5 |P|= |A| B = HSFS(A, |P|)6 while f (B)≥ f (B) do7 if f (B) == f (B)∧ |B|== |B| then8 break9 else

10 B == B11 |P|= |B| B = HSFS(A, |P|)

12 return BAlgorithm 3.4.1: Iterative refinement procedure

Note that for very high dimensional data sets, the musician size may be configured

via binary search instead, as shown in Algorithm 3.4.2. The two parameters |P|min

and |P|max define the field of search, which is narrowed iteratively in a divide and

conquer fashion. The aim is to determine the most suitable value of |P|, which is

used to obtain a good quality and compact feature subset.

1 |P|max = |A|2 |P|min = 13 B = A4 while |P|min < |P|max do5 |P|= b |P|min+|P|max

2 c6 B = HSFS(A, |P|)7 if f (B)> f (B)∨ ( f (B) == f (B)∧ |B|< |B|) then8 |P|max = |B|9 else

10 |P|min = |B|

11 return BAlgorithm 3.4.2: Musician size adjustment via binary search

65

3.5. Experimentation and Discussion

3.5 Experimentation and Discussion

In this section, the results of a series of experimental evaluations are reported, in

order to demonstrate the capabilities of the proposed HSFS approach. The focus

of the study lies with parameter controlled HSFS with iterative refinement, which

embeds two of the more mature modifications previously discussed in Section 3.4.

A systematic comparison against a selection of nature-inspired global optimisation

heuristics (as reviewed in Section 2.2.2) is provided in Sections 3.5.1.1 and 3.5.1.2,

which reveals the competitive performance of HSFS. In addition to the comparative

studies against the aforementioned approaches, further experiments are carried

out in order to demonstrate the characteristics of HS. Section 3.5.2.1 reveals the

differences in results when different parameter control rules are employed. This

empirically demonstrates that the recommended set of rules as presented in Section

3.4.1 is both intuitively sound and practically effective. A comparison against the

original HS algorithm is provided in Section 3.5.2.2, in order to show the effect of

the proposed enhancements.

The main part of the experimentation is carried out using three filter evaluators:

CFS [93], PCFS [52], and FRFS [126], which differ in terms of computational

complexity and characteristics. For instance, CFS is the most lightweight method. It

addresses the problem of FS through a correlation-based approach, and identifies

features that are highly correlated with the class, yet uncorrelated with each other

[93]. PCFS is an FS approach that attempts to identify a group of features that

are inconsistent, and removes irrelevant features in the process [52]. FRFS [126],similar to most rough set-based methods, exploits fuzzy-rough set notions such as

the lower and upper approximations of a given concept, and is able to identify

very compact subsets of features that can fully discern the training objects into

their respective classes. Note that FRFS is relatively high in terms of computational

complexity, and finding the minimal sized solution of full discernibility (a minimal

fuzzy-rough reduct) remains as significant research. Further detail regarding the

employed evaluators may be found in Section 2.1.1.2.

In total 12 real-valued UCI benchmark data sets [78] are used, in order to demon-

strate the capabilities of HSFS, and the nature-inspired FS approaches reviewed

in Section 2.2.2. Several data sets are of high dimensionality and hence, present

reasonable challenges for FS. Lower dimensional problems (e.g., cleve and heart)

are also included to examine whether tested algorithms can identify the best feature

66


subsets. Table 3.5 provides a summary of these data sets. In order to ensure conver-

gence for the more complex data sets, a large number of iterations are uniformly

chosen. The remaining parameters, such as the population size are also configured

to be comparable with each other.

The classification algorithms adopted in the experiments include two commonly

employed techniques: 1) the tree-based C4.5 algorithm [264] which uses entropy

to identify the most informative feature at each level, in order to split the training

samples according to their respective classes; and 2) the probabilistic Bayesian

classifier with naïve independence assumptions (NB) [132]. C4.5 is optimistic by

nature. It is an unstable classifier, i.e., may over- or under-perform for specific

training folds, but also unbiased towards any class or object, whilst NB is pessimistic

(stable but biased). Obtaining the contrasting views of two different classifiers helps

to provide a more comprehensive understanding of the qualities of the selected

feature subsets.

Table 3.5: Data set information

Data set Feature Instance Class C4.5 NB

arrhy 279 452 16 65.97 61.40cleve 14 297 5 51.89 55.36handw 256 1593 10 75.74 86.21heart 14 270 2 77.56 84.00ionos 35 230 2 86.22 83.57libra 91 360 15 68.24 63.635multi 650 2000 10 94.54 95.30ozone 73 2534 2 92.70 67.66secom 591 1567 2 89.56 30.04sonar 60 208 2 73.59 67.85water 39 390 3 81.08 85.40wavef 41 699 2 75.49 79.99

Stratified 10-fold cross-validation (10-FCV) is employed for data validation, where

a given data set is partitioned into 10 subsets. Of these 10 subsets, nine are used to

form one training fold. The FS methods are employed to identify quality subsets,

which are then used to build classification models. A single subset is retained as the

testing data, so that the built classifiers can be compared using the same unseen

data. This process is then repeated 10 times (the number of folds). The advantage

of 10-FCV over random sub-sampling is that all objects are used for both training

and testing, and each object is tested only once per fold.

67


The stratification of the data prior to its division into different folds ensures that

each class label has equal representation in all folds, thereby helping to alleviate

bias/variance problems [18]. In the experiments, unless stated otherwise, 10-FCV is

executed 10 times (10×10-FCV) in order to reveal the impact of the stochastic nature

of the approaches employed. The differences in performance of the various methods

are statistically compared using paired t-test with two-tailed P = 0.01. Note that

since 10×10-FCV is imposed, each of the figures displayed in the following tables is

an averaged result of 100 search outputs (per data set per algorithm). Obviously,

the searches are carried out using the same fold of the data set each time, so that

their results (and the final averaged figures) are directly comparable.

3.5.1 Evaluation of HSFS

Following the evaluation procedures described previously, the evaluation scores and

sizes of the feature subsets identified by HSFS and the 9 other nature-inspired FS

algorithms are detailed in Section 3.5.1.1. These feature subsets are further used

to train classifier learners, in order to further reveal the characteristics of the test

algorithms. Since that the three evaluators employed to judge subset quality are

all filter-based approaches, it should be noted that the predictive accuracies of the

subsequent classifiers are not part of the goals to be optimised by the respective

search methods.

3.5.1.1 FS Results

Tables 3.6 to 3.8 detail the results collected with three different subset evaluators and

all of the reviewed search algorithms including HSFS. As the time complexity of a

subset evaluation using CFS is very low, the maximum search iterations/generations

for this set of experiment is set to a very large value (gmax = 50,000), in order to

allow all algorithms to fully converge. Based on the figures shown in Table 3.6,

GA, HS, and MA deliver very similar results, and work well for lower dimensional

data sets such as cleve, heart, ionos, and water. Algorithms such as SA and TS

demonstrate very good performance for the most complex data sets: arrhy, multi,

and secom. ABC, ACO, and FF are not particularly competitive in identifying feature

subsets with the highest evaluation scores. However, they are relatively good at

producing very compact feature subsets, with acceptable evaluation quality.

For results obtained with PCFS given in Table 3.7, TS fails to find subsets with

the best evaluation scores for most of the data sets, which forms a sharp contrast to

68


Table 3.6: FS results using CFS, showing both the evaluation scores (left) and the sizes(right) of the selected feature subsets. Bold figures indicate the highest evaluationscores, shaded cells signify the overall most compact best solutions

ABC ACO CSA FF GA

arrhy 0.349 10.5 0.369 15.5 0.467 24.6 0.394 27.7 0.263 61.4cleve 0.271 5.4 0.239 6.4 0.271 5.4 0.269 5.0 0.274 6.7handw 0.435 92.0 0.444 31.5 0.527 78.0 0.484 58 0.513 97.9heart 0.337 5.3 0.313 6.3 0.337 5.3 0.334 5.0 0.338 6.4ionos 0.523 9.6 0.511 9.5 0.538 9.4 0.528 9.2 0.539 10.4libra 0.577 23.5 0.570 15.6 0.610 23.0 0.589 21.7 0.611 31.4multi 0.824 235 0.802 26.9 0.923 106 0.836 88.4 0.879 258ozone 0.100 14.0 0.102 10.7 0.112 14.0 0.106 13.0 0.114 21.9secom 0.024 2.9 0.085 14.2 0.101 14.5 0.045 15.7 0.008 97.0sonar 0.330 11.4 0.316 17.7 0.360 16.7 0.339 12.5 0.360 17.7water 0.417 8.2 0.387 10.3 0.426 10.4 0.416 7.5 0.426 10.5wavef 0.366 11.3 0.361 14.5 0.383 13.1 0.364 10.5 0.384 14.9

HS MA PSO SA TS

arrhy 0.466 26.8 0.275 58.4 0.280 16.4 0.466 23.5 0.467 25.1cleve 0.274 6.7 0.274 6.7 0.274 6.7 0.266 4.4 0.274 6.6handw 0.525 93.7 0.467 120 0.473 122 0.526 68.2 0.527 82.3heart 0.338 6.4 0.338 6.4 0.338 6.4 0.333 4.7 0.335 5.3ionos 0.539 10.3 0.539 10.4 0.535 10.9 0.536 8.4 0.537 9.3libra 0.611 28.7 0.607 33.7 0.595 26.6 0.607 19.1 0.611 25.0multi 0.919 141 0.836 318 0.849 357 0.926 76.0 0.926 80.0ozone 0.114 21.4 0.113 23.3 0.110 25.3 0.107 9.5 0.114 19.8secom 0.064 22.5 0.025 95.6 0.018 20.8 0.101 14.1 0.100 14.3sonar 0.360 17.7 0.360 17.6 0.329 11.9 0.358 15.4 0.359 16.6water 0.426 10.5 0.426 10.5 0.417 8.9 0.423 8.4 0.424 9.2wavef 0.384 14.9 0.384 14.9 0.372 12.4 0.382 12.6 0.384 14.9

its strong performance in the previous set of experiment. However, it still identifies

the best solutions for multi and wavef, which are two of the higher dimensional

problems. CSA, GA, and HS demonstrate their capabilities in finding good quality

and compact feature subsets for seven of the 12 data sets. GA is the only algorithm

that identifies the overall best solutions for the ozone and secom data sets. Note

that the search outputs from these algorithms can differ significantly. Taking the

secom data set as an example, the overall best evaluation score is 0.990 (achieved by

CSA, GA, and PSO) with around 200 features, while ACO, HS, and TS yield subsets

with an average size of only 2.5, 24, and 97, respectively. The evaluation scores of

the resultant solutions are also significantly lower in comparison, indicating that

69


only local minimal solutions are possibly returned. Similar observations are also

reflected by the results for the arrhy data set.

Table 3.7: FS results using PCFS, showing both the evaluation scores (left) andthe sizes (right) of the selected feature subsets. Bold figures indicate the highestevaluation scores, shaded cells signify the overall most compact best solutions

ABC ACO CSA FF GA

arrhy 0.987 121 0.802 8.8 0.988 107 0.977 38.7 0.989 45.6cleve 0.781 8.0 0.775 8.8 0.781 7.9 0.715 6.4 0.781 7.9handw 1.000 27.1 1.000 26.9 1.000 22.0 1.000 23.1 1.000 41.2heart 0.961 9.5 0.955 10.2 0.961 9.5 0.915 7.2 0.961 9.5ionos 0.996 9.8 0.993 10.0 0.996 7.0 0.996 8.7 0.996 10.0libra 0.971 35.2 0.935 19.0 0.972 17.2 0.968 24.5 0.972 18.2multi 1.000 14.6 1.000 10.6 1.000 13.1 1.000 19.5 1.000 44.9ozone 0.997 23.0 0.969 14.2 0.999 16.6 0.996 20.1 1.000 21.0secom 0.986 211 0.936 2.5 0.989 213 0.974 150 0.990 198sonar 0.993 24.8 0.946 11.6 0.993 11.7 0.989 15.8 0.993 12.4water 0.994 15.8 0.975 10.8 0.995 9.8 0.990 10.5 0.995 10.0wavef 0.999 12.4 0.999 11.4 0.999 9.7 0.999 10.9 1.000 11.5

HS MA PSO SA TS

arrhy 0.989 29.1 0.989 114 0.989 111 0.989 29.1 0.983 21.6cleve 0.781 7.9 0.781 7.9 0.781 7.9 0.781 8.0 0.738 6.7handw 1.000 70.2 1.000 40.5 1.000 24.0 1.000 83.4 0.999 18.1heart 0.961 9.5 0.961 9.5 0.961 9.5 0.922 8.7 0.947 8.3ionos 0.996 6.8 0.996 10.1 0.996 8.1 0.989 15.6 0.991 6.4libra 0.972 16.4 0.972 33.3 0.972 28.4 0.961 41.6 0.967 16.0multi 1.000 9.1 1.000 43.8 1.000 13.4 1.000 325 1.000 6.1ozone 0.999 18.4 1.000 31.8 1.000 26.0 0.994 34.5 0.999 19.1secom 0.988 23.7 0.988 314 0.990 256 0.971 294 0.979 97.3sonar 0.991 11.8 0.993 23.7 0.993 17.6 0.924 30.3 0.985 11.6water 0.995 10.2 0.995 13.8 0.995 12.2 0.927 18.9 0.990 9.3wavef 1.000 11.1 1.000 14.9 1.000 12.9 0.979 20.1 1.000 10.6

FRFS is a computationally intensive evaluator that requires a considerable amount

of time when the number of training objects is very large. Because of the underlying

properties of fuzzy discernibility matrices [126, 172], it is easy to find feature subsets

with almost full dependency scores (for the commonly adopted fuzzy t-norms and

fuzzy implicators). However, the search for the most compact solutions (fuzzy-rough

reducts) is very challenging. These characteristics of FRFS help to compare the size

reduction capabilities of the reviewed methods. Note that the evaluation scores are

not compared statistically, since subsets with full dependency score can be readily

70


identified. According to the results shown in Table 3.8, HS performs very well in this

set of experiments, mainly due to the fact that it is tailored to solving FRFS problems

in the first place [60], and that it also embeds mechanisms to actively refine the sizes

of the feature subsets during the search. GA, MA, and PSO also deliver competitive

performance. Although TS obtains best results for six of the 12 data sets, it fails to

optimise the FRFS dependency scores for multi and secom, producing sub-optimal

solutions.

Table 3.8: FS results using FRFS, showing both the evaluation scores (left) and thesizes (right) of the selected feature subsets. Shaded cells signify the overall mostcompact best solutions

ABC ACO CSA FF GA

arrhy 1.000 40.4 1.000 23.6 1.000 29.0 1.000 29.2 1.000 63.4cleve 0.929 13.0 0.929 13.0 0.929 13.0 0.854 10.9 0.999 12.9handw 1.000 26.6 1.000 27.5 1.000 22.0 1.000 22.9 1.000 40.4heart 0.959 12.8 0.960 13.0 0.959 12.8 0.909 10.7 1.000 10.1ionos 0.993 15.3 0.994 16.6 0.993 14.0 0.992 15.1 1.000 25.7libra 0.997 19.7 0.997 21.3 0.998 18.7 0.997 19.5 1.000 29.4multi 1.000 20.3 1.000 17.6 1.000 18.9 1.000 27.3 1.000 49.8ozone 0.975 36.5 0.962 35.7 0.973 32.9 0.976 34.5 0.982 48.5secom 1.000 35.2 1.000 20.5 1.000 37.0 1.000 26.6 1.000 67.5sonar 1.000 13.3 1.000 14.5 1.000 12.9 1.000 13.5 1.000 17.3water 0.998 18.0 0.998 19.0 0.998 15.9 0.997 17.8 1.000 20.8wavef 1.000 17.0 1.000 18.0 0.999 16.5 1.000 17.0 1.000 19.2

HS MA PSO SA TS

arrhy 1.000 25.1 1.000 63.1 1.000 34.5 1.000 108 1.000 24.9cleve 0.999 12.9 0.999 12.9 0.999 12.9 0.929 11.2 0.989 11.9handw 0.999 22.5 1.000 40.0 1.000 23.8 1.000 129 0.999 22.0heart 1.000 10.0 1.000 10.1 1.000 10.3 0.989 10.6 0.959 9.1ionos 1.000 25.7 1.000 26.1 1.000 26.5 0.991 14.9 1.000 25.9libra 1.000 20.7 1.000 29.9 1.000 23.2 0.999 26.2 0.999 20.4multi 1.000 15.3 1.000 41.8 1.000 21.0 1.000 323 0.562 6.0ozone 0.924 38.5 0.982 48.8 0.982 51.9 0.919 36.1 0.979 33.7secom 1.000 15.8 1.000 67.1 1.000 31.7 1.000 296 0.803 7.0sonar 1.000 13.0 1.000 16.9 1.000 14.1 1.000 17.8 0.998 11.7water 1.000 19.8 1.000 22.0 1.000 22.1 0.981 19.2 1.000 19.9wavef 1.000 18.4 1.000 19.4 1.000 19.2 0.996 20.5 1.000 17.4

71


3.5.1.2 Classification Performance

Tables 3.9 to 3.11 show the accuracies of the classification models, trained using the

same cross-validation folds as those used to perform FS. The quality of the underlying

subsets have already been discussed in the previous subsection, the accuracies of the

full (unreduced) data sets is given in Table 3.5.

For features selected by CFS, as shown in Table 3.9, the worst solutions (with an

averaged score of 0.024 and size of 2.9) that are found by ABC, actually result in the

best classification performance for both tested classifiers for the secom data set. This

shows that for filter-based evaluators, a solution that achieves the highest evaluation

score does not necessarily guarantee the best classification model, subsequently

learnt using such features, since these subsets are selected independently of the end

classifier learners, However, in general, there is a reasonable correlation between

subset quality (judged by the CFS evaluator) and the classification accuracy. Feature

subsets selected according to CFS also build slightly more accurate models than those

constructed based on PCFS and FRFS, and a greater majority of algorithms are able

to find better performing solutions.

Table 3.10 reports the results collected using the set of feature subsets selected

via PCFS, CSA and TS seem to lead to the best classification performance overall.

Each of the remaining algorithms also finds the best results for one or more data

sets. For the secom data set, all of the reviewed algorithms, apart from ACO, select

features that do not contribute to good NB classifier models. Several models result

in an averaged classification accuracy lower than 40%. A closer investigation reveals

that in fact, local best solutions have been selected by these algorithms for a number

of cross-validation folds, which have a large, negative impact in the final 10-FCV

results.

For classifiers built using feature subsets selected by FRFS as demonstrated in

Table 3.11, algorithms such as ACO, MA, SA, and TS all perform reasonably well.

Both tested classifiers also tend to agree more often (than the two previous sets of

experiments) in terms of predictive accuracy. The performance of a given classifier,

and the evaluation score of its underlying feature subset are also well correlated

for FRFS. Note that since FRFS generally helps to find more compact subsets, the

resultant classification accuracies are also slightly lower when compared to those

obtained by CFS or PCFS.

72


Table 3.9: C4.5 (left) and NB (right) classification accuracies using the featuresubsets found with the respective search algorithms via CFS. Bold figures indicatebest classification accuracy (per classifier), shaded cells signify higher accuracies areachieved for both examined classifiers

ABC ACO CSA FF GA


HS MA PSO SA TS


3.5.1.3 Discovery of Multiple Quality Feature Subsets

The stochastic nature of global optimisation techniques such as HS allows the discov-

ery of multiple different solutions for the same set of training samples. Table 3.12

details the differences in the discovered reducts between HS-IR and HC using the

data set arrhy as an example. These subsets are recorded during the experiment,

where the same cross-validation fold is used by both methods. Note that similar

properties have also been observed for other data sets, although the employment of

arrhy allows such properties revealed more clearly. In general, different runs may

73


Table 3.10: C4.5 (left) and NB (right) classification accuracies using the featuresubsets found with the respective search algorithms via PCFS. Bold figures indicatebest classification accuracy (per classifier), shaded cells signify higher accuracies areachieved for both examined classifiers

ABC ACO CSA FF GA


HS MA PSO SA TS

arrhy 66.2 66.8 66.3 64.0 66.1 63.9 66.2 67.5 66.7 68.2cleve 54.5 55.9 54.5 55.8 54.4 55.9 54.5 55.8 55.2 56.0handw 69.9 75.9 67.9 72.8 63.7 66.6 70.2 76.8 71.1 73.5heart 78.5 84.4 78.5 84.4 78.5 84.4 79.6 83.1 79.5 84.5ionos 85.7 80.1 85.3 80.1 85.5 78.6 85.5 81.0 87.4 80.3libra 65.4 61.5 65.1 61.6 65.3 61.6 66.1 61.5 63.7 59.9multi 81.7 84.3 85.5 89.1 78.5 81.0 93.4 94.9 89.8 91.4ozone 93.2 75.6 93.1 70.6 93 71.8 92.9 69.4 93.2 75.2secom 92.5 70.9 90.4 32.3 90.2 35.0 89.9 31.9 92.1 67.6sonar 73.5 66.6 73.0 66.3 73.7 66.6 72.2 66.9 73.8 67.3water 81.4 86.2 81.7 86.1 81.7 86.3 80.2 83.2 81.9 86.1wavef 74.7 79.5 74.3 78.8 74.8 79.1 71.6 75.8 75.8 80.2

converge to the same solution, possibly due to limited number of best solutions that

can be inferred from data sets of a lower dimensionality.

For 10 runs of HS-IR, 10 different reducts of an average size 7 are selected (again,

all reaching the full dependency measure of 1), while HC results in a single subset

of size 7. The ability to produce multiple quality subsets from the same training

data may greatly benefit multi-view learning techniques such as classifier ensemble

[61, 266], where the subsets may be used to generate partitions of the training data

in order to build diverse classification models.

74


Table 3.11: C4.5 (left) and NB (right) classification accuracies using the featuresubsets found with the respective search algorithms via FRFS. Bold figures indicatebest classification accuracy (per classifier), shaded cells signify higher accuracies areachieved for both examined classifiers

ABC ACO CSA FF GA


HS MA PSO SA TS


3.5.2 Evaluation of Additional Improvements

Additional experimentation have been conducted in this section in order to demon-

strate the effectiveness of the proposed improvements which have been described in

Section 3.4. Note that the effect of parameter control rules have also been evaluated

under conventional settings, for improving the solution quality of numerical optimi-

sation problems. They are omitted here due to the lack of relevance to the topic at

hand. Refer to [59] for more detailed results and discussions.

75


Table 3.12: Comparison of multiple HS-IR reducts versus single HC reduct for thearrhy data set, all subsets are of size 7 and evaluation score 1

Feature indices

HSFS

· · ·3 5 9 80 242 246 2753 14 104 113 169 181 271

14 20 64 209 238 242 2515 6 7 113 188 231 2673 6 76 161 246 252 2650 5 6 10 209 257 262

14 191 208 218 241 259 27540 161 186 209 225 228 2556 80 128 161 208 239 2718 14 208 216 225 241 246

· · ·

HC 0 3 6 168 169 217 251

3.5.2.1 Comparison of Parameter Control Rules

The parameter control rules discussed in Section 3.4.1 are examined here, with results

compared against other possible approaches. The previously employed data sets are

once again adopted, with either CFS or FRFS acting as the subset evaluator. For the

less complex data sets, all algorithms lead to identical results and therefore those

particular results are omitted. The following discussions focus on the remainder of the

results. Here, the arrows illustrate how the parameters are adjusted over iterations,

e.g. |H| means that |H| decreases from maximum to minimum value as the search

progresses; δ → means that δ is static throughout; and |H| δ indicates the

cases where both parameters increase over time, and hence the recommended rules.

Table 3.13 shows the subset size and evaluation score obtained using the CFS

evaluator. The first three rows show different |H| adjustment functions with a

static δ. The effect of an increasing |H| can be identified by the overall superior

evaluation scores, the subset sizes are not differentiated by a large amount but the

decreasing value of |H| generally leads to larger feature subsets. Rows 3 to 5 show

the comparison of different δ functions with a static |H|. Here, the main difference in

results is feature subset size, where an increasing δ helps to discover smaller subsets,

and the evaluation score is also generally higher. Comparison of combined rule sets

shows that the final results are better when both parameters are increasing during

the search.

76


Table 3.13: Comparison of parameter control rules using CFS, averaged subset sizerounded to nearest integer and evaluation score by 10× 10-FCV, shaded row indicatesthe suggested rules

ionos olito sonar water2 water

Mode Size Score Size Score Size Score Size Score Size Score

|H| δ→ 17 0.5020 15 0.5667 29 0.0907 16 0.1648 17 0.2545|H| δ→ 15 0.5101 16 0.5676 28 0.1111 15 0.2638 16 0.3690|H| → δ→ 16 0.5107 15 0.5675 27 0.1083 15 0.2441 16 0.3258|H| → δ 12 0.5155 16 0.5676 21 0.2978 11 0.3405 13 0.3940|H| → δ 15 0.5087 16 0.5676 26 0.1300 14 0.1944 16 0.2758|H| δ 12 0.5173 16 0.5677 20 0.2989 12 0.3409 13 0.4079|H| δ 15 0.5103 15 0.5665 27 0.1227 15 0.1819 15 0.2848|H| δ 17 0.5032 15 0.5673 30 0.0884 16 0.1735 17 0.2668|H| δ 16 0.5075 16 0.5676 28 0.1020 16 0.1887 16 0.2841

The same conclusion can be reached by studying the results obtained using FRFS

as the subset evaluator, as shown in Table 3.14. All HS variations have achieved

full fuzzy-rough dependency measure for the discovered subsets, the difference in

performance is therefore reflected purely by the size reduction. HS with increasing |H|and δ finds more compact subsets overall, while HS with a static |H| and increasing

δ achieves a close second place with minor increase in subset size, once again

demonstrating that δ adjustment plays a key role in the size reduction of feature

subsets.

Table 3.14: Comparison of parameter control rules using FRFS, averaged subset sizerounded to nearest integer across 10× 10-FCV, shaded row indicates the suggestedrules

Mode ionos olito sonar water2 water

|H| δ→ 14 9 29 16 16|H| δ→ 12 7 25 14 14|H| → δ→ 13 7 26 14 14|H| → δ 10 6 20 9 10|H| → δ 11 7 22 12 12|H| δ 10 6 18 9 9|H| δ 12 7 22 13 13|H| δ 14 10 28 17 16|H| δ 13 8 27 14 14

77


3.5.2.2 Effect of Parameter Control and Iterative Refinement

The effect of proposed improvements are demonstrated here with a comparison of

the original HS algorithm. The parameter settings employed by HS-based methods in

this experiment are given in Table 3.15. Thanks to the performance increase brought

about by parameter control and iterative refinement, the improved HS algorithm

no longer requires as many iterations as the original to achieve good results. The

maximum number of iterations used by HS-IR is therefore reduced by half of the

original amount. This results in significant savings in time complexity.

Table 3.15: Parameter settings for demonstration of parameter control and iterativerefinement

Algorithm Parameter Value

|H| 20HS Original gmax 2000(HS-O) δ 0.8

|P| 10

|H| 10-20HS with gmax 1500Parameter Control (HS-PC) δ 0.5-1

|P| 10

HS |H| 10-20with Parameter Control gmax 1000and Iterative Refinement (HS-IR) δ 0.5-1

Table 3.16 details the results obtained, showing both the subset size and the

evaluation score. The columns labelled HS-O contain these subset sizes discovered by

the original algorithm, and those labelled HS-PC show the results of using parameter

controlled HS, HS-IR which iteratively refines HS is also included for comparison. For

the purpose of maintaining consistency of evaluation, the selection process employs

the same cross-validation folds as used in the previous subsections. The paired t-test

is again employed to compare the differences between HS-PC and HS-O, and HS-IR

against HS-PC. In all cases except one, the enhancements offer statistically significant

improvements in terms of subset size reduction and evaluation optimisation. For the

data set secom, whilst HS-PC did not increase the evaluation score when compared

to HS-O, it did reduce the average subset size.

It can be seen from these results that the proposed improvements have greater

effect under more complex situations, such as those involving subsets with larger

78


Table 3.16: Comparison of proposed HS improvements using feature subsets selectedby CFS, regarding averaged subset size, evaluation score, and C4.5 classificationaccuracy, by 10× 10-FCV, v, −, ∗ indicate statistically better, same, or worse results

Full HS-O HS-PC HS-IR

Data set |A| Acc.% |B| f (B) Acc.% |B| f (B) Acc.% t |B| f (B) Acc.% t

ionos 35 85.62 14.04 0.533 85.30 11.46 0.539 85.57 v 10.06 0.542 85.30 *water 39 79.74 15.24 0.386 83.13 12.3 0.419 82.82 * 10.1 0.427 82.46 -wavef 41 76.62 17.32 0.362 77.02 15.46 0.382 77.21 v 14.9 0.384 77.23 -sonar 61 72.62 25.34 0.132 72.62 20.54 0.317 73.52 v 17.22 0.359 72.95 *ozone 73 92.62 38.58 0.106 93.35 29.36 0.113 93.41 - 19.9 0.114 93.28 -libra 91 70.28 51.12 0.582 70.56 43.4 0.603 70.83 v 24.26 0.607 69.33 *arrhy 280 65.06 161.82 0.052 67.84 117.74 0.088 67.49 - 27.36 0.441 67.27 -secom 591 88.96 348.78 0.002 90.06 279.64 0.002 90.74 v 15.34 0.087 92.78 visole 618 83.42 383 0.692 83.37 356.98 0.716 83.59 - 205.71 0.723 83.02 *multi 650 94.30 401.16 0.836 94.22 365.98 0.867 94.42 v 124.11 0.91 94.63 v

numbers of features, larger numbers of instances, or those that contain many com-

peting potential solutions. The effect of parameter control is revealed largely in

terms of better evaluation scores, while subset sizes are also reduced in the process.

However, iterative refinement greatly improves the overall solution quality, and

shows exceptional capability for reducing the size of subsets. For the secom data

set, HS-IR succeeded in reducing the solution size by over 95% without sacrificing

the evaluation score, making HS-IR a competitive algorithm in dealing with higher

dimensional FS problems.

From the differences in classification accuracy, it can be seen that of the 10

data sets, HS-PC manages improvements over HS-O for 6 cases, ties for 3 cases,

and under-performs for just one case. This demonstrates the effectiveness of using

parameter control. HS-IR obtains better feature subset evaluation scores (when

judged using CFS) and smaller subset sizes. However, classification accuracy results

indicate that more compact feature subsets (with equal or better evaluation score)

may not necessarily lead to equal or better classifiers. For example, regarding data

set arrhy, although HS-IR raised average evaluation score from 0.002 to 0.087, and

reduced average subset size from 279.64 to 15.34, the classifier accuracy remained

the same. Yet, the experimental results also show that, for this data set, HS-IR

equipped with the CFS evaluator removed a fair amount of redundancy, while not

affecting the end classifier performance.

79


3.5.3 Iterative Refinement of Fuzzy-Rough Reducts

Section 3.5.1.1 has shown that the iterative refinement technique works very well

for finding smaller fuzzy-rough reducts. The following experimental results show

graphically how an initial solution is improved upon over several refinements. Two

data sets with a relatively large numbers of features are used: arrhy (280 features)

and web (2557 features). The search objective is to find a fuzzy-rough reduct (with

fuzzy-rough dependency measure of 1.0) of the smallest possible size. For the web

data set, only ten different runs are performed due to its very high dimensionality.

For each data set the reduct sizes at each iterations are recorded, averaged, and

summarised in Figs. 3.6 and 3.7.

Figure 3.6: Iterative fuzzy-rough reduct refinement for the arrhy data set

The refinement procedure is completed within five iterations in six out of 10 runs

for the arrhy data set, with an averaged final reduct size of 7.17. As for the web data

set, 40% of the runs terminate within 30 refinements with the rest taking more than

33 iterations. This is one of the scenarios where an exponential adjustment to the

musician size |P| may become beneficial, such as that suggested in Algorithm 3.4.2.

Note that for smaller problems with <100 features, compact fuzzy-rough reducts are

usually found within 3 iterations. In reality, if the search is to be performed multiple

times, the efficiency can be further increased by initialising the number of feature

selectors |P| to a smaller value, which may be discovered in the first few executions

of HSFS.

80


Figure 3.7: Iterative fuzzy-rough reduct refinement for the web data set

3.5.4 Discussion of Results

HS based approaches are computationally inexpensive themselves in terms of com-

putational overhead and robustness. This is because the algorithm comprises a very

simple concept, and the implementation is also straightforward. The run-time of the

entire FS process is mainly determined by the following two factors: the maximum

number of iterations gmax, and the efficiency of the subset evaluation method. gmax

can be manually configured according to the complexity of the data set, in the ex-

perimental evaluation, HS converges very quickly with a similar run-time to that of

GA- and PSO-based searches. The experiments also revealed the downside of FRFS,

which does not scale well for larger data sets, empirically. The greater the number

of instances in the data set, the longer time is required for computing fuzzy-rough

dependency measures.

The use of subset storage in HS offers a major advantage over that of other

techniques such as GA, as it maintains a record of the historical candidate feature

subsets by previously executed iterations. All elements of the memory together

contribute to the new subset, while changes in genetic populations tend to result

in the destruction of previous knowledge of the problem. The harmony memory

considering rate δ also helps the search mechanism to escape from the local best

solutions.

81

3.6. Summary

In all of the experiments, apart from relaxing the total number of generations

gmax for those conducted using CFS and PCFS, there has been no attempt to optimise

the parameters for each of the employed data sets. The same parameter settings are

used for easy comparison throughout regardless of the difference in complexities of

the data sets. It can be expected that the results obtained from the proposed work

with optimisation would be even better than those already observed.

The proposed approach offers an improved search heuristic. Unfortunately, in

general, there is no stochastic search mechanism that can guarantee exhaustive

search; otherwise it is not a heuristic in the first place. Therefore, no subsets found

can be theoretically proven to be a global optimal (except those rough or fuzzy-rough

reducts identified via propositional satisfiability [127]). However, in practice, it

is important to investigate the relative strength of a given search heuristic. The

systematic experimental studies presented in this work confirm that empirically, the

quality of those subsets found by the proposed technique generally outperform those

returned by the others.

3.6 Summary

In this chapter, a new FS search strategy named HSFS has been presented. It is based

on a recently developed, music-inspired, simple yet powerful meta-heuristic - HS.

Pseudocode and illustrative diagrams have been given in order to aid the explanation.

Additional improvements to HSFS have also been proposed in an attempt to address

the potential weaknesses of the original HS algorithm, and to adapt the approach for

FS problems. The resulting method offers a number of advantages over conventional

approaches, such as fast convergence, simplicity, insensitivity to initial value settings,

and efficiency in finding quality feature subsets. The suggested parameter control

rules have been designed to work with traditional optimisation problems also [59],and are readily generalised to support a wider range of problems. The iterative

refinement mechanism works more closely with the size of the musician group |P|,which is an additional degree of freedom introduced by HSFS.

Results for the experimental evaluation show that all algorithms reviewed in

Section 2.2 are capable of finding good quality solutions. SA and TS are particularly

powerful in optimising the evaluation scores of the CFS evaluator, and work well

with a few high dimensional problem. Algorithms such as CSA, GA, and HSFS offer

82

3.6. Summary

more balanced results for all tested subset evaluators, in terms of both evaluation

score and subset size. HSFS has demonstrated competitive FS performance for both

sets of experiments carried out using CFS and PCFS, and it particularly excels in

size reduction, producing compact fuzzy-rough reducts for most of the tested data

sets. This is largely due to the proposed parameter control rules and the iterative

refinement procedure. The selected feature subsets are verified via the use of two

classification algorithms: C4.5 and NB. The performance of the resultant models

generally support the quality measurements of the filter-based evaluators, although,

there exist cases where feature subsets with very low evaluation scores lead to the

most accurate classifiers.

In-depth analysis of these experimental findings, and how HSFS and the reviewed

algorithms may be further improved remain as topics of active research. The relevant

future directions are discussed in great detail in Section 9.2.1.1. It is without doubt

that stochastic feature subset search algorithms such as HSFS are particularly strong

in identifying distinctive, good quality feature subsets. Such feature subsets, while

substantially reducing the problem dimensionality, are also beneficial for improving

the performance of any classifiers subsequently employed. It is therefore natural

to utilise a ensemble-based learning mechanisms, in order to better exploit the

advantages offered by these feature subsets.

83

Chapter 4

HSFS for Feature Subset Ensemble

T HE strengths of stochastic FS search methods such as HSFS, apart from be-

ing able to escape from local best solutions, lie with their ability to identify

multiple feature subsets of similar quality. “Feature subset ensemble” (FSE) is an

ensemble-based approach that aims to extract information from a collection of base

FS components, producing an aggregated result from the collection. In so doing,

the performance variance of obtaining a single result from a single method can be

reduced. It is also intuitively appealing that the combination of multiple subsets may

remove (or reduce the impact of) less important features, resulting in an informative,

robust, and efficient solution.

A majority of the existing techniques that follow this idea focus on combining

feature ranking techniques (also termed criteria ensembles [238]), e.g., for the

purpose of text classification [195] and software defect prediction [259]. They work

by merging the ranking scores or exploring the rank ordering of the features returned

by individual FS methods. An implementation of FSE similar to wrapper-based FS

algorithms has also been studied [10]. Additionally, feature redundancy elimination

has been achieved using tree-based classifiers ensembles [252]. Several terms similar

to FSE exist in the literature but represent a variety of different meanings, most

of which refer to classifier ensembles built upon feature subsets (e.g., [197]). One

notable example of this type of approach is the widely used Random Subspaces

technique [102].

In this chapter, a new representation termed “occurrence coefficient-based FSE”

(OC-FSE) is proposed. It works by analysing the feature occurrences within a group

84

4.1. Occurrence Coefficient-Based Ensemble

of base FS algorithms, and subsequently producing a collection of feature occurrence

coefficients (OC). It is a concise notion that merges the views of the individual

components involved. Three possible implementations of the FSE concept are intro-

duced and discussed. These include: 1) building ensembles using stochastic search

techniques such as HS, 2) generating diversity by partitioning the training data, and

3) constructing ensembles by mixing various different FS algorithms.

To make better use of the information embedded within an OC-FSE, a novel

OC threshold-based classifier aggregation method is also presented. It improves

upon the existing ideas that imitate the popular majority vote scheme [249], which

is often adopted by conventional ensemble approaches to classifier learning [227].The proposed methods are flexible, allowing feature subset evaluators to be used in

conjunction with feature ranking; and more importantly, to be scalable for large-sized

ensembles and time-critical applications. The use of stochastic search-based and

data partition-based methods in an OC-FSE implementation is due to the observation

that they themselves are able to induce quality FS components from just a single FS

algorithm, thereby reducing the cost of the initial FSE configuration.

The remainder of this chapter is structured as follows. The proposed OC-FSE

approach and the accompanying aggregation technique are explained in Section

4.1, together with three alternative implementations of the FSE concept detailed

in Section 4.1.1. Illustrative flow charts are provided to aid understanding. This

section also provides a complexity analysis of the proposed approach. Section 4.2

presents the experimentation carried out on real-world problem cases [78], dedicated

to empirically identifying important characteristics of the present techniques. It

includes: an analysis of classification accuracy following the proposed approach,

tested using the three implementations (Section 4.2.1); a cross comparison between

the different implementations (Section 4.2.2); and a demonstration of how this

work may deal with a large number of base FS components (Section 4.2.3). Finally,

Section 4.3 summarises the chapter.

4.1 Occurrence Coefficient-Based Ensemble

This section presents the key notions of OC-FSE, and discusses the possible imple-

mentations that can systematically construct such FSEs, with the aid of illustrative

flow charts. The proposed OC threshold-based aggregation method is specified

85


that extracts the information embedded within an OC-FSE. It provides an efficient

alternative to the evaluation of a traditional, complete FSE built using subsets found

by the base FS components (referred to as an ordinary FSE hereafter). A brief com-

plexity analysis is also provided to point out the computational costs of the proposed

methods.

For a given ordinary FSE, assuming its underlying feature subsets are B = Bi | i =1, · · · , |B|, can be represented by a set of binary strings: bBi , · · · , bB|B|, as shown in

Table 4.1. Here |B| denotes the size of the ensemble. Existing methods in the literature

generally build the subsequent classifier system using the individual subsets [102,

197], or attempt to merge them into a single subset [227, 238], which is denoted by B∗

below. OC-FSE is developed by exploiting an alternative approach to the combination

of the feature subsets. In particular, the decisions of the ensemble components

are organised in a |B| × |A| boolean decision matrix D. In this representation, a

horizontal row denotes a feature subset Bp, p = 1, · · · , |B|, and the binary cell value

dpq, q = 1, · · · , |A|, indicates whether aq ∈ Bp. The OC parameter σq of feature aq is

then defined as:

σq =

|B|∑

p=1dpq

|B|(4.1)

It first counts the number of occurrences of the features present in the ensemble,

then normalises the occurrences by the ensemble size |B|.

Table 4.1: Ordinary FSE of five feature subsets: Bi | i = 1, · · · , 5 with eight features:a1, · · · , a8

bBi a1 a2 a3 a4 a5 a6 a7 a8

bB1 0 1 1 0 0 1 0 1bB2 1 0 1 0 1 0 0 0bB3 0 0 1 0 1 0 0 1bB4 0 1 1 0 0 0 1 1bB5 1 1 1 0 0 0 0 1

Obviously, 0 ≤ σq ≤ 1. The resultant OC indicates how frequent a particular

feature aq is selected in an ordinary FSE, e.g., σ3 = 1 indicates that a3 is present

in all subsets with respect to Table 4.1. Irrelevant features naturally have an OC

value of σ = 0. An FSE may now be constructed by a set of such OCs: E =σ1, · · · ,σ|A|. The example FSE given in the table can therefore be denoted as

86


0.4, 0.6, 1.0, 0.0, 0.4, 0.2, 0.2, 0.8. The size of such an FSE is defined as the sum of

OCs: |E|=|A|∑

q=1σq, 0≤ |E| ≤ |A|.

4.1.1 Ensemble Construction Methods

This section introduces three methods that generate the required base pool of classi-

fiers, where stochastic search algorithms such as HSFS play an important role in the

production of distinctive feature subsets.

4.1.1.1 Single Subset Quality Evaluation Algorithm with Stochastic Search

Many of the existing nature-inspired heuristics, e.g., GA, PSO, and HS, share certain

properties, most notably the ability to generate multiple, quality solutions [62]. This

characteristic has been demonstrated previous in Section 3.5.1.3, and can be exploited

to efficiently construct FSEs. As illustrated in Fig. 4.1, an employed stochastic

algorithm searches for subsets until the targeted number of subsets |B| is satisfied.

This simple implementation requires only one evaluator and one search technique,

therefore the effort spent in configuring and training the necessary components is

minimal. However, this approach may be less desirable for data sets with fewer

features, where the number of diverse feature subsets is limited, and thus, the

resultant FSE diversity maybe low.

Figure 4.1: Flow chart for single subset quality evaluation algorithm with stochasticsearch

4.1.1.2 Single Subset Quality Evaluation Algorithm with Partitioned Data

An alternative approach for creating a diverse FSE is to use data partitioning, where

the training data is divided into a number of different chunks, and FS is then carried

out on each individual data partition. This is illustrated in Fig. 4.2. Here the ensemble

diversity is ensured because of the differences between the data partitions. Strategies

87


similar to stratified cross-validation [185] may be employed, in order to maintain

class balance, and to ensure that minority classes are sufficiently represented in each

data partition. This approach may be less effective for data sets with limited training

objects, since most FS evaluators require a sufficient number of training objects in

order to choose the most meaningful features. For such data sets, this may also

impose a constraint on the ensemble size |B| (number of data partitions).

Figure 4.2: Flow chart for single subset quality evaluation algorithm with partitionedtraining data

4.1.1.3 Mixture of Subset Quality Evaluation Algorithms

The most intuitive FSE construction is perhaps to employ a number of different

subset quality evaluation algorithms. Diversity can be naturally obtained from the

differences in opinions reached by the evaluators themselves. The construction

process may be further randomised by the use of a pseudo random generator, as

illustrated in Fig. 4.3, where subset evaluators 1 to Y are the available feature subset

evaluators. This may become beneficial when the available evaluators are fewer

than the desired number of ensemble components, where certain evaluators are

expected to be used multiple times. Although many practical problems may favour

such a scheme, the overhead of integrating several methods, and the complexity of

the employed algorithms themselves may affect the overall run-time efficiency. Also,

as multiple evaluators are used simultaneously, finding optimal parameter settings

for the ensemble may become computationally challenging.

4.1.2 Decision Aggregation

One of the most commonly used ensemble aggregation approaches is majority vote

[249], where the decision with the highest ensemble agreement is selected as the

final prediction. This method is beneficial for the situations where a single aggregated

feature subset is preferable. Following the proposed OC-based approach, a given

88


Figure 4.3: Flow chart for mixture of subset quality evaluation algorithms

OC threshold: α, 0 < α ≤ 1, may be adopted to control the number of features

considered for inclusion in the aggregated outcome B∗, such that: aq ∈ B∗ if σq ≥ α.

The common majority (more than half) voting method can be intuitively assimilated

by setting α = 0.5. The value αmay be adjusted according to the problem at hand. In

particular, if the level of agreement is very high (which may indicate poor ensemble

diversity), a higher α value should be used in order to control the size of the resultant

subset. Alternatively, if a highly diverse FSE is expected to be obtained, there may

exist very few features where σq > 0.5; to combat this, it may be necessary to employ

a lowered α value.

Finding the right configuration of α can be difficult without in-depth knowledge

of the application problem at hand, while a poorly aggregated subset will bring

negative impact upon the end classifier performance. To address this issue, a multi-

layered OC threshold-based aggregation scheme can be adopted. It first produces

multiple feature subsets using different degrees of α, where the number of subsets

is set to [ 1∆α], i.e., the nearest integer to 1

∆α, if the entire possible value range of

α is partitioned into a number of intervals of a length ∆α. It subsequently builds

classifiers that can generate class probability distributions d1, . . . di, . . . dC for the C

possible class labels. The classification outcome of OC-FSE is therefore a weighted

combination of such distributions:

[ 1∆α]

∑

j=1

w jd1 j, . . .

[ 1∆α]

∑

j=1

w jdi j, . . .

[ 1∆α]

∑

j=1

w jdC j (4.2)

The most probable class label may then be subsequently taken as the final output.

Using the previous example FSE as shown in Table 4.1, five feature subsets may be

generated using intervals∆α = 0.2, as given in Table 4.2. Suppose the classifiers built

using these feature subsets lead to the distributions shown (with 3 possible classes)

89


for a given test object. The final aggregation outcome is 1.9,2.5,2.8 assuming

equal weighting (w j = α, j = 1, . . . , 5), with class 3 being the most probable class

label. Although this example is fictional, it illustrates the possibility of alternative

class labels being derived if majority vote or simple merging is used instead. For

instance, by majority vote (α= 0.5), class 2 would be returned.

Table 4.2: An example of OC threshold-based aggregation with 3 possible classes

B∗α

Feature subsets Distribution

B∗0.1 a3 ∪ a8 ∪ a2 ∪ a1, a5 ∪ a6, a7 0.7, 0.5,0.3B∗0.3 a3 ∪ a8 ∪ a2 ∪ a1, a5 0.6, 0.5,0.3B∗0.5 a3 ∪ a8 ∪ a2 0.3, 0.6,0.5B∗0.7 a3 ∪ a8 0.2, 0.4,0.8B∗0.9 a3 0.1, 0.5,0.9


As the ensemble procedure depends largely on the training (Ot), search (Os), and

evaluation (Oe) complexity of the employed FS components, the overall complexity

of an FSE is also relative to Ot , Os, and Oe. For a given feature evaluator, using HS as

an example, the complexity of the subset search process Os = Oe · gmax, depending on

Oe and the maximum number of iteration gmax. The total complexity of training and

obtaining the solution for a single feature selector is therefore Ot +Os in this case.

For ensembles constructed using a stochastic search method, the training com-

plexity is Ot , as only a single algorithm is involved which needs to be trained once.

The ensemble search complexity is Os · |B|, where |B| is the ensemble size. The total

complexity is therefore:

Ostochastic = Ot +Os · |B| (4.3)

For data partition-based ensembles, the evaluators need to be re-trained for every

data partition, resulting in a training complexity of Ot · |B| for these components,

whilst having the same search complexity Os · |B| as stochastic ensembles. The total

complexity is:

Odata partition = (Ot +Os) · |B| (4.4)

For ensembles generated from a mixture of algorithms, the training complexity

is based on the number of available evaluators∑Y

i=1 Ot i, where Y is the number of

90


evaluators. The search complexity is:

|B|∑

i=1

Osi, Osi=

¨

Oei· gmax for subset evaluators

O(|A|) for feature rankers(4.5)

and |A| is the number of features. The feature ranking approaches simply pick the

|A| best features at O(|A|) complexity, while subset-based evaluators need to perform

a search on the solution space. The final complexity of the mixture approach is

therefore:

Omixture =Y∑

i=1

Ot i+|B|∑

i=1

Osi(4.6)

Furthermore, O(|A| · |B|) straightforward calculations are required to compute

the ensemble output, and to convert the base FS components into the OC-FSE. The

OC threshold-based decision aggregation imposes a constant cost regardless of the

number of base components involved, as the number of trained classifiers is always

[ 1∆α]. This makes the proposed method potentially favourable for large sized FSE

systems or time critical ensemble applications. Note that Os can be further reduced

by integrating the process of ensemble construction and the search procedure itself.


The algorithms adopted in the experiments cover several rather different underlying

techniques, including the well-known C4.5 algorithm [264] and the naïve Bayes-

based classifier (NB) [132]. The vaguely quantified fuzzy-rough nearest neighbour

(VQNN) [118] is also employed which is a very recent and powerful classification

technique. With such use of the various classifiers, a more comprehensive under-

standing of the resulting FSEs, and OC threshold-based aggregation method can be

reached.

A number of subset evaluators are used in the experiments, including CFS [93],PCFS [52], and FRFS [126]. Feature ranking methods are also employed in the

mixture of algorithms implementation, which will be introduced in detail in its

dedicated section (4.2.1.3). In total 13 real-valued UCI benchmark data sets [78]are used to demonstrate the efficacy of the proposed approaches, several of which

are of reasonably high dimensionality and hence, present sufficient challenges for

FS. A summary of the characteristics of these data sets is given in Table 4.3, and the

91


Table 4.3: Data sets used for OC-FSE experimentation

Data set Feature Instance Class C4.5 NB VQNN

arrhy 279 452 16 65.97 61.40 61.55cleve 14 297 5 51.89 55.36 52.99ecoli 8 336 8 82.53 85.21 84.21glass 10 214 6 66.96 48.09 65.61handw 257 1593 10 75.74 86.21 77.68ionos 35 230 2 86.22 83.57 83.05libra 91 360 15 68.24 63.63 67.25multi 650 2000 10 94.54 95.30 98.03ozone 73 2534 2 92.70 67.66 93.69secom 591 1567 2 89.56 30.04 93.36sonar 61 208 2 73.59 67.85 75.39water 39 390 3 81.08 85.40 81.58wavef 41 699 2 75.49 79.99 79.49

HS parameter settings empirically employed in the experiments are: |H|= 10− 20,

gmax = 1000− 2000, δ = 0.5− 1, while |P| is iteratively refined.

Stratified 10-FCV is employed, where a given data set is partitioned into 10

subsets. Of these 10 subsets, nine are used to form a training fold and a single subset

is retained as the testing data. The construction of the base classifier ensemble, and

the ensemble reduction process are both performed using the same training fold, so

that the reduced subset of classifiers can be compared using the same unseen testing

data. This process is then repeated 10 times (the number of folds). The advantage

of 10-FCV over random sub-sampling is that all objects are used for both training

and testing, and each object is used for testing only once per fold. The stratification

of the data prior to its division into different folds ensures that each class label has

equal representation in all folds, thereby helping to alleviate bias/variance problems

[17]. In the experiment, unless stated otherwise, 10-FCV is executed 10 times in

order to reduce the impact of the stochastic methods employed. The differences in

performance of various methods are statistically evaluated using paired t-test with

two-tailed p = 0.01.

The classification outcomes of the proposed OC threshold-based FSE aggregation

method (with ∆α = 0.1) are reported in Section 4.2.1. The base FS components

B, |B|= 20 are produced by the three different ensemble construction methods as

described in Section 4.1.1. With OC threshold α= 0.5 (which assimilates majority

92


vote), the OC-FSE-discovered feature subsets can be “flattened” into standard (single)

feature subsets. That is, the union of these subsets is regarded as a selected subset

itself. The accuracies of the classifiers trained using such flattened subsets are also

presented. The outputs from the base FS components are collected during the

ensemble construction process, which are then used to build traditional, feature

subset-based classifier ensembles.

Note that these ordinary FSEs (with 20 feature subsets) are certainly larger in size,

when compared to OC-FSEs with 10 flattened feature subsets. The purpose of the

comparison is to determine whether the proposed approach is indeed competitive, in

terms of classification accuracy, and subset size. The averaged accuracies of the single

FS algorithms are included as well, in order to signify the performance baseline.

Comparative studies between the three FSE implementations are further made in

Section 4.2.2, where the performances of the ensembles are averaged across different

classifiers, thereby providing a high level reflection of the characteristics of these

approaches. Finally, Section 4.2.3 investigates into the relationship between the

aggregation accuracy and the size of OC-FSE-selected feature subsets, as well as the

number of initial FS components involved in the ensemble construction.

4.2.1 Classification Results

The classification results are presented in Tables 4.4 to 4.6. The number of base

FS components |B| is 20 throughout. This facilitates comparison between different

approaches, especially between the three FSE implementations. However, note that

this may not be most suitable for several data sets (e.g., those with fewer instances).

The figures highlighted in bold indicate statistically superior results in comparison to

the rest. As explained previously in Section 4.1.1.1, the evaluators that access the

quality of a feature subset as a whole (such as FRFS, CFS, and PCFS) are employed in

the stochastic search and data partition-based implementations. Because the source

of diversity arises from the randomised search, a feature ranking based evaluator

will typically (with minor variations from different cross-validation folds) result in

the same feature subset over different runs.

4.2.1.1 Ensemble Constructed via Stochastic Search

As shown in Table 4.4, the proposed method (OC-FSE) is able to deliver very competi-

tive classification performance and generally produces better results than the subsets

93


selected by a single FS algorithm. For the cleve, ecoli, and glass data sets, it

results in equal or better accuracies across almost all classifiers and FS methods,

while the ordinary FSE performs consistently well for the handw data set. Better

overall results are obtained for the CFS and FRFS based ensembles, though the

proposed method is out performed by the ordinary FSE constructed using PCFS, for

the data sets handw, libra and multi. OC-FSE works well in conjunction with

the NB classifier (76.33% vs. 44.22% for the secom data set with PCFS), and the

ordinary FSE is able to achieve very competitive performance when paired with C4.5.

For low dimensional data sets such as cleve, ecoli, and glass, the FRFS evaluator

consistently select the same features, therefore no diversity is present in the resultant

FSEs. However, OC-FSE manages to improve several classifiers for cleve.

Although ordinary FSE performs better in a number of cases, it maintains much

larger ensembles than OC-FSE (20 vs. 10 in this experiment), and has higher space

and time complexity when compared with OC-FSE. Also, PCFS identifies features that

are most inconsistent with the class. For higher dimensional data sets, the number

of common features selected within a given FSE may be significantly reduced. This

affects the performance of the feature subsets aggregated under higher α thresholds.

Finally, considering the flattened B∗0.5 to be in the form of a standard subset, its

performance is also promising and presents a compromise between subset size

(which determines classifier complexity) and classification accuracy.

4.2.1.2 Ensemble Constructed via Data Partitioning

Table 4.5 details the results collected using the data partition-based implementation.

Several characteristics reveal earlier also hold for this set of experiments: (1) OC-FSE

works exceptionally well with NB while the ordinary FSE performs better with C4.5.

(2) The ordinary FSE still produces better results for the data sets handw and libra.

(3) Less performance variation is observed for low dimensional data sets for the

FRFS evaluator. The proposed method achieves competitive scores for the high

dimensional and large data sets, such as multi, ozone, and secom, with reasonable

sized FSEs. This demonstrates the strength of the data partition-based ensemble

construction technique.

4.2.1.3 Mixture of Algorithms

For this set of experiments, a number of individual feature evaluators are considered,

including several feature ranking approaches including information gain, data relia-

94


Table 4.4: Classification accuracy % result of stochastic search implementation,shaded cells indicate statistically significant improvements for each of the testedclassification algorithms

OC-FSE B∗0.5 (OC α= 0.5) Ordinary FSE Single Subset

C4.5 NB VQNN Size C4.5 NB VQNN Size C4.5 NB VQNN Size C4.5 NB VQNN Size

CFS

arrhy 71.4 69.3 62.9 32.2 67.7 68.3 65.4 27.3 73.3 68.8 63.5 33.8 67.3 67.4 63.8 32.1cleve 55.8 56.6 53.2 6.7 55.3 56.4 52.2 6.7 55.6 56.6 53.2 6.7 55.3 56.4 52.2 6.7ecoli 79.8 82.7 82.2 4.7 79.2 81.3 81.9 4.7 79.8 82.6 82.2 4.9 79.2 81.3 81.9 4.7glass 70.9 47.6 67.0 6.6 70.4 47.68 68.0 6.6 68.8 47.4 67.0 6.6 70.4 47.7 68.0 6.6handw 80.5 83.5 57.4 81.6 74.1 81.0 55.0 67.1 86.5 86.1 67.5 82.0 75.4 83.6 62.6 81.7ionos 86.8 86.9 81.7 10.3 86.6 87.0 81.7 10.4 86.7 86.4 81.2 10.6 86.4 86.8 81.7 10.3libra 71.2 62.2 66.6 30.3 65.7 59.6 62.3 28.9 74.7 63.6 66.1 31.8 66.6 61.7 63.8 30.0multi 96.6 97.6 99.0 162 94.5 96.8 98.4 82.9 96.8 96.7 98.6 171 94.6 96.5 98.2 162ozone 93.6 74.3 93.7 20.9 93.1 73.9 93.7 20.7 94.0 73.8 93.7 23.2 93.1 74.0 93.7 20.5secom 92.9 83.1 93.4 18.6 92.3 72.6 93.4 18.3 93.2 73.7 93.4 25.2 92.4 71.6 93.4 19.4sonar 75.6 66.9 75.0 17.8 75.0 66.9 76.0 18.1 75.5 66.6 76.0 18.2 75.3 66.3 76.1 17.7water 81.8 86.0 86.3 10.4 81.0 86.0 85.9 10.5 82.1 86.0 86.1 10.7 81.1 85.9 85.9 10.4wavef 77.2 80.2 84.5 14.8 77.1 80.1 83.8 14.9 77.0 80.1 83.9 14.8 77.2 80.2 83.6 14.7

PCFS

arrhy 70.1 69.9 61.9 44.7 64.7 65.8 62.7 21.7 73.9 69.6 61.8 49.4 66.1 65.4 62.2 44.0cleve 54.6 55.9 52.5 8.0 53.9 55.3 52.5 8.0 54.1 55.7 52.5 7.9 54.6 55.7 52.5 8.0ecoli 83.0 85.6 84.5 6.0 83.0 85.6 84.4 6.0 81.4 85.2 84.5 6.0 83.0 85.6 84.4 6.0glass 69.4 49.1 68.8 6.7 69.2 48.9 68.6 6.7 68.6 48.2 67.4 6.8 69.3 49.1 68.8 6.7handw 78.5 81.8 65.5 60.2 66.6 70.8 50.0 38.0 91.4 84.3 67.6 24.6 63.8 66.8 51.1 20.9ionos 87.3 81.5 81.0 7.6 85.6 77.4 79.4 6.5 89.5 81.5 81.9 7.5 85.5 77.4 79.1 7.5libra 71.7 59.7 64.3 20.5 60.2 54.0 61.3 13.6 76.5 61.4 66.6 18.7 64.1 60.4 63.7 21.2multi 91.7 94.6 94.6 42.0 85.4 87.4 89.2 19.2 97.4 95.4 97.0 14.0 81.5 84.2 83.7 9.8ozone 93.8 79.7 93.7 18.8 93.3 77.6 93.7 15.4 94.1 74.3 93.7 20.4 93.1 74.6 93.7 19.5secom 93.2 76.3 93.4 68.4 92.3 72.6 93.4 16.8 93.1 44.2 93.4 86.5 91.3 48.4 93.4 89.0sonar 77.1 66.8 76.5 11.5 73.7 67.1 76.3 11.7 78.8 66.4 77.9 12.1 74.2 66.8 77.8 12.9water 81.9 86.4 86.3 9.8 81.6 86.2 86.2 9.9 82.0 86.2 86.6 10.2 81.4 86.0 86.0 10.0wavef 78.8 81.7 80.0 11.1 74.6 80.3 77.5 11.9 82.8 81.9 82.7 11.4 74.4 79.2 77.0 10.9

FRFS

arrhy 66.6 65.5 56.2 27.9 59.7 61.5 55.1 16.1 70.6 67.9 55.1 20 57.3 61.5 57.5 20.0cleve 53.3 56.3 55.9 10.0 52.2 55.6 54.6 10.1 52.6 55.3 54.9 10.0 53.2 55.6 55.6 10.0ecoli 83.7 85.7 85.1 6.9 83.7 85.7 85.1 6.9 83.7 85.7 85.1 6.9 83.7 85.7 85.1 6.9glass 67.8 47.7 65.2 9.0 67.8 47.7 65.2 9.0 67.5 47.7 65.2 9.0 67.8 47.7 65.2 9.0handw 82.4 84.4 67.6 73.2 69.2 73.7 55.3 50.3 91.7 84.5 66.0 24.5 62.8 66.0 54.3 25.0ionos 89.9 89.8 85.0 10.2 88.0 83.0 83.5 8.8 89.8 85.0 85.8 10.0 87.0 83.9 81.8 10.0libra 79.2 61.4 68.1 17.5 59.7 47.8 58.1 9.4 71.9 61.1 60.6 10 63.6 59.4 65.0 10.0multi 98.0 95.4 97.3 89.6 87.6 93.1 93.9 62.5 93.9 95.8 96.9 19.4 83.6 87.1 85.8 19.6ozone 94.0 80.6 93.7 25.0 93.1 77.6 93.7 19.5 94.1 72.3 93.7 25.0 93.1 71.8 93.7 25.0secom 93.4 92.1 93.4 76.9 92.0 90.6 93.4 47.9 93.4 92.9 93.4 24.1 93.0 89.9 93.4 24.4sonar 74.6 74.1 77.5 13.6 69.7 73.1 77.5 10.7 78.9 73.1 79.4 10.0 70.2 70.7 75.0 10.0water 83.9 85.9 83.6 10.0 81.5 85.4 83.1 9.9 85.4 85.9 83.6 10.0 81.5 84.6 83.3 10.0wavef 77.2 81.9 82.0 25.2 72.4 79.3 71.8 30.2 83.8 81.7 78.9 19.0 71.7 76.6 70.4 19.0

95


Table 4.5: Classification accuracy % result of data partition-based implementation,shaded cells indicate statistically significant improvements for each of the testedclassifiers

OC-FSE B∗0.5 (OC α= 0.5) Ordinary FSE Single Subset

C4.5 NB VQNN Size C4.5 NB VQNN Size C4.5 NB VQNN Size C4.5 NB VQNN Size

CFS

arrhy 71.7 69.7 63.1 38.8 68.6 68.3 65.9 25.7 72.9 68.4 63.2 33.7 67.8 67.5 63.6 31.6cleve 55.7 56.8 53.4 6.6 55.5 56.9 53.1 6.7 54.4 56.5 53.3 6.6 55.7 56.7 52.7 6.8ecoli 78.0 83.8 82.8 4.9 79.2 82.3 82.2 4.9 79.8 82.5 82.3 4.9 79.6 82.2 81.9 4.8glass 71.0 48.3 67.2 6.4 70.3 48.0 67.6 6.5 68.6 47.9 68.7 6.4 69.0 48.1 67.5 6.6handw 81.3 86.1 59.3 81.4 73.7 79.0 50.0 60.1 86.6 84.2 67.2 82 75.5 83.7 63.1 79.7ionos 87.7 86.7 82.1 10.6 87.1 86.4 81.5 10.5 87.8 86.6 82.2 10.8 87.2 86.5 81.6 10.3libra 73.2 59.7 63.1 29.7 64.9 57.8 61.4 26.8 74.2 61.9 64.5 31.6 67.0 61.7 63.8 30.4multi 96.5 97.6 98.8 161 94.7 96.9 98.6 81.4 96.5 96.9 98.7 170 94.4 96.3 98.4 172ozone 93.8 74.6 93.7 21.4 93.2 73.9 93.7 21.1 94.2 73.8 93.7 23.1 93.2 73.9 93.7 21.4secom 93.1 82.3 93.4 20.0 92.4 75.0 93.4 16.9 93.3 75.3 93.4 24.9 92.2 70.2 93.4 18.4sonar 76.3 66.8 74.7 17.0 76.0 66.4 75.8 17.7 77.1 66.4 75.4 17.9 75.8 66.9 77.2 17.6water 82.5 86.5 86.8 10.2 81.2 85.9 86.1 10.1 82.5 86.4 86.5 10.5 81.8 86.4 86.5 10.5wavef 77.3 80.2 84.5 14.6 77.1 80.1 83.9 14.9 77.0 80.1 84.0 14.8 77.2 80.2 83.7 14.7

PCFS


FRFS


96


bility [22], chi-square [291], RELIEF [147], and symmetrical uncertainty [219], in

conjunction with various feature subset evaluators (FRFS, CFS, and PCFS). Together,

eight different evaluation methods are employed. A pseudo-random generator as

described in Section 4.1.1.3 is used to produce the required 20 ensemble components.

For FS rankers, the final feature subset size is adjusted according to the size of subsets

obtained by the subset evaluators.

The classification performances of the classifiers that utilise the constructed FSEs

are compared in Table 4.6. The most interesting results are achieved for the data sets

ecoli, ionos, sonar, water, and wavef, where all three tested classifiers have

an improved performance. By employing the proposed work, the overall accuracies

of both C4.5 and NB are improved for 10 out of 13 data sets, as compared to that

of VQNN (6/13 data sets). This indicates that the OC threshold-based aggregation

technique is favourable for such generically built ensembles.

Table 4.6: Classification accuracy % result of mixture of algorithms, shaded cellsindicate statistically significant improvements for each of the classification algorithms

OC-FSE B∗0.5 (OC α= 0.5) Ordinary FSE

Data set C4.5 NB VQNN Size C4.5 NB VQNN Size C4.5 NB VQNN Size

arrhy 69.16 65.73 64.68 108.0 66.92 65.70 64.95 116.2 68.11 65.44 64.75 108.0cleve 55.85 56.69 52.79 7.1 55.71 56.72 52.96 7.0 55.91 56.48 52.79 8.1ecoli 81.42 84.30 82.17 4.7 79.04 80.30 81.16 4.6 79.75 80.51 81.49 5.7glass 69.02 50.14 64.17 5.5 67.12 49.47 63.31 5.4 67.77 49.53 64.40 6.5handw 78.54 84.51 64.04 114.1 76.48 84.05 62.47 125.3 78.97 84.66 63.24 105.1ionos 89.43 84.13 84.91 14.5 87.22 83.48 83.22 14.3 89.04 84.13 84.74 15.4libra 68.28 57.50 61.75 38.7 64.11 56.64 60.94 44.5 67.61 58.17 62.28 39.0multi 95.22 95.49 98.07 286.7 94.64 95.04 97.69 314.1 95.29 95.37 97.85 259.2ozone 93.78 70.96 93.69 32.1 92.67 69.46 93.69 35.5 93.68 69.78 93.69 32.1secom 91.45 44.22 93.36 223.7 90.04 36.94 93.36 255.2 91.19 34.97 93.36 189.9sonar 78.80 79.82 82.94 25.3 75.05 66.51 77.38 26.4 75.91 66.80 77.19 26.1water 82.77 86.56 85.85 16.4 81.95 86.26 84.77 17.3 82.26 86.46 84.87 17.4wavef 76.63 80.34 83.76 17.7 76.38 79.98 82.45 19.4 76.58 79.98 82.72 18.7

4.2.2 Comparison of Ensemble Generation Methods

A graphical view of the classification results achieved is shown in Fig. 4.4, detailing

the average performance and spread of the three FSE implementations, against the

classification models built using the base FS components over each of the 12 data

sets. The PCFS subset evaluator is used to select the features, and the FSEs are tested

using the C4.5 algorithm. The 10-FCV process is repeated 50 times per data set,

producing 500 different results in the expectation that the variations in training and

97


testing are sufficiently captured. Table 4.7 presents an overall statistical summary

which combines the results over all examined data sets. The mean accuracy is also

presented in order to give a full comparison of the performance differences between

the implementations.

Table 4.7: Statistical comparison of the three FSE implementations, using meanaccuracy aggregated over all data sets, shaded cells indicate best results

Mean Median Max Min SD

Stochastic 81.53 81.82 93.45 64.18 5.02Partition 81.23 81.36 93.33 64.47 5.06Mixture 80.21 80.29 93.23 63.80 5.11Base 76.00 76.33 91.12 57.13 5.76

Paired t-test (p = 0.1) indicates that all FSE implementations have lower standard

deviations than the corresponding base FS components. This experimentally demon-

strates that the use of FSE reduces performance variance of the end classification

models. However, no significant differentiation can be determined between the

ensemble approaches themselves. From these figures, it can be concluded that no

implementation works universally better than the others, although the stochastic

search-based method leads in terms of overall classification accuracy, achieving best

scores in eight of the 12 cases. Note that the mixture-of-algorithm-based method

presents outstanding performance for the multi data set (highest accuracy of 95.18%

and lowest spread of 1.53%). No obvious improvement is noticed for the data sets

cleve and ecoli, other than a minor accuracy increase. This is expected as these

are low dimensional data sets. For all the data sets tested, the use of OC-FSE results

in improvement on the classifiers accuracy, while the number of features required

to perform the classification is also much reduced. This reflects that as a novel

filter-based approach, OC-FSE offers a beneficial pre-processing step for the purpose

of classification.

4.2.3 Scalability Tests

The aim of this set of experiments is to verify whether using a different number

of base FS components may affect the accuracy (and FSE size) of the proposed

approach. Obviously, this is a computationally expensive experimentation. Thus,

only the stochastic search-based ensemble generation method is employed along

with the PCFS evaluator. Fig. 4.5 details the variations in performance for several

98


Figure 4.4: Comparison of average classification accuracies (crosses) and spreads ofthe three FSE implementations and base components for each data set

99

4.3. Summary

data sets with reasonable complexity (low dimensional data sets are omitted). The

number of the base FS components ranges from 1 (where the method collapses into

a single standard FS algorithm), up to 128. The classification accuracies of the three

earlier adopted classifiers are displayed. The resultant FSE sizes are also included.

Most of the tested data sets reveal an increase in accuracy when the number

of base components rises. This is intuitively appealing as more ensemble members

generally improve the group diversity. If computational resource allows, it may be

beneficial to employ larger base components, where the proposed approach excels

for its constant evaluation complexity (defined by ∆α) as discussed in Section 4.1.3.

For data sets such as arrhy, sonar, and libra, the size of the FSE shows very

minor variation (less than 1 feature on average), while almost an exponential increase

is observed for the data sets handw (257 features) and multi (650 features). It is

interesting to point out that increasing the number of base components does not

necessarily guarantee better performance, as demonstrated by the libra data set.

The accuracies of all classifiers peak at a value of over 72% for this data set when 8

components are used, and then its performance starts to gradually decline.

4.3 Summary

This chapter has introduced an occurrence coefficient-based FSE (OC-FSE) approach

and detailed three distinctive techniques in an effort to implement this approach.

In OC-FSE, the outcomes from multiple, different FS results are integrated together,

for the purpose of producing a high level view that helps to perform the subsequent

classification tasks. The key advantage of OC-FSE (and FSE in general) is that the

end classifier performance is no longer dependent upon merely one selected subset,

making it a potentially more flexible and robust technique, especially in dealing with

high dimensional and large data sets. For such data sets, multiple feature subsets

with equally highest attained scores may be discovered when judged by one single

feature evaluator, but not all may perform equally well in terms of classification.

Two of the proposed implementations, the stochastic search-based and the data

partition-based, use just a single subset evaluation algorithm; whilst the mixture

of algorithms approach aims to produce the ensemble by combining distinctive FS

evaluation measures.

Comparative experimental studies have demonstrated that OC-FSE significantly

improves over single FS results, when combining subsets discovered using the three

100

4.3. Summary

Figure 4.5: Comparison of averaged OC-FSE classification accuracies and subsetsizes, plotted against different numbers of base FS components

101

4.3. Summary

FSE construction techniques, respectively. The results have shown the strength

of OC-FSE in dealing with almost all data sets tested, having at least comparable

classification accuracies to those of ordinary FSEs built using the same subsets, whilst

reducing the overall ensemble complexity. In particular, the stochastic search-based

approach appears to perform better than the rest, which may have benefited from

the quality search results ensured by HSFS.

The limitations of the current OC-FSE implementations are discussed in 9.2.1.2,

which includes future planned extensions that may further improve the efficacy of the

present work. It is worth pointing out here that ensemble-based classification models,

especially those that rely on the underlying feature subsets to create different views

of data, are also prone to irrelevance and redundancy. This is because the selected

feature subsets, unless carefully controlled, may not generate sufficient diversity.

Experimental results also confirm that employing an arbitrary large sized BCP does

not guarantee good ensemble performance. The subsequent chapter explores the

potential of FS mechanisms in identifying the less informative ensemble members

(the so-called base classifiers), so that the efficiency and classification performance

of the resultant ensemble may be further enhanced.

102

Chapter 5

HSFS for Classifier Ensemble

Reduction

T HIS chapter is a continuation of the investigation carried out by its predecessor,

where HSFS has been utilised to generate distinctive underlying feature subsets,

thus enabling the construction of diverse groups of base classifiers, i.e., FSEs. The

goal of classifier ensemble reduction (CER) [250] (or classifier ensemble pruning)

studied in this chapter, is to reduce the level of redundancy in such pre-constructed

pools of base classifiers, in order to identify a much reduced subset of classifiers that

can still deliver comparable classification results. Alternative approaches to building

classifier ensembles (other than FSE) also involve diversifying the training data [27],or random partitioning of the input space [102], before finally aggregating their

decisions together to produce the ensemble prediction.

CER is an intermediate step between ensemble construction and decision aggre-

gation. Efficiency is one of the obvious gains from CER. Having a reduced number

of classifiers can eliminate a portion of run-time overhead, making the ensemble

processing quicker; having fewer classifiers also means relaxed memory and storage

requirements. Removing redundant ensemble members may also lead to improved

diversity within the group, further increasing the prediction accuracy of the ensemble.

Existing approaches in the literature include techniques that employ clustering [85]to discover groups of models that share similar predictions, and subsequently prune

each cluster separately. Others use reinforcement learning [200] and multi-label

103

learning [178] to achieve redundancy removal. A number of similar approaches

[150, 251] focus on selecting a potentially optimal subset of classifiers, in order to

maximise a certain predefined diversity measure.

In this chapter, a new framework for CER is presented which builds upon the

ideas from existing FS techniques. Inspired by the analogies between CER and FS,

this approach attempts to discover a subset of classifiers by eliminating redundant

group members, while maintaining (or increasing) the level of diversity within the

original ensemble. As a result, the CER problem is being tackled from a different

angle: each ensemble member is now transformed into an artificial feature in a

newly constructed data set, and the “feature” values are generated by collecting

the respective classifier predictions. FS algorithms can then be used to remove

redundant features (now representing classifiers) in the present context, in order

to select a minimal classifier subset while maintaining original ensemble diversity,

and preserving ensemble prediction accuracy. The current CER framework extends

the original idea [61] that works exclusively with the fuzzy-rough subset evaluator

[126], thus allowing many different FS evaluators and subset search methods to be

used. It is also made scalable for reducing very large classifier ensembles.

The fusion of CER and FS techniques is of particular significance for problems

that place high demands on both accuracy and speed, including intelligent robotics

and systems control [161]. For instance, simultaneous mapping and localisation

has been identified as a very important task for building robots [176]. To perform

such tasks, apart from the direct use of raw data or simple features as geometric

representations, different approaches that capture more contextual information have

been utilised recently [149]. It has been recognised that ensemble-based methods

may better utilise these additional cognitive and reasoning mappings in order to

boost the performance. In effect, CER may be adopted to prune down the redundant,

unessential models, so that the complexity of the resultant system is restricted to a

manageable level. Also, FS has already been successfully applied to challenging real-

world problems like Martian terrain image classification [224], and to reducing the

computational cost in vision-based robot positioning [263] and activity recognition

[253]. It is therefore, of natural appeal to be able to integrate classifier ensembles

and CER, in order to further enhance the potential of both types of approach.

The remainder of this chapter is laid out as follows. Section 5.1 introduces the

key concepts of the proposed CER framework that builds upon the HSFS algorithm,

104

5.1. Framework for Classifier Ensemble Reduction

illustrating how CER can be modelled as an FS problem, and details the approach

developed to address this task. Section 5.2 presents the experimentation results

along with discussions. Section 5.3 summarises the chapter.

5.1 Framework for Classifier Ensemble Reduction

For most practical scenarios, the classifier ensemble is generated and trained using

a set of given training data. For new samples, each ensemble member individually

predicts a class label, which are then aggregated to provide the ensemble decision.

It is inevitable that such ensembles contain redundant classifiers that share very

similar if not identical models. This may be caused by the shortage of training data,

or the performance limitations of the model diversification process. Such ensemble

members, while occupying valuable system resources, are likely to draw the same

class prediction for new samples, and therefore provide very limited new information

to the group.

The ensemble reduction process, if carried out in between ensemble generation

and aggregation, may reduce the amount of redundancy in the system. The benefit

of having a group of classifiers is to maintain and improve the ensemble diversity.

The fundamental concept and goals of CER are therefore the same as FS. Having

already introduced the HSFS technique (Chapter 3), the following section focuses

on explaining how a CER problem can be converted into an FS scenario, and details

the framework proposed to efficiently perform the reduction. The overall approach

developed in this work is illustrated in Fig 5.1 which contains four key steps.

5.1.1 Base Classifier Pool Generation

Forming a diverse base classifier pool (BCP) is the first step in producing a good

classifier ensemble. Any preferred methods can be used to build the base classifiers,

such as Bagging [27] or Random Subspace [102]. A BCP can either be created using

a single classification algorithm, or through a mixture of classifiers.

Bagging can generate a number of base learners each from a different bootstrap

data set by calling the same base learning algorithm. A bootstrap data set is obtained

by sub-sampling the original training data set with replacement. The size of the

data set is the same as that of the training data set. Thus, for a (conventional)

bootstrap sample [27], some training instances may appear but some may not,

105


Figure 5.1: Overview of CER

where the probability that an example appears at least once is about 0.632 [295].After obtaining the base learners, Bagging combines their outputs using aggregation

methods such as majority vote [249], and the most-voted class is predicted. The

pseudocode for Bagging is shown in Algorithm 5.1.1. Since Bagging randomly selects

different subsets of training samples in order to build diverse classifiers, differences

in the training data present extra or missing information for different classifiers,

resulting in different classification models.

The Random Subspace method randomly generates different subsets of domain

attributes and builds various classifiers on top of each of such subsets. Algorithm

5.1.2 outlines the basic procedures, assuming a predefined subspace size of s, and

the features are chosen randomly without replacement. The differences between the

subsets creates different view points of the same problem [40], typically resulting in

different boundaries for classification.

For a single base classification algorithm, these two methods both provide a

good level of diversitie. In addition, a mixed classifier scheme is implemented in

the presented work. By selecting classifiers from different schools of classification

106


1 t, number of training rounds2 Ci, i = 1, · · · , t, base learners3 X i, i = 1, · · · , t, bootstrap data sets4 x j, j = 1, · · · , |X |, original training objects5 Y = A∪ Z , conditional and decision features6 for i = 1 to t do7 X i = ;8 while |X i|< |X | do9 random r, 1≤< r ≤ t

10 X i = X i ∪ x r11 Train Ci using (X i, Y )

Algorithm 5.1.1: Bagging algorithm

1 t, number of training rounds2 Ci, i = 1, · · · , t, base learners3 Ai, i = 1, · · · , t, random feature subsets4 a j, j = 1, · · · , |A|, original set of features5 Y = A∪ Z , conditional and decision features6 for i = 1 to t do7 A′ = A8 Ai = ;9 while |X i|< |X | do

10 random ar , ar ∈ A′

11 Ai = Ai ∪ ar12 A′ = A′ \ ar13 Train Ci using (X , Ai ∪ Z)

Algorithm 5.1.2: Random Subspace algorithm

algorithms, the diversity is naturally achieved through the various foundations of

the algorithms themselves.

5.1.2 Classifier Decision Transformation

Once the base classifiers have been built, their decisions on the training instances are

also gathered. For base classifiers Ci, i = 1, 2, . . . , |C|, Ci ∈ C, and training instances

x j, j = 1,2, . . . , |X |, where |C| is the total number of base classifiers, and |X | is the

total number of training instances, a decision matrix as shown in Table 5.1 can be

constructed. The value di j represents the ith classifier’s decision on the jth instance.

For supervised FS, a class label is required for each training sample, the same class

attribute is taken from the original data set, and assigned to each of the instances.

107


Note that both the total number of instances and the relations between instances

and their class labels remain unchanged. Although all attributes and values are

completely replaced by transformed classifier predictions, the original class labels

remain the same. A new data set is therefore constructed, each column represents

an artificially generated feature, each row corresponds to a training instance, the

cell then stores the transformed feature value.

Table 5.1: Classifier ensemble decision matrix

X C1 C2 · · · Ci · · · C|C|

x1 d11 d21 · · · di1 · · · d|C|1x2 d12 d22 · · · di2 · · · d|C|2...

......

......

x j d1 j d2 j · · · di j · · · d|C| j...

......

......

x|X | d1|X | d2|X | · · · di|X | · · · d|C||X |

5.1.3 FS on Transformed Data set

HSFS is then performed on the artificial data set, evaluating the emerging feature

subset using the predefined subset evaluator (such as the fuzzy-rough dependency

measure [126]). HSFS optimises the quality of discovered subsets, while trying to

reduce subset sizes. When HS terminates, its best harmony is translated into a feature

subset and returned as the FS result. The features then indicate their corresponding

classifiers that should be included in the learnt classifier ensemble. For example,

if the best harmony found by HS is C−, C9, C3, C23, C3, C5, C17, C−, the translated

artificial feature subset is then C3, C5, C9, C17, C23. Thus, the 3rd, 5th, 9th, 17th

and 23rd classifiers will be chosen from the BCP to construct the classifier ensemble.

5.1.4 Ensemble Decision Aggregation

Once the classifier ensemble is constructed, new objects are classified by the ensemble

members, and their results are aggregated to form the final ensemble decision output.

Such an aggregation process allows the evidence from different sources, i.e., the

class labels predicted by the individual base classifiers, to be combined. This is in

order to derive a degree of belief (represented as a certain belief function [220]) that

takes into account all the available evidence. In particular, the Average of Probability

108


[114] method is used in this chapter. Given ensemble members Ci, i = 1,2, . . . , |C|,and decision classes d j, j = 1,2, . . . , |Ωz|, where|C| is the ensemble size and |Ωz| isthe number of decision classes, classifier decisions can be viewed as a matrix of

probability distributions pi j, i = 1,2, . . . , |C|, j = 1,2, . . . , |Ωz|. Here, pi j indicates

prediction from classifier Ci for decision class d j. The final aggregated decision is

the winning classifier that has the highest averaged prediction across all classifiers,

as shown in Eq. 5.1.

|C|∑

i=1

pi1

|C|,|C|∑

i=1

pi2

|C|, . . . ,

|C|∑

i=1

pi|Ωz |

|C| (5.1)

Note that this is effective because redundant classifiers are now removed. As such,

the usual alternative aggregation method: majority vote [249] is no longer favourable

since the “majority” has now been significantly reduced.


Various factors affect the overall complexity of the proposed CER framework, namely

the performance of the base classification algorithm and the subset evaluator. Since

the proposed CER framework is generic and not limited to a specific collection of

methods, in the following analysis, OC(t rain), OC(test), and Oeval are used to represent

the complexity of training and testing the employed base classifier, and that of

the subset evaluator, respectively. The amount of time required to construct the

base ensemble (OBag ging +OC(t rain))× |C| can be rather substantial if the size of the

ensemble |C| is very large. The process of generating the artificial training data

set is straightforward, requiring only OC(test) × |C| × |X |, where |X | is the number of

instances.

Recall from Section 3.3.3, HSFS requires Oeval × gmax to perform the subset

search, as the total number of evaluations is controlled by the maximum number of

iterations gmax. Note that the subset evaluation itself can be time consuming for high

dimensional data (large sized ensembles). As for the complexity of the HS algorithm:

the initialisation requires O(|P| × |H| operations to randomly fill the subset storage,

where |P| is the number of musicians. The improvisation process is of the order

O(|P| × gmax) because every feature selector needs to produce a new feature at every

iteration. Finally, the complexity of predicting the class label for any new sample is

OC(test) × |C|, here |C| is the size of the reduced ensemble.

109



To demonstrate the capability of the proposed CER framework, a number of exper-

iments have been carried out. The implementation works closely with the WEKA

[264] data mining software which provides software realisation of the algorithms

employed, and an efficient platform for comparative evaluation. To implement the

ideas proposed in this chapter, the main ensemble construction method adopted

is the Bagging approach [27], and the base classifier learner used is the decision

tree-based C4.5 algorithm [264]. The Correlation Based FS [93, 94] (CFS), the

Probabilistic Consistency Based FS [52] (PCFS), and the FS technique developed

using fuzzy-rough set theory [126] (FRFS) are employed as the feature subset eval-

uators. The HSFS algorithm then works together with the various evaluators to

identify quality feature (classifier) subsets. In order to demonstrate the scalability

of the framework, the base ensembles are created in three different sizes: 50, 100,

and 200. A collection of real-valued UCI [78] benchmark data sets are used in the

experiments, a number of which are very large in size and high in dimensionality and

hence, present significant challenges for the construction and reduction of ensembles.

The parameters used in the experiments and the information of the data sets are

summarised in Table 5.2.

Table 5.2: HS parameter settings and data set information

|H| |P| δ gmax

10-20 |A| 0.5-1 2000

Data set Features Instances Decisions

arrhy 280 452 16cleve 14 297 5ecoli 8 336 8glass 9 214 6heart 13 270 2ionos 35 230 2libra 91 360 15ozone 73 2534 2secom 591 1567 2sonar 61 208 2water 39 390 3wavef 41 5000 3wine 14 178 3

110


Stratified 10-FCV is employed for data validation, where a given data set is

partitioned into 10 subsets. Of these 10 subsets, nine are used to form a training

fold, and a single subset is retained as the testing data. The construction of the base

classifier ensemble, and the ensemble reduction process are both performed using

the same training fold, so that the reduced subset of classifiers can be compared

using the same unseen testing data. This process is then repeated 10 times (the

number of folds). The advantage of 10-FCV over random sub-sampling is that all

objects are used for both training and testing, and each object is used for testing only

once per fold. The stratification of the data prior to its division into different folds

ensures that each class label has equal representation in all folds, thereby helping

to alleviate bias/variance problems [18]. The experimental outcomes presented

are averaged values of 10 different 10-FCV runs, in order to reduce or alleviate the

impact of random factors within the heuristic algorithms.

5.2.1 Reduction Performance for Decision Tree-Based

Ensembles

In this set of experiments, the BCP is built using a decision tree-based approach, and

C4.5 [264] is selected as the base algorithm. Table 5.3 summarises the obtained 3

sets of results for CFS, PCFS, and FRFS respectively, after applying CER, as compared

against the results of using: (1) the base algorithm itself, (2) the full base classifier

pool, and (3) randomly formed ensembles. Entries in bold indicate that the selected

ensemble performance is either statistically equivalent or improved significantly when

compared with the original ensemble, using paired t-test with two-tailed threshold

p = 0.01.

Two general observations can be drawn across all three set-ups: (1) The prediction

accuracies of the constructed classifier ensembles are universally superior than that

achievable using a single C4.5 classifier. Most of the data sets that revealed the

greatest performance increase are either large in size or high in dimensionality.

This reinforces the benefit of employing classifier ensembles. (2) All FS techniques

tested demonstrate substantial ensemble size reduction, showing clear evidence of

dimensionality reduction.

For the original ensembles of size 50, the CFS evaluator performs very well. In

seven out of 11 tested data sets, CFS achieves comparable or better classification

accuracy when compared with the original ensemble. The FRFS evaluator also

111


Table 5.3: Comparison on C4.5 classification accuracy

CFS PCFS FRFS Random Full Base C4.5

Data set Acc.% Size Acc.% Size Acc.% Size Acc.% Size Acc.% Size Acc.%

Base Ensembles of Size 50

arrhy 74.59 21.6 71.93 5.3 74.81 26.3 73.71 10 74.47 50 66.39cleve 55.54 25.8 56.57 5.7 56.60 13.6 54.16 10 54.90 50 50.21ecoli 84.55 11.6 83.95 6.8 83.96 23.8 83.94 10 84.24 50 81.88glass 74.46 15.0 66.45 4.6 76.71 11.9 72.94 10 70.24 50 70.15ionos 91.30 10.8 90.00 3.2 90.43 3.1 90.00 10 90.87 50 87.39libra 79.44 23.2 74.72 3.5 78.89 15.4 77.78 10 81.67 50 71.39ozone 93.88 26.2 94.12 12.3 93.96 43 93.40 10 94.00 50 92.94secom 93.30 35.9 92.79 6.3 92.92 6.3 93.11 10 93.24 50 89.28sonar 75.31 24.5 71.93 3.3 71.05 3.2 72.45 10 75.88 50 70.05water 87.69 20.9 83.33 4 84.61 6.1 84.87 10 86.67 50 80.00wavef 82.92 42.2 81.50 8.7 82.47 11 81.00 10 82.98 50 75.50


arrhy 73.91 28.3 73.04 5.2 74.37 22.3 73.26 20 74.47 100 66.39cleve 54.56 30.4 58.26 6.7 54.46 11.8 55.56 20 56.56 100 50.21ecoli 84.85 13.7 85.76 6.4 85.16 24.2 84.84 20 84.25 100 81.88glass 71.60 16.5 70.58 4.5 74.31 11.7 72.53 20 74.42 100 70.15ionos 89.13 14.3 90.43 3.1 84.35 3.2 90.87 20 91.74 100 87.39libra 80.83 33.0 74.17 3.5 77.78 15.3 77.22 20 80.28 100 71.39ozone 94.24 31.8 93.84 13.5 94.16 74.2 94.16 20 94.16 100 92.94secom 93.43 59.4 93.04 6.2 92.51 6.1 93.00 20 93.30 100 89.28sonar 75.36 30.4 72.88 3.8 75.36 3.5 72.93 20 75.36 100 70.05water 87.18 25.7 85.64 4.7 86.15 6.2 87.18 20 86.92 100 80.00wavef 83.20 71 80.88 9 83.33 11 82.90 20 83.42 100 75.50


arrhy 75.47 39.9 72.80 5.7 73.04 21.3 74.37 40 75.25 200 66.39cleve 57.93 45 52.56 5.8 55.24 11.9 55.54 40 54.90 200 50.21ecoli 83.96 24.5 83.94 6.6 84.29 24.3 84.86 40 84.54 200 81.88glass 72.53 25.9 72.97 4.8 72.49 11.6 72.08 40 73.94 200 70.15ionos 90.87 20 86.09 3.2 90.87 3.6 89.57 40 91.74 200 87.39libra 81.67 41 74.17 4 81.11 15.1 79.17 40 79.44 200 71.39ozone 94.55 45 93.65 28.9 94.49 143 94.40 40 94.24 200 92.94secom 93.36 95.7 92.92 6.2 93.38 6 93.30 40 93.36 200 89.28sonar 78.69 45.8 73.36 4.1 74.31 4.5 74.88 40 75.83 200 70.05water 87.95 38.6 83.33 4.3 85.87 6.8 86.67 40 87.95 200 80.00wavef 83.12 107.2 81.06 9.3 82.40 12 82.76 40 83.48 200 75.50

delivers good accuracies in four data sets while having fairly small reduced ensembles.

The PCFS only produces equally good solutions for the cleve and ozone data sets,

however, it has the most noticeable ensemble size reduction ability. The reduced

ensembles demonstrate increased classification performance for the cleve, glass,

and water data sets.

112


For the medium (100) sized ensembles, both CFS and FRFS produce good results

in five data sets, however, none of these further improves the ensemble classification

accuracy. Although PCFS only achieves best performance for the cleve data set,

it manages to achieve an average accuracy of 58.26% across all 10× 10 reduced

ensembles, with an averaged size of only 6.7. Note that for the ozone and sonar

data sets, the reduced ensembles discovered by CFS and FRFS both show very similar

averaged accuracy, which is almost identical to that of the original full ensembles.

This may indicate that the key members of the ensembles are certainly present in

the reduced subsets, while FRFS eliminates the most redundancy (average reduced

ensemble size is 3.5) for the sonar data set.

For the large sized ensembles, CFS shows a clear lead in terms of the overall quality

of the reduced ensembles, scoring equal classification accuracy for five data sets, and

delivers an improvement in ensemble accuracy for the cleve, libra, and sonar

data sets. This experimentally demonstrates the capability and benefit of employing

the proposed CER framework in dealing with large sized ensembles with large,

complex data sets. FRFS also produces good quality ensembles with much reduced

size, showing its strength in redundancy removal. PCFS is not competitive in this set

of experiments, this may be due to its (perhaps overly) aggressive reduction behaviour,

which possibly resulted in certain quality ensemble members being ignored.

5.2.2 Alternative Ensemble Construction Approaches

The following set of experiments compares supervised FS (FRFS) to unsupervised

[169] (U-FRFS) FS approaches. A total of 10 different base classification algorithms

are selected, containing one to two distinctive classifiers from each representative

classifier groups. The selected methods include fuzzy-based fuzzy nearest neighbours

[140], fuzzy-rough nearest neighbours [117], vaguely quantified fuzzy-rough nearest

neighbours [117], lazy-based k-nearest neighbors [4], tree-based C4.5 [264], reduced

error pruning tree [70], rule-based methods with repeated incremental pruning to

produce error reduction [264] and projective adaptive resonance theory [264], naïve

bayes [132] and multilayer perceptron [98].

Bagging [27] and Random Subspace [102] are subsequently used to create

differentiation between classifiers to fill the total BCP of 50. Tables 5.2 and 5.3

show the experimental results, using these two methods respectively. Due to the

considerable system resource required to construct and maintain the base ensembles,

113


this set of experimentations are carried out using ensembles of size 50 with lower

dimension benchmark data sets.

For mixed classifiers created using Bagging (Fig. 5.2), the FRFS method find

ensembles with much greater size variation. For the ecoli data set in particular, the

averaged ensemble size is 15.98. The results indicate that many distinctive features

(i.e. good diversity classifiers) are present, This particular ensemble also results in

the highest accuracy for ecoli compared against other approaches, with 87.67%

BCP accuracy, and 86.66% ensemble accuracy. A large performance decrease is also

noticed for the sonar data set. Interestingly, the unsupervised FRFS achieves better

overall performance than its supervised counterpart, with smaller selected ensemble

sizes.

Figure 5.2: Mixed classifiers using Bagging

The Random Subspace based mixed classifier scheme (Fig. 5.3) produces better

base pools in 7 out of 9 cases. Both FRFS and U-FRFS find smaller ensembles on

average than the case where Bagging is used. Neither method suffers from extreme

performance decrease following reduction unlike the results obtained when a single

base algorithm is employed. Despite having a BCP that under performs for the ecoli

data set, both methods manage to achieve an increase of 5% in accuracy. The quality

of the mixed classifier group is lower than that of the C4.5 based single algorithm

approach for several data sets. This is largely due to the use of non-optimised base

classifiers. It can be expected that the results achievable after optimisation would be

even better.

114


Figure 5.3: Mixed classifiers using Bagging

5.2.3 Discussion

Although the execution time of the examined approaches have not been precisely

recorded and presented, it was observed during the study that data sets with large

number of instances such as the ozone, secom, and wavef data sets, all require a

substantial amount of time for the reduction process. This observation seems to be

consistent with the findings of the complexity analysis in Section 5.1.5: the reduction

process relies on the efficacy of the evaluators (which may not scale linearly with

the number of training instances), and thus, for huge data sets, it may be beneficial

to choose the lighter weight evaluators (such as CFS). However, since the reduction

process itself can be performed independently and separately from the main ensemble

process, CER is generally treated as a pre-processing step (similar to FS) for the

ensemble classification, or a post-processing refinement procedure for the generated

raw ensembles. The time complexity for such processes is less crucial and has less

impact.

The experimental evaluation also reveals that different evaluators show distinctive

characteristics when producing the reduced ensemble. For example, PCFS consis-

tently delivers very compact ensembles (with less than 10 members for most data

sets). CFS excels in terms of ensemble classification accuracy but with much larger

sized subsets. FRFS is balanced between ensemble accuracy and dimensionality re-

duction, with very occasional large solutions (the ozone data set). The unsupervised

method also produces comparable results to its supervised counterparts.

Note that for a number of experimental data sets, performing CER does not always

yield subsets with equal or better performance. This may be due to the employed filter-

115

5.3. Summary

based FS approaches (which do not cross-examine against the original data in terms

of classification accuracy). How concepts developed by existing wrapper-based and

hybrid FS techniques may be applied to further improve the framework remains active

research. The information lost (even the redundant classifiers) through reduction

may also be the cause for such decrease in performance. Similar behaviour has also

been observed in the FS problem domain. The quality (such as size and variance) of

the training data also plays a very important role in CER, the classifiers that were

deemed redundant by the subset evaluators may in fact carry important internal

models, which are just not sufficiently reflected by the available training samples.

5.3 Summary

This chapter has presented a new approach to CER. It works by applying FS techniques

for minimising redundancy in an artificial data set, generated via transforming a

given classifier ensemble’s decision matrix. The aim is to further reduce the size of

an ensemble, while maintaining and improving classification accuracy and efficiency.

Experimental comparative studies show that several existing FS approaches can

produce good solutions by employing the proposed approach. Reduced ensembles

are found with comparable classification accuracies to those of the original ensembles,

and in most cases also provide good improvement over the performance achievable

by the base algorithm alone. The characteristics of the results also vary depending

on the employed FS evaluator. As a novel application of the FS concept in the area

of classifier ensemble learning, the present work has identified a promising direction

for future theoretical research, and it has laid the necessary foundation for which

further extensions and refinements may be built upon. More in-depth discussions

regarding ideas are given in Section 9.2.1.3.

116

Chapter 6

HSFS for Dynamic Data

M OST of the FS techniques discussed so far focus on selecting from a static pool of

training instances with a fixed number of original features. However, for most

real-world domains, data may be gradually refined, and information regarding the

problem domain may be actively added and/or removed. Dynamic FS [58, 99, 289],also referred to as on-line FS [268], has attracted significant attention recently.

Unlike the conventional, off-line FS that is performed where all the features and

instances are present a-priori, dynamic FS considers situations where the information

regarding a certain problem domain is not fully available. The extraction of features,

or the procedure of collecting new instances may be difficult or time consuming.

New sets of features or instances may only be presented in an incremental fashion,

and the FS technique needs to adapt to the new information quickly and accurately.

Existing studies in the literature typically work with a classifier learner [99, 260,

289], but also involve alternative applications such as prediction [76]. Little work

has been carried out for studying situations where features or instances are removed.

However, such scenarios may be common for applications where data have a limited

validity [25, 34], and outdated information need to be removed to ensure data

consistency or simply to save storage space. As previously explained in Chapters 4

and 5, a classifier ensemble [295] exploits the uncorrelated errors within a group

of classifiers caused by their diverse internal models [217], in order to increase

the classification accuracy over single classifier systems. FSE in particular, is an

effective type of classifier ensemble that generates a group of classifiers with diverse

underlying feature subsets, thereby creating different views of the original data

117

6.1. Dynamic FS Scenarios

[195, 197]. nature-inspired FS search techniques [192, 261] such as HSFS [62] can

help to construct such ensembles by producing multiple, compact, and high quality

feature subsets.

In this chapter, theoretical discussions is present with respect to four basic dynamic

FS scenarios: feature addition, feature removal, instance addition, and instance

removal. It provides an insight of how a nature-inspired meta-heuristic such as HSFS

may be beneficial in more complex situations (arbitrary combination of the possible

events). A dynamic FS technique termed “dynamic HSFS” (D-HSFS) is proposed,

which is capable of actively maintaining the quality of a merging feature subset for a

given changing data set. Its stochastic mechanisms also allow multiple good feature

subsets to be identified simultaneously. The subsequent part of the chapter further

investigates the feasibility of implementing an adaptive FSE framework (A-FSE)

using these actively refined feature subsets.

The remainder of the chapter is organised as follows. Section 6.1 introduces the

concept of dynamic FS with a discussion of four basic dynamic scenarios. Section 6.2

explains the proposed D-HSFS algorithm which handles arbitrary combinations of

the basic dynamic FS events. A generic A-FSE technique is presented in Section 6.3,

aiming to better handle changing data. An implementation of this technique using

D-HSFS is also detailed. Section 6.4 reports the results of experimental investigation,

in order to demonstrate the efficacy of the proposed approach. Finally, Section 6.5

provides a brief summary of the work.

6.1 Dynamic FS Scenarios

The aim of a conventional (static), subset-based FS algorithm, as previously intro-

duced in Section 1.1, is to determine an optimal feature subset B ⊆ A with the best

evaluation score f (B) and minimum size |B|. Such a feature subset may encapsulate

the original concept to the maximum extend, and be able to distinguish the training

instances into their respective classes. Here f : B → R, f (B) ∈ [0,1] is a subset

evaluation function that maps a set of feature subsets onto a set of real numbers.

The nature of dynamic data sets requires any FS algorithm to depart from a

one-off pre-processing process, becoming a recursive procedure. A previous feature

subset Bk obtained on the basis of the data set at an earlier state (Xk, Ak) needs to be

re-evaluated, against any newly added information as well as any removed instances

118


and features. This will produce a modified feature subset Bk+1 that has adapted to

the changed data (Xk+1, Ak+1). In this chapter, for simplicity, it is assumed that the

possible class labels are predefined and unaltered throughout the process: Zk = Zk+1.

Four common scenarios that may occur in a given dynamic FS environment are

introduced below, considering events where features or instances may be added or

removed. Insights are also provided regarding how a previously selected subset of

features Bk may be improved, in order to discover a higher quality feature subset

Bk+1. For ease of discussion, let Bk denote the feature subset that is of the highest

achievable evaluation score for the previous state of the data. The fuzzy-rough

dependency measure exploited by FRFS [126] is adopted in this section, in order to

provide concrete examples of the dynamic procedures. The notation of the evaluation

function is annotated as f XkAk(Bk), which signifies that the quality of a given feature

subset Bk, selected on the basis of the data set at its current state (Xk, Ak). For dynamic

FRFS, the aim is (still) to find a fuzzy-rough reduct Rk ⊆ Ak, which is defined as a

subset of features that preserves the dependency degree of the unreduced data, i.e.,

f XkAk(Rk) = f Xk

Ak(Ak).

6.1.1 Feature Addition

This scenario considers the situation where new features are incrementally added

during the FS process, i.e., |Ak+1|> |Ak|, whilst the set of training instances X remains

static. In particular, if the currently available set of features Ak is already capable of

fully distinguishing all objects x ∈ X into their respective classes, any subsequent

feature addition will bring no improvement to the discernibility of the data set.

Therefore, no further selection is necessary for the purpose of improving f (Bk).However, if Ak itself was not informative enough, then it is crucial to examine the

new features Ak+1 \ Ak, in order to improve the discernibility of the subset. Ideally,

every feature a, a ∈ Bk, should also be checked, and amended (in the sense of being

removed or replaced) as the new features may be more informative and hence, may

help to further reduce |Bk|. This may be skipped for time-critical applications with a

risk of resulting in possibly non-global best feature subsets.

For dynamic FRFS, the properties of a fuzzy-rough reduct Rk can be exploited to

significantly simplify such a dynamic process. If the existing set of features Ak can

already fully discern all of the instances in Xk with respect to their associated classes,

119


i.e., f XkAk(Ak) = 1, and a reduct has been identified:

for ∀k′ > k, f XkAk(Rk) = f Xk

Ak(Ak) = f Xk′

Ak′(Ak′) = 1, Ak ⊆ Ak′ (6.1)

Then no further modification to Rk is necessary. However, if full fuzzy-rough depen-

dency for the data set is not achieved in the previous step, i.e., f XkAk(Ak)< 1. It is then

crucial to examine the new features. The procedure that handles dynamic feature

addition for FRFS is detailed in Algorithm 6.1.1.

1 if f XkAk(Rk) = 1 then

2 return Rk

3 f ∗← f XkAk(Rk)

4 while f ∗ 6= f Xk+1Ak+1(Ak+1) do

5 foreach x ∈ Ak+1 \ Ak do6 if f Xk

Ak(Rk ∪ x)> f ∗ then

7 Rk+1← Rk ∪ x, f ∗← f XkAk(Rk ∪ x)

8 Rk← Rk+1

9 return Rk+1Algorithm 6.1.1: Dynamic FRFS for Feature Addition: Ak ⊆ Ak+1, Xk = Xk+1

6.1.2 Feature Removal

In contrast to the previous scenario, a particular application may be initialised

with an abundance of features, which are subsequently removed throughout the FS

process. In this case, the overall discernibility of a data set itself may deteriorate,

due to informative features being removed. Particularly, if a feature a belonging

to the current candidate feature subset Bk is deleted: a ∈ Ak \ Ak+1, substitution

of feature(s) may become necessary in order to restore the discernibility of this

subset. If the deletion does not affect Bk, i.e., (Ak \ Ak+1)∩ Bk = ;, it means that the

features being removed are also the previously unselected features (for being less

informative or redundant). In such an event, no further adjustment is necessary,

and the candidate feature subset from the previous state Bk may continue to be

used, since no informative features are lost. The procedure for handling the feature

removal scenario for FRFS is given in Algorithm 6.1.2. Here lines 1 and 2 perform

the necessary check which determines whether the current reduct has been affected

by the removal of features, and the recovery process is initiated only if features are

removed from Rk.

120


1 Rk+1← Rk \ (Ak \ Ak+1)2 if Rk+1 = Rk then3 return Rk+1


5 while f ∗ < f Xk+1Ak+1(Ak+1) do

6 foreach x ∈ Ak+1 \ Rk do7 if f Xk

Ak(Rk ∪ x)> f ∗ then

8 Rk+1← Rk ∪ x, f ∗← f XkAk(Rk ∪ x)

9 Rk← Rk+1

10 return Rk+1Algorithm 6.1.2: Dynamic FRFS for Feature Removal: Ak+1 ⊆ Ak, Xk = Xk+1

6.1.3 Instance Addition

Addition of instances (while the features A remain unaltered) is perhaps the most

commonly encountered situation. Monitoring-based applications [228], or proce-

dures that involve streaming data [268] are typical examples of such a case. When

a new batch of instances is added, subset evaluators such as CFS and PCFS may

initiate the necessary correlation or consistency checking only for the unseen objects.

However, techniques similar to FRFS will require a full re-evaluation against the

entire Xk+1, because the addition of objects will inevitably change the partitioning of

the universe Xk/Z . There are exceptional cases where the data set has accumulated a

sufficiently large number of samples with almost full coverage of the underlying con-

cepts to be learned. Any “new” instances are either the same as, or almost equivalent

to the objects already analysed (judged by a certain similarity relation [123]).

Algorithm 6.1.3 details the dynamic FRFS process for the case of instance addi-

tion. In practical applications, the number of new objects may be very small when

compared to the existing pool of instances, |Xk+1 \ Xk| |Xk|, and the new objects

may be very similar (or identical) to those already collected, when judged by a

certain measure such as one of the fuzzy similarity functions given in Eqns. 2.18 to

2.20. In such scenarios, the amount of features required to be further selected (or

replaced) may be minimal, since the existing feature subset can already sufficiently

discern the new objects. Of course, if the new objects are totally unseen objects, then

a large number of modifications is still necessary.

To further improve the efficiency of the algorithm, the newly added objects may

be checked against the current fuzzy-rough lower and upper approximations of

121


the existing classes, in order to determine whether they can be subsumed by the

already established partitions. If a given new object (or a group of objects) do not

belong to the existing partitions to a satisfactory degree, it is then an indication that

modifications of the lower and upper approximations are necessary. The effect of

instance addition may be more apparent for FRFS, This is because the addition of

objects may change the fuzzy positive regions µPOSR(x), as the universe (Xk, Ak) may

now be different.

1 if f Xk+1Ak+1(Rk)≥ f Xk

Ak(Rk) then

2 return Rk


4 while f ∗ < f Xk+1Ak(Ak) do

5 foreach x ∈ Ak \ Rk do6 if f Xk+1

Ak+1(Rk ∪ x)> f ∗ then

7 Rk+1← Rk ∪ x, f ∗← f Xk+1Ak+1(Rk ∪ x)

8 Rk← Rk+1

9 return Rk+1Algorithm 6.1.3: Dynamic FRFS for Instance Addition: Ak = Ak+1, Xk ⊆ Xk+1

6.1.4 Instance Removal

The last dynamic FS scenario considered in this chapter is instance removal. A lot

of training objects may be available in the beginning of the FS process, but have to

be removed later either due to the information has become outdated, or simply a

space limitation has been reached. Since the removal of instances does not increase

the amount of inconsistency lies within a given data set, this is the most simplistic

case of dynamic FS. A previously obtained optimal features subset will maintain its

discernibility. Note that exceptional situations exist where an instance has feature

values at the boundaries of the variable range, its removal may affect the result of

those techniques that rely on fuzzy similarity relations [123], causing the overall

evaluation score to change. It may be possible to further improve |Bk|, since any

removed instances may relax the constraints (i.e., the amount of inconsistency or

un-correlated objects present in the data set), and less features may be required

to maintain full discernibility. Algorithm 6.1.4 describes an example backward

elimination-based algorithm, to prune the now redundant features. This pruning

process may also be applied periodically to the other scenarios, since incrementally

122

6.2. Dynamic HSFS

refined feature subsets are susceptible to become sub-optimal solutions. For example,

Algorithms 6.1.1 avoids further evaluation and adjustment (ignoring potentially

more informative features) whilst the current subset Rk qualifies as a reduct.

1 Rk+1← Rk

2 foreach x ∈ Rk do3 if f Xk+1

Ak+1(Rk+1 \ x) = f Xk+1

Ak+1(Rk) then

4 Rk+1← Rk+1 \ x

5 return Rk+1Algorithm 6.1.4: Dynamic FRFS for Instance Removal: Ak = Ak+1, Xk+1 ⊆ Xk

6.2 Dynamic HSFS

For many application problems, it is impractical to assume that only one of the

aforementioned scenarios would occur, but rather a combination of them. Although

an approach could be derived by combining strategies tailored for resolving such

individual cases, direct combinations may produce sub-optimal solutions, especially

in terms of subset size. nature-inspired meta-heuristics, as an alternative, may be

extended to handle dynamic FS problems. Integer-valued HSFS [62] in particular, is

structurally simple and delivers excellent FS search performance, and which is also

able to iteratively reduce the size of an emerging feature subset. This section extends

the existing HSFS method, as detailed in Chapter 3, to develop a modified, dynamic

HSFS algorithm (D-HSFS) that better addresses the challenges in a changing FS

environment.

6.2.1 Algorithm Description

The D-HSFS algorithm uses 3 parameters: |H|, the number of feature selectors |P|,and an harmony memory considering rate δ, which encourages a feature selector pi to

randomly choose from all available features A (instead of within its own note domain

ℵi). The maximum number of iterations that is conventionally employed in HS is

not required in this implementation, as the process is expected to continue operating

throughout the whole dynamic process. The representation of a dynamic feature

subsets (harmony) is the same as that employed by standard HSFS, given in Table

3.3. For simplicity, the explicit encoding/decoding process between a given harmony

H j and its associate feature subset BH j is omitted in the following explanation.

123

6.2. Dynamic HSFS

The overall operation of the proposed D-HSFS algorithm is illustrated in Fig. 6.1

and outlined in Algorithms 6.2.1 and 6.2.2. A generic feature subset evaluator with

a score range of f (B) ∈ 0, 1 is used herein to ensure generality in the explanation.

Figure 6.1: Procedures of D-HSFS

1. Initialise Harmony Memory

Set the initial values for the parameters |H|, |P|, and δ as with the application

of conventional HS. A harmony memory containing |H| randomly generated

subsets is then initialised. This also provides each feature selector a note

domain ℵ of |H| features, which may include identical choices, or nulls (−).

2. Adapt to Change

By default, the internal stochastic mechanisms of HS (especially the δ acti-

vation) are potentially capable of exploring a dynamically changing solution

domain, discovering better solutions over time, without excessive human inter-

vention. To support this, the pool of the originally available features is now

kept up-to-date at all time. Since the pool is the variable domain shared by the

feature selectors, any updates made are automatically propagated to all the se-

lectors. Also, the subset evaluator is updated instantaneously with (Ak+1, Xk+1).

124

6.2. Dynamic HSFS

1 pi ∈ P, i = 1 to |P|, group of musicians2 H j ∈H, j = 1 to |H|, harmony memory

3 ℵi =⋃|H|

j=1 H ji , note domain of pi

4 δ, harmony memory considering rate5 Hnew, emerging harmony6 BH , translated feature subset from H7 f (BH), feature subset evaluator for BH

// Initialise harmony memory8 for j = 1 to |H| do9 Hnew = ;

10 for i = 1 to |P| do11 random ar , ar ∈ A∪ −12 Hnew = Hnew ∪ ar13 H=H∪ Hnew

// Iterate14 while changing do

// Subroutine for adapting to change15 adapt (Ak+1, Xk+1)

// Improvise new harmony16 Hnew = ;17 for i = 1 to |P| do18 random rδ, 0≤ rδ ≤ 119 if rδ < δ then20 random ar , ar ∈ A∪ −21 Hnew = Hnew ∪ ar22 else23 random ar , ar ∈ ℵi

24 Hnew = Hnew ∪ ar

// Update harmony memory25 if f (BHnew)≥min( f (BH) | H ∈H) then26 H=H∪ Hnew27 H=H \ arg minH∈H f (BH)

Algorithm 6.2.1: Pseudocode of D-HSFS

To achieve this, it may be necessary to re-train the internal components of

certain evaluators (e.g., when FRFS is used).

Note that for the event of feature addition, the new features will be explored

in due time via δ activation and introduced into the harmony memory, if they

provide an improvement to the quality of the feature subset. It is intended not

125

6.2. Dynamic HSFS

to force any feature selectors to try the new features immediately, as they may

in fact be irrelevant, or less important than the existing features. However,

such a mechanism may be implemented for more time critical applications,

where a given dynamic feature subset must be refined within a limited amount

of time.

Following the update to the variable domains, the harmony memory is then

re-evaluated using the updated evaluator. This is the most crucial step to ensure

that all the stored fitness values reflects the new changes, and the harmonies are

appropriately ranked for possible future updates. Fortunately, |H| is typically

a small number and hence, this process is generally not expensive. The total

number of feature selectors |P| may also be expanded or shrunk according

to the current size of the feature pool. After this, HS may resume its normal,

iterative operation and continue to improvise new solutions.

The only exception to the above procedure is regarding the scenario of feature

removal. Before initiating the re-evaluation, the deleted features must be

purged from all of the subsets stored in the harmony memory, and from the

note domains of all affected feature selectors. Algorithm 6.2.2 summaries this

adaptation sub-routine.

1 Update the evaluator f (B) with (Ak+1, Xk+1)// Invalidate any outdated features

2 if Ak \ Ak+1 6= ; then3 for ∀H j ∈H do4 H j = H j \ (Ak \ Ak+1)

5 for i = 1 to |P| do6 ℵi = ℵi \ (Ak \ Ak+1)

// Re-adjust musician group size

7 |P|k+1 =max( |P|k·|Ak+1||Ak|

, argmaxH j∈H |H j|)// Re-evaluate all feature subsets

8 for ∀H j ∈H do9 f (BH j)

Algorithm 6.2.2: Sub-routine adapt (Ak+1, Xk+1)

3. Improvise New Subset

Each p j nominates a feature a ∈ ℵ j and all such nominated features form a

new harmony Hnew. The corresponding new feature subset BHnew , decoded by

126

6.2. Dynamic HSFS

following a scheme already illustrated in Table 3.3, then has it evaluation score

computed by f (BHnew).

4. Update Subset Storage

If the newly obtained subset achieves a higher evaluation score than that of

the worst subset in the harmony memory, or it has an equal evaluation but

is of a smaller size, then this new subset replaces the existing worst subset.

Otherwise, it is discarded.

5. Iterate

This adaptation-improvisation-update process continues to run so long as the

data set is still in a dynamic state. The best harmony in the harmony memory

for a given state of the changing data set (Ak, Xk): H = argmaxH∈H f (BH) and

its associated feature subset BH is therefore dynamically, and continuously

refined.

Example 6.2.1.

Suppose that the emerging harmony is Hnew = a1, a4, a3, a3, a2→a7, a−, and the sub-

set it represents has an evaluation score of f (BHnew) = 0.6, with BHnew = a1, a3, a4, a7,and that the existing worst subset Hworst ∈H is a1, a2, a2, a3, a6, a−with f (BHworst ) =0.5, where BHworst = a1, a2, a3, a6. Then, the updated harmony memory is H =H∪ Hnew \ Hworst. If for instance, a new feature a7 is introduced to the harmony

memory via this update for future combinations, then its associated feature selector

p5 also adds this new feature to its note domain: ℵ5 = ℵ5 ∪ a7. In the beginning

of the next iteration, assuming that features a1, a3 are removed, the new harmony

obtained in the last iteration will need to be modified from a1, a4, a3, a3, a2, a7, a− to

a−, a4, a−, a−, a2, a7, a−, and the same invalidation process is applied to all H ∈H.

The evaluation scores of the respective feature subsets are also computed again

before improvising a new solution.


Following a style of analysis adopted by the original HSFS [62], the proposed D-HSFS

method requires O(|P| · |H|) operations to randomly fill the subset storage, where

|P| is the number of feature selectors, and |H| is the size of the harmony memory.

The continuous improvisation process, in between two dynamic states (Xk, Ak) and

(Xk+1, Ak+1), is of the order O(|P| · (gk+1 − gk)) ·Oe, where gk and gk+1 denote the

127

6.3. Adaptive Feature Subset Ensemble

numbers of iterations at the respective states, and Oe signifies the complexity a single

feature subset evaluation for the employed feature subset evaluator.

The adaptation process is of the order:

Ot +O(|P| · |H| · |Ak \ Ak+1|) + |H| ·Oe , if Ak \ Ak+1 6= ;

Ot + |H| ·Oe , otherwise(6.2)

where O(|P| · |H| · |Ak \Ak+1|) reflects the cost of invalidating existing feature subsets

and note domains, in the event of a feature removal. Ot denotes the cost of re-training

the feature subset evaluator using (Xk+1, Ak+1). In typical cases, |H| is a small value

5≤ |H| ≤ 20 [84], and |P| is bounded by the total number of features. Thus, both

the improvisation and adaptation costs are reasonably low.

6.3 Adaptive Feature Subset Ensemble

For a given data set of significant complexity, a family B of quality (while not always

equally optimal) feature subsets may be discovered by the use of a stochastic search

algorithm. Any such feature subset B ∈ Bmay be used to train a subsequent classifier

learner, and a diverse FSE may be constructed, which generally has a better prediction

accuracy than that of a single classifier. In a dynamically changing environment, a

collection of such feature subsets Bk may be adaptively refined in response to the

current state of the data set (Ak, Xk). Similarly, an FSE built upon Bk also needs to be

updated accordingly. The resulting process leads to the establishment of an adaptive

FSE (A-FSE).

A generic framework for such an A-FSE is illustrated in Fig. 6.2, where each

column of components forms a dynamic FS subsysteml , l ∈ 1, . . . , |Bk|, containing

an adaptive classifier C lk that is built using a dynamic feature subset B l

k ∈ Bk. Jointly,

the |Bk| subsystems construct an adaptive ensemble of classifiers Ak = C lk | l =

1, · · · , |B|, in which different components, including subset evaluators, subset search

algorithms, and base classifier learners may be employed independently. Generally

speaking, a system implementing such a framework will naturally possess the diversity

inherent in its various components. However, each type of component may be

implemented using the same algorithm in order to achieve a higher efficiency and a

lower complexity for the overall system. Any changes to the data are propagated, in

an iterative fashion, throughout the subsystems, down to the end ensemble.

128

6.3. Adaptive Feature Subset Ensemble

Figure 6.2: Generic framework for A-FSE

This following presents an implementation of the A-FSE framework using the

proposed D-HSFS algorithm, which supports the use of any feature subset evaluation

method, such as CFS, PCFS, and FRFS. Despite being adaptive, A-FSE is in principle,

similar to a standard FSE [195, 197, 295], and any ensemble aggregation method,

such as majority vote [249], may be employed. The steps of the implementation is

outlined in Algorithm 6.3.1.

1 while changing do2 for l = 1 to |B| do

// Subsystem l3 B l

k+1 = D-HSFSl (Ak+1, Xk+1)4 if B l

k+1 6= B lk ∨ Xk+1 6= Xk then

5 Re-train C lk with (B l

k+1, Xk+1)

// Aggregate ensemble predictions6 majority vote (B, xnew)

Algorithm 6.3.1: A-FSE implemented using D-HSFS

129


It is important to note that an instance of D-HSFS is necessary for each dynamic

FS subsystem, and the feature subset refinement is a continuous process independent

of the subsequent classifiers. Although the learners should be re-trained each time

when the associated subsets are modified, or the training objects are changed, the

efficiency of the ensemble may be further improved by on-demand re-training, i.e.,

only proceed when a new test object is present for classification.

The complexity of an A-FSE implemented using D-HSFS is therefore OD-HSFS · |B|+K · |B| ·OC , where OD-HSFS is the complexity of a D-HSFS component already analysed

in Section 6.2.2, K is the total number of potential state changes in a given dynamic

system, and OC is the cost related to a single classifier employed by the ensemble. Of

course, if multiple, different classification algorithms are involved in the construction

of A-FSE, then the classification-related costs are the sum of those of the individual

base classifiers: K ·Σ|B|l=1OC l .


The present investigation employs five real-valued UCI [78] benchmark data sets,

for the purpose of simulating a dynamically changing FS environment, which is

suitable for the demonstration of the efficacy of the proposed approach. Table 6.1

provides a summary of these data sets, all of which are of high dimensionality and

contain a large number of objects, thereby presenting significant challenges to FS.

Two commonly used subset-based feature evaluators, CFS and PCFS, are used in

the experiment. CFS is the a lightweight method, which addresses the problem

of FS through a correlation-based analysis, and identifies features that are highly

correlated with the class, yet uncorrelated with each other [93]. PCFS is an FS

approach that attempts to identify a group of features that are most inconsistent

[52], thereby removing irrelevant features in the process.

Table 6.1: Summary of the data sets

Data set Feature Instance Class C4.5 PART NB

arrhy 280 452 16 65.97 66.36 61.40handw 257 1593 10 75.74 78.09 86.21multi 650 2000 10 94.54 94.95 95.30ozone 73 2534 2 92.70 92.50 67.66secom 591 1567 2 89.56 91.57 30.04

130


6.4.1 Results for Basic Dynamic FS Scenarios

The four basic dynamic FS events are individually tested here, in order to validate

the efficacies of the proposed A-FSE method, and the D-HSFS algorithm. In the

experimentation, features or objects are added or removed (according to the scenario)

randomly in batches. After a change has been made, the D-HSFS algorithm adapts

to the new data, and improves the previously selected feature subset Bk, in order

to produce a new candidate subset Bk+1. A collection of 20 candidate subsets are

simultaneously improved (each using a separate instance of D-HSFS) with respect

to the same training data, the prediction accuracy of the resultant A-FSE (of size

20) is examined using dedicated test data (10% held out from the original data set).

The classification algorithm employed here is the decision tree-based C4.5 algorithm

[264], which first constructs a full decision tree using all available features, and then

performs heuristic pruning based on the statistical importance of the features.

6.4.1.1 Feature Addition

Table 6.2 details the results of the proposed A-FSE approach for the event of feature

addition, simulated using multi and secom, two of the available data sets each with

a largest total number of features. It is evident that the evaluation scores steadily

improve as new features are being added to the dynamic data sets, which agrees with

the analysis made in Section 6.1.1. The addition of features generally refines the

knowledge of the underlying problem, and the dynamic FS component supporting

the classifiers: i.e., the proposed D-HSFS algorithm, also successfully improves the

qualities of the candidate subsets.

Note that for the multi data set, a feature subset of full evaluation score (for

PCFS) is identified at the beginning (having just 172 features). Since no further

improvement to the score can be made, D-HSFS optimises the candidate solutions

via size reduction, by substituting in possibly more informative features that are

introduced during the feature addition events. The feature subsets found using

PCFS are much smaller than those identified with CFS for multi (with averaged

size being 35.1 v.s. 91.3), while CFS achieves more significant size reduction for

secom (averaged size 30.7 v.s. 49.9). A-FSE built using feature subsets suggested by

PCFS achieves better accuracies for the multi data set (averaged accuracy 97.23%

v.s. 93.91%). It is also marginally more stable for secom (for which the CFS-based

A-FSE delivers the same level of accuracy).

131


Table 6.2: Feature addition results, showing the sizes and evaluation scores ofthe dynamic feature subsets maintained by D-HSFS, and the corresponding A-FSEaccuracies, using feature subset evaluators CFS and PCFS. Shaded cells indicatestatistically better results (using paired t-test)

multi secom

CFS PCFS CFS PCFS

|Ak| Score Size C4.5% Score Size C4.5% |Ak| Score Size C4.5% Score Size C4.5%

172 0.811 76.8 94.00 1.000 40.2 97.00 155 0.004 34.0 93.63 0.948 45.3 92.99210 0.816 81.2 92.50 1.000 38.3 98.00 198 0.006 28.8 93.63 0.948 38.7 93.63250 0.820 85.5 94.50 1.000 36.8 94.50 240 0.007 27.9 92.99 0.951 45.2 93.63289 0.834 92.4 93.50 1.000 34.7 98.50 283 0.009 30.5 94.27 0.961 51.0 93.63342 0.842 94.9 94.00 1.000 33.7 97.00 327 0.011 30.6 94.27 0.964 51.7 94.27394 0.850 97.5 93.50 1.000 32.6 97.00 356 0.013 29.9 93.63 0.968 53.7 93.63453 0.862 98.4 95.00 1.000 32.1 96.50 395 0.014 29.2 94.27 0.970 52.3 92.99504 0.874 99.8 94.00 1.000 31.5 97.50 444 0.015 30.6 93.63 0.974 51.6 93.63588 0.881 101.0 94.50 1.000 30.7 97.50 536 0.016 30.2 94.27 0.975 53.7 93.63637 0.892 103.4 96.50 1.000 30.3 97.50 579 0.016 29.7 93.63 0.979 63.9 93.63

Mean 0.843 91.3 93.91 1.000 35.1 97.23 Mean 0.010 30.7 93.80 0.961 49.9 93.57S.D. 0.032 10.4 1.39 0.000 4.8 1.10 S.D. 0.005 2.4 0.41 0.014 6.8 0.34

6.4.1.2 Feature Removal

Following the discussion made in Section 6.1.2 regarding feature removal, the

effectiveness of the proposed approach for this particular scenario is reported in

Table 6.3. The events are simulated by randomly removing batches of features from

the data sets, which are initialised with full sets of features. The feature subset

evaluation scores decreases expectedly as features are being removed from the

system, expect for PCFS, which again maintains a constant evaluation score of 1.000

throughout the whole experimentation (for multi). This may indicate that more

robust features have been identified by PCFS for this particular data set, leading to

candidate solutions more resilient to the changes.

The overall A-FSE accuracies are reasonably well preserved throughout the series

of feature removals, and have shown improvements for the difficult data set secom.

It is worth noting that, although having very similar features, the terminating states

of the previous set of experiments (with almost all features added) yield better

quality feature subsets, than those obtained here during the initial stages (before

any feature has been removed). A possible explanation is that D-HSFS previously

has a far longer time (search iterations) to perform optimisation, and the gradual

discovery or exploration of new features may also help to form compact and good

132


Table 6.3: Feature removal results, showing the sizes and evaluation scores ofthe dynamic feature subsets maintained by D-HSFS, and the corresponding A-FSEaccuracies, using feature subset evaluators CFS and PCFS. Shaded cells indicatestatistically better results (using paired t-test)

multi secom

CFS PCFS CFS PCFS

|Ak| Score Size C4.5% Score Size C4.5% |Ak| Score Size C4.5% Score Size C4.5%

604 0.854 340.0 96.50 1.000 297.7 98.00 543 0.002 269.6 92.99 0.979 275.4 92.99542 0.853 303.4 97.00 1.000 264.3 98.50 500 0.002 244.2 93.63 0.979 250.4 92.99490 0.852 275.7 97.00 1.000 236.0 97.50 456 0.002 220.8 92.99 0.979 229.0 92.99434 0.851 245.6 97.00 1.000 206.9 98.00 410 0.002 195.8 92.99 0.969 207.5 92.99379 0.849 215.1 97.00 1.000 179.7 98.50 356 0.002 167.9 92.99 0.967 183.1 92.99312 0.836 178.6 98.50 1.000 144.7 98.00 287 0.002 133.2 92.36 0.963 157.3 91.72283 0.829 163.9 97.50 1.000 130.3 98.00 251 0.001 112.9 93.63 0.944 136.4 92.99252 0.827 142.0 97.50 1.000 115.4 98.00 221 0.001 97.6 93.63 0.943 121.1 93.63224 0.824 128.4 97.50 1.000 101.4 97.50 186 0.001 80.9 94.27 0.940 102.7 93.63183 0.816 107.8 96.00 1.000 81.5 97.50 151 0.001 64.6 94.27 0.940 85.9 93.63130 0.784 94.1 93.00 1.000 56.0 99.00 118 0.002 49.6 94.27 0.938 66.9 94.27

Mean 0.835 213.5 96.71 1.000 178.5 98.08 Mean 0.002 161.6 93.37 0.960 176.9 93.21S.D. 0.021 92.6 1.36 0.000 88.5 0.47 S.D. 0.000 84.7 0.69 0.018 78.0 0.63

quality feature subsets. This may inspire further adjustments to the original HSFS

algorithm [62], allowing it to better strategising the exploration of the solution space.

6.4.1.3 Instance Addition

Table 6.4 lists the results of the proposed A-FSE approach for the events involving

instance addition, where the corresponding theoretical analysis can be found in

Section 6.1.3. This set of experiments are simulated using multi and ozone, for

having the two largest total numbers of instances. Similar to the scenario of feature

addition, initially a small collection of instances are available for training, and new

samples are introduced to the system over time. Since the total number of features

remains unchanged in this scenario for both tested data sets, there are a lot less size

variations to the selected feature subsets (when compared to the previous feature-

based dynamic events). The addition of instances gradually expands the underlying

concept embedded in the data sets, and features selected at early stages need to

be altered in order to continue to fully capture the constantly refined information.

Despite that considerable numbers (±100) of new objects are being added per change,

the accuracies of the A-FSEs are well maintained. This shows the effectiveness of the

D-HSFS components optimising the underlying dynamic feature subsets. For both of

133


the tested data sets, better ensemble performance is achieved by PCFS-based A-FSE,

with higher averaged accuracy and lower mean feature subset size.

Table 6.4: Instance addition results, showing the sizes and evaluation scores ofthe dynamic feature subsets maintained by D-HSFS, and the corresponding A-FSEaccuracies, using feature subset evaluators CFS and PCFS. Shaded cells indicatestatistically better results (using paired t-test)

multi ozone

CFS PCFS CFS PCFS

|Xk| Score Size C4.5% Score Size C4.5% |Xk| Score Size C4.5% Score Size C4.5%

360 0.635 371.0 93.50 1.000 331.1 95.00 456 0.185 31.7 90.16 0.993 22.8 91.73446 0.771 371.2 94.00 1.000 320.7 96.50 626 0.182 28.2 92.13 0.990 21.5 92.52554 0.826 377.0 93.50 1.000 314.9 94.00 791 0.169 28.7 92.13 0.999 23.2 91.34654 0.833 373.5 94.00 1.000 312.4 96.00 950 0.146 28.7 91.73 0.999 21.6 92.13790 0.846 367.3 94.50 1.000 308.6 96.50 1107 0.143 27.5 91.73 1.000 23.7 92.13851 0.848 369.3 93.00 1.000 307.6 96.00 1168 0.138 27.3 90.55 0.999 21.6 92.91991 0.850 364.6 94.50 1.000 306.0 96.50 1328 0.130 27.5 91.73 0.998 24.2 92.91

1171 0.857 361.8 96.50 1.000 305.3 95.50 1470 0.126 26.4 91.73 0.999 23.1 91.731273 0.856 361.9 95.50 1.000 303.4 96.00 1542 0.117 25.7 92.52 0.998 23.6 92.911416 0.852 357.8 95.50 1.000 302.6 97.00 1762 0.114 25.7 90.94 1.000 27.8 92.911606 0.857 354.8 94.50 1.000 301.8 98.00 2105 0.113 25.6 91.34 1.000 26.7 92.521737 0.860 354.8 95.50 1.000 301.2 97.00 2194 0.108 25.7 91.73 1.000 24.8 92.91

Mean 0.824 365.4 94.54 1.000 309.6 96.17 Mean 0.139 27.4 91.54 0.998 23.7 92.39S.D. 0.064 7.3 1.03 0.000 8.9 1.03 S.D. 0.027 1.8 0.68 0.003 2.0 0.57

6.4.1.4 Instance Removal

The results of the proposed A-FSE approach for the situation involving instance

removal is presented in Table 6.5, for the multi and ozone data sets. Most of the

observations are similar to the earlier scenarios, such as the slightly better overall

performance achieved by PCFS-based ensembles, especially for multi (96.36 v.s.

93.68). Note that the relaxed constraints due to instance removals (see Section 6.1.4)

do not necessarily correlate to better evaluation scores, especially when judged by

CFS. This is reflected by the completely opposite trends for the two data sets, where

the evaluation scores are decreasing as instances are being removed for multi, and

the scores instead improve over time for ozone. This may be explained by the fact

that the correlation analysis done by CFS is merely with respect to the current state of

the data, and two evaluation scores are not directly comparable when their underlying

training instances are different. It may be beneficial, especially for algorithms such

as D-HSFS that optimise solutions continuously, to devise a (dynamic) feature subset

134


evaluation method that provides an ordered quality metrics, so that solution qualities

at different dynamic states may be directly compared.

Table 6.5: Instance removal results, showing the sizes and evaluation scores ofthe dynamic feature subsets maintained by D-HSFS, and the corresponding A-FSEaccuracies, using feature subset evaluators CFS and PCFS. Shaded cells indicatestatistically better results (using paired t-test)

multi ozone

CFS PCFS CFS PCFS

|Xk| Score Size C4.5% Score Size C4.5% |Xk| Score Size C4.5% Score Size C4.5%

1658 0.855 361.8 96.00 1.000 320.4 98.00 2121 0.110 27.6 91.34 1.000 24.0 93.311483 0.855 360.0 95.00 1.000 315.6 95.50 1956 0.115 26.7 92.13 0.999 23.8 90.941314 0.857 353.7 94.50 1.000 312.5 98.00 1710 0.123 26.4 92.13 0.998 23.7 92.131135 0.854 357.5 95.50 1.000 309.9 97.50 1538 0.132 27.3 93.70 0.999 20.7 92.13979 0.842 367.0 96.00 1.000 307.8 96.50 1350 0.124 28.0 91.34 0.999 21.9 90.55795 0.834 367.8 94.00 1.000 306.3 95.00 1145 0.124 29.5 90.55 0.998 25.5 90.55650 0.832 375.4 92.00 1.000 304.2 95.50 947 0.129 27.6 90.94 0.998 24.3 90.94562 0.825 381.3 92.50 1.000 303.9 95.50 787 0.144 28.2 90.55 0.999 22.5 91.34481 0.783 373.4 91.50 1.000 303.6 97.50 685 0.136 28.1 90.16 0.997 23.6 90.94359 0.768 373.7 87.50 1.000 301.8 93.50 455 0.159 26.3 88.98 1.000 18.7 90.94

Mean 0.832 367.4 93.68 1.000 310.6 96.36 Mean 0.128 27.8 91.23 0.999 23.1 91.37S.D. 0.030 8.4 2.63 0.000 8.7 1.47 S.D. 0.015 1.2 1.23 0.001 2.1 0.83

6.4.2 Results for Combined Dynamic FS Scenarios

In this set of experiments, a set of randomly chosen features or instances may be

added or removed in any combination. After a change has been made, the previously

selected feature subset Bk is first evaluated against the modified data with its quality

recorded. The D-HSFS algorithm then adapts to the new data, and continues to

refine the solution, so that a new candidate subset Bk+1 is produced. A collection of

20 candidate subsets are simultaneously improved (each using a separate instance

of D-HSFS) with respect to the same training data, the resultant A-FSE of size 20 is

examined for the subsequent prediction accuracy analysis.

The base classification algorithms employed for constructing the ensembles in

the experiments include: 1) the previously used, decision tree-based C4.5 algorithm

[264]; 2) PART [33] is a partial decision tree algorithm, which does not need to

perform global pruning like C4.5 in order to produce the appropriate rules; and 3)

the probabilistic Bayesian classifier with strong (naïve) independence assumptions

(NB) [132]. FS is particularly beneficial for C4.5 and PART since the reduced feature

subsets remove much of their initial training overheads.

135


6.4.2.1 Quality of Selected Subsets

Fig. 6.3 illustrates the dynamic FS results, detailing the evaluation scores (dotted

lines) and the sizes (solid lines), both averaged over 20 adaptively refined subsets

throughout a simulated dynamic process. The shaded columns indicate the per-

formance of Bk with regards to (Ak+1, Xk+1), while the non-highlighted columns

show the quality of the adaptively refined subsets Bk+1. Note that the selection

processes are carried out independently for each of the two subset evaluators, while

the underlying dynamic training data sets are the same to facilitate comparison.

It can be observed from the figure that the averaged f (Bk), Bk ∈ Bk, tends to

decrease by a large margin when features are deleted and/or a large number of

objects are added. The two most obvious reductions are at |Ak|= 101 (arrhy) and

|Ak| = 97 (handw), where 43 and 41 features are removed, respectively. This confirms

the theoretical assumptions made in Section 6.1, feature deletion and object addition

are the two major causes of rapid changes to the underlying concept, which are also

more challenging for a given dynamic FS algorithm to recover from. The results also

demonstrates that in response to the changes, the proposed D-HSFS algorithm can

successfully locate alternative informative features, thereby dynamically restoring or

improving the quality of the candidate subsets.

According to this systematic experimentation, a larger size variation can be

observed for the subsets found via CFS. The most noticeable performance difference

between the use of the two different feature evaluators is for the multi data set.

Over 20 additional features are constantly selected by CFS when attempting to adapt

to the changed data, whilst considerably less adjustments (< 10) are made by PCFS.

This indicates that the PCFS evaluator can identify more resilient and robust features.

For this particular data set, CFS also selects larger subsets in general, when compared

to those obtained by PCFS. Given that a significant amount of information is updated,

the sizes of the resultant subsets are maintained at a reasonable level throughout

the whole simulation. This is largely detailed by the beneficial characteristics of the

HS algorithm itself, especially the ability to escape from local optimal solutions.

6.4.2.2 Ensemble Accuracy

Before running the dynamic simulation, 10% of the original objects are again retained

for testing. The classification results of the ensembles trained using the adaptively

refined feature subsets are summarised in Table 6.6. The averaged accuracies of the

136


Figure 6.3: Results of dynamic FS, showing the averaged sizes and evaluationscores over 20 dynamically selected feature subsets for each data set, plotted againstthe number of features and that of objects. Each of the shaded columns and itsimmediate right neighbour (in white) correspond to one dynamic FS event, indicatingthe quality of a previously selected subset and that of a dynamically refined featuresubset, respectively.

137

6.5. Summary

individual ensemble members are also given, indicating the quality of the underlying

dynamic subsets. Additionally, the performance of classifiers trained using the full

set of available features is also provided, signifying a base-line accuracy of these

dynamic data sets.

Table 6.6: A-FSE accuracy comparison, bold figures indicate best result per classifier,shaded cells signify overall best results

A-FSE by CFS Single by CFS A-FSE by PCFS Single by PCFS Base Accuracy

C4.5 PART NB C4.5 PART NB C4.5 PART NB C4.5 PART NB C4.5 PART NB

arrhy 57.85 60.52 57.22 57.47 55.84 50.27 59.54 61.71 60.73 55.71 55.30 50.54 56.66 55.43 52.31handw 67.08 74.98 77.07 62.93 65.74 75.87 78.19 77.94 72.18 58.05 59.49 64.26 62.85 65.66 77.30multi 94.90 96.05 96.23 92.91 93.34 94.81 97.66 97.52 95.21 87.28 86.22 90.44 93.31 93.38 94.00ozone 91.47 91.30 76.87 91.44 90.87 67.35 91.59 91.62 73.98 91.19 91.22 66.44 91.24 91.04 66.24secom 93.55 93.63 71.44 89.97 92.12 61.58 93.43 93.63 54.47 89.89 91.76 60.63 89.45 92.99 59.00

The results show that the classification accuracies of the resultant A-FSEs are

improved for almost all of the data sets. The differences are most noticeable for the

handw data set, where the A-FSE built upon subsets selected by the PCFS evaluator

outperforms the base-line by over 15% (for C4.5). However, the the individual

ensemble members are a lot less accurate, achieving an averaged accuracy of 58.05%

(for C4.5). This confirms the benefit of employing an ensemble-based approach,

which for this instance, has substantially improved (by over 20%) over individually

deployed classifiers. Note that for the secom data set, although both C4.5 and

PART-based ensembles deliver very similar classification accuracies, the NB-based

ensemble performs almost 17% better in accuracy with the CFS evaluator. This

reflects that the underlying feature subsets selected by the two evaluators are rather

different. Generally speaking, the PART classification algorithm works best with the

subsets selected by PCFS, it delivers highest overall accuracy for 3 of 5 data sets. The

C4.5 classifier also performs very well together with PCFS. In contrast, the NB-based

ensembles are improved the most by subsets found using CFS.

6.5 Summary

This chapter has presented a dynamic extension to the HSFS algorithm, in an attempt

to address the challenges posed by dynamically changing data. The stochastic

properties of HSFS enable multiple high quality dynamic feature subsets to be

maintained. Having a collection of such subsets allows an adaptive feature subset-

based classifier ensemble to be constructed. According to the experimental results,

138

6.5. Summary

the proposed D-HSFS algorithm successfully adapts to the dynamically changing

data. Importantly, the A-FSE constructed on the basis of dynamically refined subsets

also demonstrates improved classification performance, when compared to that of

single algorithm-based methods, as well as the base-line accuracies achieved using

the full (unreduced) sets of available features.

Dynamic FS has attracted much attention lately due to its strong links to various

real-world problems [58, 99, 268, 289], and its principle of adapting to a changing

underlying knowledge is also intuitively appealing. Therefore, dynamic FS and the

methods described in this chapter have much potential worthy of further research.

Sections 9.2.2.2 and 9.2.2.4 will discuss the possible future directions in theory and

application, respectively.

139

Chapter 7

HSFS for Hybrid Rule Induction

T HE most common approach to developing expressive and human readable repre-

sentations of knowledge is the use of production (if-then) rules [113]. A typical

technique for addressing the inefficiency of fuzzy rule induction [42, 88, 262] due

to high data dimensionality is to employ a pre-processing mechanism such as FS.

However, this additional step, regardless of the methods employed. adds overheads

to the overall learning process. In addition, much like the techniques presented so

far in this thesis, the FS step is often carried out in isolation of rule induction, i.e.,

filter-based [164]. This separation may prove costly for the subsequent rule induction

phase, since the selected subset of features may not be those that are the most useful

to derive rules from. This has been the motivation behind wrapper-based approaches

to FS, with additional complexity introduced by performing rule induction repeatedly

in the search for the optimal set of features. Clearly, a closer integration of FS and

rule induction is desirable.

QuickRules [119] is a recently proposed hybrid fuzzy-rough rule induction algo-

rithm. It uses a greedy HC strategy as originally employed in QuickReduct [126] (see

Section 2.1.1.2) to search for a feature subset. This helps maintain the discernibility

of the full set of features. Meanwhile, fuzzy rules are generated on the fly in an

attempt to construct a rule base that provides complete coverage of the training data.

At the end of the process, each individual rule within the rule base will contain a more

compact feature subset. However, as discussed throughout this thesis, deterministic

HC techniques such as QuickReduct may lead to the discovery of sub-optimal feature

subsets [164], both in terms of the evaluation score and the subset size. The quality

140

7.1. Background of Rule Induction

of the resultant fuzzy-rough rule base, derived using such a potentially sub-optimal

feature subset, may also be sub-optimal.

This chapter describes a hybrid fuzzy-rough rule induction approach via the use

of HSFS. Similar to that for FS problems, the primary motivation for using stochastic

algorithms in the discovery of high level prediction rules is that a global search may

be performed. The resultant approach may also cope better with feature interaction

than greedy rule induction algorithms. The HS-based method proposed herein,

termed HarmonyRules, begins with an initial set of rules with randomly generated

underlying feature subsets, which are iteratively improved during the search process.

The ultimate aim is to identify an optimised set of rules based on a number of essential

and preferable performance criteria. The essential requirements include complete

coverage of the training data, and full preservation of the underlying schematic of

the original features. Additional evaluations in terms of minimising the size of the

rule base, and cardinality of the selected feature subsets further guide the search

process to converge to a concise, meaningful, and accurate set of rules.

The remaining sections of this chapter are structured as follows. The theoretical

background of rule induction is introduced in Section 7.1. Section 7.2 describes the

proposed HSFS-based hybrid fuzzy-rough rule induction method, where detailed

pseudocode is provided in order to explain the learning procedures. Experimental

studies that demonstrate the potential of the approach are presented in Section

7.3, including an in-depth comparison against QuickRules in terms of feature subset

cardinality. Finally, an appraisal of the present approach is given in Section 7.4.

7.1 Background of Rule Induction

This section briefly describes the crisp rough rule induction method, and covers the

essential theoretical concepts exploited by both rough set based [106], and fuzzy-

rough set based [226] rule induction approaches. The QuickRules algorithm, which

the proposed work is aiming to improve upon, is also explained for completeness.

7.1.1 Crisp Rough Rule Induction

In crisp RST, rules can be generated through the use of so-called minimal complexes

[88]. Let D be a concept, t an attribute-value pair (a, v), and T a set of such attribute-

value pairs. A block of t, denoted by [t], is a set of objects for which attribute a has

141


value v. A concept D depends on a set of attribute-value pairs T , if and only if:

; 6= [T] =⋂

t∈T

[t] ⊆ D (7.1)

T is a minimal complex of D if and only if D depends on T , and no true subset T′ ⊂ T

exists such that D depends on T′.

Consider a simple example data set shown in Table 7.1, which is consisted of 14

objects x1, · · · , x14, four conditional features a1, a2, a3, a4, and a decision feature

z. A list of all possible blocks [t1], · · · , [t|T ∗|] that may be derived from the available

attribute-value pairs T ∗ = t1, · · · , t|T (∗)| is given in Table 7.2.

Table 7.1: Example data set for rough set rule induction

Outlook (a1) Temperature (a2) Humidity (a3) Wind (a4) Golf (z)

x1 sunny hot high weak nox2 sunny hot high strong nox3 overcast hot high weak yesx4 rain mild high weak yesx5 rain cool normal weak yesx6 rain cool normal strong nox7 overcast cool normal strong yesx8 sunny mild high weak nox9 sunny cool normal weak yesx10 rain mild normal weak yesx11 sunny mild normal strong yesx12 overcast mild high strong yesx13 overcast hot normal weak yesx14 rain mild high strong no

For a given concept, say Dz=no = x1, x2, x6, x8, x14, there exist several sets of

attribute-value pairs T that Dz=no depends on. For example:

T = t1, t4, t7= (outlook, sunny), (temperature,hot), (humidity, high) (7.2)

which involves three features. The intersection of their associated blocks can be

computed using Eqn. 7.1:

; 6= [T] = [t1]∩ [t4]∩ [t7] = x1, x2 ⊂ Dz=no (7.3)

In this case, T is not a minimal complex of Dz=no since there exists a subset

T ′ = t1, t7= (outlook, sunny), (humidity, high) ⊂ T (7.4)

142


Table 7.2: Example data set for rough set rule induction

Block (a, v) Pair Objects

[t1] (outlook, sunny) x1, x2, x8, x9, x11[t2] (outlook, overcast) x3, x7, x12, x13[t3] (outlook, rain) x4, x5, x6, x10, x14[t4] (temperature, hot) x1, x2, x3, x13[t5] (temperature, mild) x4, x8, x10, x11, x12, x14[t6] (temperature, cool) x5, x6, x7, x9[t7] (humidity, high) x1, x2, x3, x4, x8, x12, x14[t8] (humidity, normal) x5, x6, x7, x9, x10, x11, x13[t9] (wind, weak) x1, x3, x4, x5, x8, x9, x10, x13[t10] (wind, strong) x2, x6, x7, x11, x12, x14

with

[T ′] = [t1]∩ [t7] = x1, x2, x8 (7.5)

such that Dz=no depends on T ′, while T ′ itself is a minimal complex of Dz=no as it

cannot be further reduced without violating the properties defined in Eqn. 7.1.

It is often the case that a minimal complex only describes a given concept partially,

and hence more than one minimal complex is required to cover a concept. A local

covering T of a concept D is a collection of minimal complexes, such that the union

of all minimal complexes is exactly D and T is minimal (i.e. containing no spurious

attribute-value pairs). The discovery of such local coverings (referred to hereafter

as the rule base) forms the basis for several approaches to rough set rule induction

[210]. A partitioning of the universe of discourse (that is consistent) by a reduct will

always produce equivalence classes that are subsets of the decision concepts, and

will cover each concept fully. Once a reduct has been found, rules may be extracted

from the underlying equivalence classes. Note that in the literature, reducts for the

purpose of rule induction are termed global coverings.

A popular approach to rule induction in the relevant area is the so-called learning

from examples module, version 2 (LEM2) algorithm [88], which follows a heuristic

strategy for creating an initial rule by choosing sequentially the “best” elementary

conditions according to certain heuristic criteria. Learning examples that match this

rule are then removed from consideration. The process is repeated iteratively when

learning examples remain uncovered. The resulting set of rules covers all learning

examples.

143


Additional factors characterising rules may also be taken into account [89],including the strength of matched or partly-matched rules (the total number of

cases correctly classified by an emerging rule during training), the number of non-

matched conditions, the rule specificity (i.e., length of condition parts). All factors

are combined and the strongest decision wins. If no rule is matched, the partially

matched rules are considered and the most probable decision is chosen.

7.1.2 Hybrid Fuzzy-Rough Rule Induction

Following the theoretical background laid out previously in Section 2.1.1.2, a common

rule induction strategy in fuzzy-rough set theory [42, 226] is to induce fuzzy rules

by overlaying decision reducts on the original training decisions, and then reading

off the (qualitative) values of the selected features. In other words, by partitioning

the universe via the features present in a decision reduct, each resulting fuzzy-rough

equivalence class forms a single rule. As the partitioning is produced by a reduct, it

is guaranteed that each fuzzy-rough equivalence class is subsumed by, or equal to, a

decision concept. This means that the attribute values that produced this equivalence

class are good predictors of the decision concept. The use of a reduct also ensures

that each object is covered by the set of rules. A disadvantage of this approach is that

the generated rules are often too specific, as each rule antecedent always includes

every feature appearing in the final reduct.

For the purposes of combining rule induction and feature selection, rules are

constructed from so-called tolerance classes (antecedents) and corresponding de-

cision concepts (consequence). A fuzzy rule rx with respect to an object x ∈ X is

represented as a triple:

rx = (B, RB x , Rz x), x ∈ X , B ⊆ A (7.6)

where B ⊆ A is the set of conditional attributes that appear in the rule’s antecedent,

RB x is the fuzzy tolerance class of the object that generated the rule, and Rz x refers

to a decision class z, i.e. the consequent of the rule.

Recall from Section 2.1.1.2 that RB is a fuzzy indiscernibility relation [236] for a

given feature subset B ⊆ A:

µRB(x i, x j) = Ta∈BµRa

(x i, x j) (7.7)

144


where µRB(x i, x j) is a fuzzy similarity relation (see Eqn. 2.18 or 2.19 for example),

and T is a t-norm. For a given object x i ∈ X , its tolerance class RB x i is then defined

as:

RB x i(x j) = µRB(x i, x j),∀x j ∈ X (7.8)

This formulation is used as it provides a fast way of determining rule coverage (the

cardinality of the fuzzy set RB x), and rule specificity (the cardinality of B or the

number of rule antecedents).

7.1.2.1 Outline of QuickRules

As an example for the existing approach that attempts hybrid fuzzy-rough rule

induction: QuickRules [119] is illustrated in Algorithm 7.1.1. One of its subroutine

for checking the coverage of a newly generated rule is detailed in Algorithm 7.1.2.

It makes use of the QuickReduct method previously described in Section 2.1.1.2,

where features are being examined individually. A given feature a ∈ A\ B is added

to the candidate feature subset B, if it provides the greatest increase in fuzzy-rough

dependency evaluation. Fuzzy rules are constructed on the fly whilst the feature

subset is being improved, for those objects not yet covered by existing rules. A

given candidate rule rx is checked via the check(B, RB x , Rz x) subroutine, in order to

determine whether they are subsumed by any existing rules already in the rule base.

The process terminates when a fuzzy-rough reduct is found, and all training objects

covered. Several important mechanisms exploited by QuickRules will be explained in

a greater detail in the following sections.

7.1.2.2 Worked Example

In order to demonstrate the operations of the QuickRules algorithm, an example data

set [237] adopted in the original paper [119] is employed. This data set, as shown

in Table 7.3, consists of seven objects X = x1, · · · , x7, eight conditional features

A= a1, · · · , a8 which are all quantitative, and a decision feature z.

Using hill climbing, QuickRules is initiated when the first object x1 is examined

with respect to the first feature a1, as no rules exist at the beginning. Using Eqn.

2.22, the membership of object x1 in the fuzzy set POSa1 is computed, which in

this case, is the same as that calculated using the full set of features:

µPOSa1(x1) = µPOSA

(x1) (7.9)

145


1 T , temporary feature subset

2 B = ;, rules = ;, cov = ;3 repeat4 T = B5 foreach a ∈ A\ B do6 foreach x ∈ X \ covered (cov) do7 if POSB∪a(x) = POSA(x) then8 check (B ∪ a, RB∪ax , Rd x)

9 if f (B ∪ a)> f (T ) then10 T = B ∪ a

11 B = T

12 until f (B) = f (A)13 return B, rules

Algorithm 7.1.1: Work flow of QuickRules

1 RrB x , tolerance class of rule r ∈ rules

2 add= true3 foreach r ∈ rules do4 if RB x ⊆ Rr

B x then5 add= false6 break7 else8 if Rr

B x ⊂ RB x then rules= rules \ r9

10 if add= true then11 rules= rules∪ (B, RB x , Rz x)12 cov = cov ∪ RB x

Algorithm 7.1.2: Subroutine check (B, RB x , Rz x)

This satisfies the condition for new rule generation, as outlined in lines 7 to 8 of

Algorithm 7.1.1. Therefore, check(a1, Ra1x1, Rz x1) is invoked, with

Ra1x1 = [1.0,0.0, 0.0,0.0, 0.73,0.73, 1.0]

Rz x1 = [1,0, 1,0, 1,1, 1] (7.10)

This new rule fully covers x1 (being the object that the rule is constructed for), and

also x7 which has the same feature value as x1 for a1. It also partially covers objects

x5 and x6 to a degree of 0.73. This rule is then added to the (currently empty) rule

146


Table 7.3: Example data set for QuickRules

a1 a2 a3 a4 a5 a6 a7 a8 z

x1 1 101 50 15 36 24.2 0.526 26 0x2 8 176 90 34 300 33.7 0.467 58 1x3 7 150 66 42 342 34.7 0.718 42 0x4 7 187 68 39 304 37.7 0.254 41 1x5 0 100 88 60 110 46.8 0.962 31 0x6 0 105 64 41 142 41.5 0.173 22 0x7 1 95 66 13 38 19.6 0.334 25 0

set, and the coverage of the rule base is updated:

cov = [1.0,0.0, 0.0,0.0, 0.73,0.73, 1.0] (7.11)

The algorithm continues to examine the remaining objects for the current feature

subset a1. It identifies another rule for x5, as µPOSa1(x5) = µPOSA

(x5), with:

Ra1x5 = [0.73,0.0, 0.0,0.0, 1,1, 0.73]

Rz x5 = [1,0, 1,0, 1,1, 1] (7.12)

This newly constructed rule is then compared to the existing rule in the rule base

Ra1x1. As neither of the two rules subsumes the other, i.e., Ra1x1 6⊂ Ra1x5 and

Ra1x5 6⊆ Ra1x1, this new rule is added to the rule base, and the coverage cov is

again updated:

cov = [1.0, 0.0,0.0, 0.0,0.73, 0.73,1.0]∪ [0.73,0.0, 0.0,0.0, 1,1, 0.73]

= [1.0, 0.0,0.0, 0.0,1.0, 1.0,1.0] (7.13)

For the feature a1 that is currently being considered, the last two objects x6 and

x7 have already been covered by the existing rules. The algorithm then calculates

the dependency degree of z upon the feature subset a1, producing f (a1) (refer

to Section 2.1.1.2 for more details regarding the dependency calculation using

fuzzy-rough sets).

The remaining single-feature subsets are also checked in a similar manner as

above, during which, it is determined that µPOSa8(x2) = µPOSA

(x2), a new rule

147

7.2. HSFS for Hybrid Rule Induction

(a8, Ra8x2, Rz x2) is then added and the coverage of the rule base updated. The

dependency calculation results of the respective features are summarised as follows:

f (a1) = 0.61 f (a2) = 0.89

f (a3) = 0.28 f (a4) = 0.55

f (a5) = 0.70 f (a6) = 0.56

f (a7) = 0.46 f (a8) = 0.71

In this example, the best feature is a2 which results in the greatest increase in

dependency score, it is therefore added to the feature subset B.

The QuickRules algorithm is intended to iterate until all training objects are cov-

ered fully by the discovered rules. While examining the remaining combinations of

feature subsets, two more rules: (a2, a3, Ra2,a3x3, Rz x3) and (a2, a3, Ra2,a3x4, Rz x4)are identified. The coverage of the rule base, at this stage, becomes:

cov = [1.0,1.0, 1.0,1.0, 1.0,1.0, 1.0] (7.14)

and the simultaneously selected feature subset B = a2, a3 also reaches full depen-

dency evaluation f (B) = 1. The termination condition of QuickRules is satisfied, the

final set of rules results:

(a1, Ra1x1, Rz x1)

(a1, Ra1x5, Rz x5)

(a8, Ra8x2, Rz x2)

(a2, a3, Ra2,a3x3, Rz x3)

(a2, a3, Ra2,a3x4, Rz x4)

7.2 HSFS for Hybrid Rule Induction

In this section, HSFS and its underlying HS algorithm are employed to aid rule base

optimisation. The rule induction process is integrated directly into the FS process,

generating rules on the fly. During its iterative process, the algorithm optimises the

emerging rule base with regard to given criteria (see later). The final result is a set

of fuzzy rules that cover the training objects to the maximum extent, whilst utilising

the minimum number of features.

148


7.2.1 Mapping of Key Notions

For rule induction, as summarised in Table 7.4, each musician represents a training

object x ∈ X . The collection of available notes for each musician is the possible set of

rules with respect to x , taking the form previously introduced in Eqn. 7.6. The rules

are differentiated by the various feature subsets involved. Each musician may vote

for one rule to be included in the emerging rule base when it is being improvised.

Here, a musician may choose to nominate r−, denoting an “empty” or “blank” rule,

if the representing object is already covered by other existing rules. A harmony H is

then the combined rule base from all musicians, taking the form:

(B1, RB1x1, Rd x1), · · · , r−, · · · , (B|X |, RB|X | x|X |, Rd x|X |) (7.15)

for xn ∈ X , Bn ⊆ A.

Table 7.4: Mapping of key notions from HS to rule induction

HS Optimisation Rule induction

Musician Variable ObjectMusical Note Variable Value RuleHarmony Solution Vector Rule BaseHarmony Memory Solution Storage Rule Base StorageHarmony Evaluation Fitness Function Rule Base EvaluationOptimal Harmony Optimal Solution Optimal Rule Base

The harmony memory H stores a predefined number of “good” candidate rule

bases, which are constantly updated with better quality rule bases over the cause

of the search. The fitness function analyses and merits each harmony H (i.e., a

candidate rule base) found during the search process, using criteria including: data

coverage on the training objects, dependency with respect to the full set of features,

the size of the entire rule base, and the cardinality of the feature subsets involved in

the rules:

evaluate(H) =

coverage=∑

x∈X

(⋃

r∈H

RrB x)

dependency=|POSH ||POSA|

=

∑

x∈X ,r∈H POSTr(x)

∑

x∈X POSA(x)

size= 1−|H||X |

cardinality= 1−∑

r∈H |Tr ||H| · |B|

(7.16)

149


where coverage and dependency are prioritised.

The quality of a potential solution is first judged with respect to the criteria of

coverage and dependency. The size and cardinality are examined only if the solution

achieves equal or better coverage and dependency scores to those currently stored

within H. This prioritisation reduces the computational cost for evaluating weak

solutions that typically occur during the randomised search process, which is also

employed by the HSFS [62] for FRFS [126], where the goal of obtaining a full

fuzzy-rough dependency score is prioritised over size reduction.

7.2.2 HarmonyRules

The two stages of HarmonyRules are shown in Algorithms 7.2.1 and 7.2.2. The

integration of these two procedures proceeds in a similar way as QuickRules [119].As it is a hybrid approach, FS is performed alongside rule generation, and is embedded

within the random rule generation via the use of random feature subspace.

7.2.2.1 Initialisation

1 H= H j| j = 1, · · · , |H|2 H j

i ∈ H j, i = 1, · · · , |X |3 B = FS(A, X ) (optional)4 for j = 1 to |H| do5 Hnew, cov = ;6 for i = 1 to |X | do7 if x i ∈ (X \ covered (cov)) then8 T = RandomSubspace(B)9 if POST (x i) = POSA(x i) then

10 cov = cov ∪ RT x i

11 Hnewi = (T, RT x i, Rz x i)

12 else13 Hnew

i = r−

14 evaluate(Hnew)15 H j = Hnew

Algorithm 7.2.1: HarmonyRules initialisation

If any pre-processing is performed a-priori to rule induction (line 3), it is beneficial

to make use of fuzzy-rough set based feature subset evaluators [126], so that the

150


reduced subset B satisfies:

POSB(x) = POSA(x),∀x ∈ X (7.17)

The candidate rule base (harmony) is maintained in H, and subsequently stored in

the harmony memory H. H contains the randomly generated rules for all objects,

and its stochastic states are reflected by the use of random feature subspace (line 8).

The fuzzy set cov in X records the current degree of coverage of each object in the

training data by the current set of rules, while the function covered (cov) returns

the set of objects that are maximally (of a degree of 1.0) covered in cov:

covered (cov) = x ∈ X | cov(x) = POSA(x) (7.18)

This means that an object x is considered to be covered by the set of rules (H), if its

membership to cov is equal to that of the positive region of the full set of feature A.

A rule (note) is constructed for x and subset T only when it has not been covered

maximally yet (line 5).

The coverage is updated if a rule is discovered when x /∈ covered (cov), and it

belongs to POST to the maximum extent (line 9). That is, the tolerance class RT (x)of x (see Eqn. 7.8) is fully included in a decision concept, and the feature values of

T that generated this tolerance class are good indicators of the concept. The new

coverage is determined by taking the union of the rule’s tolerance class with the

current coverage (line 8). When all objects are fully covered, no further rules are

created. The current rule base H is then added to the harmony memory having been

evaluated on the basis of four criteria: coverage, dependency, size and cardinality, as

defined in Eqn. 7.16.

Table 7.5 depicts an illustrative example of the harmony (rule base) generation

process. It starts with a rule r4 being created by musician p4 for object x4, which

covers x4 maximally. Musician p8 then identifies a slightly more general rule r8 that

covers both x2 and x8. Intuitively, no further inspection for x2 is necessary. This is

marked as ; by p2. The process continues until all objects are covered by the rules

induced, and in this case, only 4 rules r1, r3, r4, r8 are required to described the 8

training objects X = x1, · · · , x8. In contrast to QuickRules, where a rule is added to

the rule base only if no rule exists with the same or greater coverage, and an existing

rule that has a strictly smaller coverage than the new rule is deleted. HarmonyRules

relies on the optimisation capability of HS to converge to a more compact rule base

whilst maintaining coverage of all the training objects.

151


Table 7.5: Rule base improvisation example, showing an emerging harmony (left)and its associated coverage status of objects (right)

p1 p2 p3 p4 p5 p6 p7 p8 x1 x2 x3 x4 x5 x6 x7 x8

r4 Ør− r4 r8 Ø Ø Ø

r1 r− r4 r− r− r8 Ø Ø Ø Ø Ø Ør1 r− r3 r4 r− r− r− r8 Ø Ø Ø Ø Ø Ø Ø Ø

7.2.2.2 Iteration

Once the harmony memory has been initialised with |H| number of rule bases, the

iterative improvisation process starts, as shown in Algorithm 7.2.2. At the beginning

of every iteration the rule base Hnew and the fuzzy set RB x representing data coverage

are both empty. Musicians, each representing an object x ∈ X , x /∈ covered (cov),then follow the Pick (x) procedure, each nominating a rule for inclusion in the

emerging rule base Hnew. Similar to the initialisation process, the tolerance classes

are examined, and the measure of coverage cov is updated when necessary (lines

6 and 7). The newly improvised rule base is evaluated, and it replaces the current

worst rule base stored in the harmony memory if a higher score is achieved.

1 while g < gmax do2 Hnew, cov = ;3 for i = 1 to |X | do4 if x i ∈ (X \ covered (cov)) then5 T = Pick (x i)6 if POST (x i) = POSA(x i) then7 cov = cov ∪ RT x i

8 Hnewi = (T, RT x i, Rz x i)

9 else10 Hnew

i = r−

11 if evaluate(Hnew)>minH∈H evaluate(H) then12 Update H with Hnew

13 g = g + 1

Algorithm 7.2.2: HarmonyRules iteration

152


7.2.3 Rule Adjustment Mechanisms

Recall that the key parameters of HS (as described in Section 3.1.1) are harmony

memory considering rate δ, pitch adjustment rate ρ, and fret-width τ. They encour-

age exploration and help with the fine tuning of a given candidate solution. Both δ

and ρ which affect rule choices are incorporated in this approach, allowing poten-

tially good rules and rule combinations to be discovered. Refer to Fig. 3.1.1 which

illustrates the adjustment procedure for the original HS. For δ activation, a randomly

formed feature subset is assigned to the current object x . This is conceptually similar

to the original δ activation, which causes the musician to randomly pick a value

from the entire value range [minx ,maxx] of the a given variable x .

For ρ, the feature subset involved in the rule in question will be modified by

HarmonyRules, by adding/removing k features (a hamming distance of k), k = τ×|A|.Here, τ is the predefined pitch adjusting fret-width τ ∈ [0, 1], which is scaled by the

total number of features |A|. For example, assume a total of |A| = 20 features, τ = 0.1,

k = 2. The modifications made to the feature subset Bn for the rule generated for a

given object xn may be:

(Bn, RBnxn, Rd xn) −→ (B′n, RB′n

xn, Rd xn)

Bn = a2, a3, a4, a18, a20 −→ B′n = a2, a4, a11, a18, a20 (7.19)

Here, a total of k = 2 alterations have been carried out (a3 is removed and a11

added), so that a new feature subset B′n can be obtained. The following adjustment

scenario is equally valid, resulting in an empty feature subset. This in turn specifies

an “empty” rule r− to be assign to object xn:

(Bn, RBnxn, Rd xn) −→ (r−)

a3, a20 −→ ; (7.20)

Note that the current implementation of HarmonyRules is of complexity O(gmax×|X |3). This is because the fuzzy-rough rule evaluation itself has a cost of O(|X |2),which needs to be performed at least once for every rule and every object x ∈ X .

Further optimisations and modifications based on the experimental findings remain

active research, to be further discussed in Section 9.2.1.

153



This section presents experimental evaluation of the proposed approach, for the task

of classification, over 9 real-valued benchmark data sets drawn from [78] with a

selection of classifiers. A summary of the data sets used are given in Table 7.6, where

detailed descriptions of their underlying problem domains may be found in Appendix

B. The number of conditional features ranges from 8 to 2556, and the number of

objects ranges from 120 to 390. The HS parameters empirically employed are shown

in Table 7.7.

Table 7.6: Data set information

Data set Objects Features Classes

cleve 297 13 5glass 270 13 7heart 214 9 2ionos 230 34 2olito 120 25 4water2 390 38 2water 390 38 3web 149 2556 5wine 178 13 3

Table 7.7: Parameters settings where * denotes the dynamically adjusted values

Parameter |P| |H| gmax δ ρ τ

Value |X | 20 2000 0.8* 0.5* 0.1*

7.3.1 Classification Results

Table 7.8 reports the classification performance of HarmonyRules as compared to

the HC-based QuickRules algorithm [119], and those obtained using the following

methods: a) the nearest neighbour classifier based on fuzzy sets (FNN) [140], b)

the recently developed weighted fuzzy subset-hood-based algorithm (WSBA) [213],and c) two leading rough set-based rule induction methods (learning from examples

module, version 2 (LEM2) [88] and the modified LEM algorithm (ModLEM) [210]).

For each method, 10-FCV is performed to validate the generated models with the

results averaged. Statistical paired t-test (per fold) is carried out to justify the

significance of differences between HarmonyRules and QuickRules, with threshold

154


p = 0.01, where v, −, ∗ indicate the result is statistically better, same, or worse,

respectively.

It can be seen that HarmonyRules obtained better results than QuickRules in

terms of accuracy in 4 out of 9 cases, and statistically comparable performance

is achieved for the data sets cleve, heart, and wine. Note that HarmonyRules

typically generates rules with more compact underlying feature subsets, mainly due

to the excellent FS performance of the HSFS algorithm. Therefore, the overall quality

of the discovered rule bases are superior. A more in-depth investigation regarding

rule cardinality is given later in Section 7.3.2. Larger standard deviations are also

observed for HarmonyRules results, likely having been caused by the stochastic nature

of the HS mechanism employed.

The results presented in Table 7.8 demonstrates the power of fuzzy-rough set

theory in handling the vagueness and uncertainty often present in data. Although

HarmonyRules outperforms QuickRules by 9% for data set glass, FNN clearly claims

the best result with 68.57% correct predictions. LEM2 performs relatively poorly,

particularly for the glass, ionos and olito data sets. WSBA fails to classify the

web data set, demonstrating the shortcomings of the approach when handling high

dimensional data sets. Note that, unlike HarmonyRules, LEM2 and ModLEM perform

a degree of rule pruning during induction, which are therefore expected to be able

to should produce more general rules and better resulting accuracies.

Table 7.8: Classification accuracy comparison between HarmonyRules and Quick-Rules, using 10-FCV, where v, −, ∗ indicate statistically better, same, or worse thanQuickRules

Data set HarmonyRules QuickRules FNN WSBA LEM2 ModLEM

cleve 56.53±10.35 - 56.21±7.54 49.75±5.67 51.91±8.50 53.17±5.47 50.15±8.25glass 51.88±14.28 v 42.83±12.34 68.57±9.62 35.51±8.70 14.98±6.33 62.62±7.96heart 80.00±6.11 - 80.67±6.83 66.11±7.89 83.00±6.07 77.04±10.18 76.3±7.8ionos 92.17±8.68 v 91.57±5.62 78.00±7.46 82.96±8.71 58.26±12.17 87.39±5.65olito 75.83±10.88 v 72.33±11.48 63.25±12.48 78.17±11.77 44.17±9.17 64.17±9.9water2 86.89±3.97 v 86.13±4.42 77.97±2.66 84.74±4.84 80.26±2.82 86.41±3.98water 81.28±7.69 * 82.41±4.81 74.64±3.77 81.97±5.57 70.51±3.85 84.36±5.67web 56.33±10.00 * 63.1±11.89 45.55±8.04 1.14±2.70 41.67±11.67 61.05±12.28wine 97.78±4.05 - 97.75±3.92 96.40±4.06 96.17±4.14 73.66±9.25 92.03±8.7

Additional comparative studies have been reported in [119] that include results

obtained using leading non-fuzzy rough methods [132, 264], which are given in

Table 7.9. In addition to the NB and C4.5 algorithms that have already been described

155


in Section 3.5, the sequential minimal optimisation (SMO) method that is widely

used for training support vector machines [208], a neural network-based algorithm

termed projective adaptive resonance theory (PART) [33], and finally a propositional

rule learning algorithm RIPPER (for repeated incremental pruning to produce error

reduction) [43] are also used for comparison. Base on the results, HarmonyRules

is comparable to the leading non-rough set-based techniques, with competitive

classification models obtained for data sets cleve, ionos, and wine. Indeed, the

overall performance of HarmonyRules is slightly worse than that of SMO. However,

note that black-box classifiers such as SMO do not produce humanly-interpretable

rules, as they typically work in higher dimensional transformed feature spaces. Since

one of the main reasons for performing rule induction is to make data transparent to

human users, this is counter-intuitive, becoming an obstacle to knowledge discovery

and also to the understanding of the underlying processes which generated the data.

Table 7.9: Classification accuracy of other classifiers tested using 10-FCV

Data set SMO NB C4.5 RIPPER PART

cleve 58.31±6.15 56.06±6.78 53.39±7.31 54.16±3.64 52.44±7.20glass 57.77±9.10 47.70±9.21 68.08±9.28 67.05±10.69 69.12±8.50heart 83.89±6.24 83.59±5.98 78.15±7.42 79.19±6.38 77.33±7.81ionos 82.96±6.93 83.78±7.62 86.13±6.20 87.09±6.92 87.39±6.61olito 87.92±8.81 78.50±11.31 65.75±12.13 68.83±13.06 67.00±12.86water2 83.67±4.15 70.28±7.56 83.08±5.45 82.64±5.46 83.79±5.17water 86.87±4.36 85.46±4.98 81.59±6.51 82.44±6.63 82.54±5.87web 64.78±10.47 63.41±12.93 57.63±11.31 55.09±12.99 51.50±12.86wine 98.70±2.76 97.46±3.86 93.37±5.85 93.18±6.49 92.24±6.22

7.3.2 Comparison of Rule Cardinalities

Table 7.10 presents a side by side comparison between HarmonyRules and QuickRules

in terms of the cardinalities of rules returned by different induction processes. It can

be seen that, apart from the data set glass, where HarmonyRules uses almost double

the amount of features, but with much better classification accuracy, the averaged

cardinalities of rule bases obtained by HarmonyRules are generally more compact

than those by QuickRules. However, although HarmonyRules selects substantially

smaller subsets for data set web, the resulting classification accuracy is also reduced.

One possible explanation of the above observation is that during HS optimisation,

the random feature subsets are judged purely by their capability to approximate

156


Table 7.10: HarmonyRules vs. QuickRules in terms of rule cardinalities, where v, −,∗ indicate statistically better, same, or worse results

HarmonyRules QuickRules

Data set Accuracy % Cardinality Accuracy % Cardinality

cleve 56.53±10.35 - 8.17 v 56.21±7.54 9.35glass 51.88±14.28 v 10.83 * 42.83±12.34 5.66heart 80.00±6.11 - 5.27 v 80.67±6.83 5.68ionos 92.17±8.68 v 8.99 v 91.57±5.62 9.95olito 75.83±10.88 v 8.19 v 72.33±11.48 9.39water2 86.89±3.97 v 5.99 - 86.13±4.42 5.97water 81.28±7.69 * 5.05 v 82.41±4.81 11.06web 56.33±10.00 * 20.57 v 63.1±11.89 64.04wine 97.78±4.05 - 7.05 * 97.75±3.92 4.65

Figure 7.1: HarmonyRules (left) vs. QuickRules (right) in terms of rules’ featuresubset cardinality distribution (for data set web of 2556 features)

the underlying concept, determined by fuzzy-rough lower approximation POSB(x),coverage, etc. Two rule bases may be identical in terms of evaluation scores, however

the models and their stability and complexity may be distinctively different. The

classification task to be performed may favour one with more specific rules (which

may be obtained by QuickRules), for example, for web data set as shown in Fig. 7.1.

Because one rule base evaluation criterion is the compactness of feature subsets,

HarmonyRules resulted in more rules containing less than 50 features, while a

large amount of rules returned by QuickRules are much higher in terms of attribute

cardinality.

157

7.4. Summary

7.4 Summary

This chapter has described an improved hybrid fuzzy-rough rule induction technique:

HarmonyRules based on fuzzy-rough rule induction and HSFS. HS has demonstrated

many competitive features over conventional approaches: fast convergence, sim-

plicity, insensitivity to initial states, and efficiency in finding good quality solutions.

Experimental comparative studies have shown that the resultant rule base completely

covers the training data, the feature subsets involved also score full fuzzy-rough

dependency evaluation, indicating good approximations of the original data. The

cardinality of the feature subsets involved reflects good subset size reduction. Classi-

fication accuracy is comparable to that achievable by the state of the art approaches.

In almost all aspects, the proposed approach is able to improve upon the greedy

HC-based QuickRules method.

The technique presented in this chapter may be further improved by various

means, and closer examinations of the theories involved may reveal tighter forms

of integration, resulting in more optimised (in terms of efficiency and robustness)

methods. A more detailed discussion is given in Section 9.2.1.3. Furthermore, it

is worth noting that in practical scenarios [148], there often exist substantial gaps

in a given (fuzzy) rule base, i.e., sparse, regardless of how the rules are originally

obtained (learned from data, or supplied by domain experts). This is because the

amount of knowledge or data available may be limited, and therefore, the rules may

not fully support the need of inference-based fuzzy reasoning [177, 282]. For such

type of application, methods developed for fuzzy rule interpolation [142, 143], to

be explore in the proceeding chapter, become valuable.

158

Chapter 8

HSFS for Fuzzy Rule Interpolation

F UZZY Rule Interpolation [142, 143] (FRI) is of particular significance for rea-

soning in the presence of insufficient knowledge or sparse rule bases. When a

given observation has no overlap with antecedent values, no rule can be invoked in

classical (fuzzy) rule-based inference, and therefore no consequence can be derived.

The techniques of FRI not only support inference in such situations, but also help to

reduce the complexity of fuzzy models. Despite these advantages, FRI techniques are

relatively rarely applied in practice [148]. One of the main reasons is that real-world

applications generally involve rules with a large number of antecedents, and the

errors accumulated throughout the interpolation process may affect the accuracy

of the final estimation. More importantly, a rule base may consist of less relevant,

redundant or even misleading variables, which could further deviate the outcome

of an interpolation. Such characteristics of data have been studied extensively

in the area of FS, with techniques developed to rank the importance of features

[116, 123, 147, 219, 291], or to discover a minimal feature subset from a problem

domain while retaining a suitably high accuracy in representing the original data

[52, 93, 126, 164].

This chapter presents a new approach that uses FS techniques to evaluate the

importance of antecedent variables in a fuzzy rule base. Such importance degrees

are referred to as the set of “antecedent significance values” hereafter. This allows

subsets of informative antecedent variables to be identified via the use of feature

subset search methods, e.g., HSFS. It helps to reduce the dimensionally of a rule

base by removing irrelevant antecedent variables. An antecedent significance-based

159

8.1. Background of Fuzzy Rule Interpolation (FRI)

FRI technique based on scale and move transformation-based FRI (T-FRI) is also

proposed, which exploits the information analysed by FS, in order to facilitate more

effective interpolation using weighted aggregation [273]. The benefits of this work

is demonstrated using the scenario of backward FRI [128, 131] (B-FRI), which is a

newly identified research focus regarding FRI.

The remainder of this chapter is structured as follows. Section 8.1 introduces the

general ideas behind FRI, and explains the key notions and interpolation steps of

T-FRI, which is the main method used to carry out the present investigation. This

section also gives an outline of the B-FRI method for completeness. Section 8.2

details the developed approach which applies the existing ideas in FS to FRI, explains

the antecedent significance-based aggregation procedure that is implemented using T-

FRI, and discusses its potential benefits to B-FRI. In Section 8.3, an example scenario

concerning the prediction of terrorist bombing attack is employed to showcase the

procedures of the proposed work. Further, a series of experiments have been carried

out in order to verify the general performance of the present approach. Section 8.4

summaries the chapter.

8.1 Background of Fuzzy Rule Interpolation (FRI)

This section first introduces the general principles of FRI, and provides a brief

introduction of the procedures involved in T-FRI, including the definitions of its

underlying key notions, and an outline of its interpolation steps. Note that the

basic form of T-FRI is herein employed for the ease of presentation, which utilises

two neighbouring rules of a given observation to perform interpolation. Triangular

membership functions are also adopted for simplicity, which are the most commonly

used fuzzy set representation in fuzzy systems. More detailed descriptions and

discussions on the theoretical underpinnings behind T-FRI can be found in the

original work [109, 110].

FRI approaches in the literature can be categorised into two classes with several

exceptions (e.g. type II fuzzy interpolation [36, 158]). The first category of ap-

proaches directly interpolates rules whose antecedents match the given observation.

The consequence of the interpolated rule is thus the logical outcome. Typical methods

in this group [142, 143, 248] are based on the use of α-cuts (α ∈ (0, 1]). The α-cut

of the interpolated consequent fuzzy set is calculated from the α-cuts of the observed

160


antecedent fuzzy sets, and those of all the fuzzy sets involved in the rules used for

interpolation. Having found the consequent α-cuts for all α ∈ (0, 1], the consequent

fuzzy set is then assembled by applying the Resolution Principle [229].

The second category is based on analogical reasoning [24]. Such approaches

first interpolate an artificially created intermediate rule so that the antecedents of

the intermediate rule are similar to the given observation [14]. Then, a conclusion

can be deduced by firing this intermediate rule subject to similarity constraints. The

shape distinguishability between the resulting fuzzy set and the consequence of the

intermediate rule, is set to be analogous to the shape distinguishability between

the observation and the antecedent of the created intermediate rule. In particular,

the scale and move transformation-based approach [109, 110] (T-FRI) offers a

flexible means to handle both interpolation and extrapolation involving multiple

multi-antecedent rules.

Following a similar convention adopted for FS described in Section 1.1, an

FRI system being investigated in this chapter is defined as a tuple (R, Y ), where

R= r1, · · · , r |R| is a non-empty set of finite fuzzy rules (the rule base), and Y is a

non-empty finite set of variables. Y = A∪ z where A= a j | j = 1, · · · , |A| is the

set of antecedent variables, and z is the consequent variable appearing in the rules.

Without losing generality, a given rule r i ∈ R and an observation o∗ is expressed in

the following format:

r i : if a1 is ai1, and · · · , and a j is ai

j, and · · · , and a|A| is ai|A|, then z is z i

o∗ : a1 is a∗1, and · · · , and a j is a∗j , and · · · , and a|A| is a∗|A| (8.1)

where aij represents the value (the fuzzy set) of the antecedent variable a j in rule r i,

and z i denotes the value of the consequent variable z for r i. The asterisk sign (∗)denotes that a value has been directly observed.

8.1.1 Transformation-Based FRI

A key concept used in T-FRI is the representative value rep(a j) of a fuzzy set a j,

it captures important information such as the overall location of a fuzzy set. For

triangular membership functions in the form of a j = (a j1, a j2, a j3), where a j1, a j3

represent the left and right extremities (with membership values 0), and a j2 denotes

161


the normal point (with a membership value of 1), rep(a j) is defined as the centre of

gravity of these three points:

rep(a j) =a j1 + a j2 + a j3

3(8.2)

More generalised forms of representative values for more complex membership

functions have also been defined in [109, 110].

The following is an outline of the T-FRI algorithm. An illustration of its key

procedures is also provided in Fig. 8.1 in order to aid the explanation.

Figure 8.1: Procedures of T-FRI for a given antecedent dimension a j, where thedashed line indicates the representative values of the fuzzy sets

162


1. Identification of the Closest Rules

The distance between any two rules r p, rq ∈ R, is determined by computing

the aggregated distance between all the antecedent variable values:

d(r p, rq) =

√

√

√

√

|A|∑

j=1

d(apj , aq

j )2, where d(ap

j , aqj ) =

|rep(apj )− rep(aq

j )|

maxa j−mina j

(8.3)

where d(apj , aq

j ) is the normalised result of the otherwise absolute distance

measure, so that distances are compatible with each other over different vari-

able domains. The distance between a given rule r p and the observation o∗:

d(r p, o∗) may be calculated in the same manner, and the two closest rules, say

ru and r v, are identified and used for the later interpolation process.

2. Construction of the Intermediate Fuzzy Rule

The intermediate fuzzy rule r ′ is the starting point of the transformation process

in T-FRI. It consists of a series of intermediate antecedent fuzzy sets a′j, and an

intermediate consequent fuzzy set z′:

r ′ : if a1 is a′1, and · · · , and a j is a′j, and · · · , and a|A| is a′|A|, then z is z′

(8.4)

which is a weighted aggregation of the two selected rules ru and r v. For each

of the antecedent dimensions ai, a ratio λa j, 0≤ λa j

≤ 1 is introduced, which

represents the contribution of avj towards the formation of a′j with respect to

auj :

λa j=

d(auj , a∗j )

d(auj , av

j )(8.5)

The intermediate antecedent fuzzy set a′j is then computed using:

a′j = (1−λa j)au

j +λa jav

j (8.6)

The position and shape of the intermediate consequent fuzzy set z′, is then

calculated in the same manner according to those of the consequent fuzzy sets

of the two rules zu and zv, where the ratio λz is obtained by averaging the

ratios of the antecedent variables:

λz =1|A|

|A|∑

j=1

λa j(8.7)

163


3. Computation of the Scale and Move Parameters

The goal of a transformation process T is to scale, move (or skew) an inter-

mediate fuzzy set a′j, so that the transformed shape coincides with that of the

observed value a∗j . In T-FRI, such a process is performed in two stages: 1) the

scale operation from a′j to a′′j (denoting the scaled intermediate fuzzy set), in

an effort to determine the scale ratio sa j; and 2) the move operation from a′′j

to a∗j to obtain a move ratio ma j. Once performed for each of the antecedent

variables, the necessary parameters sz and sz for the consequent variable can

be approximated as follows, in order to compute the final interpolation result

z∗.

For a triangular fuzzy set a′j = (a′j1, a′j2, a′j3), the scale ratio sa j

is calculated

using:

sa j=

a∗j3 − a∗j1a′j3 − a′j1

(8.8)

which essentially expands or contracts the support length of a′j: a′j3− a′j1 so that

it becomes the same as that of a∗j . The scaled intermediate fuzzy set a′′j , which

has the same representative value as a′j, is then acquired using the formula

below:

a′′j1 =(1+2sa j

)a j1′+(1−sa j

)a j2′+(1−sa j

)a j3′

3

a′′j2 =(1−sa j

)a j1′+(1+2sa j

)a j2′+(1−sa j

)a j3′

3

a′′j3 =(1−sa j

)a j1′+(1−sa j

)a j2′+(1+2sa j

)a j3′

3

(8.9)

The move operation shifts the position of a′′j to be the same as that of a∗j , while

maintaining its representative value rep(a′′j ). This is made possible by using a

tailored move ratio ma j:

ma j=

3(a∗j1−a′′j1)

a′′j2−a′′j1, if a∗j1 ≥ a′′j1

ma j=

3(a∗j1−a′′j1)

a′′j3−a′′j2, otherwise

(8.10)

164


The final positions of the triangle’s three points are calculated as follows:

a∗j1 = a′′j1 +ma j

a′′j2−a′′j13

a∗j2 = a′′j2 − 2ma j

a′′j2−a′′j13

a∗j3 = a′′j3 +ma j

a′′j2−a′′j13

, if ma j≥ 0 (8.11a)

a∗j1 = a′′j1 +ma j

a′′j3−a′′j23

a∗j2 = a′′j2 − 2ma j

a′′j3−a′′j23

a∗j3 = a′′j3 +ma j

a′′j3−a′′j23

, otherwise (8.11b)

Note that this operation also guarantees that the resultant shape is convex and

normal.

4. Scale and Move Transformation on Intermediate Consequent Fuzzy Set

After computing the necessary parameters based on all of the observed an-

tecedent variable values, the required parameters for z′ are then determined

by averaging the corresponding parameter values:

sz =1|A|

|A|∑

j=1

sa j(8.12)

mz =1|A|

|A|∑

j=1

ma j(8.13)

A complete scale and move transformation from the initial intermediate con-

sequent fuzzy set z′ to the final interpolative output z∗, may be represented

concisely by: z∗ = T(z′, sz, mz), highlighting the importance of the two key

transformations required.

8.1.2 Backward FRI (B-FRI)

B-FRI [128, 131] is a recently proposed extension to standard (forward) FRI, it

allows crucial missing values that directly relate to the conclusion be inferred, or

interpolated from the known antecedent values and the conclusion. This technique

supplements a conventional FRI process, and is particularly beneficial in the presence

of hierarchically arranged rule bases, since a normal inference or interpolation system

will be unable to proceed if certain key antecedent values (that connect the sub-rule

bases) are missing.

165

8.2. Antecedent Significance-Based FRI

An implementation of the B-FRI concept has been developed, based on the

mechanisms of T-FRI. It works by reversely approximating the scale and move trans-

formation parameters for the variables with missing values. In this chapter, the

scenario with a single missing antecedent value is considered, as efficient ways to

solving the more complex cases: B-FRI with multiple missing values still remain

active research.

Despite that both forward and backward T-FRI share the same underlying analogy-

based idea, backward T-FRI has several subtle differences, such as the procedures

to select the closest rules, and those to compute the transformation parameters.

For instance, assume that the value of the antecedent variable al is missing from

the observation, whilst the conclusion z∗ can be directly observed. The distance

measurement d←−(rp, rq) between any two rules is handled with a bias towards the

consequent variable:

d←−(rp, rq) =

√

√

√

√|A| · d(zp, zq)2 +|A|∑

j=1, j 6=l

d(apj , aq

j )2 (8.14)

This is because the observed value for the consequent variable embeds more infor-

mation, and the weight assigned is equal to the sum of all individual antecedents

|A|. Having identified the closest rules, the remaining steps are the same as forward

T-FRI, except that the parameters for the missing antecedent: λal, sal

, and malare

calculated using a set of similar alternative formulae. For instance, the formula to

calculate λalis:

λal= |A|λz −

1|A|

|A|∑

j=1, j 6=l

λa j(8.15)

Here, the required parameters are obtained by subtracting the sum of the values

of the given antecedent values from that of the consequent (also multiplied by a

weight of |A|). Finally, the backward interpolation result a∗l can be obtained using

T←−(a′l , sal

, mal).

8.2 Antecedent Significance-Based FRI

This section discusses the similarities and differences between the problem domain

of FS and that of FRI, and describes the approach developed that evaluates the

importance of rule antecedents using FS techniques. A weighted aggregation-based

166


approach is also introduced, which makes use of the antecedent significance values to

better approximate the interpolation results. The potential benefits of the proposed

technique in B-FRI are also explained.

8.2.1 From FS to Antecedent Selection

The key distinction between a standard FS problem (defined in Section 1.1) and FRI

is the presence of the continuously-valued consequent variable z, and that there is no

well defined class labels (hence the need for interpolation). From this point of view,

FRI is modelled more closely to regression than classification, and therefore, only a

selected few of non-class-label-dependent feature evaluators [202] can be readily

adapted for FRI, including CFS [46] and FRFS [126] which support regression tasks

by default. FRFS in particular, relies on fuzzy similarity to differentiate between two

training objects. It employs a strict equivalence relation for class labels or categorical

data, but the underlying concepts (i.e., the upper and lower approximations) may also

be constructed using real-valued “decision” variable. CFS exploits the correlations

between features and may be used for regression-type-of-problems.

Fig. 8.2 illustrates the general procedures of the proposed antecedent selection

approach for FRI. To achieve antecedent selection, a given feature evaluator such as

CFS or FRFS may be employed as is, once the rule base to be processed is converted

into a standard, crisp-valued data set. For this, any defuzzification mechanism may

be adopted, and in this chapter, the representative value of a fuzzy set (Eqn. 8.2) is

used for this purpose. The newly created, crisp-valued data set (antecedent values)

are then employed to train a feature (antecedent) evaluator. This is in order to get a

set of “feature evaluation scores”, or antecedent importance measurements ω′a j, j =

1, · · · , |A|, which are subsequently normalised to obtain the required significance

values:

ωa1, · · · ,ωa|A|=

ω′a1

Σ|A|j=1ω

′a j

, · · · ,ω′a|A|

Σ|A|j=1ω

′a j

(8.16)

These values indicate the relevance of the underlying antecedent variables A, with

respect to the values of the consequent variable z based on the information embedded

in the rule base.

A feature subset search algorithm such as HSFS may be employed to identify a

quality antecedent subset B ⊆ A, which captures the information within the original

rule base R to a reasonable (if not the maximum) extent. R may then be pruned

167


Figure 8.2: Antecedent selection procedures

to just maintaining the highest quality antecedent variables, thereby producing a

reduced rule base (much like a reduced data set with irrelevant features removed).

Subsequent tasks such as rule selection, fuzzy inference, or FRI may benefit greatly in

terms of accuracy and efficiency, once such redundant and noisy antecedent variables

have been eliminated.

8.2.2 Weighted Aggregation of Antecedent Significance

For a given rule base R, a set of antecedent significance values: ωa1, · · · ,ωa|A|, may

be computed, or supplied by subject experts. A weighted rule ranking strategy may

then be derived for the purpose of identifying the most suitable rules to perform

interpolation. Recall the standard (un-biased) formula (Eqn. 8.3) adopted by T-FRI

for calculating the distance between any given two rules r p, rq ∈ R, is effectively

assumes equal significance of all involved antecedent variables. A general form of

168


weighted distance d may be defined by:

d(r p, rq) =

√

√

√

√

|A|∑

j=1

ωa jd(ap

j , aqj )

2 (8.17)

which takes into consideration the significance value ωa jof the antecedent variables

a j, j = 1, · · · , |A|.

The use of d may allow a more flexible selection of rules. For instance, consider

the case illustrated in Fig. 8.3, with the assumption that a1 and a3 are antecedents

of high significance and a2 is irrelevant (or noisy). For a given new observation

o∗, the two closest rules determined by standard T-FRI (using un-biased distance

measure) may be r1 and r2. There may also exist another rule r3 (involving dashed

fuzzy sets) with values much closer to a1 and a3, but it has not been selected because

its overall distance to the observation is greater than that of r2, due to the value

a32 being further away. Since in fact a2 is of little importance, a weighted distance

measurement may select r1, r3 to perform interpolation, and the end result z∗(1,3) may

provide a better estimation for this scenario, than the result obtained using r1 and

r2.

As alternative rules may be selected via the use of weighted distance calculation,

the FRI mechanisms should therefore be modified in order to ensure consistency

amongst the results interpolated using different rules. In this chapter, the investiga-

tion is focused on the T-FRI method introduced in Section 8.1.1. However, the use of

the antecedent variable significance appears to be equally applicable to other types

of FRI technique, such as α-cut-based methods [142, 143, 248].

Recall the first step of T-FRI, the construction of the intermediate fuzzy rule

r ′ requires the set of intermediate antecedent fuzzy sets a′j, and the intermediate

consequent fuzzy set z′. A set of shift parameters λa1, · · · ,λa|A| ,λz are required, in

order to maintain the position (representative value) of r ′ on each of its antecedent

dimensions. The value of λz plays an important role in determining the initial position

of the intermediate consequent fuzzy set, which will affect the final interpolative

output. For the present problem, the calculation of λz is modified to reflect the

variations in antecedent variable significance, thereby producing a weighted shift

parameter λz:

λz =1|A|

|A|∑

j=1

ωa jλa j

(8.18)

169


Figure 8.3: Alternative rule selection using weighted distance calculation

which is then used to obtain the weighted intermediate consequent fuzzy set z′. It is

then necessary to apply the two-stage transformations to the intermediate consequent

fuzzy set z′, and the parameter values for the weighted transformations: weighted

scale ratio sz and move ratio mz are computed using:

sz =1|A|

|A|∑

j=1

ωa jsa j

(8.19)

mz =1|A|

|A|∑

j=1

ωa jma j

(8.20)

These are modified version of Eqns. 8.12 and 8.13, following the same principle as

that applied to the calculation of λz.

Finally, a complete, weighted T-FRI procedure from z′ to z∗ can be readily cre-

ated, by following the transformation z∗ = T (z′, sz, mz). This weighted aggregation

170


procedure makes minimal alterations to the original T-FRI algorithm. Symbolically,

it appears identical to the conventional T-FRI method, and is therefore omitted here.

As such, the procedure maintains its structural simplicity and intuitive appeal, while

extending the capability of T-FRI.

8.2.3 Use of Antecedent Significance in B-FRI

One of the common problems faced by a B-FRI system is the event where more

than one antecedent value is missing from an observation. It is difficult to fully

reconstruct, or even closely approximate multiple missing values, since there may

exist a number of possible combinations of values that lead to the same conclusion.

It is also computationally complex to perform reverse reasoning with a large number

of unknowns. Antecedent selection, being a dimensionality reducing technique,

may be potentially beneficial in such situations. By identifying more important

antecedent variables, or by removing irrelevant antecedents altogether, a priority-

based backward reasoning system may be established and greatly simplifies the

problem. However, much of relevant research concerning this issue is beyond the

scope of this thesis.


This section provides a real-world scenario concerning the prediction of terrorist

activities, it is used to demonstrate the procedures of the proposed antecedent

significance-based approach, for both conventional T-FRI and B-FRI problems. The

accuracy and efficiency of the work is further validated via systematic evaluation

using synthetic random data.

8.3.1 Application Example

Consider a practical scenario that involves the prediction of terrorist bombing risk.

The likelihood of an explosion can be directly affected by the amount of people in

the area, crowded places (high popularity and high travel convenience) are usually

more likely to attract terrorist attentions. Safety precaution such as police patrol

may also be a very important factor, the more alert and prepared a place is, the less

opportunities there are for the terrorists to attack. Other aspects such as temperature

and humidity may be of relevance, but their impact on the potential outcome is

171


limited. Table 8.1 lists a few example linguistic rules that may be derived for such a

scenario.

Table 8.1: Example linguistic rules for terrorist bombing prediction (M. for Moderate,V. for Very)

popularity convenience patrol temperature humidity riska1 a2 a3 a4 a5 z

r1 V. Low V. Low V. High M. High V. Lowr2 V. Low V. High V. Low High Low V. Lowr3 M. High M. Low M. High M.r4 M. M. M. Low Low M. Lowr5 M. High Low M. Low M. High Highr6 High V. Low High V. Low Low V. Lowr7 High High M. M. High M. Lowr8 High High V. Low Low Low V. High

The correlation-based FS (CFS) [93] and the fuzzy-rough set-based FS (FRFS)

techniques are employed in the experiment, and the antecedent significance values

obtained using the two respective methods are presented in Table 8.2. Both feature

evaluators agree on that temperature and humidity are relatively less important than

the other 3 antecedent variables. CFS in particular, assigns a weight of ωa5= 0.0299

to humidity, signifying its relatively lack of relevancy in this rule base. The ranking

of importance for the major antecedent variables is a3 > a1 > a2, when CFS is used.

The resultant ranking determined by FRFS is similar, thought it gives convenience

(a2) a higher significance score.

Table 8.2: Antecedent significance values determined by CFS and FRFS

wa1wa2

wa3wa4

wa5

CFS 0.2765 0.2461 0.3312 0.1163 0.0299FRFS 0.2220 0.3228 0.2904 0.0833 0.0814

8.3.1.1 FRI Example

Suppose that a new observation o∗ is present for interpolation, its linguistic values

and the underlying semantics in terms of triangular fuzzy sets, are given in Table 8.3.

The rules selected using the standard T-FRI process, the antecedent significance-based

weighted distance metric, and the reduced rule base are also provided. The two

172


closest rules selected following the standard T-FRI process are close to the observed

values on all antecedent dimensions. However, if antecedent significance values

are taken into consideration, alternative rules will be selected. For the two rules

selected according to CFS, large differences in values can be observed for the variable

humidity (a5), which is likely caused by its very low significance value, as shown

previously in Table 8.2.

Table 8.3: Example observation (linguistic terms and fuzzy set representations),and the closest rules selected by standard T-FRI, and weighted T-FRI with valuesdetermined using CFS and FRFS


o∗ High High M. High Low M. High ?o∗ (8.0, 8.5,9) (5.8, 7.5, 8) (5.0, 5.5, 6.0) (1.5, 2.0, 3) (5.5, 6.0, 6.5) ?

Standard r1 (8, 8.3,8.4) (8.4, 8.6, 9.1) (5.4, 5.9, 6.2) (3.5, 3.7, 4) (6.3, 6.9, 7.2) (2.9, 3, 3.3)Standard r2 (9.7, 9.8,10.4) (4.9,5.4, 5.5) (1.7, 2.1, 2.2) (1.2, 1.2,1.8) (3.7, 4.3, 4.4) (4.3, 5, 5.4)

CFS r1 (8.5,9.1, 9.8) (8.1, 8.6, 9) (5, 5.4, 6.1) (1.8, 2.3,2.4) (2.4, 3, 3.1) (2.7, 3, 3.2)CFS r2 (6, 6.6,6.8) (5.3, 5.9, 6.1) (7.4, 7.6, 7.8) (1.7, 2.1, 2.8) (9.2, 9.6, 10.2) (1.4, 2, 2.5)

FRFS r1 (8.9,9.3, 9.8) (7.2, 7.6, 8.3) (6.3, 6.3, 6.4) (0.6, 0.7,1.3) (4.1, 4.5, 4.7) (2.6, 3, 3.7)FRFS r2 (6.9, 6.9, 7) (3.9, 4.2, 4.7) (3.6, 3.9, 4.6) (2.8, 3.5,4.3) (7.9, 7.9, 8.1) (2.8, 3, 3.4)

The detailed calculations of the T-FRI transformations are omitted here to save

space as they are easily conceived. The final interpolative result of standard T-

FRI is z∗ = (3.1,3.6,4.4), following a transformation of T(z′ = (3.4,3.7,4), sz =2.163, mz = 0.003). Using weights determined by CFS, the result is z∗ = (2.2, 2.8, 3.2),the weighted transformation is:

T (z′ = (2.4,2.8, 3), s = 1.479, m= −0.101) (8.21)

The results obtained based on FRFS is z∗ = (2.2, 3.2, 4)with a corresponding weighted

transformation as shown below:

T (z′ = (2.7,3, 3.6), sz = 2.084, mz = −0.337) (8.22)

Conceptually speaking, although the area in question may be crowded, due to

the place being popular and convenient to reach, the risk of an attack should be quite

low. This is because the level of alert is moderate high, and the two weather-related

factors (despite being less significant): low temperature and high humidity may

further discourage any potential activities. One of the example rules listed in Table

173


8.1, r7 describes a fairly similar event, where the consequent value is given as M.

Low. Based on these, the results obtained via weighted aggregation: (2.2,2.8,3.2)(Low) via CFS, and (2.2, 3.2, 4) (Low) via FRFS, are more intuitively agreeable than

that produced by standard T-FRI: (3.1, 3.6,4.4) (M. Low).

8.3.1.2 B-FRI Example

For the B-FRI scenario, suppose a given observation o∗ with a missing value for the

antecedent variable patrol (a3), where the consequent variable risk (z) is directly

observed. Table 8.4 lists the observation o∗, and the different rules selected by the re-

spective approaches. Note that both CFS- and FRFS-based weighted distance metrics

select the same two closest rules, both of which differ from those selected by standard

B-FRI. For the standard B-FRI method, the values of the required parameters for the

missing antecedent variable are computed based on those of the know antecedent

and consequent variables. For example, following Eqn. 8.15, λa3may be calculated:

λa3= 5λz −

15

5∑

j=1, j 6=2

λa j

= 5× 0.792− 2.171= 1.789 (8.23)

which then constructs an intermediate fuzzy term a′3 = (0.2,1.2,2.2). Both sa3and

ma3are computed similar to λa3

, and finally, the backward transformation T←− given

in Eqn. 8.24 is derived which provides the final B-FRI output of V. Low.

a∗3 = T←−((0.2,1.2, 2.2), 0.400,−0.172) = (0.7,1.2, 1.5) (8.24)

To avoid unnecessary repetition, the detailed procedures to compute the weighted

B-FRI outputs are omitted. The CFS-based antecedent significance values yield a

weighted B-FRI transformation as shown below:

a∗3 = T←−((1.5,2.5, 3.5), 0.1389, 0.0185) = (2.4, 2.5,2.7) (Low) (8.25)

while the FRFS-based method calculates slightly differently, resulting in the following

backward interpolative outcome:

a∗3 = T←−((1.7, 2.7,3.7), 0.0807,0.0714) = (2.6,2.7, 2.8) (8.26)

which may also be interpreted into a linguistic meaning of Low.

174


Table 8.4: Example observation (both linguistic terms and fuzzy set representations),and the closest rules selected by standard B-FRI, and weighted B-FRI with valuesdetermined using CFS and FRFS


o∗ High High ? Low M. High M. Higho∗ (8.0,8.5, 9) (5.8,7.5, 8) ? (1.5, 2.0, 3) (5.5,6.0, 6.5) (5.1, 5.8, 6.4)

Standard r1 (8.7,9.7, 10.7) (5.4, 6.4, 7.4) (0.9,1.9, 2.9) (0.2,1.2, 2.2) (6.7,7.7, 8.7) (4.6, 5.6, 6.6)Standard r2 (7.5, 8.5, 9.5) (6.9, 7.9,8.9) (0.5,1.5, 2.5) (7.2,8.2, 9.2) (3.9,4.9, 5.9) (4.8, 5.8, 6.8)

CFS r1 (7.7, 8.7,9.7) (5.9, 6.9, 7.9) (4.1,5.1, 6.1) (1.0,2.0, 3.0) (3.8,4.8, 5.8) (2.3, 3.3, 4.3)CFS r2 (6.7, 7.7, 8.7) (7.5, 8.5,9.5) (0.0,0.8, 1.8) (3.0,4.0, 5.0) (5.2,6.2, 7.2) (5.7, 6.7, 7.7)

FRFS r1 (7.7, 8.7,9.7) (5.9, 6.9, 7.9) (4.1,5.1, 6.1) (1.0,2.0, 3.0) (3.8,4.8, 5.8) (2.3, 3.3, 4.3)FRFS r2 (6.7, 7.7, 8.7) (7.5, 8.5,9.5) (0.0,0.8, 1.8) (3.0,4.0, 5.0) (5.2,6.2, 7.2) (5.7, 6.7, 7.7)

Note that all of the values, except for patrol and risk, are the same as the previous

observation used to demonstrate forward T-FRI. This narrows down the reason

why risk has jumped from Low to M. High, which is the level of patrol in the area.

Intuitively, for a highly crowded area, if very little patrol is present (as suggested by

the result of standard B-FRI: V. Low), the resultant value of risk should become V. High.

Therefore, having a Low level of patrol may be a more appealing approximation,

8.3.2 Systematic Evaluation

To evaluate the proposed antecedent selection approach and its effectiveness in

antecedent significance aggregation, a numerical test function with 15 variables

(|A| = 15) is used. Such a systematic test is important to validate the consistency,

accuracy, and robustness of the developed approach. This is because random samples

may be generated from a controlled environment, where the ground truths are also

available to verify the correctness of the interpolation results. These tests share a

similar underlying principle behind that of cross-validation and statistical evaluation

[21, 151].

8.3.2.1 FRI Results

The results shown in Table 8.5 are averaged outcomes of 200 randomised runs. By

employing the weighted aggregation scheme based on the antecedent significance

values, both the mean error and standard deviation are considerably improved. The

results obtained according to FRFS appears to have a slightly higher mean error

175


and a wider spread, however, t-test (p = 0.01) shows that the difference is not

statistically significant. The improvement is more evident when the original rule

base is simplified by removing the redundant antecedent variables.

Table 8.5: Evaluation of proposed approaches for standard FRI

Mean error % S.D. %

Standard T-FRI 7.32 6.15Weighted by CFS 5.33 4.69Weighted by FRFS 5.68 5.16Reduced by CFS 3.38 3.01Reduced by FRFS 3.33 2.63

The antecedent subset selected by CFS is a0, a4, a7, a13, a reduction of 73%

in the number of variables, which achieves an mean error of 3.38%; the subset

selected by FRFS is a1, a4, a7, a9, a13, with a reduction of 67%, helps to obtain an

mean error of 3.33%. Both evaluators yield reasonable reduction results, and the

interpolation error (compared to numerical function’s true output) is also much

lower than standard and weighted T-FRI.

8.3.2.2 B-FRI Results

The same numerical test function adopted in Section 8.3.2 is used again to verify

the performance gain for B-FRI problems. A randomly selected antecedent variable

is set to be missing per test iteration. Obviously, this “missing” variable index is

drawn from the set a0, a4, a7, a13 ∩ a1, a4, a7, a9, a13= a1, a7, a13, which is the

intersection of the two antecedent subsets identified by CFS and FRFS, respectively.

This allows direct comparison between the different techniques.

The proposed weighted aggregation scheme, and the antecedent selected rule

base are then used to reconstruct the original values. In this set of experiments,

the error is calculated with respect to the actual antecedent variable value, that has

been intentionally removed to simulate the B-FRI environment. The mean error and

standard deviation of the 200 simulated tests are given in Table 8.6.

The number of antecedent variables involved is quite large and presents a con-

siderable challenge for precise backward reasoning. The original B-FRI approach

achieves a 18.20% mean error, while the accuracy is slightly improved when weighted

aggregation is used. Based on the simplified rule bases reduced by CFS and FRFS, the

176

8.4. Summary

Table 8.6: Evaluation of proposed approaches for B-FRI

Mean error % S.D. %

Standard B-FRI 18.20 19.40Weighted by CFS 16.94 19.15Weighted by FRFS 17.59 18.76Reduced by CFS 8.45 13.70Reduced by FRFS 6.93 13.70

mean interpolation error is notably improved, with a mean error of 8.45% and 6.95%,

respectively. Furthermore, the quality of the output is also more stable, dropping

from the original 19.40% to 13.70% for both cases, demonstrating the benefits of

the reduced rule base for B-FRI.

8.4 Summary

This chapter has presented a new FRI approach that exploits FS techniques in order

to evaluate the importance of antecedent variables. A weighted aggregation-based

interpolation method is proposed that makes use of the identified antecedent sig-

nificance values. The original rule base may also be simplified by removing the

irrelevant or noisy antecedents using a FS search algorithm such as HSFS, and re-

tains an antecedent subset of a much lower dimensionality. Example scenarios and

systematic tests are employed to demonstrate the potential benefits of the work, for

both conventional and B-FRI problems. The resultant antecedent significance-based

FRI technique is both technically sound and conceptually appealing, as humans often

(automatically) screen out seemingly irrelevant antecedents, and focus on more

important factors in order to perform reasoning. A discussion regarding possible

future improvements of the present work is given in Section 9.2.1.4.

177

Chapter 9

Conclusion

T HIS chapter presents a high level summary of the research as detailed in the

preceding chapters. Having reviewed and compared the work to the relevant

approaches in the literature, the thesis has demonstrated that the developed HSFS

algorithm has utilised HS effectively for the task of FS. The proposed modifications to

HSFS further enhance the efficacy of the algorithm, improving both the compactness

and evaluation quality of the discovered feature subsets. A number of theoretical

areas have also been identified that exploits the stochastic behaviour of HSFS. The

capabilities and potential of the developed applications have been experimentally

validated, and compared with either the original approaches, or relevant techniques

in the literature. The chapter also presents a number of initial thoughts about the

directions for future research.

9.1 Summary of Thesis

A survey of ten different NIMs has been given in Chapter 2, which covers FS ap-

proaches derived from both classic stochastic algorithms and other cutting-edge

techniques. The key common notions and mechanisms of the reviewed algorithms

have been extracted, and an unified style of notation has been adopted, with pseu-

docode included. While conducting the review, several techniques including ABC

and FA have also been modified considerably, in order to utilise a wide range of

feature subset evaluation measures, and to improve their search performance.

178

9.1. Summary of Thesis

HSFS (as described in Chapter 3) is a successful application of HS for the problem

of FS. The initial development of the algorithm has been introduced that facilitates

binary-valued feature subset representation, which is also the common choice for

most nature-inspired approaches. Evolving from its initial forms, the HSFS method

makes use of an integer-valued encoding scheme. It maps the notion of musicians

in HS to arbitrary “FS experts” or “individual feature selectors”, and offers a more

flexible platform for the underlying stochastic approach of HS.

To overcome the drawbacks of static parameters employed in the original HS

method, deterministic parameter control rules have been introduced. Procedures to

iteratively refine the size of the emerging feature subsets have also been presented.

Both of these modifications contribute towards a more flexible, data-oriented ap-

proach, in order to encourage the search process to identify good solutions efficiently.

The proposed HSFS algorithm is a generic technique that can be used in conjunc-

tion with alternative filter-based [164], wrapper-based [107], and hybrid feature

subset evaluation techniques [22, 23]. Owing to the underlying randomised and

yet simple nature, the entire solution space of a given problem may be examined by

running the HSFS algorithm in parallel. This will help to reveal a number of quality

solutions much more quickly than random search or exhaustive search methods. The

ability to identify multiple good solutions is of particular importance for ensemble

learning, as the alternative subsets of features may create distinctive views of the

problem at hand, thereby enabling a diverse feature subset-based classifier ensemble

to be built.

An intuitive usage of the stochastic characteristics of HSFS is the OC-FSE method

(as described in Chapter 4). Different diversification methods have been investi-

gated, in an effort to efficiently construct the base pool of classifiers, including

stochastic search, data-partition, and mixture of FS algorithms. The resultant system

outperforms single classifiers, and is more concise and efficient than ordinary (non-

OC-based) ensembles. The HSFS algorithm (and FS technique in general) has been

utilised further in Chapter 5 for the purpose of pruning redundant base classifiers.

FS is performed on artificially generated data sets, which are transformed from

ensemble training outputs, in order to identify and remove irrelevant or redundant

classifiers. Unsupervised FS has also been utilised in the study as a means to discover

redundancy without resorting to the examination of class labels.

179

9.2. Future Work

To deal with scenarios where data may be dynamically changing, an extension

to the (static) HSFS algorithm has been devised in Chapter 6. D-HSFS provides

the additional functionality which adapts to the changes that occur during train-

ing, including events of feature addition, feature removal, instance addition, and

instance removal. D-HSFS is particularly powerful for resolving situations where

arbitrary combinations of the above mentioned scenarios happen simultaneously. A

modified, adaptive FSE has also been proposed that improves the predictive accuracy

of the concurrently trained classifier learners. The resultant approach is an adaptive

framework that is able to evolve along with the dynamic data set.

The use of FS in rule-based systems has been studied in Chapters 7 and 8. Both

theoretical areas benefit from the use of HSFS. For fuzzy-rough rule induction in

particular, the rule base processed by HS is both compact (low number of rules),

and concise (small cardinality of individual rule antecedents), whilst maintaining a

full coverage of the training objects. It has been shown that both conventional FRI

and B-FRI methods may become more efficient, when the less relevant antecedents

have been correctly identified and assigned with lower significant values. A higher

interpolative accuracy can also be achieved when additional information obtained

from FS is utilised.

The FS performance of HSFS, its improvements and applications, both theoretical

and practical, have been experimentally evaluated, and systematically compared

to the relevant techniques in the literature. The results of experimentation have

demonstrated that HSFS is particularly effective at reducing the size of feature subsets,

while its ability to optimise the evaluation score is on par to the other methods. In

addition, it has been shown that the proposed modifications to HSFS are beneficial

in improving the quality of the selected feature subsets.

9.2 Future Work

Although promising, much can be done to further improve the work presented so

far in this thesis. The following addresses a number of interesting issues whose

successful solution will help strength the current research.

180

9.2. Future Work

9.2.1 Short Term Tasks

This section discusses on extensions and tasks that could be readily implemented if

additional time were available.

9.2.1.1 HSFS

Although a preliminary convergence detection mechanism for HS has been suggested

in [290], allowing the algorithm to detect convergence (based on the frequency of

updates to the best solution within the harmony memory) and to self-terminate.

It would be useful to develop a more advanced stopping criterion, utilising the

overall quality of the entire harmony memory, and additional states of the search

process. This way, the run-time efficiency will become self-adaptive to the problem

at hand, and a further performance improvement can therefore be expected. More

intelligent iterative refinement procedures, alternative to that described in 3.4.2 may

be developed. The purpose is not just to encourage the discovery of more compact

feature subsets, but to find them in much shorter time. The harmony memory

consolidation procedure suggested in [290] is a step in this particular direction,

which can achieve feature subset size reduction with a smaller number of iterations.

The subset evaluators employed in the experimental evaluation indicate that

certain methods are more biased than the rest towards maintaining end classification

accuracy, or towards minimising the resultant feature subset size. Further investiga-

tion is thus necessary such that better group-based approaches (such as FSEs) may

be developed by combining these evaluation measures. As pointed out in Section

3.3.2, feature relevancy measurements such as correlation [93] and fuzzy-rough

dependency [126] may be utilised to identify more relevant neighbours, so that

the stochastic mechanisms controlled by the pitch adjustment rate and fret-width

parameters may be better exploited.

9.2.1.2 FSE

Currently, despite the effort vested in developing adaptive techniques, the number

of base FS components (i.e., the size of the FSE) needs to be predefined. The

construction process should however, be able to automatically “recruit” or “fire” base

FS components according to the complexity of the problem data, in a similar fashion

as that used in the harmony consolidation process of HSFS. An extreme case for this

would be the situation where the data set contains only one optimal feature subset,

181

9.2. Future Work

which may be handled by a single component, thereby eliminating the necessity

of employing a group-based approach (equivalently, shrinking the ensemble size

to one). To achieve this, enhancing the methods developed for CER are of strong

relevance.

A thorough interpretation of the underlying problem domain, given a very large

data set, may become infeasible for many real-world applications and hence, the

amount of labelled training samples is often limited. This makes unsupervised FS al-

gorithms [22, 23, 169], and semi-unsupervised learning techniques [170] potentially

beneficial and desirable. FSEs could help in better identifying correlated (similar)

groups of features [120], rather than individually important features under such

circumstances. Feature grouping techniques are potentially beneficial to computa-

tionally complex FS methods such as FRFS, where FS may be performed on the basis

of predefined groups. This will not only improve the time taken to process large

data, but also potentially generate better subsets with less internal redundancy.

9.2.1.3 Rule Induction

Existing improvements to HSFS may be validated to real their effectiveness for the

problem of rule induction. Investigations into how the parameters of HarmonyRules

can be better tuned [173, 257, 290] are of particular interest. It may also be beneficial

to perform an in-depth analysis of the underlying theoretical characteristics of the

learning mechanism, such as scalability. As the current approach treats training

objects as musicians, an alternative structure may be necessary in order to cope with

huge data sets, where the large number of objects will affect the search performance.

Although the scalability of HS itself has been studied in the literature [49, 257], a

divide and conquer approach, or a hierarchically structured HS components may

further improve the performance. Additional examination will be helpful to utilise

the pool of discovered rule sets/feature subsets. A fuzzy-rough rule-based ensemble

similar to those constructed for FSE [61] may be formed, where the subsets may be

used to generate partitions of the training data in order to build diverse classification

models.

9.2.1.4 FRI

The present antecedent selection approach for FRI can be improved further by

considering unsupervised or semi-supervised FS methods [170, 169, 288], which

182

9.2. Future Work

have emerged recently for analysing the inter-dependencies between features without

the aid of class information. Current work in B-FRI also requires exhaustive search

of suitable parameter values to perform reverse reasoning, a heuristic-based method

employing algorithms such as HS may greatly speed up the process. Although

generic in concept, the current implementation of the antecedent significance-based

aggregation approach is strongly coupled with the T-FRI method. It is worth further

extending the principles behind weighted aggregation to alternative FRI methods,

thereby providing a potentially more flexible framework for efficient interpolation.

Fuzzy aggregation functions [23, 191] are of particular assistance in realising such a

task.

9.2.2 Long Term Developments

This section proposes several future directions that could each form the basis of a

much more significant piece of research.

9.2.2.1 Hierarchical HS for FS and General Purpose Optimisation

Research into hierarchically structured HS for high dimensional and large FS problems

may be beneficial to the further development of the work presented in this thesis.

Such a theoretical extension is potentially applicable to a wide range of applications,

not limited to FS. The idea behind multi-layered HS is originated from the concept of

orchestras, where the current “flat” HS can be seen as a band. An orchestra consists

of multiple sub-sections, including string, wind, brass, and percussion instruments,

and is typically led by a conductor. Depending on the type of music being performed,

additional players, solo performers, or alternative arrangements of sections can take

place.

Hierarchical HS may utilise locally organised search operations which help to

detect similar or related features. Effective feature grouping may lead to substantial

reduction of the complexity of any subsequent search process (imagine it being a

pre-processing step for FS). The restricted feature domain may take advantage of

such groupings to provide stronger informative hints to the feature selectors. Lower

tiered search processes can be focused on difference evaluation criteria, or artificially

injected preferences, where meta-level procedures can oversee the progress of overall

search, so that both macro- and micro-level control are achieved.

183

9.2. Future Work

9.2.2.2 Theoretical Developments of Dynamic FS and A-FSE

A key concern of many real-world applications is the responsiveness of the FS mecha-

nisms, where an in-depth investigation of the relevant application-specific techniques

[5] may reveal even more reactive methods. The internal mechanisms of the HSFS

algorithm may also be further exploited to improve its efficiency. In particular, the

possibility of building an A-FSE out of the candidate solutions stored within the

harmony memory, rather than employing multiple, simultaneous searches is worth

exploring. Multiple FS criteria may also be utilised simultaneously in an effort to

identify better dynamic feature subsets. For this, ideas developed for multi-objective

optimisation [286] may be exploited. The current work has not yet considered the

scenario where additional class labels may be revealed (or different labels may be

assigned to the existing objects) during the dynamic learning process. How the

proposed approach may be further extended to handle dynamic rule learning [188]remains active research, as well as the development of hybrid or embedded models

for a closer integration between dynamic FS and classification.

9.2.2.3 Descriptive CER and its Applications

The formulation of alternative transformation procedures for producing the decision

matrix is of particular interest for the development of descriptive CER. Many state-

of-the-art classifiers are capable of producing a likelihood distribution governing

the chance that a particular instance may belong to a certain class, where the class

with highest probability is usually taken as the final prediction. This probability

distribution may contain more information, and is potentially more suitable to be

utilised as the artificial feature values (rather than the final prediction alone as within

the current). Other statistical information regarding the classifiers such as bias and

variance, may also be used to construct additional artificially generated features, in

order to create a more comprehensive artificial data set for FS-based CER.

The approaches developed following this principle direction may be applicable

to problems involving substantially larger data sets (when compared to those in-

vestigated in Chapter 5), such as Martian rock classification [222, 224], weather

forecasting [216], and intelligent robotics [161, 176, 263]. These areas present

significant challenge to the existing FS and classification algorithms; addressing

them will help to better understand and validate the characteristics of the employed

methods. Investigations into the underlying reasons for why different FS techniques

184

9.2. Future Work

deliver distinctive characteristics in CER will also be beneficial, either to simplify

the complexity of the learnt ensembles, or to improve the overall classifier ensemble

accuracy.

9.2.2.4 A-FSE for Weather Forecasting

One of the most challenging application problems that require the assistance of FS is

weather forecasting [111, 214]. Traditional weather forecasting has been built on

a foundation of deterministic modelling. The forecast typically starts with certain

initial conditions, puts them into a sophisticated computational model, and ends

with a prediction about the forthcoming weather. Ensemble-based forecasting [86]is first introduced in the early 1990s. In this method, results of (up to hundreds of)

different computer runs, each with slight variations in starting conditions or model

assumptions, are combined to derive the final forecast. As with statistical techniques,

ensembles may provide more accurate statements about the uncertainty in daily and

seasonal forecasting.

In particular, weather forecasting deals with data sources that are constantly

changing. The data volume may grow both in terms of attributes and objects, whilst

historical information may also become invalid or irrelevant over time. The A-FSE

approach developed in this thesis can actively form and refine ensembles in a dy-

namic environment, in an effort to maintain the preciseness and effectiveness of the

extracted knowledge. Such a technique may be further generalised to the prediction

of natural disasters, and unusual, severe, or unseasonal weather (commonly referred

to as extreme weather) that lies at the extremes of historical distributions. A consid-

erable amount of effort is foreseen to be needed to establish an adaptive system that

can handle real forecasting problems of extremely high complexity. However, the

work developed in this thesis may offer useful insight into such further development.

185

Appendix A

Publications Arising from the Thesis

A number of publications have been generated from the research carried out within

the PhD project. Below lists the resultant publications that are in close relevance to

the thesis, including both papers already published and articles submitted for review.

A.1 Journal Articles

1. R. Diao, F. Cao, Peng, N. Snooke, and Q. Shen, Feature Selection Inspired

Classifier Ensemble Reduction [57], IEEE Transactions on Cybernetics, 10 pp.,

in press.

2. S. Jin, R. Diao, and Q. Shen, Backward Fuzzy Rule Interpolation [129], IEEE

Transactions on Fuzzy Systems, 14 pp., in press.

3. R. Diao and Q. Shen, Feature Selection with Harmony Search [62], IEEE

Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 42,

no. 6, pp. 1509–1523, 2012.

4. R. Diao and Q. Shen, Adaptive Feature Selection Ensemble for Dynamic Data:

A Harmony Search-Based Approach, 12 pp., submitted.

5. R. Diao and Q. Shen, Occurrence Coefficient Threshold-Based Feature Subset

Ensemble, 12 pp., submitted.

186

187

6. R. Diao and Q. Shen, Nature Inspired Feature Selection Meta-Heuristics, 27

pp., submitted.

7. L. Zheng, R. Diao, and Q. Shen, Self-Adjusting Harmony Search-based Feature

Selection, 11 pp., submitted.

A.2 Book Chapter

8. Q. Shen, R. Diao, and P. Su, Feature Selection Ensemble [227], Turing Centenary,

pp. 289–306, 2012.

A.3 Conference Papers

9. R. Diao, S. Jin, and Q. Shen, Antecedent Selection in Fuzzy Rule Interpolation

using Feature Selection Techniques, submitted.

10. L. Zheng, R. Diao, and Q. Shen, Efficient Feature Selection using a Self-Adjusting

Harmony Search Algorithm [290], Proceedings of the 13th UK Workshop on

Computational Intelligence, 2013.

11. R. Diao, N. Mac Parthaláin, and Q. Shen, Dynamic Feature Selection with

Fuzzy-Rough Sets [58], Proceedings of the 22nd IEEE International Conference

on Fuzzy Systems, 2013.

12. S. Jin, R. Diao, C. Quek, and Q. Shen, Backward Fuzzy Rule Interpolation

with Multiple Missing Values [128], Proceedings of the 22nd IEEE International

Conference on Fuzzy Systems, 2013.

13. R. Diao and Q. Shen, A Harmony Search Based Approach to Hybrid Fuzzy-rough

Rule Induction [63], Proceedings of the 21st IEEE International Conference on

Fuzzy Systems, pp. 1–8, 2012.

14. S. Jin, R. Diao, and Q. Shen, Backward Fuzzy Interpolation and Extrapolation

with Multiple Multi-antecedent Rules [131], Proceedings of the 21st IEEE

International Conference on Fuzzy Systems, pp. 1–8, 2012.

188

15. R. Diao and Q. Shen, Fuzzy-rough classifier ensemble selection [61], Pro-

ceedings of the 20th IEEE International Conference on Fuzzy Systems, pp.

1516–1522, 2011.

16. S. Jin, R. Diao, and Q. Shen, Towards Backward Fuzzy Rule Interpolation

[130], Proceedings of the 11th UK Workshop on Computational Intelligence,

2011.

17. R. Diao and Q. Shen, Two New Approaches to Feature Selection with Harmony

Search [60], Proceedings of the 19th IEEE International Conference on Fuzzy

Systems, pp. 3161–3167, 2010.

18. R. Diao and Q. Shen, Deterministic Parameter Control in Harmony Search [59],Proceedings of the 10th UK Workshop on Computational Intelligence, 2010.

Appendix B

Data Sets Employed in the Thesis

The data sets employed in the thesis are mostly public available benchmark data,

available through the UCI machine learning repository [78] which have been drawn

from real-world problem scenarios. Table B.1 provides a summary of the properties

of these data sets. Their underlying problem domains are described in detail below,

where the URL of the respective data sets are also given in order to facilitate easy

access.

Table B.1: Information of data sets used in the thesis

Data set Feature Instance Class

arrhy 279 452 16cleve 14 297 5ecoli 8 336 8glass 10 214 6handw 256 1593 10heart 14 270 2ionos 35 230 2isole 617 7797 26libra 91 360 15multi 650 2000 10olito 25 120 4ozone 73 2534 2secom 591 1567 2sonar 60 208 2water 39 390 3water2 39 390 2wavef 40 699 2web 2556 149 5wine 13 178 3

189

190

• Arrhythmia (arrhy)

http://archive.ics.uci.edu/ml/datasets/Arrhythmia

This database contains 279 attributes, 206 of which are linear valued and the

rest are nominal [78]. “The aim is to distinguish between the presence and

absence of cardiac arrhythmia and to classify it in one of the 16 groups. Class 01

refers to ’normal’ ECG classes 02 to 15 refers to different classes of arrhythmia

and class 16 refers to the rest of unclassified ones. For the time being, there

exists a computer program that makes such a classification. However there are

differences between the cardiolog’s and the programs classification. Taking the

cardiolog’s as a gold standard we aim to minimise this difference by means of

machine learning tools.” [90]

• Cleveland Heart Disease Data Set (cleve)

http://archive.ics.uci.edu/ml/datasets/Heart+Disease

“This database contains 76 attributes altogether, but all published experiments

refer to using a subset of 14 of them. In particular, the Cleveland database is

the only one that has been used by ML researchers to this date. The decision

attribute refers to the presence of heart disease in the patient. It is integer

valued from 0 (no presence) to 4. Experiments with the Cleveland database

have concentrated on simply attempting to distinguish presence (values 1,2,3,4)

from absence (value 0). The names and social security numbers of the patients

were recently removed from the database, replaced with dummy values.” [2]

• Ecoli (ecoli)

http://archive.ics.uci.edu/ml/datasets/Ecoli

“The localization site of a protein within a cell is primarily determined by its

amino acid sequence. Rule-based expert system for classifying proteins into

their various cellular localization sites, using their amino acid sequences, in

gram-negative bacteria and in eukaryotic cells.” [105]

• Glass Identification (glass)

http://archive.ics.uci.edu/ml/datasets/Glass+Identification

This data set contains 10 attributes which describes the chemical contents of

glass. “The study of classification of types of glass (in determining whether

the glass was a type of “float” glass or not) was motivated by criminological

191

investigation. At the scene of the crime, the glass left can be used as evidence

if it is correctly identified.” [73]

• Semeion Handwritten Digit (handw)

http://archive.ics.uci.edu/ml/datasets/Semeion+Handwritten+Digit

1593 handwritten digits from around 80 persons were scanned, stretched in a

rectangular box 16x16 in a gray scale of 256 values. Then each pixel of each

image was scaled into a boolean (1/0) value using a fixed threshold. Each

person wrote on a paper all the digits from 0 to 9, twice. The commitment was

to write the digit the first time in the normal way (trying to write each digit

accurately) and the second time in a fast way (with no accuracy). [32]

• Statlog (Heart) (heart)

http://archive.ics.uci.edu/ml/datasets/Statlog+%28Heart%29

“This data set is a heart disease database, with 6 real-valued attributes: 1, 4, 5,

8, 10, 12; 1 ordered attribute:11; 3 binary attributes: 2, 6, 9; and 3 nominal

features:7, 3, 13. The class label to be predicted: absence (1) or presence (2)

of heart disease.”

• Ionosphere (ionos)

http://archive.ics.uci.edu/ml/datasets/Ionosphere

“This radar data was collected by a system in Goose Bay, Labrador. This system

consists of a phased array of 16 high-frequency antennas with a total trans-

mitted power on the order of 6.4 kilowatts. The targets were free electrons

in the ionosphere. "Good" radar returns are those showing evidence of some

type of structure in the ionosphere. "Bad" returns are those that do not; their

signals pass through the ionosphere. Received signals were processed using

an autocorrelation function whose arguments are the time of a pulse and the

pulse number.” [232]

• Isolet (isole)

http://archive.ics.uci.edu/ml/datasets/ISOLET

“This data set was generated as follows. 150 subjects spoke the name of each

letter of the alphabet twice. Hence, we have 52 training examples from each

speaker. All attributes are continuous, real-valued attributes scaled into the

192

range -1.0 to 1.0. The data set is a good domain for a noisy, perceptual task. It

is also a very good domain for testing the scaling abilities of algorithms.” [74]

• Libras (libra)

http://archive.ics.uci.edu/ml/datasets/Libras+Movement

“The data set contains 15 classes of 24 instances each, where each class refer-

ences to a hand movement type in LIBRAS (Portuguese name ’LÍngua BRAsileira

de Sinais’, the official Brazilian signal language). In the video pre-processing,

a time normalisation is carried out selecting 45 frames from each video, in

according to an uniform distribution. In each frame, the centroid pixels of the

segmented objects (the hand) are found, which compose the discrete version

of the curve F with 45 points. All curves are normalised in the unitary space.”

[64]

• Multiple Features (multi)

http://archive.ics.uci.edu/ml/datasets/Multiple+Features

“This dataset consists of features of handwritten numerals (‘0’–‘9’) extracted

from a collection of Dutch utility maps. 200 patterns per class (for a total of

2,000 patterns) have been digitized in binary images. These digits are repre-

sented in terms of the following six feature setsSS: 1) 76 Fourier coefficients

of the character shapes; 2) 216 profile correlations; 3) 64 Karhunen-Love

coefficients; 4) 240 pixel averages in 2 x 3 windows; 5) 47 Zernike moments;

6) 6 morphological features. The first 200 patterns are of class ‘0’, followed by

sets of 200 patterns for each of the classes ‘1’ - ‘9’.” [198]

• Olitos (olito)

http://michem.disat.unimib.it/chm/download/datasets.htm#olit

This data set concerns the chemometric analysis of olive oils [9]. The chemical

information such as fatty acids, sterols, and triterpenic alcohols are analysed

from 120 olive oil samples from Tuscany, Italy, collected in 88 different areas

of production. The class variable determines the cultivars of the oil samples.

• Ozone Level Detection (ozone)

http://archive.ics.uci.edu/ml/datasets/Ozone+Level+Detection

193

“Ground ozone level data included in this collection were collected from 1998

to 2004 at the Houston, Galveston and Brazoria area. The data contains impor-

tant attributes that are highly valued by Texas Commission on Environmental

Quality: local ozone peak prediction; upwind ozone background level; pre-

cursor emissions related factor; maximum temperature in degrees F; base

temperature where net ozone production begins; solar radiation total for the

day; wind speed near sunrise; wind speed mid-day.” [283].

• Secom (secom)

http://archive.ics.uci.edu/ml/datasets/SECOM

“A complex modern semi-conductor manufacturing process is normally under

consistent surveillance via the monitoring of signals/variables collected from

sensors and or process measurement points. The measured signals contain

a combination of useful information, irrelevant information as well as noise.

When performing system diagnosis, engineers typically have a much larger

number of signals than are actually required. The Process Engineers may use

certain selected signals to determine key factors contributing to yield excursions

downstream in the process.” [180]

• Sonar (sonar)

http://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+(Sonar,+Mines

+vs.+Rocks)

“The data set contains 111 patterns obtained by bouncing sonar signals off a

metal cylinder at various angles and under various conditions, and 97 patterns

obtained from rocks under similar conditions. The transmitted sonar signal is

a frequency-modulated chirp, rising in frequency. The data set contains signals

obtained from a variety of different aspect angles, spanning 90 degrees for the

cylinder and 180 degrees for the rock. Each pattern is a set of 60 numbers in

the range 0.0 to 1.0. Each number represents the energy within a particular

frequency band, integrated over a certain period of time.” [87]

• Water (water)

http://archive.ics.uci.edu/ml/datasets/Water+Treatment+Plant

“This dataset comes from the daily measures of sensors in a urban waste water

treatment plant. The objective is to classify the operational state of the plant

194

in order to predict faults through the state variables of the plant at each of

the stages of the treatment process. This domain has been stated as an ill-

structured domain.” A variant of this data set: water2 with 2 different classes

(as opposed to the original of 3) has also been utilised in this thesis. [15]

• Waveform Database Generator (wavef)

http://archive.ics.uci.edu/ml/datasets/Waveform+Database+Generator

+(Version+2)

“There are 3 classes of waves, each class is generated from a combination of 2

of 3 base waves. There are 40 attributes, all of which include noise, where the

latter 19 attributes are all noise attributes with mean 0 and variance 1.”. [26]

• Wine (wine)

http://archive.ics.uci.edu/ml/datasets/Wine

“These data are the results of a chemical analysis of wines grown in the same

region in Italy but derived from three different cultivars. The analysis deter-

mined the quantities of 13 constituents found in each of the three types of

wines.” [254]

Appendix C

List of Acronyms

10-FCV 10-fold cross-validation

A-FSE Adaptive feature subset ensemble

ABC Artificial Bee Colony

ACO Ant Colony Optimisation

B-FRI Backward fuzzy rule interpolation

BCP Base classifier pool

CER Classifier ensemble reduction

CFS Correlation-based feature selection

CSA Clonal Selection Algorithm

D-HSFS Dynamic feature selection with Harmony Search

FF Firefly Search

FNN Fuzzy nearest neighbour

FRFS Fuzzy-rough set-based feature selection

FRI Fuzzy rough interpolation

FS Feature selection

FSE Feature subset-based classifier ensemble

GA Genetic Algorithm

195

196

HC Hill-Climbing

HS Harmony Search

HS-O Original Harmony Search

HS-PC Harmony Search with parameter control

HS-IR Harmony Search with parameter control and iterative refinement

HSFS Feature selection with Harmony Search

LEM2 Learning from examples module, version 2

MA Memetic Algorithm

ModLEM Modified algorithm for learning from examples module

NB Naïve Bayes-based classifier

NIM Nature-inspired meta-heuristic

OC Occurrence coefficient

OC-FSE Occurrence coefficient-based feature subset classifier ensemble

PART Projective adaptive resonance theory

PCFS Probabilistic consistency-based feature selection

PSO Particle swarm optimisation

RIPPER Repeated incremental pruning to produce error reduction

RST Rough set theory

SA Simulated Annealing

SMO Sequential minimal optimisation

T-FRI Transformation-based fuzzy rule interpolation

TS Tabu Search

U-FRFS Unsupervised fuzzy-rough set-based feature selection

WSBA Weighted fuzzy subset-hood-based rule induction algorithm

Bibliography

[1] N. Abe and M. Kudo, “Entropy criterion for classifier-independent feature selection,”in Knowledge-Based Intelligent Information and Engineering Systems, ser. Lecture Notesin Computer Science, R. Khosla, R. Howlett, and L. Jain, Eds. Springer BerlinHeidelberg, 2005, vol. 3684, pp. 689–695.

[2] D. Aha and D. Kibler, “Instance-based prediction of heart-disease presence with thecleveland database,” University of California, Tech. Rep., Mar 1988.

[3] D. W. Aha and R. L. Bankert, “A comparative evaluation of sequential feature selectionalgorithms,” in Learning from Data: Artificial Intelligence and Statistics V, ser. LectureNotes in Statistics, D. H. Fisher and H.-J. Lenz, Eds. New York, USA: Springer-Verlag,1996, pp. 199–206.

[4] D. Aha, D. Kibler, and M. Albert, “Instance-based learning algorithms,” MachineLearning, vol. 6, no. 1, pp. 37–66, 1991.

[5] M. Ahmadi, M. Taylor, and P. Stone, “IFSA: Incremental feature-set augmentation forreinforcement learning tasks,” in The 6th International Joint Conference on AutonomousAgents and Multiagent Systems. Springer, 2007.

[6] H. Akaike, “A new look at the statistical model identification,” IEEE Trans. Autom.Control, vol. 19, no. 6, pp. 716–723, 1974.

[7] M. R. AlRashidi and M. El-Hawary, “A survey of particle swarm optimization applica-tions in electric power systems,” IEEE Trans. Evol. Comput., vol. 13, no. 4, pp. 913–918,2009.

[8] E. Amaldi and V. Kann, “On the approximability of minimizing nonzero variables orunsatisfied relations in linear systems,” Theoretical Computer Science, vol. 209, no.1?2, pp. 237–260, 1998.

[9] C. Armanino, R. Leardi, S. Lanteri, and G. Modi, “Chemometric analysis of tuscanolive oils,” Chemometrics and Intelligent Laboratory Systems, vol. 5, no. 4, pp. 343–354,1989.

[10] M. Attik, “Using ensemble feature selection approach in selecting subset with relevantfeatures,” in Advances in Neural Networks, ser. Lecture Notes in Computer Science,J. Wang, Z. Yi, J. Zurada, B.-L. Lu, and H. Yin, Eds. Springer Berlin Heidelberg, 2006,vol. 3971, pp. 1359–1366.

197

198

[11] A. Atyabi, M. Luerssen, S. Fitzgibbon, and D. Powers, “Evolutionary feature selectionand electrode reduction for EEG classification,” in 2012 IEEE Congress on EvolutionaryComputation, Jun. 2012, pp. 1–8.

[12] H. Banati and M. Bajaj, “Fire fly based feature selection approach,” InternationalJournal of Computer Science Issues, vol. 8, no. 2, pp. 473–479, 2011.

[13] Y. Bar-Cohen, Biomimetics: Biologically Inspired Technologies. Taylor & Francis, 2005.

[14] P. Baranyi, L. T. Kóczy, and T. D. Gedeon, “A generalized concept for fuzzy ruleinterpolation,” IEEE Trans. Fuzzy Syst., vol. 12, no. 6, pp. 820–837, 2004.

[15] J. B’ejar, “Linneo: a classification methodology for ill-structured domains,” Facultatd’Informtica de Barcelona, Tech. Rep., 1993.

[16] R. Bellman, Dynamic Programming. Princeton, NJ, USA: Princeton University Press,1957.

[17] Y. Bengio and Y. Grandvalet, “No unbiased estimator of the variance of K-fold cross-validation,” Journal of Machine Learning Research, vol. 5, pp. 1089–1105, Sep. 2004.

[18] ——, “Bias in estimating the variance of K-fold cross-validation,” in Statistical Modelingand Analysis for Complex Data Problems, P. Duchesne and B. RéMillard, Eds. SpringerUS, 2005, pp. 75–95.

[19] R. B. Bhatt and M. Gopal, “On fuzzy-rough sets approach to feature selection,” PatternRecognition Letters, vol. 26, pp. 965–975, 2005.

[20] C. M. Bishop, Neural Networks for Pattern Recognition, 1st ed. Oxford UniversityPress, USA, Jan. 1996.

[21] G. Bontempi, H. Bersini, and M. Birattari, “The local paradigm for modeling andcontrol: from neuro-fuzzy to lazy learning,” Fuzzy Sets and Systems, vol. 121, no. 1,pp. 59–72, 2001.

[22] T. Boongoen, C. Shang, N. Iam-on, and Q. Shen, “Extending data reliability measureto a filter approach for soft subspace clustering,” IEEE Trans. Syst., Man, Cybern. B,vol. 41, no. 6, pp. 1705–1714, 2011.

[23] T. Boongoen and Q. Shen, “Nearest-neighbor guided evaluation of data reliability andits applications,” IEEE Trans. Syst., Man, Cybern. B, vol. 40, no. 6, pp. 1622–1633, Dec.2010.

[24] B. Bouchon-Meunier, R. Mesiar, C. Marsala, and M. Rifqi, “Compositional rule ofinference as an analogical scheme,” Fuzzy Sets and Systems, vol. 138, no. 1, pp. 53–65,2003.

[25] V. Braverman, R. Ostrovsky, and C. Zaniolo, “Optimal sampling from sliding windows,”Journal of Computer and System Sciences, vol. 78, no. 1, pp. 260–272, 2012.

[26] L. Breiman, Classification and regression trees, ser. The Wadsworth and Brooks-Colestatistics-probability series. Chapman & Hall, 1984.

199

[27] ——, “Bagging predictors,” Machine Learning, vol. 24, pp. 123–140, 1996.

[28] ——, “Technical note: Some properties of splitting criteria,” Machine Learning, vol. 24,no. 1, pp. 41–47, 1996.

[29] J. Brownlee, Clever Algorithms: Nature-Inspired Programming Recipes. Lulu.com,2011.

[30] E. Burke and J. Landa Silva, “The design of memetic algorithms for scheduling andtimetabling problems,” in Recent Advances in Memetic Algorithms, Studies in Fuzzinessand Soft Computing. Springer, 2004, pp. 289–312.

[31] K. Burnham and D. Anderson, Model Selection and Multi-Model Inference: A PracticalInformation-Theoretic Approach. Springer, 2002.

[32] M. Buscema, “Metanet: The theory of independent judges,” Substance Use & Misuse,vol. 33, no. 2, pp. 439–461, 1998.

[33] Y. Cao and J. Wu, “Projective art for clustering data sets in high dimensional spaces,”Neural Networks, vol. 15, no. 1, pp. 105–120, Jan. 2002.

[34] J. Chang and W. Lee, “Finding recently frequent itemsets adaptively over onlinetransactional data streams,” Information Systems, vol. 31, no. 8, pp. 849–869, 2006.

[35] N. Chawla and D. Davis, “Bringing big data to personalized healthcare: A patient-centered framework,” Journal of General Internal Medicine, vol. 28, no. 3, pp. 660–665,2013.

[36] S. Chen and Y. Chang, “Fuzzy rule interpolation based on the ratio of fuzziness ofinterval type-2 fuzzy sets,” Expert Systems with Applications, vol. 38, no. 10, pp.12 202–12 213, 2011.

[37] X. Chen, Y.-S. Ong, M.-H. Lim, and K. C. Tan, “A multi-facet survey on memeticcomputation,” IEEE Trans. Evol. Comput., vol. 15, no. 5, pp. 591–607, 2011.

[38] Y. Chen, D. Miao, and R. Wang, “A rough set approach to feature selection based onant colony optimization,” Pattern Recognition Letters, vol. 31, no. 3, pp. 226–233,2010.

[39] A. Chouchoulas and Q. Shen, “Rough set-aided keyword reduction for text categorisa-tion,” Applied Artificial Intelligence, vol. 15, pp. 843–873, 2001.

[40] C. M. Christoudias, R. Urtasun, and T. Darrell, “Multi-view learning in the presence ofview disagreement,” in 24th Conference on Uncertainty in Artificial Intelligence, 2008.

[41] L.-Y. Chuang, S.-W. Tsai, and C.-H. Yang, “Improved binary particle swarm optimizationusing catfish effect for feature selection,” Expert Systems with Applications, vol. 38,no. 10, pp. 12 699–12 707, 2011.

[42] I. Cloete and J. van Zyl, “Fuzzy rule induction in a set covering framework,” IEEETrans. Fuzzy Syst., vol. 14, no. 1, pp. 93–110, 2006.

200

[43] W. W. Cohen, “Fast effective rule induction,” in Twelfth International Conference onMachine Learning. Morgan Kaufmann, 1995, pp. 115–123.

[44] R. Cooley, B. Mobasher, and J. Srivastava, “Web mining: information and pattern dis-covery on the world wide web,” in Proceedings of the 9th IEEE International Conferenceon Tools with Artificial Intelligence, 1997, pp. 558–567.

[45] C. Cornelis, G. H. Martín, R. Jensen, and D. Slezak, “Feature selection with fuzzydecision reducts,” in Rough Sets and Knowledge Technology, ser. Lecture Notes inComputer Science, G. Wang, T. Li, J. Grzymala-Busse, D. Miao, A. Skowron, and Y. Yao,Eds. Springer Berlin Heidelberg, 2008, vol. 5009, pp. 284–291.

[46] Y. Cui, J. Jin, S. Zhang, S. Luo, and Q. Tian, “Correlation-based feature selection andregression,” in Advances in Multimedia Information Processing, ser. Lecture Notes inComputer Science, G. Qiu, K. Lam, H. Kiya, X.-Y. Xue, C.-C. Kuo, and M. Lew, Eds.Springer Berlin Heidelberg, 2010, vol. 6297, pp. 25–35.

[47] P. Cunningham and J. Carney, “Diversity versus quality in classification ensemblesbased on feature selection,” in In 11th European Conference on Machine Learning.Springer, 2000, pp. 109–116.

[48] A. Darwiche, Modeling and Reasoning with Bayesian Networks, 1st ed. New York, NY,USA: Cambridge University Press, 2009.

[49] S. Das, A. Mukhopadhyay, A. Roy, A. Abraham, and B. Panigrahi, “Exploratory powerof the harmony search algorithm: Analysis and improvements for global numericaloptimization,” IEEE Trans. Syst., Man, Cybern. B, vol. 41, no. 1, pp. 89–106, 2011.

[50] K. Das Sharma, A. Chatterjee, and A. Rakshit, “Design of a hybrid stable adaptivefuzzy controller employing Lyapunov theory and harmony search algorithm,” IEEETrans. Control Syst. Technol., vol. 18, no. 6, pp. 1440–1447, 2010.

[51] M. Dash, K. Choi, P. Scheuermann, and H. Liu, “Feature selection for clustering–a filtersolution,” in Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE InternationalConference on, 2002, pp. 115–122.

[52] M. Dash and H. Liu, “Consistency-based search in feature selection,” Artificial Intelli-gence, vol. 151, no. 1-2, pp. 155–176, Dec. 2003.

[53] ——, “Feature selection for classification,” Intelligent Data Analysis, vol. 1, pp. 131–156, 1997.

[54] L. de Castro and F. Von Zuben, “Learning and optimization using the clonal selectionprinciple,” IEEE Trans. Evol. Comput., vol. 6, no. 3, pp. 239 –251, jun 2002.

[55] J. Debuse and V. Rayward-Smith, “Feature subset selection within a simulated an-nealing data mining algorithm,” Journal of Intelligent Information Systems, vol. 9, pp.57–81, 1997.

[56] J. Demšar, “Statistical comparisons of classifiers over multiple data sets,” J. Mach.Learn. Res., vol. 7, pp. 1–30, Dec. 2006.

201

[57] R. Diao, F. Chao, T. Peng, N. Snooke, and Q. Shen, “Feature selection inspired classifierensemble reduction,” IEEE Trans. Cybern., in press.

[58] R. Diao, N. Mac Parthaláin, and Q. Shen, “Dynamic feature selection with fuzzy-roughsets,” in IEEE International Conference on Fuzzy Systems, Jun. 2013, pp. 1–7.

[59] R. Diao and Q. Shen, “Deterministic parameter control in harmony search,” in Pro-ceedings of the 10th UK Workshop on Computational Intelligence, 2010.

[60] ——, “Two new approaches to feature selection with harmony search,” in IEEE Inter-national Conference on Fuzzy Systems, Jul. 2010, pp. 1–7.

[61] ——, “Fuzzy-rough classifier ensemble selection,” in IEEE International Conference onFuzzy Systems, june 2011, pp. 1516 –1522.

[62] ——, “Feature selection with harmony search,” IEEE Trans. Syst., Man, Cybern. B,vol. 42, no. 6, pp. 1509–1523, 2012.

[63] ——, “A harmony search based approach to hybrid fuzzy-rough rule induction,” inIEEE International Conference on Fuzzy Systems, 2012, pp. 1–8.

[64] D. Dias, R. Madeo, T. Rocha, H. Biscaro, and S. Peres, “Hand movement recognitionfor brazilian sign language: A study using distance-based neural networks,” in NeuralNetworks, 2009. IJCNN 2009. International Joint Conference on, 2009, pp. 697–704.

[65] M. Dorigo and T. Stützle, “Ant colony optimization: Overview and recent advances,”in Handbook of Metaheuristics, ser. International Series in Operations Research &Management Science, M. Gendreau and J.-Y. Potvin, Eds. Springer US, 2010, vol.146, pp. 227–263.

[66] M. Drobics, U. Bodenhofer, and E. P. Klement, “FS-FOIL: an inductive learning methodfor extracting interpretable fuzzy descriptions,” International Journal of ApproximateReasoning, vol. 32, no. 2–3, pp. 131–152, 2003.

[67] D. Dubois and H. Prade, Putting rough sets and fuzzy sets together. Intelligent DecisionSupport, Kluwer Academic Publishers, Dordrecht„ 1992.

[68] S. Džeroski and B. Ženko, “Is combining classifiers better than selecting the best one,”Machine Learning, vol. 54, no. 3, pp. 255–273, Mar. 2004.

[69] A. Ekbal, S. Saha, O. Uryupina, and M. Poesio, “Multiobjective simulated annealingbased approach for feature selection in anaphora resolution,” in Proceedings of the8th international conference on Anaphora Processing and Applications, ser. DAARC’11.Berlin, Heidelberg: Springer-Verlag, 2011, pp. 47–58.

[70] T. Elomaa and M. Kääriäinen, “An analysis of reduced error pruning,” Journal ofArtificial Intelligence Research, vol. 15, no. 1, pp. 163–187, Sep. 2001.

[71] C. Emmanouilidis, A. Hunter, and J. MacIntyre, “A multiobjective evolutionary settingfor feature selection and a commonality-based crossover operator,” in Proceedings ofthe 2000 Congress on Evolutionary Computation, vol. 1, 2000, pp. 309–316vol.1.

202

[72] R. Esposito and L. Saitta, “A monte carlo analysis of ensemble classification,” inProceedings of the 23rd International Conference on Machine Learning, 2004, pp. 265–272.

[73] I. W. Evett and E. J. Spiehler, “Rule induction in forensic science,” Central ResearchEstablishment, Home Office Forensic Science Service, Tech. Rep., 1987.

[74] M. A. Fanty and R. Cole, “Spoken letter recognition,” in NIPS, 1990, p. 220.

[75] U. Fayyad, G. Piatetsky-shapiro, and P. Smyth, “From data mining to knowledgediscovery in databases,” AI Magazine, vol. 17, pp. 37–54, 1996.

[76] A. Fern, R. Givan, B. Falsafi, and T. Vijaykumar, “Dynamic feature selection for hard-ware prediction,” Purdue University, Tech. Rep., 2000.

[77] M. Fesanghary, M. Mahdavi, M. Minary-Jolandan, and Y. Alizadeh, “Hybridizingharmony search algorithm with sequential quadratic programming for engineeringoptimization problems,” Computer Methods in Applied Mechanics and Engineering, vol.197, no. 33-40, pp. 3080–3091, 2008.

[78] A. Frank and A. Asuncion, “UCI machine learning repository,” 2010.

[79] A. A. Freitas, “A review of evolutionary algorithms for data mining,” in Soft Computingfor Knowledge Discovery and Data Mining, 2007, pp. 61–93.

[80] X. Fu and Q. Shen, “Fuzzy compositional modeling,” IEEE Trans. Fuzzy Syst., vol. 18,no. 4, pp. 823–840, Aug. 2010.

[81] N. Fu-zhong and L. Ming, “Attribute value reduction in variable precision rough set,”in 6th International Conference on Parallel and Distributed Computing, Applications andTechnologies, 2005, pp. 904–906.

[82] A. Ganguly, J. Gama, O. Omitaomu, M. Gaber, and R. Vatsavai, Knowledge Discoveryfrom Sensor Data, ser. Industrial Innovation Series. Taylor & Francis, 2008.

[83] F. García López, M. García Torres, B. Melián Batista, J. A. Moreno Pérez, and J. M.Moreno-Vega, “Solving feature subset selection problem by a parallel scatter search,”European Journal of Operational Research, vol. 169, no. 2, pp. 477–489, 2006.

[84] Z. W. Geem, Ed., Recent Advances In Harmony Search Algorithm, ser. Studies in Com-putational Intelligence. Springer, 2010, vol. 270.

[85] G. Giacinto and F. Roli, “An approach to the automatic design of multiple classifiersystems,” Pattern Recognition Letters, vol. 22, pp. 25–33, 2001.

[86] T. Gneiting and A. E. Raftery, “Weather forecasting with ensemble methods,” Science,vol. 310, no. 5746, pp. 248–249, 2005.

[87] R. P. Gorman and T. J. Sejnowski, “Analysis of hidden units in a layered networktrained to classify sonar targets,” Neural Networks, vol. 1, p. 75, 1988.

203

[88] J. W. Grzymala-Busse, “Three strategies to rule induction from data with numericalattributes,” in Transactions on Rough Sets II, ser. Lecture Notes in Computer Science,J. Peters, A. Skowron, D. Dubois, J. W. Grzymala-Busse, M. Inuiguchi, and L. Polkowski,Eds. Springer Berlin Heidelberg, 2005, vol. 3135, pp. 54–62.

[89] P. Grzymala-Busse, J. Grzymala-Busse, and Z. Hippe, “Melanoma prediction using datamining system lers,” in 25th Annual International Computer Software and ApplicationsConference, 2001, pp. 615–620.

[90] H. Guvenir, S. Acar, G. Demiroz, and A. Cekin, “A supervised machine learning algo-rithm for arrhythmia analysis,” in Computers in Cardiology 1997, 1997, pp. 433–436.

[91] I. Guyon and A. Elisseeff, “An introduction to variable and feature selection,” Journalof Machine Learning Research, vol. 3, pp. 1157–1182, Mar. 2003.

[92] B. Haktanirlar Ulutas and S. Kulturel-Konak, “A review of clonal selection algorithmand its applications,” Artificial Intelligence Review, vol. 36, no. 2, pp. 117–138, 2011.

[93] M. Hall, “Correlation-based feature subset selection for machine learning,” Ph.D.dissertation, University of Waikato, Hamilton, New Zealand, 1998.

[94] M. A. Hall, “Correlation-based feature selection for discrete and numeric class machinelearning,” in Proceedings of the 17th International Conference on Machine Learning.Morgan Kaufmann, 2000, pp. 359–366.

[95] D. Hand, “Principles of data mining,” Drug Safety, vol. 30, no. 7, pp. 621–622, 2007.

[96] J. Handl and J. Knowles, “Feature subset selection in unsupervised learning via multi-objective optimization,” International Journal of Computational Intelligence Research,vol. 2, no. 3, pp. 217–238, 2006.

[97] M. H. Hansen and B. Yu, “Model selection and the principle of minimum descriptionlength,” Journal of the American Statistical Association, vol. 96, no. 454, pp. 746–774,2001.

[98] S. Haykin, Neural Networks: A Comprehensive Foundation, ser. International edition.Prentice Hall International, 1999.

[99] H. He, H. Daumé III, and J. Eisner, “Cost-sensitive dynamic feature selection,” in ICMLWorkshop on Inferning: Interactions between Inference and Learning, Edinburgh, Jun.2012.

[100] A.-R. Hedar, J. Wang, and M. Fukushima, “Tabu search for attribute reduction inrough set theory,” Soft Computing, vol. 12, no. 9, pp. 909–918, Apr. 2008.

[101] C. Hinde, A. Bani-Hani, T. Jackson, and Y. Cheung, “Evolving polynomials of the inputsfor decision tree building,” Journal of Emerging Technologies in Web Intelligence, vol. 4,no. 2, 2012.

[102] T. Ho, “The random subspace method for constructing decision forests,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 20, no. 8, pp. 832–844, Aug. 1998.

204

[103] S. Hoi, J. Wang, P. Zhao, and R. Jin, “Online feature selection for mining big data,” inProceedings of the 1st International Workshop on BigMine, 2012, pp. 93–100.

[104] Y. Hong, S. Kwong, Y. Chang, and Q. Ren, “Consensus unsupervised feature rankingfrom multiple views,” Pattern Recognition Letters, vol. 29, no. 5, pp. 595–602, 2008.

[105] P. Horton and K. Nakai, “A probabilistic classification system for predicting the cellularlocalization sites of proteins,” in Proceedings of the 4th International Conference onIntelligent Systems for Molecular Biology. AAAI Press, 1996, pp. 109–115.

[106] N.-C. Hsieh, “Rule extraction with rough-fuzzy hybridization method,” in Advancesin Knowledge Discovery and Data Mining, ser. Lecture Notes in Computer Science,T. Washio, E. Suzuki, K. Ting, and A. Inokuchi, Eds. Springer Berlin Heidelberg,2008, vol. 5012, pp. 890–895.

[107] C.-N. Hsu, H.-J. Huang, and D. Schuschel, “The annigma-wrapper approach to fastfeature selection for neural nets,” IEEE Trans. Syst., Man, Cybern. B, vol. 32, no. 2, pp.207–212, 2002.

[108] P. Hsu, R. Lai, and C. Chiu, “The hybrid of association rule algorithms and genetic algo-rithms for tree induction: an example of predicting the student course performance,”Expert Systems with Applications, vol. 25, no. 1, pp. 51–62, 2003.

[109] Z. Huang and Q. Shen, “Fuzzy interpolative reasoning via scale and move transforma-tions,” IEEE Trans. Fuzzy Syst., vol. 14, no. 2, pp. 340–359, 2006.

[110] ——, “Fuzzy interpolation and extrapolation: A practical approach,” IEEE Trans. FuzzySyst., vol. 16, no. 1, pp. 13–28, 2008.

[111] N. Q. Hung, M. S. Babel, S. Weesakul, and N. K. Tripathi, “An artificial neural networkmodel for rainfall forecasting in bangkok, thailand,” Hydrology and Earth SystemSciences, vol. 13, no. 8, pp. 1413–1425, 2009.

[112] S. Hunt, Q. Meng, and C. J. Hinde, “An extension of the consensus-based bundlealgorithm for group dependant tasks with equipment dependencies,” in Neural Infor-mation Processing, ser. Lecture Notes in Computer Science, T. Huang, Z. Zeng, C. Li,and C. Leung, Eds. Springer Berlin Heidelberg, 2012, vol. 7666, pp. 518–527.

[113] H. Ishibuchi, K. Nozaki, N. Yamamoto, and H. Tanaka, “Selecting fuzzy if-then rulesfor classification problems using genetic algorithms,” IEEE Trans. Fuzzy Syst., vol. 3,no. 3, pp. 260–270, 1995.

[114] R. A. Jacobs, “Methods for combining experts’ probability assessments,” Neural Com-putation, vol. 7, no. 5, pp. 867–888, September 1995.

[115] R. Jensen and C. Cornelis, “Fuzzy-rough instance selection,” in IEEE InternationalConference on Fuzzy Systems, 2010, pp. 1–7.

[116] R. Jensen and Q. Shen, “Fuzzy-rough sets assisted attribute selection,” IEEE Trans.Fuzzy Syst., vol. 15, no. 1, pp. 73–89, 2007.

205

[117] R. Jensen and C. Cornelis, “A new approach to fuzzy-rough nearest neighbour clas-sification,” in Rough Sets and Current Trends in Computing, ser. Lecture Notes inComputer Science, C.-C. Chan, J. Grzymala-Busse, and W. Ziarko, Eds. SpringerBerlin Heidelberg, 2008, vol. 5306, pp. 310–319.

[118] ——, “Fuzzy-rough nearest neighbour classification,” in Transactions on Rough SetsXIII, ser. Lecture Notes in Computer Science. Springer Berlin / Heidelberg, 2011,vol. 6499, pp. 56–72.

[119] R. Jensen, C. Cornelis, and Q. Shen, “Hybrid fuzzy-rough rule induction and featureselection,” in IEEE International Conference on Fuzzy Systems, 2009, pp. 1151–1156.

[120] R. Jensen and Q. Shen, “Using fuzzy dependency-guided attribute grouping in featureselection,” in Proceedings of the 9th International Conference on Rough Sets. Springer,2003, pp. 250–254.

[121] ——, “Fuzzy-rough attribute reduction with application to web categorization,” FuzzySets and Systems, vol. 141, no. 3, pp. 469–485, 2004.

[122] ——, “Fuzzy-rough data reduction with ant colony optimization,” Fuzzy Sets andSystems, vol. 149, pp. 5–20, 2005.

[123] ——, Computational Intelligence and Feature Selection: Rough and Fuzzy Approaches.Wiley-IEEE Press, 2008.

[124] ——, “Are more features better? a response to attributes reduction using fuzzy roughsets,” IEEE Trans. Fuzzy Syst., vol. 17, no. 6, pp. 1456–1458, 2009.

[125] ——, “Feature selection for aiding glass forensic evidence analysis,” Intell. Data Anal.,vol. 13, no. 5, pp. 703–723, Oct. 2009.

[126] ——, “New approaches to fuzzy-rough feature selection,” IEEE Trans. Fuzzy Syst.,vol. 17, no. 4, pp. 824–838, Aug. 2009.

[127] R. Jensen, A. Tuson, and Q. Shen, “Finding rough and fuzzy-rough set reducts withSAT,” Information Sciences, vol. 255, no. 0, pp. 100–120, 2014.

[128] S. Jin, R. Diao, C. Quek, and Q. Shen, “Backward fuzzy rule interpolation with multiplemissing values,” in IEEE International Conference on Fuzzy Systems, 2013.

[129] ——, “Backward fuzzy rule interpolation,” IEEE Trans. Fuzzy Syst., 2014, in press.

[130] S. Jin, R. Diao, and Q. Shen, “Towards backward fuzzy rule interpolation,” in Proceed-ings of the 11th UK Workshop on Computational Intelligence, 2011.

[131] ——, “Backward fuzzy interpolation and extrapolation with multiple multi-antecedentrules,” in IEEE International Conference on Fuzzy Systems, 2012, pp. 1–8.

[132] G. John and P. Langley, “Estimating continuous distributions in bayesian classifiers,”in Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence, 1995, pp.338–345.

206

[133] M. M. Kabir, M. Shahjahan, and K. Murase, “A new local search based hybrid geneticalgorithm for feature selection,” Neurocomputing, vol. 74, no. 17, pp. 2914–2928,2011.

[134] M. Kabir, M.S., and K. Murase, “A new hybrid ant colony optimization algorithm forfeature selection,” Expert Systems with Applications, vol. 39, no. 3, pp. 3747–3763,2012.

[135] D. Karaboga and B. Akay, “A survey: algorithms simulating bee swarm intelligence,”Artificial Intelligence Review, vol. 31, no. 1-4, pp. 61–85, 2009.

[136] D. Karaboga and B. Basturk, “A powerful and efficient algorithm for numerical functionoptimization: artificial bee colony (ABC) algorithm,” Journal of Global Optimization,vol. 39, no. 3, pp. 459–471, Nov. 2007.

[137] M. Karzynski, l. Mateos, J. Herrero, and J. Dopazo, “Using a genetic algorithm and aperceptron for feature selection and supervised class learning in dna microarray data,”Artificial Intelligence Review, vol. 20, no. 1-2, pp. 39–51, 2003.

[138] L. Ke, Z. Feng, and Z. Ren, “An efficient ant colony optimization approach to attributereduction in rough set theory,” Pattern Recognition Letters, vol. 29, no. 9, pp. 1351–1357, 2008.

[139] S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy, “Improvements toPlatt’s SMO algorithm for SVM classifier design,” Neural Computation, vol. 13, no. 3,pp. 637–649, Mar. 2001.

[140] J. Keller, M. Gray, and J. Givens, “A fuzzy K-nearest neighbor algorithm,” IEEE Trans.Syst., Man, Cybern., vol. 15, no. 4, pp. 580–585, 1985.

[141] T. Kietzmann, S. Lange, and M. Riedmiller, “Incremental GRLVQ: Learning relevantfeatures for 3D object recognition,” Neurocomputing, vol. 71, no. 13-15, pp. 2868–2879, 2008.

[142] L. Koczy and K. Hirota, “Approximate reasoning by linear rule interpolation andgeneral approximation,” International Journal of Approximate Reasoning, vol. 9, no. 3,pp. 197–225, 1993.

[143] ——, “Interpolative reasoning with insufficient evidence in sparse fuzzy rule bases,”Information Sciences, vol. 71, no. 1-2, pp. 169–201, 1993.

[144] R. Kohavi and G. H. John, “Wrappers for feature subset selection,” Artificial Intelligence,vol. 97, no. 1, pp. 273–324, 1997.

[145] J. Komorowski, Z. Pawlak, L. Polkowski, and A. Skowron, “Rough sets: A tutorial,”1998.

[146] I. Kononenko, “Estimating attributes: Analysis and extensions of relief,” in MachineLearning, ser. Lecture Notes in Computer Science, F. Bergadano and L. Raedt, Eds.Springer Berlin Heidelberg, 1994, vol. 784, pp. 171–182.

207

[147] I. Kononenko, E. Simec, and M. Robnik-Sikonja, “Overcoming the myopia of inductivelearning algorithms with RELIEFF,” Applied Intelligence, vol. 7, pp. 39–55, 1997.

[148] S. Kovács, “Special issue on fuzzy rule interpolation,” Journal of Advanced Computa-tional Intelligence and Intelligent Informatics, p. 253, 2011.

[149] B. Kröse, N. Vlassis, R. Bunschoten, and Y. Motomura, “A probabilistic model forappearance-based robot localization,” in In First European Symposium on AmbienceIntelligence. Springer, 2000, pp. 264–274.

[150] L. Kuncheva, “Switching between selection and fusion in combining classifiers: anexperiment,” IEEE Trans. Syst., Man, Cybern. B, vol. 32, no. 2, pp. 146–156, 2002.

[151] ——, “Fuzzy versus nonfuzzy in combining classifiers designed by boosting,” IEEETrans. Fuzzy Syst., vol. 11, no. 6, pp. 729–741, 2003.

[152] L. I. Kuncheva and C. J. Whitaker, “Measures of diversity in classifier ensembles andtheir relationship with the ensemble accuracy,” Machine Learning, vol. 51, no. 2, pp.181–207, May 2003.

[153] R. Leardi, R. Boggia, and M. Terrile, “Genetic algorithms as a strategy for featureselection,” Journal of Chemometrics, vol. 6, no. 5, pp. 267–281, 1992.

[154] J. Lee and M. Verleysen, Nonlinear Dimensionality Reduction, ser. Information Scienceand Statistics. Springer, 2007.

[155] K. S. Lee and Z. W. Geem, “A new meta-heuristic algorithm for continuous engineeringoptimization: harmony search theory and practice,” Computer Methods in AppliedMechanics and Engineering, vol. 194, no. 36-38, pp. 3902–3933, Sep. 2005.

[156] ——, “A new structural optimization method based on the harmony search algorithm,”Computers & Structures, vol. 82, no. 9–10, pp. 781–798, 2004.

[157] K. S. Lee, Z. W. Geem, S.-h. Lee, and K.-w. Bae, “The harmony search heuristicalgorithm for discrete structural optimization,” Engineering Optimization, vol. 37,no. 7, pp. 663–684, 2005.

[158] L. Lee and S. Chen, “Fuzzy interpolative reasoning using interval type-2 fuzzy sets,”New Frontiers in Applied Artificial Intelligence, vol. 5027, pp. 92–101, 2008.

[159] W. Lee, S. J. Stolfo, and K. W. Mok, “Adaptive intrusion detection: A data miningapproach,” Artificial Intelligence Review, vol. 14, no. 6, pp. 533–567, Dec. 2000.

[160] N. Li, I. Tsang, and Z.-H. Zhou, “Efficient optimization of performance measures byclassifier adaptation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. PP, no. 99, p. 1,2012.

[161] X. Li and L. Parker, “Design and performance improvements for fault detection intightly-coupled multi-robot team tasks,” in Proceedings of IEEE International Conferenceon Robotics and Automation, 2009.

208

[162] M. Lippi, M. Jaeger, P. Frasconi, and A. Passerini, “Relational information gain,”Machine Learning, vol. 83, pp. 219–239, 2011.

[163] H. Liu and H. Motoda, Feature Selection for Knowledge Discovery and Data Mining.Norwell, MA, USA: Kluwer Academic Publishers, 1998.

[164] ——, Computational Methods of Feature Selection (Chapman & Hall/CRC Data Miningand Knowledge Discovery Series). Chapman & Hall/CRC, 2007.

[165] H. Liu and L. Yu, “Toward integrating feature selection algorithms for classificationand clustering,” IEEE Trans. Knowl. Data Eng., vol. 17, no. 4, pp. 491–502, 2005.

[166] Y. Liu, Q. Zhou, E. Rakus-Andersson, and G. Bai, “A fuzzy-rough sets based compact ruleinduction method for classifying hybrid data,” in Rough Sets and Knowledge Technology,ser. Lecture Notes in Computer Science, T. Li, H. Nguyen, G. Wang, J. Grzymala-Busse,R. Janicki, A. Hassanien, and H. Yu, Eds. Springer Berlin Heidelberg, 2012, vol.7414, pp. 63–70.

[167] Y. Liu, G. Wang, H. Chen, H. Dong, X. Zhu, and S. Wang, “An improved particle swarmoptimization for feature selection,” Journal of Bionic Engineering, vol. 8, no. 2, pp.191–200, 2011.

[168] N. Mac Parthaláin, “Guiding rough and fuzzy-rough feature selection using alternativeevaluation functions and search strategies,” Ph.D. dissertation, University of WalesAberystwyth, 2006.

[169] N. Mac Parthaláin and R. Jensen, “Measures for unsupervised fuzzy-rough featureselection,” International Journal of Hybrid Intelligent Systems, vol. 7, no. 4, pp. 249–259, Dec. 2010.

[170] ——, “Fuzzy-rough set based semi-supervised learning,” in IEEE International Confer-ence on Fuzzy Systems, Jun. 2011, pp. 2465–2472.

[171] N. Mac Parthaláin, R. Jensen, Q. Shen, and R. Zwiggelaar, “Fuzzy-rough approachesfor mammographic risk analysis,” Intelligent Data Analysis, vol. 14, no. 2, pp. 225–244,Apr. 2010.

[172] N. Mac Parthaláin, Q. Shen, and R. Jensen, “A distance measure approach to exploringthe rough set boundary region for attribute reduction,” IEEE Trans. Knowl. Data Eng.,vol. 22, no. 3, pp. 305–317, Mar. 2010.

[173] M. Mahdavi, M. Fesanghary, and E. Damangir, “An improved harmony search algorithmfor solving optimization problems,” Applied Mathematics and Computation, vol. 188,no. 2, pp. 1567–1579, 2007.

[174] M. Mahdavi, M. H. Chehreghani, H. Abolhassani, and R. Forsati, “Novel meta-heuristicalgorithms for clustering web documents,” Applied Mathematics and Computation, vol.201, no. 1-2, pp. 441–451, 2008.

[175] D. Manjarres, I. Landa-Torres, S. Gil-Lopez, J. D. Ser, M. Bilbao, S. Salcedo-Sanz, andZ. Geem, “A survey on applications of the harmony search algorithm,” EngineeringApplications of Artificial Intelligence, vol. 26, no. 8, pp. 1818–1831, 2013.

209

[176] A. Marán-Hernández, R. Méndez-Rodríguez, and F. Montes-González, “Significant fea-ture selection in range scan data for geometrical mobile robot mapping,” in Proceedingsof the 5th International Symposium on Robotics and Automation, 2006.

[177] J. Marin-Blazquez and Q. Shen, “From approximative to descriptive fuzzy classifiers,”IEEE Trans. Fuzzy Syst., vol. 10, no. 4, pp. 484–497, 2002.

[178] F. Markatopoulou, G. Tsoumakas, and L. Vlahavas, “Instance-based ensemble pruningvia multi-label classification,” in 22nd IEEE International Conference on Tools withArtificial Intelligence, vol. 1, 2010, pp. 401–408.

[179] M. H. Mashinchi, M. A. Orgun, M. Mashinchi, and W. Pedrycz, “A tabu-harmonysearch-based approach to fuzzy linear regression,” IEEE Trans. Fuzzy Syst., vol. 19,no. 3, pp. 432–448, 2011.

[180] M. McCann, Y. Li, L. P. Maguire, and A. Johnston, “Causality challenge: Benchmarkingrelevant signal components for effective monitoring and process control,” Journal ofMachine Learning Research–Proceedings Track, vol. 6, pp. 277–288, 2010.

[181] P. McCullagh, “What is a statistical model?” The Annals of Statistics, vol. 30, no. 5, pp.1225–1310, 10 2002.

[182] R. Meiri and J. Zahavi, “Using simulated annealing to optimize the feature selectionproblem in marketing applications,” European Journal of Operational Research, vol.171, no. 3, pp. 842–858, 2006.

[183] N. Memon, D. Hicks, and H. Larsen, “Notice of violation of ieee publication princi-ples harvesting terrorists information from web,” in 11th International Conference onInformation Visualization, 2007, pp. 664–671.

[184] H. ming Lee, C. ming Chen, J. ming Chen, and Y. lu Jou, “An efficient fuzzy classifierwith feature selection based on fuzzy entropy,” IEEE Trans. Syst., Man, Cybern. B,vol. 31, pp. 426–432, 2001.

[185] T. Mitchell, Machine Learning, 1st ed. McGraw-Hill Education (ISE Editions), Oct.1997.

[186] L. Molina, L. Belanche, and A. Nebot, “Feature selection algorithms: a survey andexperimental evaluation,” in Proceedings of 2002 IEEE International Conference onData Mining, 2002, pp. 306–313.

[187] D. P. Muni, N. R. Pal, and J. Das, “Genetic programming for simultaneous featureselection and classifier design,” IEEE Trans. Syst., Man, Cybern. B, vol. 36, no. 1, pp.106–117, 2006.

[188] N. Naik, R. Diao, C. Quek, and Q. Shen, “Towards dynamic fuzzy rule interpolation,”in IEEE International Conference on Fuzzy Systems, 2013.

[189] R. Y. M. Nakamura, L. A. M. Pereira, K. A. Costa, D. Rodrigues, J. P. Papa, and X.-S.Yang, “BBA: A binary bat algorithm for feature selection,” in 25th SIBGRAPI Conferenceon Graphics, Patterns and Images, Aug. 2012, pp. 291–297.

210

[190] L. Nanni and A. Lumini, “Ensemblator: An ensemble of classifiers for reliable classi-fication of biological data,” Pattern Recognition Letters, vol. 28, no. 5, pp. 622–630,2007.

[191] Y. Narukawa, Modeling Decisions: Information Fusion and Aggregation Operators, ser.Cognitive Technologies. Springer, 2010.

[192] S. Nemati, M. E. Basiri, N. Ghasem-Aghaee, and M. H. Aghdam, “A novel ACO-GAhybrid algorithm for feature selection in protein function prediction,” Expert Systemswith Applications, vol. 36, no. 10, pp. 12 086–12 094, 2009.

[193] K. Nigam, A. K. Mccallum, S. Thrun, and T. Mitchell, “Text classification from labeledand unlabeled documents using EM,” in Machine Learning, 1999, pp. 103–134.

[194] I.-S. Oh, J.-S. Lee, and B.-R. Moon, “Hybrid genetic algorithms for feature selection,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 11, pp. 1424–1437, 2004.

[195] J. S. Olsson, “Combining feature selectors for text classification,” in Proceedings of the15th ACM international conference on Information and knowledge management, 2006,pp. 798–799.

[196] Y.-S. Ong, N. Krasnogor, and H. Ishibuchi, “Special issue on memetic algorithms,” IEEETrans. Syst., Man, Cybern. B, vol. 37, no. 1, pp. 2 –5, feb. 2007.

[197] D. W. Opitz, “Feature selection for ensembles,” in Proceedings of 16th National Confer-ence on Artificial Intelligence. Press, 1999, pp. 379–384.

[198] P. Paclik, W. Duin, G. M. P. van Kempen, and R. Kohlus, “On feature selection withmeasurement cost and grouped features,” Pattern Recognition Group, Delft Universityof Technology.

[199] S. Palanisamy and K. S, “Artificial bee colony approach for optimizing feature selection,”International Journal of Computer Science Issues, vol. 9, no. 3, pp. 432–438, 2012.

[200] I. Partalas, G. Tsoumakas, and I. Vlahavas, “Pruning an ensemble of classifiers viareinforcement learning,” Neurocomputing, vol. 72, no. 7–9, pp. 1900–1909, 2009.

[201] N. M. Parthallcin and Q. Shen, “Exploring the boundary region of tolerance roughsets for feature selection,” Pattern Recognition, vol. 42, no. 5, pp. 655–667, 2009.

[202] D. Paul, E. Bair, T. Hastie, and R. Tibshirani, “Preconditioning for feature selectionand regression in high-dimensional problems,” The Annals of Statistics, vol. 36, no. 4,pp. pp.1595–1618, 2008.

[203] Z. Pawlak, Rough Sets: Theoretical Aspects of Reasoning about Data. Norwell, MA,USA: Kluwer Academic Publishers, 1992.

[204] Z. Pawlak, J. Grzymala-Busse, R. Slowinski, and W. Ziarko, “Rough sets,” Communica-tions of the ACM, vol. 38, no. 11, pp. 88–95, Nov. 1995.

211

[205] J. M. Peña, J. A. Lozano, P. Larrañaga, and I. n. Inza, “Dimensionality reduction inunsupervised learning of conditional Gaussian networks,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 23, no. 6, pp. 590–603, Jun. 2001.

[206] H. Peng, F. Long, and C. Ding, “Feature selection based on mutual information criteriaof max-dependency, max-relevance, and min-redundancy,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 27, no. 8, pp. 1226–1238, 2005.

[207] S. Piramuthu, “Evaluating feature selection methods for learning in data miningapplications,” European Journal of Operational Research, vol. 156, no. 2, pp. 483–494,2004.

[208] J. C. Platt, “Sequential minimal optimization: A fast algorithm for training supportvector machines,” Advances in Kernel Methods - Support Vector Learning, Tech. Rep.,1998.

[209] H. Prade, G. Richard, and M. Serrurier, “Enriching relational learning with fuzzypredicates,” in Knowledge Discovery in Databases: PKDD 2003, ser. Lecture Notes inComputer Science, N. Lavrac, D. Gamberger, L. Todorovski, and H. Blockeel, Eds.Springer Berlin Heidelberg, 2003, vol. 2838, pp. 399–410.

[210] B. Predki and S. Wilk, “Rough set based data exploration using ROSE system,” in Foun-dations of Intelligent Systems, ser. Lecture Notes in Computer Science, R. ZbigniewW.and A. Skowron, Eds. Springer Berlin Heidelberg, 1999, vol. 1609, pp. 172–180.

[211] Z. Qin and J. Lawry, “LFOIL: Linguistic rule induction in the label semantics frame-work,” Fuzzy Sets and Systems, vol. 159, no. 4, pp. 435–448, Feb. 2008.

[212] C. C. Ramos, A. N. Souza, G. Chiachia, A. X. F. ao, and J. ao P. Papa, “A novel algorithmfor feature selection using harmony search and its application for non-technical lossesdetection,” Computers & Electrical Engineering, vol. 37, no. 6, pp. 886–894, 2011.

[213] K. Rasmani and Q. Shen, “Data-driven fuzzy rule generation and its application forstudent academic performance evaluation,” Applied Intelligence, vol. 25, no. 3, pp.305–319, 2006.

[214] V. Rathnayake, L. Premaratne, and D. Sonnadara, “Development of feature basedartificial neural network model for weather nowcasting,” in National Symposium onDisaster Risk Reduction & Climate Change Adaptation, 2010.

[215] M. J. Reddy and D. K. Mohanta, “A comparative study of artificial neural network(ann) AND fuzzy inference system (fis) approach for digital relaying of transmissionline faults,” International Journal on Artificial Intelligence and Machine Learning, vol. 6,pp. 1–7, 2006.

[216] S. Royston, J. Lawry, and K. Horsburgh, “A linguistic decision tree approach to pre-dicting storm surge,” Fuzzy Sets and Systems, vol. 215, no. 0, pp. 90–111, 2013.

[217] L. Saitta, “Hypothesis diversity in ensemble classification,” in Foundations of IntelligentSystems, ser. Lecture Notes in Computer Science, F. Esposito, Z. Ras, D. Malerba, andG. Semeraro, Eds. Springer Berlin Heidelberg, 2006, vol. 4203, pp. 662–670.

212

[218] G. Schwarz, “Estimating the dimension of a model,” Annals of Statistics, vol. 6, no. 2,pp. 461–464, 03 1978.

[219] S. Senthamarai Kannan and N. Ramaraj, “A novel hybrid feature selection via sym-metrical uncertainty ranking based local memetic search algorithm,” Knowledge-BasedSystems, vol. 23, no. 6, pp. 580–585, Aug. 2010.

[220] G. Shafer, A mathematical theory of evidence. Princeton university press, 1976.

[221] M. Shah, M. Marchand, and J. Corbeil, “Feature selection with conjunctions of decisionstumps and learning from microarray data,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 34, no. 1, pp. 174–186, 2012.

[222] C. Shang and D. Barnes, “Fuzzy-rough feature selection aided support vector machinesfor mars image classification,” Computer Vision and Image Understanding, vol. 117,no. 3, pp. 202–213, 2013.

[223] ——, “Support vector machine-based classification of rock texture images aided byefficient feature selection,” in The International Joint Conference on Neural Networks,june 2012, pp. 1–8.

[224] C. Shang, D. Barnes, and Q. Shen, “Facilitating efficient mars terrain image classifica-tion with fuzzy-rough feature selection,” International Journal of Hybrid IntelligentSystems, vol. 8, no. 1, pp. 3–13, Jan. 2011.

[225] C. Shannon, “A mathematical theory of communication,” Bell System Technical Journal,vol. 27, pp. 379–423, 623–656, July, October 1948.

[226] Q. Shen and A. Chouchoulas, “A rough-fuzzy approach for generating classificationrules,” Pattern Recognition, vol. 35, no. 11, pp. 2425–2438, 2002.

[227] Q. Shen, R. Diao, and P. Su, “Feature selection ensemble,” in Turing Centenary, ser.EPiC Series, A. Voronkov, Ed., vol. 10. EasyChair, 2012, pp. 289–306.

[228] Q. Shen and R. Jensen, “Selecting informative features with fuzzy-rough sets and itsapplication for complex systems monitoring,” Pattern Recognition, vol. 37, no. 7, pp.1351–1363, 2004.

[229] Z. Shen, L. Ding, and M. Mukaidono, “Fuzzy resolution principle,” in Proceedings ofthe 18th International Symposium on Multiple-Valued Logic, 1988, pp. 210–215.

[230] S. Shojaie and M. Moradi, “An evolutionary artificial immune system for featureselection and parameters optimization of support vector machines for ERP assessmentin a P300-based GKT,” in International Biomedical Engineering Conference, Dec. 2008,pp. 1–5.

[231] W. Siedlecki and J. Sklansky, “A note on genetic algorithms for large-scale featureselection,” Pattern Recognition Letters, vol. 10, no. 5, pp. 335–347, 1989.

[232] V. G. Sigillito, S. P. Wing, L. V. Hutton, and K. B. Baker, “Classification of radar returnsfrom the ionosphere using neural networks,” Johns Hopkins APL Tech. Dig, vol. vol.10, pp. 262–266, 1989.

213

[233] S. Singh, J. Kubica, S. Larsen, and D. Sorokina, “Parallel large scale feature selectionfor logistic regression,” in SDM, 2009, pp. 1171–1182.

[234] R. K. Sivagaminathan and S. Ramakrishnan, “A hybrid approach for feature subsetselection using neural networks and ant colony optimization,” Expert Systems withApplications, vol. 33, no. 1, pp. 49–60, 2007.

[235] J. SKLANSKY and M. VRIESENGA, “Genetic selection and neural modeling of piecewise-linear classifiers,” International Journal of Pattern Recognition and Artificial Intelligence,vol. 10, no. 05, pp. 587–612, 1996.

[236] D. Slezak, Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing: 10th In-ternational Conference, RSFDGrC 2005, Regina, Canada, August 31 - September 3,2005, Proceedings, ser. Lecture Notes in Computer Science / Lecture Notes in ArtificialIntelligence. Springer, 2005.

[237] J. W. Smith, J. Everhart, W. Dickson, W. Knowler, and R. Johannes, “Using the adaplearning algorithm to forecast the onset of diabetes mellitus,” in Annual Symposiumon Computer Application in Medical Care, 1988, pp. 261–265.

[238] P. Somol, J. Grim, and P. Pudil, “Criteria ensembles in feature selection,” in MultipleClassifier Systems, ser. Lecture Notes in Computer Science, J. Benediktsson, J. Kittler,and F. Roli, Eds. Springer Berlin Heidelberg, 2009, vol. 5519, pp. 304–313.

[239] R. Srinivasa Rao, S. Narasimham, M. Ramalinga Raju, and A. Srinivasa Rao, “Optimalnetwork reconfiguration of large-scale distribution system using harmony searchalgorithm,” IEEE Trans. Power Syst., vol. 26, no. 3, pp. 1080–1088, 2011.

[240] S. Srinivasan and S. Ramakrishnan, “Evolutionary multi objective optimization forrule mining: a review,” Artificial Intelligence Review, vol. 36, no. 3, pp. 205–248, 2011.

[241] D. J. Stracuzzi and P. E. Utgoff, “Randomized variable elimination,” Journal of MachineLearning Research, vol. 5, pp. 1331–1364, 2004.

[242] N. Suguna and K. G. Thanushkodi, “An independent rough set approach hybrid withartificial bee colony algorithm for dimensionality reduction,” American Journal ofApplied Sciences, vol. 8, no. 3, pp. 261–266, 2011.

[243] A. Sung, A. Merke, and M. Riedmiller, “Reinforcement learning using a grid basedfunction approximator,” in Biomimetic Neural Learning for Intelligent Robots, ser.Lecture Notes in Computer Science, S. Wermter, G. Palm, and M. Elshaw, Eds. SpringerBerlin Heidelberg, 2005, vol. 3575, pp. 235–244.

[244] D. Swets and J. Weng, “Efficient content-based image retrieval using automatic featureselection,” in Proceedings of International Symposium on Computer Vision, 1995, pp.85–90.

[245] R. W. Swiniarski and A. Skowron, “Rough set methods in feature selection and recog-nition,” Pattern Recognition Letters, vol. 24, no. 6, pp. 833–849, 2003.

214

[246] M. A. Tahir, J. Kittler, and A. Bouridane, “Multilabel classification using heterogeneousensemble of multi-label classifiers,” Pattern Recognition Letters, vol. 33, no. 5, pp.513–523, 2012.

[247] A. Tajbakhsh, M. Rahmati, and A. Mirzaei, “Intrusion detection using fuzzy associationrules,” Applied Soft Computing, vol. 9, no. 2, pp. 462–469, Mar. 2009.

[248] D. Tikk, I. Joó, L. Kóczy, P. Várlaki, B. Moser, and T. Gedeon, “Stability of interpolativefuzzy KH controllers,” Fuzzy Sets and Systems, vol. 125, no. 1, pp. 105–119, 2002.

[249] V. Torra and Y. Narukawa, Modeling Decisions: Information Fusion and AggregationOperators. Springer, 2007.

[250] G. Tsoumakas, I. Partalas, and I. Vlahavas, “A taxonomy and short review of ensembleselection,” in Workshop on Supervised and Unsupervised Ensemble Methods and TheirApplications, 2008.

[251] A. Tsymbal, M. Pechenizkiy, and P. Cunningham, “Diversity in search strategies forensemble feature selection,” Information Fusion, vol. 6, no. 1, pp. 83–98, 2005.

[252] E. Tuv, A. Borisov, G. Runger, and K. Torkkola, “Feature selection with ensembles,artificial variables, and redundancy elimination,” Journal of Machine Learning Research,vol. 10, pp. 1341–1366, Dec. 2009.

[253] D. L. Vail and M. M. Veloso, “Feature selection for activity recognition in multi-robotdomains,” in Proceedings of the 23rd AAAI Conference on Artificial Intelligence, 2008,pp. 1415–1420.

[254] B. Vandeginste, “Parvus: An extendable package of programs for data exploration,classification and correlation,” Journal of Chemometrics, vol. 4, no. 2, pp. 191–193,1990.

[255] A. Vasebi, M. Fesanghary, and S. M. T. Bathaee, “Combined heat and power economicdispatch by harmony search algorithm,” International Journal of Electrical Power &Energy Systems, vol. 29, no. 10, pp. 713–719, Dec. 2007.

[256] R. Vilalta and Y. Drissi, “A perspective view and survey of meta-learning,” ArtificialIntelligence Review, vol. 18, no. 2, pp. 77–95, 2002.

[257] C.-M. Wang and Y.-F. Huang, “Self-adaptive harmony search algorithm for optimiza-tion,” Expert Systems with Applications, vol. 37, no. 4, pp. 2826–2837, 2010.

[258] H. Wang, S. Kwong, Y. Jin, W. Wei, and K. Man, “Multi-objective hierarchical geneticalgorithm for interpretable fuzzy rule-based knowledge extraction,” Fuzzy Sets andSystems, vol. 149, no. 1, pp. 149–186, 2005.

[259] H. Wang, T. M. Khoshgoftaar, and A. Napolitano, “A comparative study of ensemblefeature selection techniques for software defect prediction,” in Proceedings of the 20109th International Conference on Machine Learning and Applications, 2010, pp. 135–140.

215

[260] J. Wang, P. Zhao, S. C. Hoi, and rong jin, “Online feature selection and its applications,”IEEE Transactions on Knowledge and Data Engineering, vol. 99, no. PrePrints, p. 1,2013.

[261] X. Wang, J. Yang, X. Teng, W. Xia, and R. Jensen, “Feature selection based on roughsets and particle swarm optimization,” Pattern Recognition Letters, vol. 28, no. 4, pp.459–471, 2007.

[262] X. Wang, E. C. Tsang, S. Zhao, D. Chen, and D. S. Yeung, “Learning fuzzy rules fromfuzzy samples based on rough set technique,” Information Sciences, vol. 177, no. 20,pp. 4493–4514, 2007.

[263] G. Wells and C. Torras, “Assessing image features for vision-based robot positioning,”Journal of Intelligent and Robotic Systems, vol. 30, no. 1, pp. 95–118, 2001.

[264] I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques,2nd ed., ser. Morgan Kaufmann Series in Data Management Systems. MorganKaufmann, Jun. 2005.

[265] J. Wroblewski, “Finding minimal reducts using genetic algorithms,” in Proceedings of2nd International Joint Conference on Information Science, 1995, pp. 186–189.

[266] J. Wróblewski, “Ensembles of classifiers based on approximate reducts,” FundamentaInformaticae, vol. 47, no. 3-4, pp. 351–360, Oct. 2001.

[267] X. Wu, V. Kumar, J. Ross Q., J. Ghosh, Q. Yang, H. Motoda, G. McLachlan, A. Ng, B. Liu,P. Yu, Z.-H. Zhou, M. Steinbach, D. Hand, and D. Steinberg, “Top 10 algorithms indata mining,” Knowledge and Information Systems, vol. 14, pp. 1–37, 2008.

[268] X. Wu, K. Yu, W. Ding, H. Wang, and X. Zhu, “Online feature selection with streamingfeatures,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 5, pp. 1178–1192, 2013.

[269] D. Xie, “Fuzzy association rules discovered on effective reduced database algorithm,”in IEEE International Conference on Fuzzy Systems, 2005, pp. 779–784.

[270] E. P. Xing, M. I. Jordan, and R. M. Karp, “Feature selection for high-dimensionalgenomic microarray data,” in Proceedings of the 18th International Conference onMachine Learning. Morgan Kaufmann, 2001, pp. 601–608.

[271] Z. Xu, “An overview of methods for determining OWA weights: Research articles,”International Journal of Intelligent Systems, vol. 20, no. 8, pp. 843–865, Aug. 2005.

[272] ——, “Dependent OWA operators,” in Proceedings of the Third international conferenceon Modeling Decisions for Artificial Intelligence, ser. MDAI’06. Berlin, Heidelberg:Springer-Verlag, 2006, pp. 172–178.

[273] R. Yager, “On ordered weighted averaging aggregation operators in multicriteriadecisionmaking,” IEEE Trans. Syst., Man, Cybern., vol. 18, no. 1, pp. 183–190, 1988.

[274] C.-S. Yang, L.-Y. Chuang, Y.-J. Chen, and C.-H. Yang, “Feature selection using memeticalgorithms,” in Third International Conference on Convergence and Hybrid InformationTechnology, vol. 1, Nov. 2008, pp. 416–423.

216

[275] J. Yang and V. Honavar, “Feature subset selection using a genetic algorithm,” IEEEIntell. Syst., vol. 13, no. 2, pp. 44–49, Mar. 1998.

[276] L. Yang and Q. Shen, “Adaptive fuzzy interpolation,” IEEE Trans. Fuzzy Syst., vol. 19,no. 6, pp. 1107–1126, 2011.

[277] X.-S. Yang, Nature-Inspired Metaheuristic Algorithms. Luniver Press, 2008.

[278] Y. Yang and J. O. Pedersen, “A comparative study on feature selection in text cate-gorization,” in Proceedings of the 14th International Conference on Machine Learning.San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1997, pp. 412–420.

[279] L. Yu and H. Liu, “Efficient feature selection via analysis of relevance and redundancy,”J. Mach. Learn. Res., vol. 5, pp. 1205–1224, Dec. 2004.

[280] S. C. Yusta, “Different metaheuristic strategies to solve the feature selection problem,”Pattern Recognition Letters, vol. 30, no. 5, pp. 525–534, Apr. 2009.

[281] X.-J. Zeng, J. Y. Goulermas, P. Liatsis, D. Wang, and J. A. Keane, “Hierarchical fuzzysystems for function approximation on discrete input spaces with application,” IEEETrans. Fuzzy Syst., vol. 16, no. 5, pp. 1197–1215, 2008.

[282] X.-J. Zeng and G. Singh Madan, “Approximation accuracy analysis of fuzzy systems asfunction approximators,” Fuzzy Systems, IEEE Transactions on, vol. 4, no. 1, pp. 44–63,1996.

[283] K. Zhang, W. Fan, X. Yuan, I. Davidson, and X. Li, “Forecasting skewed biased stochasticozone days: analyses, solutions and beyond,” Knowl. Inf. Syst, 2008.

[284] L. Zhang, X. Meng, W. Wu, and H. Zhou, “Network fault feature selection based onadaptive immune clonal selection algorithm,” in International Joint Conference onComputational Sciences and Optimization, vol. 2, 2009, pp. 969–973.

[285] M.-L. Zhang and Z.-H. Zhou, “Improve multi-instance neural networks through featureselection,” in Neural Processing Letters, 2004, pp. 1–10.

[286] Q. Zhang, W. Liu, E. Tsang, and B. Virginas, “Expensive multiobjective optimization byMOEA/D with gaussian process model,” Evolutionary Computation, IEEE Transactionson, vol. 14, no. 3, pp. 456–474, 2010.

[287] R. Zhang and L. Hanzo, “Iterative multiuser detection and channel decoding forDS-CDMA using harmony search,” IEEE Trans. Signal Process., vol. 16, no. 10, pp.917–920, 2009.

[288] J. Zhao, K. Lu, and X. He, “Locality sensitive semi-supervised feature selection,”Neurocomputing, vol. 71, no. 10–12, pp. 1842–1849, 2008.

[289] W. Zhao, Y. Wang, and D. Li, “A dynamic feature selection method based on combina-tion of GA with K-means,” in 2nd International Conference on Industrial Mechatronicsand Automation, vol. 2, 2010, pp. 271–274.

217

[290] L. Zheng, R. Diao, and Q. Shen, “Efficient feature selection using a self-adjustingharmony search algorithm,” in Proceedings of the 13th UK Workshop on ComputationalIntelligence, 2013.

[291] Z. Zheng, “Feature selection for text categorization on imbalanced data,” ACM SIGKDDExplorations Newsletter, vol. 6, p. 2004, 2004.

[292] S.-M. Zhou and J. Q. Gan, “Constructing accurate and parsimonious fuzzy modelswith distinguishable fuzzy sets based on an entropy measure,” Fuzzy Sets and Systems,vol. 157, no. 8, pp. 1057–1074, 2006.

[293] S.-M. Zhou and J. Gan, “Constructing L2-SVM-based fuzzy classifiers in high-dimensional space with automatic model selection and fuzzy rule ranking,” IEEETrans. Fuzzy Syst., vol. 15, no. 3, pp. 398–409, 2007.

[294] S.-M. Zhou, J. Garibaldi, R. John, and F. Chiclana, “On constructing parsimonioustype-2 fuzzy logic systems via influential rule selection,” IEEE Trans. Fuzzy Syst.,vol. 17, no. 3, pp. 654–667, 2009.

[295] Z. Zhou, Ensemble Methods: Foundations and Algorithms, ser. Chapman & Hall/CRCData Mining and Knowledge Discovery Serie. Taylor & Francis, 2012.

[296] Z. Zhu and Y.-S. Ong, “Memetic algorithms for feature selection on microarray data,”in Advances in Neural Networks, ser. Lecture Notes in Computer Science, D. Liu, S. Fei,Z.-G. Hou, H. Zhang, and C. Sun, Eds. Springer Berlin Heidelberg, 2007, vol. 4491,pp. 1327–1335.

[297] Z. Zhu, Y.-S. Ong, and M. Dash, “Wrapper-filter feature selection algorithm using amemetic framework,” IEEE Trans. Syst., Man, Cybern. B, vol. 37, no. 1, pp. 70–76,2007.

Feature Selection with Harmony Search · Abstract Feature selection is a term given to the problem...

Documents

Transcript of Feature Selection with Harmony Search · Abstract Feature selection is a term given to the problem...