DS2014: Feature selection in hierarchical feature spaces
-
Upload
petar-ristoski -
Category
Science
-
view
167 -
download
3
description
Transcript of DS2014: Feature selection in hierarchical feature spaces
Motivation: Linked Open Data as Background
Knowledge
10/12/2014 2
• Linked Open Data is a method for publishing interlinked
datasets using machine interpretable semantics
• Started 2007
• A collection of ~1,000 datasets
– Various domains, e.g. general knowledge, government data, …
– Using semantic web standards (HTTP, RDF, SPARQL)
• Free of charge
• Machine processable
• Sophisticated tool stacks
Petar Ristoski, Heiko Paulheim
Example: the Auto MPG Dataset
• A well-known UCI dataset
– Goal: predict fuel consumption of cars
• Hypothesis: background knowledge → more accurate predictions
• Used background knowledge:
– Entity types and categories from DBpedia (=Wikipedia)
• Results: M5Rules down to almost half the prediction error
– i.e. on average, we are wrong by 1.6 instead of 2.9 MPG
10/12/2014 Petar Ristoski, Heiko Paulheim 4
Attribute setLinear Regression M5Rules
RMSE RE RMSE RE
original 3.359 0.118 2.859 0.088
original + direct types 3.334 0.117 2.835 0.091
original + categories 4.474 0.144 2.926 0.090
original + direct types + categories 2.551 0.088 1.574 0.042
Drawbacks
• The generated feature sets are rather large
– e.g. for dataset of 300 instances, it may generate up to 5,000 features
from one source
• Increase complexity and runtime
• Overfitting for too specific features
10/12/2014 5Petar Ristoski, Heiko Paulheim
Linked Open Data is Backed by Ontologies
10/12/2014 Petar Ristoski, Heiko Paulheim 6
LOD Graph Excerpt Ontology Excerpt
Problem Statement
• Each instance is an n-dimensional binary feature vector (v1,v2,…,vn),
where vi ∈ {0,1} for all 1≤ vi ≤n
• Feature space: V={v1,v2,…, vn}
• Hierarchic relation between two features vi and vj can be denoted as
vi < vj, where vi is more specific than vj
• For all hierarchical features, the following implication holds:
vi < vj→ (vi = 1 → vj = 1)
• Transitivity between hierarchical features exists:
vi < vj ˄ vj < vk→ vi < vk
• The problem of feature selection can be defined as finding a
projection of V to V’, where V’ ⊆ V and p(V’) ≥ p(V), where p is a
performance function:
𝑝: 𝑃 𝑉 → [0,1]
10/12/2014 Petar Ristoski, Heiko Paulheim 8
Hierarchical Feature Space: Example
10/12/2014 Petar Ristoski, Heiko Paulheim 9
Josh Donaldson is the best 3rd baseman in the American League.
LeBron James NOT ranked #1 after newly released list of Top NBA players
“Two things are infinite: the universe and human stupidity; and I'm not sure about the universe.”―Albert Einstein
In his weekly address, President Barack Obama discusses expanding
opportunity for hard-working Americans: http://ofa.bo/ccH
Nineteen-year-old figure skater YuzuruHanyu, who won a gold medal in the
Sochi Olympics, is among the 684 peo... http://bit.ly/1kb6W5y
Barack Obama cracks jokes at Vladimir Putin's expense http://dlvr.it/5Z7JCR
I spotted the Lance Armstrong case in 2006 when everyone thought he was
God, and now this case catches my attention.
10/12/2014 Petar Ristoski, Heiko Paulheim 10
Josh Donaldson is the best 3rd baseman in the American League.
LeBron James NOT ranked #1 after newly released list of Top NBA players
dbpedia:Josh_Donaldsondbpedia:LeBron_James
dbpedia-owl:Basketball_Player
dbpedia-owl:Baseball_Player
dbpedia-owl:Athlete
Hierarchical Feature Space: Example
Hierarchical Feature Space
• Linked Open Data
– DBpedia, YAGO, Biperpedia, Google Knowledge Graph
• Lexical Databses
– WordNet, DANTE
• Domain specific ontologies, taxonomies and vocabularies
– Bioinformatics: Gene Ontology (GO), Entrez
– Drugs: the Drug Ontology
– E-commerce: GoodRelations
10/12/2014 Petar Ristoski, Heiko Paulheim 12
Standard Feature Selection
• Wrapper methods
– Computationally expensive
• Filter methods
– Several techniques for scoring the relevance of the features
• Information Gain
• χ2
• Information Gain Ratio
• Gini Index
– Often similar results
10/12/2014 Petar Ristoski, Heiko Paulheim 14
TSEL Feature Selection
• Tree-based feature selection (Jeong et al.)
– Select most representative and most effective feature from each branch
of the hierarchy
• 𝑙𝑖𝑓𝑡 =𝑃(𝑓|𝐶)
𝑃(𝐶)
10/12/2014 Petar Ristoski, Heiko Paulheim 17
Bottom-Up Hill-Climbing Feature Selection
• Bottom-up hill climbing search algorithm to find an optimal subset of
concepts for document representation (Wang et al.)
𝑓 = 1 +α − 𝑛
α∗ β ∗
𝑖∈𝐷𝐷𝑐𝑖 , 𝐷𝑐𝑖⊆ 𝐷𝐾𝑁𝑁𝑖 𝑎𝑛𝑑 β > 0
10/12/2014 Petar Ristoski, Heiko Paulheim 18
Greedy Top-Down Feature Selection
• Greedy based top-down search strategy for feature selection (Lu et al.)
– Select the most effective nodes from different levels of the hierarchy
10/12/2014 Petar Ristoski, Heiko Paulheim 19
Hierarchical Feature Selection Approach
(SHSEL)
• Exploit the hierarchical structure of the feature space
• Hierarchical relation : vi < vj→ (vi = 1 → vj = 1)
• Relevance similarity:
– Relevance (Blum et al.) : A feature vi is relevant to a target class C if
there exists a pair of examples A and B in the instance space such that
A and B differ only in their assignment to vi and C(A) ≠ C(B)
• Two features vi and vj have similar relevance if:
1 − 𝑅 𝑣𝑖 − 𝑅 𝑣𝑗 ≥ 𝑡, 𝑡 → [0,1]
• Goal: Identify features with similar relevance, and select the most
valuable abstract features, without losing predictive power
10/12/2014 Petar Ristoski, Heiko Paulheim 21
Hierarchical Feature Selection Approach
(SHSEL)
• Initial Selection
– Identify and filter out ranges of nodes with similar relevance in each
branch of the hierarchy
• Pruning
– Select only the most relevant features from the previously reduced set
10/12/2014 Petar Ristoski, Heiko Paulheim 22
Initial SHSEL Feature Selection
1. Identify range of nodes with similar relevance in each branch:
– Information gain: 𝑠(𝑣𝑖 , 𝑣𝑗) = 1 − 𝐼𝐺 𝑣𝑖 − 𝐼𝐺(𝑣𝑗)
– Correlation: 𝑠(𝑣𝑖 , 𝑣𝑗) = 𝐶𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛(𝑣𝑖 , 𝑣𝑗)
2. If the similarity is greater than a user specified threshold, remove
the more specific feature, based on the hierarchical relation
10/12/2014 Petar Ristoski, Heiko Paulheim 23
𝑠 𝑣𝑖 , 𝑣𝑗 = 1 − 0.45 − 0.5 = 0.95
t=0.9
s>t
Post SHSEL Feature Selection
• Select the features with the highest relevance on each path
– user specified threshold
– select features with relevance above path average relevance
10/12/2014 Petar Ristoski, Heiko Paulheim 24
𝐼𝐺(𝑣𝑖)=0.2AVG(Sp)=0.25
Evaluation
• We use 5 real-world datasets and 6 synthetically generated datasets
• Classification methods:
– Naïve Bayes
– k-Nearest Neighbors (k=3)
– Support Vector Machine (polynomial kernel function)
No parameter optimization
10/12/2014 Petar Ristoski, Heiko Paulheim 26
Evaluation: Real World Datasets
Name Features #Instances Class Labels #Features
Sports Tweets T DBpedia Direct Types 1,179 positive(523); negative(656) 4,082
Sports Tweets C DBpedia Categories 1,179 positive(523); negative(656) 10,883
Cities DBpedia Direct Types 212 high(67); medium(106); low(39) 727
NY Daily Headings DBpedia Direct Types 1,016 positive(580); negative(436) 5,145
StumbleUpon DMOZ Categories 3,020 positive(1,370); negative(1,650) 3,976
10/12/2014 Petar Ristoski, Heiko Paulheim 27
• Hierarchical features are generated from DBpedia (structured version of Wikipedia)
– The text is annotated with concepts using DBpedia Spotlight
• The feature generation is independent of the class labels, and it is unbiased towards any of the feature selection approaches
Evaluation: Synthetic Datasets
• Generate the middle layer using polynomial function
• Generate the hierarchy upwards and downwards following the
hierarchical feature implication and transitivity rule
• The depth and branching factor are controlled with parameters D
and B
10/12/2014 Petar Ristoski, Heiko Paulheim 28
Name #Instances Class Labels #Features
S-D2-B2 1,000 positive(500); negative(500) 1,201
S-D2-B5 1,000 positive(500); negative(500) 1,021
S-D2-B10 1,000 positive(500); negative(500) 961
S-D4-B2 1,000 positive(500); negative(500) 2,101
S-D4-B4 1,000 positive(500); negative(500) 1,741
S-D4-B10 1,000 positive(500); negative(500) 1,621
Evaluation: Synthetic Datasets
• Depth = 1 & Branching = 2
10/12/2014 Petar Ristoski, Heiko Paulheim 29
1 0 1 1 0 1 0 0
1 1 1 0
0
1
0 1 0 10 0 0
Evaluation: Synthetic Datasets
• Generate the middle layer using polynomial function
• Generate the hierarchy upwards and downwards following the
hierarchical feature implication and transitivity rule
• The depth and branching factor are controlled with parameters D
and B
10/12/2014 Petar Ristoski, Heiko Paulheim 30
Name #Instances Class Labels #Features
S-D2-B2 1,000 positive(500); negative(500) 1,201
S-D2-B5 1,000 positive(500); negative(500) 1,021
S-D2-B10 1,000 positive(500); negative(500) 961
S-D4-B2 1,000 positive(500); negative(500) 2,101
S-D4-B4 1,000 positive(500); negative(500) 1,741
S-D4-B10 1,000 positive(500); negative(500) 1,621
Evaluation: Approach
• Testing all approaches using two classification methods
– Naïve Bayes, KNN and SVM
• Metrics for performance evaluation
– Accuracy: Acc V′ =𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝐶𝑙𝑎𝑠𝑠𝑓𝑖𝑒𝑑 𝐼𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠 (𝑉′)
𝑇𝑜𝑡𝑎𝑙 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐼𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠
– Feature Space Compression: 𝑐 𝑉′ = 1 −|𝑉′|
|𝑉|
– Harmonic Mean: 𝐻 = 2 ∗𝐴𝑐𝑐 𝑉′ ∗𝑐 𝑉′
𝐴𝑐𝑐 𝑉′ +𝑐 𝑉′
• Results calculated using stratified 10-fold cross validation
– Feature selection is performed inside each fold
• Parameter optimization for each feature selection strategy
10/12/2014 Petar Ristoski, Heiko Paulheim 31
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
Relevance Similarity Threshold
Accuracy
Compression
H. Mean
Evaluation: SHSEL IG
10/12/2014 Petar Ristoski, Heiko Paulheim 32
• Classification accuracy when using different relevance similarity threshold on the cities dataset
Evaluation: Classification Accuracy (NB)
10/12/2014 Petar Ristoski, Heiko Paulheim 33
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
Sports Tweets T Sports Tweets C StumbleUpon Cities NY Daily Headings
original
initialSHSEL IG
initialSHSEL C
pruneSHSEL IG
pruneSHSEL C
SIG
SC
TSEL Lift
TSEL IG
HillClimbing
GreedyTopDown
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
S_D2_B2 S_D2_B5 S_D2_B10 S_D4_B2 S_D4_B5 S_D4_B10
original
initialSHSEL IG
initialSHSEL C
pruneSHSEL IG
pruneSHSEL C
SIG
SC
TSEL Lift
TSEL IG
HillClimbing
GreedyTopDown
Evaluation: Feature Space Compression (NB)
10/12/2014 Petar Ristoski, Heiko Paulheim 36
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
Sports Tweets T Sports Tweets C StumbleUpon Cities NY Daily Headings
initialSHSEL IG
initialSHSEL C
pruneSHSEL IG
pruneSHSEL C
SIG
SC
TSEL Lift
TSEL IG
HillClimbing
GreedyTopDown
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
S_D2_B2 S_D2_B5 S_D2_B10 S_D4_B2 S_D4_B5 S_D4_B10
initialSHSEL IG
initialSHSEL C
pruneSHSEL IG
pruneSHSEL C
SIG
SC
TSEL Lift
TSEL IG
HillClimbing
GreedyTopDown
Evaluation: Harmonic Mean (NB)
10/12/2014 Petar Ristoski, Heiko Paulheim 39
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
Sports Tweets T Sports Tweets C StumbleUpon Cities NY Daily Headings
initialSHSEL IG
initialSHSEL C
pruneSHSEL IG
pruneSHSEL C
SIG
SC
TSEL Lift
TSEL IG
HillClimbing
GreedyTopDown
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
S_D2_B2 S_D2_B5 S_D2_B10 S_D4_B2 S_D4_B5 S_D4_B10
initialSHSEL IG
initialSHSEL C
pruneSHSEL IG
pruneSHSEL C
SIG
SC
TSEL Lift
TSEL IG
HillClimbing
GreedyTopDown
Conclusion & Outlook
10/12/2014 Petar Ristoski, Heiko Paulheim 43
• Contribution
– An approach that exploits hierarchies for feature selection in
combination with standard metrics
– The evaluation shows that the approach outperforms standard feature
selection techniques, and other approaches using hierarchies
• Future Work
– Conduct further experiments
• E.g. text mining, bioinformatics
– Feature Selection in unsupervised learning
• E.g. clustering, outlier detection
• Laplacian Score