Ranking Interesting Subgroups
description
Transcript of Ranking Interesting Subgroups
Stefan RüpingFraunhofer [email protected]
Ranking Interesting Subgroups
2
Fraunhofer Web-Projekt, Kick-off am 17.7.08
1. name_score >= 1 &geoscore >= 1 & housing >= 5 p = 41.6%
2. Income_score >= 5 & name_score >= 5 & housing >= 5 p = 36.0%
3. Active_housholds >= 3 & queries_per_household >= 1 &housing >= 5 p = 43.8%
4. Families == 0 &name_score >= 1 & housing == 0 p = 28.9%
5. Financial_status == 0 &name_score >= 3 &housing <= 5 p = 66.1%
Motivation
3
Fraunhofer Web-Projekt, Kick-off am 17.7.08
1. name_score >= 1 &geoscore >= 1 & housing >= 5 p = 41.6%
2. Income_score >= 5 & name_score >= 5 & housing >= 5 p = 36.0%
3. Active_housholds >= 3 & queries_per_household >= 1 &housing >= 5 p = 43.8%
4. Families == 0 &name_score >= 1 & housing == 0 p = 28.9%
5. Financial_status == 0 &name_score >= 3 &housing <= 5 p = 66.1%
Motivation
4
Fraunhofer Web-Projekt, Kick-off am 17.7.08
1. name_score >= 1 &geoscore >= 1 & housing >= 5 p = 41.6%
2. Income_score >= 5 & name_score >= 5 & housing >= 5 p = 36.0%
3. Active_housholds >= 3 & queries_per_household >= 1 &housing >= 5 p = 43.8%
4. Families == 0 &name_score >= 1 & housing == 0 p = 28.9%
5. Financial_status == 0 &name_score >= 3 &housing <= 5 p = 66.1%
Motivation
Applying ranking to complex data: subgroup models
Optimization of data mining models for non-expert users
5
Fraunhofer IAIS
Overview
Introduction to Subgroup Discovery Interesting Patterns Ranking Subgroups
• Representation• Ranking SVMs• Iterative algorithm
Experiments Conclusions
6
Fraunhofer IAIS
Subgroup Discovery
Input• X defined by nominal attributes A1,…,Ad
• Data Subgroup language
• Propositional formula Ai1 = vj1 Ai2 = vj2 … For a subgroup S let
• g(S) = #{ xi S }/n, p(S) = #{ xi S | yi = 1 }/g(S), p0 = |yi = 1|/n• q(S) = g(S)a (p(S)-p0)
Task• Find k subgroups with highest significance (maximal quality q)
}1,0{),(,),,( 11 Xyxyx nn
a = 0.5 t-testSubgroup quality = significance of
pattern
Subgroup size and class probability
7
Fraunhofer IAIS
Subgroup Discovery: Example
Weather Advertised
Ice Cream Sales
good yes highgood no highgood no highgood no highbad no lowbad yes highbad no lowbad no low
8
Fraunhofer IAIS
Subgroup Discovery: Example
Weather Advertised
Ice Cream Sales
good yes highgood no highgood no highgood no highbad no lowbad yes highbad no lowbad no low
S1: Weather = good sales = highg(S) = 4/8p(S) = 4/4q(S) = (4/8)0.5 (4/4 - 5/8) = 0.265
9
Fraunhofer IAIS
Subgroup Discovery: Example
Weather Advertised
Ice Cream Sales
good yes highgood no highgood no highgood no highbad no lowbad yes highbad no lowbad no low
S1: Weather = good sales = highg(S) = 4/8p(S) = 4/4q(S) = (4/8)0.5 (4/4 - 5/8) = 0.265
S2: Advertised = yes sales = highg(s) = 2/8p(S) = 2/2q(S) = (2/8)0.5 (2/2 – 5/8) = 0.187
10
Fraunhofer IAIS
Subgroup Discovery: Example
Weather Advertised
Ice Cream Sales
good yes highgood no highgood no highgood no highbad no lowbad yes highbad no lowbad no low
S1: Weather = good sales = highg(S) = 4/8p(S) = 4/4q(S) = (4/8)0.5 (4/4 - 5/8) = 0.265
S2: Advertised = yes sales = highg(s) = 2/8p(S) = 2/2q(S) = (2/8)0.5 (2/2 – 5/8) = 0.187Significance ≠ Interestingness
11
Fraunhofer IAIS
Interesting Patterns
What makes a pattern interesting to the user? Depends on prior knowledge, but heuristics exist Attributes
• Actionability• Acquaintedness
Sub-space• Novelty
Complexity• Not too complex• Not too simple
?
12
Fraunhofer IAIS
Overview: Ranking Interesting Subgroups
Data Subgroup Discovery
Ranking SVM
Task Modification
Subgroup Representatio
n
„S1 > S2“
13
Fraunhofer IAIS
Subgroup Representation (1/3)
Subgroups become examples of ranking learner! Notation
• Ai = original attribute• r(S) = representation of subgroup S
Remember: important properties of subgroups• Attributes• Examples• Complexity
Representing complexity• r(S) includes g(S) and p(S)-p0
14
Fraunhofer IAIS
Subgroup Representation (2/3)
Representing attributes For each attribute Ai of the original examples include
into subgroup representation attribute
Observation: TF/IDF-like representation performs even better
else
AcontainsSiffSr i
i 01
)(
jji
iTFIDFi Sr
SrSr)(1
)()(
15
Fraunhofer IAIS
Subgroup Representation (3/3)
Representing examples User may be more interested in subset of examples Construct list of known relevant and irrelevant
subgroups from user feedback For each subgroup S and each known relevant/irrelevant
subgroup T define
relatedness of S to known subgroup T||||||)(
TSTSSrT
16
Fraunhofer IAIS
Ranking Optimization Problem
Rationale• Subgroup discovery gives quality q(S) = g(S)a (p(S)-p0)• User defines ranking by pairs „S1 > S2“ (S1 is better than S2)• Find true ranking q* such that S1 > S2 <=> q*(S1) > q*(S2)
Assumption
(justfied by assuming hidden labels of interestingness of examples)
Define linear ranking function log q*(S) = (a,1,w) r(S)
d
i
Srwa iiepSpSgSq3
)(0
* 2))(()()(
17
Fraunhofer IAIS
Ranking Optimization Problem (2/2)
Solution similar to ranking SVM Optimization problem:
Equivalent problem:
where z = r(Si,1)-r(Si,2). Remember log q*(S) = (a,1,w) r(S)
0,)(log)(log..
min),1,(
2,*
1,*
202
1
iiii
ii
SqSqts
Cwaa
0,),1,(..
min)( 2212
021
ii
ii
zwats
Cwaa
18
Fraunhofer IAIS
Ranking Optimization Problem (2/2)
Solution similar to ranking SVM Optimization problem:
Equivalent problem:
where z = r(Si,1)-r(Si,2). Remember log q*(S) = (a,1,w) r(S)
0,)(log)(log..
min),1,(
2,*
1,*
202
1
iiii
ii
SqSqts
Cwaa
0,),1,(..
min)( 2212
021
ii
ii
zwats
Cwaa
Deviation from parameter a0 in
subgroup discovery
19
Fraunhofer IAIS
Ranking Optimization Problem (2/2)
Solution similar to ranking SVM Optimization problem:
Equivalent problem:
where z = r(Si,1)-r(Si,2). Remember log q*(S) = (a,1,w) r(S)
0,)(log)(log..
min),1,(
2,*
1,*
202
1
iiii
ii
SqSqts
Cwaa
0,),1,(..
min)( 2212
021
ii
ii
zwats
Cwaa
Deviation from parameter a0 in
subgroup discovery
Constant weight for g(S) defines margin
20
Fraunhofer IAIS
Iterative Procedure
Why?• Google: ~1012 web pages• Same number of possible subgroups on 12-dimensional data set
with 9 distinct values per attribute• cannot compute all subgroups for single-step ranking
Approach• Optimization problem gives new estimate of a• Transform weight of subgroups–features into weights for original
examples• Idea: replace binary y with numeric value. Appropriate offset
guarantees that subgroup-q is approximates optimized q*
subgroup rankingsearch
21
Fraunhofer IAIS
Experiments
Simulation on UCI data• Replace true label with most correlated attribute• Use true label to simulate user• Measure correspondence of algorithm‘s ranking with subgroups
found on true label• Tests ability of approach to flexibly adapt to correlated patterns
Performance measure• Area under the curve – retrieval of true top 100 subgroups• Kendall‘s - internal consistency of returned ranking
22
Fraunhofer IAIS
Results
Wilcoxon signed rank test confirms significance
3 Data sets with minimal AUC are exactly the ones with minimal correlation between true and proxy label!
Data set AUC
Diabetes 0.256 0.008Breast-w 0.759 0.120Vote 0.664 0.051Segment 0.596 0.601Vehicle 0.053 0.500Heart-c 0.180 0.036Primary-tumor 0.739 0.532Hypothyroid 0.729 0.307Ionosphere 0.227 0.708Credit-a 0.050 0.241Credit-g 0.019 0.285Colic 1.9E-4 0.213Anneal 0.030 0.329Soybean 1.9E-4 0.040Mushroom 0.542 0.320mean 0.323 0.286
23
Fraunhofer IAIS
Conclusions
Example of ranking on complex, knowledge-rich data Interestingness of subgroups patterns can be
significantly increased with interactive ranking-based method
Step toward automating machine learning for end-users Future work:
• Validation with true users• Active learning approach