Dr Ine Jacobs Katholieke Universiteit Leuven a CLABS event ...
Learning Markov Network Structure with Decision Trees Daniel Lowd University of Oregon Jesse Davis...
-
Upload
valentine-stanley -
Category
Documents
-
view
214 -
download
0
Transcript of Learning Markov Network Structure with Decision Trees Daniel Lowd University of Oregon Jesse Davis...
![Page 1: Learning Markov Network Structure with Decision Trees Daniel Lowd University of Oregon Jesse Davis Katholieke Universiteit Leuven Joint work with:](https://reader030.fdocuments.in/reader030/viewer/2022032723/56649d1f5503460f949f2bf3/html5/thumbnails/1.jpg)
Learning Markov Network Structure with Decision Trees
Daniel LowdUniversity of Oregon
Jesse DavisKatholieke Universiteit Leuven<[email protected]>
Joint work with:
![Page 2: Learning Markov Network Structure with Decision Trees Daniel Lowd University of Oregon Jesse Davis Katholieke Universiteit Leuven Joint work with:](https://reader030.fdocuments.in/reader030/viewer/2022032723/56649d1f5503460f949f2bf3/html5/thumbnails/2.jpg)
Problem Definition
Applications: Diagnosis, prediction, recommendations,and much more!
Smoke
Wheeze Asthma
Cancer
Flu
F W A S CT T T F FF F T T FF T T F FT F F T TF F F T T
Training Data Markov Network Structure
P(F,W,A,S,C)
SLOW
SEARCH
![Page 3: Learning Markov Network Structure with Decision Trees Daniel Lowd University of Oregon Jesse Davis Katholieke Universiteit Leuven Joint work with:](https://reader030.fdocuments.in/reader030/viewer/2022032723/56649d1f5503460f949f2bf3/html5/thumbnails/3.jpg)
Key Idea
Result: Similar accuracy, orders of magnitude faster!
S
W A
C
F
F W A S C
T T T F F
F F T T F
F T T F F
T F F T T
F F F T T
Training Data Markov Network Structure
P(F,W,A,S,C)
F=?
0.2
0.5 0.7false
falseS=?
true
true
P(C|F,S)
Decision Trees
![Page 4: Learning Markov Network Structure with Decision Trees Daniel Lowd University of Oregon Jesse Davis Katholieke Universiteit Leuven Joint work with:](https://reader030.fdocuments.in/reader030/viewer/2022032723/56649d1f5503460f949f2bf3/html5/thumbnails/4.jpg)
Outline
• Background– Markov networks– Weight learning– Structure learning
• DTSL: Decision Tree Structure Learning• Experiments
![Page 5: Learning Markov Network Structure with Decision Trees Daniel Lowd University of Oregon Jesse Davis Katholieke Universiteit Leuven Joint work with:](https://reader030.fdocuments.in/reader030/viewer/2022032723/56649d1f5503460f949f2bf3/html5/thumbnails/5.jpg)
Markov Networks: Representation
Smoke
Wheeze Asthma
Cancer
Flu
(aka Markov random fields, Gibbs distributions, log-linear models, exponential models, maximum entropy models)
Smoke Cancer
Feature iWeight of Feature i
1.5
P(x) = exp wi fi(x) i(1
Z
)
Variables
![Page 6: Learning Markov Network Structure with Decision Trees Daniel Lowd University of Oregon Jesse Davis Katholieke Universiteit Leuven Joint work with:](https://reader030.fdocuments.in/reader030/viewer/2022032723/56649d1f5503460f949f2bf3/html5/thumbnails/6.jpg)
Markov Networks: Learning
Smoke Cancer
Weight Learning Given: Features, Data Learn: Weights
Structure Learning Given: Data Learn: Features
Two Learning Tasks
Feature iWeight of Feature i
1.5
P(x) = exp wi fi(x) i(1
Z
)
![Page 7: Learning Markov Network Structure with Decision Trees Daniel Lowd University of Oregon Jesse Davis Katholieke Universiteit Leuven Joint work with:](https://reader030.fdocuments.in/reader030/viewer/2022032723/56649d1f5503460f949f2bf3/html5/thumbnails/7.jpg)
)()()(log xnExnxPw iwiwi
Markov Networks: Weight Learning
• Maximum likelihood weights
• Pseudo-likelihood
No. of times feature i is true in data
Expected no. times feature i is true according to model
i
ii xneighborsxPxPL ))(|()(
Slow: Requires inference at each step
No inference: More tractable to compute
![Page 8: Learning Markov Network Structure with Decision Trees Daniel Lowd University of Oregon Jesse Davis Katholieke Universiteit Leuven Joint work with:](https://reader030.fdocuments.in/reader030/viewer/2022032723/56649d1f5503460f949f2bf3/html5/thumbnails/8.jpg)
Markov Networks: Structure Learning[Della Pietra et al., 1997]
• Given: Set of variables = {F, W, A, S, C}• At each step
{F W, F A, …, A C, F S C, …, A S C}
Current model = {F, W, A, S, C, S C}
New model = {F, W, A, S, C, S C, F W}
Candidate features:Conjoin variables to features in model
Select best candidate
Iterate until no feature improves score
Downside: Weight learning at each step – very slow!
![Page 9: Learning Markov Network Structure with Decision Trees Daniel Lowd University of Oregon Jesse Davis Katholieke Universiteit Leuven Joint work with:](https://reader030.fdocuments.in/reader030/viewer/2022032723/56649d1f5503460f949f2bf3/html5/thumbnails/9.jpg)
Bottom-up Learning of Markov Networks (BLM)
F1: F W AF2: A SF3: W AF4: F S CF5: S C
F W A S C
T T T F F
F F T T F
F T T F F
T F F T T
F F F T T
1. Initialization with one feature per example2. Greedily generalize features to cover more examples3. Continue until no improvement in score
F1: W AF2: A SF3: W AF4: F S CF5: S C
Initial Model Revised Model
[Davis and Domingos, 2010]
Downside: Weight learning at each step – very slow!
![Page 10: Learning Markov Network Structure with Decision Trees Daniel Lowd University of Oregon Jesse Davis Katholieke Universiteit Leuven Joint work with:](https://reader030.fdocuments.in/reader030/viewer/2022032723/56649d1f5503460f949f2bf3/html5/thumbnails/10.jpg)
L1 Structure Learning[Ravikumar et al., 2009]
Given: Set of variables= {F, W, A, S, C}Do: L1 logistic regression to predict each variable
Construct pairwise features between target and each variable with non-zero weight
C
W
A
S
F
0.0
0.0
1.0
0.0F
W
A
S
C
1.5
0.0
0.0
0.0
…
Model = {S C, … , W F}Downside: Algorithm restricted to pairwise features
![Page 11: Learning Markov Network Structure with Decision Trees Daniel Lowd University of Oregon Jesse Davis Katholieke Universiteit Leuven Joint work with:](https://reader030.fdocuments.in/reader030/viewer/2022032723/56649d1f5503460f949f2bf3/html5/thumbnails/11.jpg)
General Strategy: Local Models
Learn a “local model” to predict each variable given the others:
A
B
CB
A B D
D
C
Combine the models into a Markov network
CA B D
c.f. Ravikumar et al., 2009
![Page 12: Learning Markov Network Structure with Decision Trees Daniel Lowd University of Oregon Jesse Davis Katholieke Universiteit Leuven Joint work with:](https://reader030.fdocuments.in/reader030/viewer/2022032723/56649d1f5503460f949f2bf3/html5/thumbnails/12.jpg)
DTSL: Decision Tree Structure Learning[Lowd and Davis, ICDM 2010]
Given: Set of variables= {F, W, A, S, C}Do: Learn a decision tree to predict each variable
F=?
0.2
0.5 0.7
false
falseS=?
true
trueP(C|F,S) = P(F|C,S) =
Construct a feature for each leaf in each tree:F C ¬F S C ¬F ¬S C
F ¬C ¬F S ¬C ¬F ¬S ¬C
…
![Page 13: Learning Markov Network Structure with Decision Trees Daniel Lowd University of Oregon Jesse Davis Katholieke Universiteit Leuven Joint work with:](https://reader030.fdocuments.in/reader030/viewer/2022032723/56649d1f5503460f949f2bf3/html5/thumbnails/13.jpg)
DTSL Feature Pruning[Lowd and Davis, ICDM 2010]
F=?
0.2
0.5 0.7
false
falseS=?
true
trueP(C|F,S) =
Original: F C ¬F S C ¬F ¬S C
F ¬C ¬F S ¬C ¬F ¬S ¬CPruned: All of the above plus F, ¬F S, ¬F ¬S, ¬F
Nonzero: F, S, C, F C, S C
![Page 14: Learning Markov Network Structure with Decision Trees Daniel Lowd University of Oregon Jesse Davis Katholieke Universiteit Leuven Joint work with:](https://reader030.fdocuments.in/reader030/viewer/2022032723/56649d1f5503460f949f2bf3/html5/thumbnails/14.jpg)
Empirical Evaluation
• Algorithms– DSTL [Lowd and Davis, ICDM 2010]– DP [Della Pietra et al., 1997]– BLM [Davis and Domingos, 2010]– L1 [Ravikumar et al., 2009]– All parameters were tuned on held-out data
• Metrics– Running time (structure learning only)– Per variable conditional marginal log-likelihood
(CMLL)
![Page 15: Learning Markov Network Structure with Decision Trees Daniel Lowd University of Oregon Jesse Davis Katholieke Universiteit Leuven Joint work with:](https://reader030.fdocuments.in/reader030/viewer/2022032723/56649d1f5503460f949f2bf3/html5/thumbnails/15.jpg)
Conditional Marginal Log Likelihood
• Measures ability to predict each variable separately, given evidence.
• Split variables into 4 sets:Use 3 as evidence (E), 1 as query (Q)Rotate through all variables appear in queries
• Probabilities estimated using MCMC(specifically MC-SAT [Poon and Domingos, 2006])
€
CMLL(x,e) = logP(X i = x i | E = e)i
∑
[Lee et al., 2007]
![Page 16: Learning Markov Network Structure with Decision Trees Daniel Lowd University of Oregon Jesse Davis Katholieke Universiteit Leuven Joint work with:](https://reader030.fdocuments.in/reader030/viewer/2022032723/56649d1f5503460f949f2bf3/html5/thumbnails/16.jpg)
Domains
• Compared across 13 different domains– 2,800 to 290,000 train examples– 600 to 38,000 tune examples– 600 to 58,000 test examples– 16 to 900 variables
![Page 17: Learning Markov Network Structure with Decision Trees Daniel Lowd University of Oregon Jesse Davis Katholieke Universiteit Leuven Joint work with:](https://reader030.fdocuments.in/reader030/viewer/2022032723/56649d1f5503460f949f2bf3/html5/thumbnails/17.jpg)
Run time (minutes)NL
TCS
MSN
BC
KDDC
up20
00
Plan
ts
Audi
o
Jest
er
Netfl
ix
MSW
eb
Book
Each
Mov
ie
Web
KB
20Ne
wsg
roup
s
Reut
ers-
52
0
200
400
600
800
1000
1200
1400
1600
DTSLL1BLMDLP
![Page 18: Learning Markov Network Structure with Decision Trees Daniel Lowd University of Oregon Jesse Davis Katholieke Universiteit Leuven Joint work with:](https://reader030.fdocuments.in/reader030/viewer/2022032723/56649d1f5503460f949f2bf3/html5/thumbnails/18.jpg)
Run time (minutes)NL
TCS
MSN
BC
KDDC
up20
00
Plan
ts
Audi
o
Jest
er
Netfl
ix
MSW
eb
Book
Each
Mov
ie
Web
KB
20Ne
wsg
roup
s
Reut
ers-
52
0.1
1
10
100
1000
10000
DTSLL1BLMDLP
![Page 19: Learning Markov Network Structure with Decision Trees Daniel Lowd University of Oregon Jesse Davis Katholieke Universiteit Leuven Joint work with:](https://reader030.fdocuments.in/reader030/viewer/2022032723/56649d1f5503460f949f2bf3/html5/thumbnails/19.jpg)
Accuracy: DTSL vs. BLM
DTSL Better
BLM Better
![Page 20: Learning Markov Network Structure with Decision Trees Daniel Lowd University of Oregon Jesse Davis Katholieke Universiteit Leuven Joint work with:](https://reader030.fdocuments.in/reader030/viewer/2022032723/56649d1f5503460f949f2bf3/html5/thumbnails/20.jpg)
Accuracy: DTSL vs. L1
DTSL Better
L1 Better
![Page 21: Learning Markov Network Structure with Decision Trees Daniel Lowd University of Oregon Jesse Davis Katholieke Universiteit Leuven Joint work with:](https://reader030.fdocuments.in/reader030/viewer/2022032723/56649d1f5503460f949f2bf3/html5/thumbnails/21.jpg)
Do long features matter?
MSN
BC
Plan
ts
Book
Each
Mov
ie
NLT
CS
KDD
Cup
20 N
ewsg
roup
s
Web
KB
MSW
eb
Reut
ers
Audi
o
Jest
er
Netf
lix 0
2
4
6
8
10
12
DTSL Better L1 Better
DTS
L Fe
atur
e Le
ngth
![Page 22: Learning Markov Network Structure with Decision Trees Daniel Lowd University of Oregon Jesse Davis Katholieke Universiteit Leuven Joint work with:](https://reader030.fdocuments.in/reader030/viewer/2022032723/56649d1f5503460f949f2bf3/html5/thumbnails/22.jpg)
Results Summary
• Accuracy– DTSL wins on 5 domains– L1 wins on 6 domains– BLM wins on 2 domains
• Speed– DTSL is 16 times faster than L1– DTSL is 100-10,000 times faster than BLM and DP– For DTSL, weight learning becomes the bottleneck
![Page 23: Learning Markov Network Structure with Decision Trees Daniel Lowd University of Oregon Jesse Davis Katholieke Universiteit Leuven Joint work with:](https://reader030.fdocuments.in/reader030/viewer/2022032723/56649d1f5503460f949f2bf3/html5/thumbnails/23.jpg)
Conclusion
• DTSL uses decision trees to learn MN structures much faster with similar accuracy.
• Using local models to build a global model is an effective strategy.
• L1 and DTSL have different strengths– L1 can combine many independent influences– DTSL can handle complex interactions– Can we get the best of both worlds?
(Ongoing work…)