Farhana Ferdousi Liza and Marek Grzes School of Computing ... · Farhana Ferdousi Liza and Marek...
Transcript of Farhana Ferdousi Liza and Marek Grzes School of Computing ... · Farhana Ferdousi Liza and Marek...
The UK’s European universityEstimating the Accuracy of Spectral Learning for HMMs Farhana Ferdousi Liza and Marek GrzesSchool of Computing, University of Kent, UK
Roadmap
• Motivation• Background• Proposed method• Experiments and results• Conclusion• Q&A
Motivation (Why Estimating the Accuracy?)
• Model is incorrect or training data is insufficient. (when we see unexpected results)
• The unsupervised learning and model selection
Motivation (Why Spectral Learning for HMM?)
θ0 0.2 0.4 0.6 0.8 1
likeliho
od
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5LM
LM
LMLM
LM
3_SL,LM
← 2_SL
↓ 1_SL
Background
• Spectral Learning• Moment based parameter estimation• Use information contained in the eigen-vectors of a (item-item similarity) matrix to detect structure.
• Provide certain (PAC-style) guarantees of performance (not extremely heuristics).
• Hidden Markov Model• Described by three matrices: T, O and 𝜋.
Spectral learning for HMM
OOM operator for HMM
Empirical low label moment calculation:
Transformed operators for HMM
UΣ𝑉∗ =
Observation 1
X axis-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6
Y ax
is
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1SecondVecFirstVec
0.5
X axis
0
-0.5-0.5
0
Y axis
0.5
0.5
-0.5
0
Z ax
is
SecondVecFirstVecThirdVec
As expected, the basis vector (representing the subspace)rotates.
Observation 2
training data size0 5000 10000ba
sis
vect
or a
ngle
cha
nge
diff
0
20
40
60
80 m =2
training data size0 5000 10000ba
sis
vect
or a
ngle
cha
nge
diff
0
20
40
60
80 m =3
training data size0 5000 10000ba
sis
vect
or a
ngle
cha
nge
diff
0
20
40
60
80 m =4
training data size ×1040 2 4ba
sis
vect
or a
ngle
cha
nge
diff
0
20
40
60
80 m =8
training data size0 20 40 60ba
sis
vect
or a
ngle
cha
nge
diff
0102030405060 threshold = 0.1
training data size0 200 400ba
sis
vect
or a
ngle
cha
nge
diff
0
20
40
60
80threshold = 0.01
training data size0 200 400 600 800ba
sis
vect
or a
ngle
cha
nge
diff
0
20
40
60threshold = 0.001
training data size0 200 400 600 800ba
sis
vect
or a
ngle
cha
nge
diff
0
10
20
30
40
50
60threshold = 0.0001
Synthetic Dataset Real Dataset
Difference between the rotating bases gets smaller on larger training subsets
ill conditioned HMM
• A matrix is ill-conditioned if the condition number is too large or inverse condition number (ICN) is too small.
• An HMM is ambiguous if ICN of the characteristic matrixes of HMM is too small (close to zero), and in such a case, the parameter estimation is difficult for any parameter estimation technique.
• ICN was calculated as a ratio between the smallest and largest singular value of a row augmented matrix of Tand O.
• Example: ill conditioned HMM
Proposed Criteria
• From our observations we have proposed the convergence criteria based on basis vector angle change difference.
• Our claim is based on the second observation that angle change difference reduces "subspace stabilises";; also "can be determined”, when the training data increases.
• [Hsieh and Olsen] showed that, active subspace will never change in a neighborhood of the global minimum.
• The subset size and the threshold determination is application specific, can be determine using cross validation technique.
Experimental setting
Real Dataset: web-navigation data from msnbc.com
Synthetic Dataset : Configuration for the synthetic dataset
Evaluation 1:Error Measure for synthetic dataset (true model is known)
Threshold(log scaled)-10 -5 0 5
Nor
mal
ized
L1
erro
r
8.8
8.9
9
9.1
9.2
9.3Example 1
Threshold(log scaled)-10 -5 0 5
Nor
mal
ized
L1
erro
r8.28
8.29
8.3
8.31
8.32
8.33
8.34
Example 2
Threshold(log scaled)-10 -5 0 5
Nor
mal
ized
L1
erro
r
7.04
7.05
7.06
7.07
7.08
7.09
Example 15
Threshold(log scaled)-10 -5 0 5
Nor
mal
ized
L1
erro
r
7.61
7.62
7.63
7.64
7.65
7.66
7.67
Example 4
Evaluation 1:Error does not corresponds with ill conditioned HMM
Threshold(log scaled)-10 -5 0 5
Nor
mal
ized
L1
erro
r×104
0
2
4
6
8
10
12Example 8
Threshold(log scaled)-10 -5 0 5
Nor
mal
ized
L1
erro
r
×104
0
1
2
3
4
5Example 9
Threshold(log scaled)-10 -5 0 5
Nor
mal
ized
L1
erro
r
×104
0
2
4
6
8
10Example 18
Threshold(log scaled)-10 -5 0 5
Nor
mal
ized
L1
erro
r
×105
0
2
4
6
8
10Example 17
Evaluation 2:Recovered Parameter and proposed criteria (Well conditioned HMM)
Evaluation 2:Recovered Parameter and proposed criteria (ill conditioned HMM)
Threshold = 0.00001
Conclusion
• The angle change difference can be a useful criteria for checking the convergence.
• Without a convergence criteria it would be difficult to know whether the model is incorrect or, the model is correct but more training example is required.
Future Work
• Problems with spectral learning• Can not incorporate long term dependency• For large domain the SVD can be time consuming• Simplification of the domain space might be tricky and in some cases intractable
• Q & A• Thanks
THE UK’S EUROPEAN UNIVERSITY
www.kent.ac.uk