Improvement of the reliability score for autocoding...
Transcript of Improvement of the reliability score for autocoding...
![Page 1: Improvement of the reliability score for autocoding โฆr-project.ro/conference2019/presentations/uRos2019_ytoko.pdfImprovement of Reliability Score for Autocoding and its Implementation](https://reader035.fdocuments.in/reader035/viewer/2022070802/5f02dad67e708231d40657c9/html5/thumbnails/1.jpg)
Improvement of Reliability Score for Autocoding
and its Implementation in R
uRos2019, 21 May 2019, Bucharest
Yukako Toko1, Shinya Iijima1, Mika Sato-Ilic1,2
1National Statistics Center, Japan
2 University of Tsukuba, Japan
1
![Page 2: Improvement of the reliability score for autocoding โฆr-project.ro/conference2019/presentations/uRos2019_ytoko.pdfImprovement of Reliability Score for Autocoding and its Implementation](https://reader035.fdocuments.in/reader035/viewer/2022070802/5f02dad67e708231d40657c9/html5/thumbnails/2.jpg)
Overview - Background
Uncertainty of Training Data
* Semantic problem
* Interpretation problem
* Insufficiently detailed input
information
Development of Overlapping ClassifierOne Feature into Multiple Classes
(Y. Toko, S. Iijima, M. Sato-Ilic, 2018)
Utilized the idea of Fuzzy Partition Entropy
Reliability Score Considering Uncertainties
from Both Measures
Utilize Difference of Measures
for Uncertainties
Uncertainty from data
Probability Measure
Uncertainty from latent
classification structure in data
Fuzzy Measure
Conventional Classifier
One Feature
into One Class
Problem
2
![Page 3: Improvement of the reliability score for autocoding โฆr-project.ro/conference2019/presentations/uRos2019_ytoko.pdfImprovement of Reliability Score for Autocoding and its Implementation](https://reader035.fdocuments.in/reader035/viewer/2022070802/5f02dad67e708231d40657c9/html5/thumbnails/3.jpg)
Purpose of This Study
Overlapping Classifier
based on Reliability Score(Y. Toko, S. Iijima, M. Sato-Ilic, 2018)
Consideration of Frequency of Each Feature
in training dataset
Inclusion of the frequency of each feature
to the Reliability Score
Consideration of Generalized Reliability Score
Apply T-norm in Statistical metric space
Improvement of the reliability score
3
![Page 4: Improvement of the reliability score for autocoding โฆr-project.ro/conference2019/presentations/uRos2019_ytoko.pdfImprovement of Reliability Score for Autocoding and its Implementation](https://reader035.fdocuments.in/reader035/viewer/2022070802/5f02dad67e708231d40657c9/html5/thumbnails/4.jpg)
Overview - System Structure
Training
dataset
Feature
frequency table
Training process
Input OutputFeature
extraction
Candidates
retrieval
Reliability
Score calc.
Classification process
Feature
extraction
Feature A Class X Feature AClass X
Class Y
-> proposed an algorithm that allows the assignment of one feature
is classified to multiple classes
(overlapping classification)
To address the unrealistic restriction : one feature is classified to a single class
Generalized
Reliability Score
Inclusion of Frequency
of Each Feature
to Reliability Score
4
![Page 5: Improvement of the reliability score for autocoding โฆr-project.ro/conference2019/presentations/uRos2019_ytoko.pdfImprovement of Reliability Score for Autocoding and its Implementation](https://reader035.fdocuments.in/reader035/viewer/2022070802/5f02dad67e708231d40657c9/html5/thumbnails/5.jpg)
Method โ Overlapping classifier
Step 1 : Calculate the probability of j-th feature ( j=1,โฆ,J ) to a class k ( k=1,โฆ,K ) as
๐๐ ๐ =๐๐ ๐
๐๐, ๐๐ =
๐=1
๐พ
๐๐ ๐
๐๐๐ : Number of text descriptions in a class ๐ with j-th feature in the training dataset
5
![Page 6: Improvement of the reliability score for autocoding โฆr-project.ro/conference2019/presentations/uRos2019_ytoko.pdfImprovement of Reliability Score for Autocoding and its Implementation](https://reader035.fdocuments.in/reader035/viewer/2022070802/5f02dad67e708231d40657c9/html5/thumbnails/6.jpg)
Method โ Overlapping classifier
Step 2 : Determine at most เทฉ๐พ (เทฉ๐พ < ๐พ) promising candidate classes for each feature based
on ๐๐ ๐
2. Create เทจ๐๐1,โฏ , เทจ๐๐ เทฉ๐พ๐, เทจ๐พ๐ โค เทจ๐พ โค ๐พ
Note : When there are same values in ๐๐1,โฏ , ๐๐๐พ ,
then we select as many as possible different เทจ๐พ๐ classes for each feature j
1. Arrange ๐๐1,โฏ , ๐๐๐พ in descending order and create ๐๐1,โฏ , ๐๐๐พ ,
such as ๐๐1 โฅ โฏ โฅ ๐๐๐พ , ๐ = 1,โฏ , ๐ฝ
6
![Page 7: Improvement of the reliability score for autocoding โฆr-project.ro/conference2019/presentations/uRos2019_ytoko.pdfImprovement of Reliability Score for Autocoding and its Implementation](https://reader035.fdocuments.in/reader035/viewer/2022070802/5f02dad67e708231d40657c9/html5/thumbnails/7.jpg)
Method โ Overlapping classifier
Step 3 : Calculate the Reliability Score าง๐๐๐
าง๐๐๐ = เทจ๐๐๐ 1 +๐=1
เทฉ๐พ๐เทจ๐๐๐ log๐พ เทจ๐๐๐ , ๐ = 1,โฏ , ๐ฝ, ๐ = 1,โฏ , เทจ๐พ๐
Step 4 : Determine top L (๐ฟ โ {1, โฆ , ฯ๐๐=1โ๐ เทฉ๐พ๐๐}) candidate classes
When the number of target text descriptions is T, and each text description includes
โ๐ ( ๐ = 1,โฏ , ๐ ) features, corresponding าง๐๐๐ for l-th text description can be represented as
าง๐๐๐๐, ๐๐ = 1,โฏ , โ๐ , ๐ = 1,โฏ , เทฉ๐พ๐๐ , ๐ = 1,โฏ , ๐
Reliability score of j-th feature included in l-th text description to a class k
7
![Page 8: Improvement of the reliability score for autocoding โฆr-project.ro/conference2019/presentations/uRos2019_ytoko.pdfImprovement of Reliability Score for Autocoding and its Implementation](https://reader035.fdocuments.in/reader035/viewer/2022070802/5f02dad67e708231d40657c9/html5/thumbnails/8.jpg)
Method โ Overlapping classifier
าง๐๐๐ : Reliability Score of j -th feature to a class k
Probability of feature j to class k Classification status of feature j over the เทฉ๐พ๐ classes
Transformation from เทจ๐๐ ๐ to classification
status of feature j
Degree of Reliability
Explanation of the uncertainty of the training
data
Utilization of the deference of measurements
of uncertainty
Probability Fuzzy
าง๐๐๐ = เทจ๐๐๐ 1 +๐=1
เทฉ๐พ๐เทจ๐๐๐ log๐พ เทจ๐๐๐
8
![Page 9: Improvement of the reliability score for autocoding โฆr-project.ro/conference2019/presentations/uRos2019_ytoko.pdfImprovement of Reliability Score for Autocoding and its Implementation](https://reader035.fdocuments.in/reader035/viewer/2022070802/5f02dad67e708231d40657c9/html5/thumbnails/9.jpg)
Method โ different fuzzy measurement
าง๐๐๐ = เทจ๐๐๐
๐=1
๐พ
๐๐๐2
Apply another fuzzy measurement for reliability score
Partition coefficient for each feature j
๐๐ถ๐ =
๐=1
๐พ
๐๐๐2 , ๐ = 1, โฆ , ๐ฝ
(Y. Toko, K. Wada, S. Iijima, M. Sato-Ilic, 2018)
Another degree of Reliability
Classification status of feature j over the K classes
าง๐๐๐ = เทจ๐๐๐ 1 +๐=1
เทฉ๐พ๐เทจ๐๐๐ log๐พ เทจ๐๐๐
Partition coefficient Partition entropy
9
![Page 10: Improvement of the reliability score for autocoding โฆr-project.ro/conference2019/presentations/uRos2019_ytoko.pdfImprovement of Reliability Score for Autocoding and its Implementation](https://reader035.fdocuments.in/reader035/viewer/2022070802/5f02dad67e708231d40657c9/html5/thumbnails/10.jpg)
Method โ Generalized Reliability Score
าง๐๐๐ = เทจ๐๐๐
๐=1
๐พ
๐๐๐2 าง๐๐๐ = เทจ๐๐๐ 1 +
๐=1
เทฉ๐พ๐เทจ๐๐๐ log๐พ เทจ๐๐๐
Partition coefficient Partition entropy
าง๐๐๐ = T เทจ๐๐๐ ,
๐=1
๐พ
๐๐๐2 าง๐๐๐ = T เทจ๐๐๐ , 1 +
๐=1
เทฉ๐พ๐เทจ๐๐๐ log๐พ เทจ๐๐๐
Generalization
T(๐ , ๐): T-norm between a and b
10
![Page 11: Improvement of the reliability score for autocoding โฆr-project.ro/conference2019/presentations/uRos2019_ytoko.pdfImprovement of Reliability Score for Autocoding and its Implementation](https://reader035.fdocuments.in/reader035/viewer/2022070802/5f02dad67e708231d40657c9/html5/thumbnails/11.jpg)
Method โ T-norm (Triangular norms)
๐ โถ 0,1 ร 0,1 โ [0,1]
โ๐, ๐, ๐, ๐ โ [0,1]
(1) 0 โค ๐ ๐, ๐ โค 1,
๐ ๐, 0 = ๐ 0, ๐ = 0
๐ ๐, 1 = ๐ 1, ๐ = ๐( Boundary conditions )
(2) a โค ๐, ๐ โค ๐ => ๐ ๐, ๐ โค ๐(๐, ๐) ( Monotonicity )
(3) ๐ ๐, ๐ = ๐(๐, ๐) ( Symmetry )
(4) ๐(๐ ๐, ๐ , ๐) = ๐(๐, ๐ ๐, ๐ ) ( Associativity )
(K. Menger, 1942)
11
![Page 12: Improvement of the reliability score for autocoding โฆr-project.ro/conference2019/presentations/uRos2019_ytoko.pdfImprovement of Reliability Score for Autocoding and its Implementation](https://reader035.fdocuments.in/reader035/viewer/2022070802/5f02dad67e708231d40657c9/html5/thumbnails/12.jpg)
Method โ Statistical metric space
๐น๐๐ ๐ฅ โก Pr{๐๐๐ < ๐ฅ}
โ๐, ๐, ๐ โ ๐
๐๐๐ = 0
๐๐๐ > 0 (๐ โ ๐)
๐๐๐ = ๐๐๐
๐๐๐ โค ๐๐๐ + ๐๐๐
๐น๐๐ ๐ฅ = 1, for all ๐ฅ > 0
๐น๐๐ ๐ฅ < 1, ๐ โ ๐ for some ๐ฅ > 0
๐น๐๐ = ๐น๐๐
๐น๐๐(๐ฅ + ๐ฆ) โฅ ๐(๐น๐๐ ๐ฅ , ๐น๐๐ ๐ฆ )
โ
โ
โ
โ
12
![Page 13: Improvement of the reliability score for autocoding โฆr-project.ro/conference2019/presentations/uRos2019_ytoko.pdfImprovement of Reliability Score for Autocoding and its Implementation](https://reader035.fdocuments.in/reader035/viewer/2022070802/5f02dad67e708231d40657c9/html5/thumbnails/13.jpg)
Method โ Examples of T-norm
๐ฅ๐ฆ
t-norm ๐ก(๐ฅ, ๐ฆ)
Algebraic Prod.
Hamacher Prod.๐ฅ๐ฆ
๐ + (1 โ ๐)(๐ฅ + ๐ฆ โ ๐ฅ๐ฆ)
Sin based t-norm2
๐๐ ๐๐โ1 sin
๐
2๐ฅ + sin
๐
2๐ฆ โ 1 โจ0
Dombi Prod.
1
1 +๐ 1 โ ๐ฅ
๐ฅ
๐
+1 โ ๐ฆ๐ฆ
๐
๐ โฅ 0
๐ > 0
13
![Page 14: Improvement of the reliability score for autocoding โฆr-project.ro/conference2019/presentations/uRos2019_ytoko.pdfImprovement of Reliability Score for Autocoding and its Implementation](https://reader035.fdocuments.in/reader035/viewer/2022070802/5f02dad67e708231d40657c9/html5/thumbnails/14.jpg)
Method โ Utilization of T-norm for Reliability Score
Algebraic Prod.
Hamacher Prod. าง๐๐๐ =เทจ๐๐๐ ฯ๐=1
๐พ ๐๐๐2
๐ + (1 โ ๐)( เทจ๐๐๐ + ฯ๐=1๐พ ๐๐๐
2 โ เทจ๐๐๐ ฯ๐=1๐พ ๐๐๐
2 )
Sin based t-norm าง๐๐๐ =2
๐๐ ๐๐โ1 sin
๐
2เทจ๐๐๐ + sin
๐
2
๐=1
๐พ
๐๐๐2 โ 1 โจ0
าง๐๐๐ = เทจ๐๐๐ โ
๐=1
๐พ
๐๐๐2
๐ โฅ 0
Dombi Prod.าง๐๐๐ =
1
1 +๐ 1 โ เทจ๐๐๐
เทจ๐๐๐
๐
+1 โ ฯ๐=1
๐พ ๐๐๐2
ฯ๐=1๐พ ๐๐๐
2
๐
๐ > 014
![Page 15: Improvement of the reliability score for autocoding โฆr-project.ro/conference2019/presentations/uRos2019_ytoko.pdfImprovement of Reliability Score for Autocoding and its Implementation](https://reader035.fdocuments.in/reader035/viewer/2022070802/5f02dad67e708231d40657c9/html5/thumbnails/15.jpg)
Utilize T-norm and Sigmoid function
ำ๐๐๐ โก tanh ๐๐ โ าง๐๐๐
ำ๐๐๐ โก๐๐
1 + ๐๐2
โ าง๐๐๐
Method โ Improved Reliability Score
าง๐๐๐ = T เทจ๐๐๐ ,
๐=1
๐พ
๐๐๐2
าง๐๐๐ = T เทจ๐๐๐ , 1 +๐=1
เทฉ๐พ๐เทจ๐๐๐ log๐พ เทจ๐๐๐
15
Sigmoid functionT-norm
![Page 16: Improvement of the reliability score for autocoding โฆr-project.ro/conference2019/presentations/uRos2019_ytoko.pdfImprovement of Reliability Score for Autocoding and its Implementation](https://reader035.fdocuments.in/reader035/viewer/2022070802/5f02dad67e708231d40657c9/html5/thumbnails/16.jpg)
Partition
Entropy (PE)
+Algebraic Prod.
Partition
Coefficient (PC)
+Algebraic Prod.
PE
+ Hamacher Prod.
+ Sigmoid func. (a)
PE
+ Hamacher Prod.
+ Sigmoid func. (b)
PC
+ Hamacher Prod.
+ Sigmoid func. (a)
PC
+ Hamacher Prod.
+ Sigmoid func. (b)
1st candidate 35,044 35,051 35,064 35,100 35,119 35,134
2nd candidate 1,649 1,682 1,618 1,589 1,614 1,595
3rd candidate 536 540 551 541 539 539
4th candidate 283 293 277 283 291 293
5th candidate 189 179 189 187 185 188
Total 37,701 37,745 37,699 37,700 37,748 37,749
40,000
Number of
total
instances
Number of matched instances
Results
Data: Family Income and Expenditure survey, Japan
Data size : approx.400,000 instances approx. 350,000 for Training
40,000 for EvaluationResults
16
We used data answered via online survey system
Sigmoid func.
(a): เต๐๐ 1 + ๐๐2 , (b): tanh ๐๐
๐ = 0.99 ๐๐ธ , 0.7(๐๐ถ)
Hamacher Prod.๐ฅ๐ฆ
๐ + (1 โ ๐)(๐ฅ + ๐ฆ โ ๐ฅ๐ฆ)
![Page 17: Improvement of the reliability score for autocoding โฆr-project.ro/conference2019/presentations/uRos2019_ytoko.pdfImprovement of Reliability Score for Autocoding and its Implementation](https://reader035.fdocuments.in/reader035/viewer/2022070802/5f02dad67e708231d40657c9/html5/thumbnails/17.jpg)
Algebraic
Prod.
Sin-based
T-norm
Hamacher
Prod.Dombi Prod.
1st candidate 854 854 854 854
2nd candidate 58 55 58 56
3rd candidate 20 26 20 23
Total 932 935 932 933
1,000
T-normNumber of
total
instances
Number of matched instances
Results
Data: Family Income and Expenditure survey, Japan
Data size : 11,000 instances 10,000 for Training
1,000 for Evaluation
Only foodstuff and dining-out data were used
We assigned 11 classification codes for this experiment
Results
17
![Page 18: Improvement of the reliability score for autocoding โฆr-project.ro/conference2019/presentations/uRos2019_ytoko.pdfImprovement of Reliability Score for Autocoding and its Implementation](https://reader035.fdocuments.in/reader035/viewer/2022070802/5f02dad67e708231d40657c9/html5/thumbnails/18.jpg)
Implementation in R
We implemented this technique in R and the R package is under development.
18
![Page 19: Improvement of the reliability score for autocoding โฆr-project.ro/conference2019/presentations/uRos2019_ytoko.pdfImprovement of Reliability Score for Autocoding and its Implementation](https://reader035.fdocuments.in/reader035/viewer/2022070802/5f02dad67e708231d40657c9/html5/thumbnails/19.jpg)
Summary
Propose Generalized Reliability Score
to Improve Handling Ability of the Reliability
of Each Data to Each Code
Inclusion of Frequency of Each Feature to the Reliability Score
โ Improved Classification Accuracy
Utilize T-norm in Statistical Metric Space to the Reliability Score
โ Generalize the Reliability Score
Numerical examples show better performance
19
We implemented this technique in R
and the R package is under development
![Page 20: Improvement of the reliability score for autocoding โฆr-project.ro/conference2019/presentations/uRos2019_ytoko.pdfImprovement of Reliability Score for Autocoding and its Implementation](https://reader035.fdocuments.in/reader035/viewer/2022070802/5f02dad67e708231d40657c9/html5/thumbnails/20.jpg)
Reference[1] B. Schweizer, A. Sklar, Probabilistic Metric Spaces, Dover Publications, New York (2005).
[2] J. C. Bezdek, J. Keller, R. Krisnapuram, N.R. Pal, Fuzzy Models and Algorithms for Pattern Recognition and
Image Processing, Kluwer Academic Publishers, New York (1999).
[3] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press, New York (1981).
[4] K. Menger, Statistical Metrics, Proc. Nat. Acad. of Sci,. U.S.A., 28 (1942), 535-537.
[5] M. Mizumoto, Pictorial Representaition of Fuzzy connectives, Part โ : Cases of t-Norms, t-Conorms and
Averaging Operators, Fuzzy Sets and Systems, 31 (1989), 217-242.
[6] Y. Toko, S. Iijima, M. Sato-Ilic, Overlapping classification for autocoding system, Journal of Romanian
Statistical Review, Vol. 4, (2018), 58-73.
[7] Y. Toko, K. Wada, S. Iijima, M. Sato-Ilic, Supervised multiclass classifier for autocoding based on partition
coefficient, Intelligent Decision Technologies 2018, Smart Innovation, Systems and Technologies, Springer,
Switzerland, Vol. 97, (2018), 54-64.
Thank you !
20