Improvement of the reliability score for autocoding...

20
Improvement of Reliability Score for Autocoding and its Implementation in R uRos2019, 21 May 2019, Bucharest Yukako Toko 1 , Shinya Iijima 1 , Mika Sato-Ilic 1,2 1 National Statistics Center, Japan 2 University of Tsukuba, Japan 1

Transcript of Improvement of the reliability score for autocoding...

Page 1: Improvement of the reliability score for autocoding โ€ฆr-project.ro/conference2019/presentations/uRos2019_ytoko.pdfImprovement of Reliability Score for Autocoding and its Implementation

Improvement of Reliability Score for Autocoding

and its Implementation in R

uRos2019, 21 May 2019, Bucharest

Yukako Toko1, Shinya Iijima1, Mika Sato-Ilic1,2

1National Statistics Center, Japan

2 University of Tsukuba, Japan

1

Page 2: Improvement of the reliability score for autocoding โ€ฆr-project.ro/conference2019/presentations/uRos2019_ytoko.pdfImprovement of Reliability Score for Autocoding and its Implementation

Overview - Background

Uncertainty of Training Data

* Semantic problem

* Interpretation problem

* Insufficiently detailed input

information

Development of Overlapping ClassifierOne Feature into Multiple Classes

(Y. Toko, S. Iijima, M. Sato-Ilic, 2018)

Utilized the idea of Fuzzy Partition Entropy

Reliability Score Considering Uncertainties

from Both Measures

Utilize Difference of Measures

for Uncertainties

Uncertainty from data

Probability Measure

Uncertainty from latent

classification structure in data

Fuzzy Measure

Conventional Classifier

One Feature

into One Class

Problem

2

Page 3: Improvement of the reliability score for autocoding โ€ฆr-project.ro/conference2019/presentations/uRos2019_ytoko.pdfImprovement of Reliability Score for Autocoding and its Implementation

Purpose of This Study

Overlapping Classifier

based on Reliability Score(Y. Toko, S. Iijima, M. Sato-Ilic, 2018)

Consideration of Frequency of Each Feature

in training dataset

Inclusion of the frequency of each feature

to the Reliability Score

Consideration of Generalized Reliability Score

Apply T-norm in Statistical metric space

Improvement of the reliability score

3

Page 4: Improvement of the reliability score for autocoding โ€ฆr-project.ro/conference2019/presentations/uRos2019_ytoko.pdfImprovement of Reliability Score for Autocoding and its Implementation

Overview - System Structure

Training

dataset

Feature

frequency table

Training process

Input OutputFeature

extraction

Candidates

retrieval

Reliability

Score calc.

Classification process

Feature

extraction

Feature A Class X Feature AClass X

Class Y

-> proposed an algorithm that allows the assignment of one feature

is classified to multiple classes

(overlapping classification)

To address the unrealistic restriction : one feature is classified to a single class

Generalized

Reliability Score

Inclusion of Frequency

of Each Feature

to Reliability Score

4

Page 5: Improvement of the reliability score for autocoding โ€ฆr-project.ro/conference2019/presentations/uRos2019_ytoko.pdfImprovement of Reliability Score for Autocoding and its Implementation

Method โ€“ Overlapping classifier

Step 1 : Calculate the probability of j-th feature ( j=1,โ€ฆ,J ) to a class k ( k=1,โ€ฆ,K ) as

๐‘๐‘— ๐‘˜ =๐‘›๐‘— ๐‘˜

๐‘›๐‘—, ๐‘›๐‘— =

๐‘˜=1

๐พ

๐‘›๐‘— ๐‘˜

๐‘›๐‘—๐‘˜ : Number of text descriptions in a class ๐‘˜ with j-th feature in the training dataset

5

Page 6: Improvement of the reliability score for autocoding โ€ฆr-project.ro/conference2019/presentations/uRos2019_ytoko.pdfImprovement of Reliability Score for Autocoding and its Implementation

Method โ€“ Overlapping classifier

Step 2 : Determine at most เทฉ๐พ (เทฉ๐พ < ๐พ) promising candidate classes for each feature based

on ๐‘๐‘— ๐‘˜

2. Create เทจ๐‘๐‘—1,โ‹ฏ , เทจ๐‘๐‘— เทฉ๐พ๐‘—, เทจ๐พ๐‘— โ‰ค เทจ๐พ โ‰ค ๐พ

Note : When there are same values in ๐‘๐‘—1,โ‹ฏ , ๐‘๐‘—๐พ ,

then we select as many as possible different เทจ๐พ๐‘— classes for each feature j

1. Arrange ๐‘๐‘—1,โ‹ฏ , ๐‘๐‘—๐พ in descending order and create ๐‘๐‘—1,โ‹ฏ , ๐‘๐‘—๐พ ,

such as ๐‘๐‘—1 โ‰ฅ โ‹ฏ โ‰ฅ ๐‘๐‘—๐พ , ๐‘— = 1,โ‹ฏ , ๐ฝ

6

Page 7: Improvement of the reliability score for autocoding โ€ฆr-project.ro/conference2019/presentations/uRos2019_ytoko.pdfImprovement of Reliability Score for Autocoding and its Implementation

Method โ€“ Overlapping classifier

Step 3 : Calculate the Reliability Score าง๐‘๐‘—๐‘˜

าง๐‘๐‘—๐‘˜ = เทจ๐‘๐‘—๐‘˜ 1 +๐‘š=1

เทฉ๐พ๐‘—เทจ๐‘๐‘—๐‘š log๐พ เทจ๐‘๐‘—๐‘š , ๐‘— = 1,โ‹ฏ , ๐ฝ, ๐‘˜ = 1,โ‹ฏ , เทจ๐พ๐‘—

Step 4 : Determine top L (๐ฟ โˆˆ {1, โ€ฆ , ฯƒ๐‘—๐‘™=1โ„Ž๐‘™ เทฉ๐พ๐‘—๐‘™}) candidate classes

When the number of target text descriptions is T, and each text description includes

โ„Ž๐‘™ ( ๐‘™ = 1,โ‹ฏ , ๐‘‡ ) features, corresponding าง๐‘๐‘—๐‘˜ for l-th text description can be represented as

าง๐‘๐‘—๐‘™๐‘˜, ๐‘—๐‘™ = 1,โ‹ฏ , โ„Ž๐‘™ , ๐‘˜ = 1,โ‹ฏ , เทฉ๐พ๐‘—๐‘™ , ๐‘™ = 1,โ‹ฏ , ๐‘‡

Reliability score of j-th feature included in l-th text description to a class k

7

Page 8: Improvement of the reliability score for autocoding โ€ฆr-project.ro/conference2019/presentations/uRos2019_ytoko.pdfImprovement of Reliability Score for Autocoding and its Implementation

Method โ€“ Overlapping classifier

าง๐‘๐‘—๐‘˜ : Reliability Score of j -th feature to a class k

Probability of feature j to class k Classification status of feature j over the เทฉ๐พ๐‘— classes

Transformation from เทจ๐‘๐‘— ๐‘˜ to classification

status of feature j

Degree of Reliability

Explanation of the uncertainty of the training

data

Utilization of the deference of measurements

of uncertainty

Probability Fuzzy

าง๐‘๐‘—๐‘˜ = เทจ๐‘๐‘—๐‘˜ 1 +๐‘š=1

เทฉ๐พ๐‘—เทจ๐‘๐‘—๐‘š log๐พ เทจ๐‘๐‘—๐‘š

8

Page 9: Improvement of the reliability score for autocoding โ€ฆr-project.ro/conference2019/presentations/uRos2019_ytoko.pdfImprovement of Reliability Score for Autocoding and its Implementation

Method โ€“ different fuzzy measurement

าง๐‘๐‘—๐‘˜ = เทจ๐‘๐‘—๐‘˜

๐‘˜=1

๐พ

๐‘๐‘—๐‘˜2

Apply another fuzzy measurement for reliability score

Partition coefficient for each feature j

๐‘ƒ๐ถ๐‘— =

๐‘˜=1

๐พ

๐‘๐‘—๐‘˜2 , ๐‘— = 1, โ€ฆ , ๐ฝ

(Y. Toko, K. Wada, S. Iijima, M. Sato-Ilic, 2018)

Another degree of Reliability

Classification status of feature j over the K classes

าง๐‘๐‘—๐‘˜ = เทจ๐‘๐‘—๐‘˜ 1 +๐‘š=1

เทฉ๐พ๐‘—เทจ๐‘๐‘—๐‘š log๐พ เทจ๐‘๐‘—๐‘š

Partition coefficient Partition entropy

9

Page 10: Improvement of the reliability score for autocoding โ€ฆr-project.ro/conference2019/presentations/uRos2019_ytoko.pdfImprovement of Reliability Score for Autocoding and its Implementation

Method โ€“ Generalized Reliability Score

าง๐‘๐‘—๐‘˜ = เทจ๐‘๐‘—๐‘˜

๐‘˜=1

๐พ

๐‘๐‘—๐‘˜2 าง๐‘๐‘—๐‘˜ = เทจ๐‘๐‘—๐‘˜ 1 +

๐‘š=1

เทฉ๐พ๐‘—เทจ๐‘๐‘—๐‘š log๐พ เทจ๐‘๐‘—๐‘š

Partition coefficient Partition entropy

าง๐‘๐‘—๐‘˜ = T เทจ๐‘๐‘—๐‘˜ ,

๐‘˜=1

๐พ

๐‘๐‘—๐‘˜2 าง๐‘๐‘—๐‘˜ = T เทจ๐‘๐‘—๐‘˜ , 1 +

๐‘š=1

เทฉ๐พ๐‘—เทจ๐‘๐‘—๐‘š log๐พ เทจ๐‘๐‘—๐‘š

Generalization

T(๐‘Ž , ๐‘): T-norm between a and b

10

Page 11: Improvement of the reliability score for autocoding โ€ฆr-project.ro/conference2019/presentations/uRos2019_ytoko.pdfImprovement of Reliability Score for Autocoding and its Implementation

Method โ€“ T-norm (Triangular norms)

๐‘‡ โˆถ 0,1 ร— 0,1 โ†’ [0,1]

โˆ€๐‘Ž, ๐‘, ๐‘, ๐‘‘ โˆˆ [0,1]

(1) 0 โ‰ค ๐‘‡ ๐‘Ž, ๐‘ โ‰ค 1,

๐‘‡ ๐‘Ž, 0 = ๐‘‡ 0, ๐‘ = 0

๐‘‡ ๐‘Ž, 1 = ๐‘‡ 1, ๐‘Ž = ๐‘Ž( Boundary conditions )

(2) a โ‰ค ๐‘, ๐‘ โ‰ค ๐‘‘ => ๐‘‡ ๐‘Ž, ๐‘ โ‰ค ๐‘‡(๐‘, ๐‘‘) ( Monotonicity )

(3) ๐‘‡ ๐‘Ž, ๐‘ = ๐‘‡(๐‘, ๐‘Ž) ( Symmetry )

(4) ๐‘‡(๐‘‡ ๐‘Ž, ๐‘ , ๐‘) = ๐‘‡(๐‘Ž, ๐‘‡ ๐‘, ๐‘ ) ( Associativity )

(K. Menger, 1942)

11

Page 12: Improvement of the reliability score for autocoding โ€ฆr-project.ro/conference2019/presentations/uRos2019_ytoko.pdfImprovement of Reliability Score for Autocoding and its Implementation

Method โ€“ Statistical metric space

๐น๐‘๐‘ž ๐‘ฅ โ‰ก Pr{๐‘‘๐‘๐‘ž < ๐‘ฅ}

โˆ€๐‘, ๐‘ž, ๐‘Ÿ โˆˆ ๐‘†

๐‘‘๐‘๐‘ = 0

๐‘‘๐‘๐‘ž > 0 (๐‘ โ‰  ๐‘ž)

๐‘‘๐‘๐‘ž = ๐‘‘๐‘ž๐‘

๐‘‘๐‘๐‘Ÿ โ‰ค ๐‘‘๐‘๐‘ž + ๐‘‘๐‘ž๐‘Ÿ

๐น๐‘๐‘ ๐‘ฅ = 1, for all ๐‘ฅ > 0

๐น๐‘๐‘ž ๐‘ฅ < 1, ๐‘ โ‰  ๐‘ž for some ๐‘ฅ > 0

๐น๐‘๐‘ž = ๐น๐‘ž๐‘

๐น๐‘๐‘Ÿ(๐‘ฅ + ๐‘ฆ) โ‰ฅ ๐‘‡(๐น๐‘ž๐‘ ๐‘ฅ , ๐น๐‘ž๐‘Ÿ ๐‘ฆ )

โ†”

โ†”

โ†”

โ†”

12

Page 13: Improvement of the reliability score for autocoding โ€ฆr-project.ro/conference2019/presentations/uRos2019_ytoko.pdfImprovement of Reliability Score for Autocoding and its Implementation

Method โ€“ Examples of T-norm

๐‘ฅ๐‘ฆ

t-norm ๐‘ก(๐‘ฅ, ๐‘ฆ)

Algebraic Prod.

Hamacher Prod.๐‘ฅ๐‘ฆ

๐‘ + (1 โˆ’ ๐‘)(๐‘ฅ + ๐‘ฆ โˆ’ ๐‘ฅ๐‘ฆ)

Sin based t-norm2

๐œ‹๐‘ ๐‘–๐‘›โˆ’1 sin

๐œ‹

2๐‘ฅ + sin

๐œ‹

2๐‘ฆ โˆ’ 1 โˆจ0

Dombi Prod.

1

1 +๐‘ 1 โˆ’ ๐‘ฅ

๐‘ฅ

๐‘

+1 โˆ’ ๐‘ฆ๐‘ฆ

๐‘

๐‘ โ‰ฅ 0

๐‘ > 0

13

Page 14: Improvement of the reliability score for autocoding โ€ฆr-project.ro/conference2019/presentations/uRos2019_ytoko.pdfImprovement of Reliability Score for Autocoding and its Implementation

Method โ€“ Utilization of T-norm for Reliability Score

Algebraic Prod.

Hamacher Prod. าง๐‘๐‘—๐‘˜ =เทจ๐‘๐‘—๐‘˜ ฯƒ๐‘˜=1

๐พ ๐‘๐‘—๐‘˜2

๐‘ + (1 โˆ’ ๐‘)( เทจ๐‘๐‘—๐‘˜ + ฯƒ๐‘˜=1๐พ ๐‘๐‘—๐‘˜

2 โˆ’ เทจ๐‘๐‘—๐‘˜ ฯƒ๐‘˜=1๐พ ๐‘๐‘—๐‘˜

2 )

Sin based t-norm าง๐‘๐‘—๐‘˜ =2

๐œ‹๐‘ ๐‘–๐‘›โˆ’1 sin

๐œ‹

2เทจ๐‘๐‘—๐‘˜ + sin

๐œ‹

2

๐‘˜=1

๐พ

๐‘๐‘—๐‘˜2 โˆ’ 1 โˆจ0

าง๐‘๐‘—๐‘˜ = เทจ๐‘๐‘—๐‘˜ โˆ—

๐‘˜=1

๐พ

๐‘๐‘—๐‘˜2

๐‘ โ‰ฅ 0

Dombi Prod.าง๐‘๐‘—๐‘˜ =

1

1 +๐‘ 1 โˆ’ เทจ๐‘๐‘—๐‘˜

เทจ๐‘๐‘—๐‘˜

๐‘

+1 โˆ’ ฯƒ๐‘˜=1

๐พ ๐‘๐‘—๐‘˜2

ฯƒ๐‘˜=1๐พ ๐‘๐‘—๐‘˜

2

๐‘

๐‘ > 014

Page 15: Improvement of the reliability score for autocoding โ€ฆr-project.ro/conference2019/presentations/uRos2019_ytoko.pdfImprovement of Reliability Score for Autocoding and its Implementation

Utilize T-norm and Sigmoid function

ำ–๐‘๐‘—๐‘˜ โ‰ก tanh ๐‘›๐‘— โˆ— าง๐‘๐‘—๐‘˜

ำ–๐‘๐‘—๐‘˜ โ‰ก๐‘›๐‘—

1 + ๐‘›๐‘—2

โˆ— าง๐‘๐‘—๐‘˜

Method โ€“ Improved Reliability Score

าง๐‘๐‘—๐‘˜ = T เทจ๐‘๐‘—๐‘˜ ,

๐‘˜=1

๐พ

๐‘๐‘—๐‘˜2

าง๐‘๐‘—๐‘˜ = T เทจ๐‘๐‘—๐‘˜ , 1 +๐‘š=1

เทฉ๐พ๐‘—เทจ๐‘๐‘—๐‘š log๐พ เทจ๐‘๐‘—๐‘š

15

Sigmoid functionT-norm

Page 16: Improvement of the reliability score for autocoding โ€ฆr-project.ro/conference2019/presentations/uRos2019_ytoko.pdfImprovement of Reliability Score for Autocoding and its Implementation

Partition

Entropy (PE)

+Algebraic Prod.

Partition

Coefficient (PC)

+Algebraic Prod.

PE

+ Hamacher Prod.

+ Sigmoid func. (a)

PE

+ Hamacher Prod.

+ Sigmoid func. (b)

PC

+ Hamacher Prod.

+ Sigmoid func. (a)

PC

+ Hamacher Prod.

+ Sigmoid func. (b)

1st candidate 35,044 35,051 35,064 35,100 35,119 35,134

2nd candidate 1,649 1,682 1,618 1,589 1,614 1,595

3rd candidate 536 540 551 541 539 539

4th candidate 283 293 277 283 291 293

5th candidate 189 179 189 187 185 188

Total 37,701 37,745 37,699 37,700 37,748 37,749

40,000

Number of

total

instances

Number of matched instances

Results

Data: Family Income and Expenditure survey, Japan

Data size : approx.400,000 instances approx. 350,000 for Training

40,000 for EvaluationResults

16

We used data answered via online survey system

Sigmoid func.

(a): เต—๐‘›๐‘— 1 + ๐‘›๐‘—2 , (b): tanh ๐‘›๐‘—

๐‘ = 0.99 ๐‘ƒ๐ธ , 0.7(๐‘ƒ๐ถ)

Hamacher Prod.๐‘ฅ๐‘ฆ

๐‘ + (1 โˆ’ ๐‘)(๐‘ฅ + ๐‘ฆ โˆ’ ๐‘ฅ๐‘ฆ)

Page 17: Improvement of the reliability score for autocoding โ€ฆr-project.ro/conference2019/presentations/uRos2019_ytoko.pdfImprovement of Reliability Score for Autocoding and its Implementation

Algebraic

Prod.

Sin-based

T-norm

Hamacher

Prod.Dombi Prod.

1st candidate 854 854 854 854

2nd candidate 58 55 58 56

3rd candidate 20 26 20 23

Total 932 935 932 933

1,000

T-normNumber of

total

instances

Number of matched instances

Results

Data: Family Income and Expenditure survey, Japan

Data size : 11,000 instances 10,000 for Training

1,000 for Evaluation

Only foodstuff and dining-out data were used

We assigned 11 classification codes for this experiment

Results

17

Page 18: Improvement of the reliability score for autocoding โ€ฆr-project.ro/conference2019/presentations/uRos2019_ytoko.pdfImprovement of Reliability Score for Autocoding and its Implementation

Implementation in R

We implemented this technique in R and the R package is under development.

18

Page 19: Improvement of the reliability score for autocoding โ€ฆr-project.ro/conference2019/presentations/uRos2019_ytoko.pdfImprovement of Reliability Score for Autocoding and its Implementation

Summary

Propose Generalized Reliability Score

to Improve Handling Ability of the Reliability

of Each Data to Each Code

Inclusion of Frequency of Each Feature to the Reliability Score

โ†’ Improved Classification Accuracy

Utilize T-norm in Statistical Metric Space to the Reliability Score

โ†’ Generalize the Reliability Score

Numerical examples show better performance

19

We implemented this technique in R

and the R package is under development

Page 20: Improvement of the reliability score for autocoding โ€ฆr-project.ro/conference2019/presentations/uRos2019_ytoko.pdfImprovement of Reliability Score for Autocoding and its Implementation

Reference[1] B. Schweizer, A. Sklar, Probabilistic Metric Spaces, Dover Publications, New York (2005).

[2] J. C. Bezdek, J. Keller, R. Krisnapuram, N.R. Pal, Fuzzy Models and Algorithms for Pattern Recognition and

Image Processing, Kluwer Academic Publishers, New York (1999).

[3] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press, New York (1981).

[4] K. Menger, Statistical Metrics, Proc. Nat. Acad. of Sci,. U.S.A., 28 (1942), 535-537.

[5] M. Mizumoto, Pictorial Representaition of Fuzzy connectives, Part โ… : Cases of t-Norms, t-Conorms and

Averaging Operators, Fuzzy Sets and Systems, 31 (1989), 217-242.

[6] Y. Toko, S. Iijima, M. Sato-Ilic, Overlapping classification for autocoding system, Journal of Romanian

Statistical Review, Vol. 4, (2018), 58-73.

[7] Y. Toko, K. Wada, S. Iijima, M. Sato-Ilic, Supervised multiclass classifier for autocoding based on partition

coefficient, Intelligent Decision Technologies 2018, Smart Innovation, Systems and Technologies, Springer,

Switzerland, Vol. 97, (2018), 54-64.

Thank you !

[email protected]

20