Informative Prediction based on Ordinal Questionnaire DataIBM Research Model parameters a,b,c are...

33
IBM Research Informative Prediction based on Ordinal Questionnaire Data T. Idé (Ide-san) and Amit Dhurandhar IBM T. J. Watson Research Center 2015 International Conference on Data Mining (ICDM 2015)

Transcript of Informative Prediction based on Ordinal Questionnaire DataIBM Research Model parameters a,b,c are...

Page 1: Informative Prediction based on Ordinal Questionnaire DataIBM Research Model parameters a,b,c are determined by maximizing log marginalized likelihood . 18 IBM Research Use numerical

IBM Research

Informative Prediction based on Ordinal Questionnaire Data

T. Idé (Ide-san) and Amit Dhurandhar IBM T. J. Watson Research Center

2015 International Conference on Data Mining (ICDM 2015)

Page 2: Informative Prediction based on Ordinal Questionnaire DataIBM Research Model parameters a,b,c are determined by maximizing log marginalized likelihood . 18 IBM Research Use numerical

2

IBM Research

Contents

Problem setting

Item response theory

Maximum-a-posteriori framework for supervised IRT

Metric learning from supervised IRT

Experiments

Page 3: Informative Prediction based on Ordinal Questionnaire DataIBM Research Model parameters a,b,c are determined by maximizing log marginalized likelihood . 18 IBM Research Use numerical

3

IBM Research

General problem setting: Binary classification on

questionnaire data

Input: Questionnaire answer for M questions,

Output: class label,

Data set:

Q1 No

Q2 Yes

Q3 No

Q4 No

Q5 Yes

Q6 No

… …

+1 Training

data

Prediction

model

Q1 Yes

Q2 No

Q3 No

Q4 No

Q5 No

Q6 Yes

… …

?

M questions

• 1: at-risk

• 0: no risk

Page 4: Informative Prediction based on Ordinal Questionnaire DataIBM Research Model parameters a,b,c are determined by maximizing log marginalized likelihood . 18 IBM Research Use numerical

4

IBM Research

Questionnaire data = ordinal data

We need special considerations

To define

proper metric

space

To ensure full

interpretability

Page 5: Informative Prediction based on Ordinal Questionnaire DataIBM Research Model parameters a,b,c are determined by maximizing log marginalized likelihood . 18 IBM Research Use numerical

5

IBM Research

Questionnaire data = ordinal data

We need special considerations

To define

proper metric

space

Q1 Yes

Q2 No

Q1

Q2

1

1 0

No guarantee that the naïve notion of distance

(e.g. Euclidean) holds for ordinal data

Page 6: Informative Prediction based on Ordinal Questionnaire DataIBM Research Model parameters a,b,c are determined by maximizing log marginalized likelihood . 18 IBM Research Use numerical

6

IBM Research

Questionnaire data = ordinal data

We need special considerations

To ensure full

interpretability

Q1

Q2

1

1 0

How do different answer

choices affect the outcome?

How do different

questions affect the

outcome differently?

Page 7: Informative Prediction based on Ordinal Questionnaire DataIBM Research Model parameters a,b,c are determined by maximizing log marginalized likelihood . 18 IBM Research Use numerical

7

IBM Research

Motivating real problem:

Predict how much likely a project is going to fail

Input data, x : questionnaire answers by reviewer

Outcome value, y : failure or success (after contract signing)

Pre-bid

consulting

Technical/business

Assessment

Risk & Impact

Analysis

Contact

signing

Project

Management

Review

Risk mitigation

actions

x y: questionnaire

answers : project health

indicator (+1/-1)

At-risk?

Q1 yes

Q2 no

Q3 yes

Typical solution design process

of IT system development

Page 8: Informative Prediction based on Ordinal Questionnaire DataIBM Research Model parameters a,b,c are determined by maximizing log marginalized likelihood . 18 IBM Research Use numerical

8

IBM Research

(For ref.) What the questionnaire looks like

Major topics covered

Communication issues with the client

Well-definedness of the project scope

Issues related to subcontractors and internal teams

Project management issues

etc.

x : questionnaire

answers IT development project

At-risk?

Q1 yes

Q2 no

Q3 yes

Page 9: Informative Prediction based on Ordinal Questionnaire DataIBM Research Model parameters a,b,c are determined by maximizing log marginalized likelihood . 18 IBM Research Use numerical

9

IBM Research

At-risk?

Q1 yes

Q2 no

Q3 yes

Problem setting summary

x : questionnaire

answers IT system development project

y :

failure or

success?

Question-question difference

Sample-sample difference

Yes-no difference

Wish to develop prediction model

that is informative in terms of:

Page 10: Informative Prediction based on Ordinal Questionnaire DataIBM Research Model parameters a,b,c are determined by maximizing log marginalized likelihood . 18 IBM Research Use numerical

10

IBM Research

Contents

Problem setting

Item response theory

Maximum-a-posteriori framework for supervised IRT

Metric learning from supervised IRT

Experiments

Page 11: Informative Prediction based on Ordinal Questionnaire DataIBM Research Model parameters a,b,c are determined by maximizing log marginalized likelihood . 18 IBM Research Use numerical

11

IBM Research

For natural interpretability, we employ the item response

theory (IRT) in psychometrics

Prob. of answering as “yes” for

the i-th question

-4 -2 0 2 4 6 8

0.0

0.2

0.4

0.6

0.8

1.0

failure tendency

pro

ba

bility

“difficulty”

“discrimination”

“guessing”

latent tendency of answering “yes”

pro

b. o

f ch

oo

sin

g “y

es”

• Overly optimistic for smaller risks

• Overly pessimistic for larger risks

• Sometimes use a guess

Page 12: Informative Prediction based on Ordinal Questionnaire DataIBM Research Model parameters a,b,c are determined by maximizing log marginalized likelihood . 18 IBM Research Use numerical

12

IBM Research

ICC naturally corrects cognitive biases of humans to

uncover the true trait

latent failure

tendency questionnaire

answer

Should be

more faithful

to the truth

May be

biased

Page 13: Informative Prediction based on Ordinal Questionnaire DataIBM Research Model parameters a,b,c are determined by maximizing log marginalized likelihood . 18 IBM Research Use numerical

13

IBM Research

Generative model for a questionnaire answer x

Prob. of answering as “yes” for

the i-th question

Page 14: Informative Prediction based on Ordinal Questionnaire DataIBM Research Model parameters a,b,c are determined by maximizing log marginalized likelihood . 18 IBM Research Use numerical

14

IBM Research

Contents

Problem setting

Item response theory

Maximum-a-posteriori framework for supervised IRT

Metric learning from supervised IRT

Experiments

Page 15: Informative Prediction based on Ordinal Questionnaire DataIBM Research Model parameters a,b,c are determined by maximizing log marginalized likelihood . 18 IBM Research Use numerical

15

IBM Research

We extend the IRT to the supervised setting

IRT is the standard method to analyze academic tests (e.g. SAT)

IRT is unsupervised. We are developing a supervised version of IRT

by including the outcome variable, y

Use the same ICC to take

account of cognitive bias

Extend the original IRT in

the supervised learning

setting

Page 16: Informative Prediction based on Ordinal Questionnaire DataIBM Research Model parameters a,b,c are determined by maximizing log marginalized likelihood . 18 IBM Research Use numerical

16

IBM Research

We extend the IRT to the supervised setting by introducing a

prior distribution conditional on y

Capture natural assumption that

troubled projects should have higher

failure tendency

Page 17: Informative Prediction based on Ordinal Questionnaire DataIBM Research Model parameters a,b,c are determined by maximizing log marginalized likelihood . 18 IBM Research Use numerical

17

IBM Research

Model parameters a,b,c are determined by maximizing log

marginalized likelihood

Page 18: Informative Prediction based on Ordinal Questionnaire DataIBM Research Model parameters a,b,c are determined by maximizing log marginalized likelihood . 18 IBM Research Use numerical

18

IBM Research

Use numerical integration technique (Gauss-Hermite

quadrature) to maximize marginalized likelihood

See paper for details

Page 19: Informative Prediction based on Ordinal Questionnaire DataIBM Research Model parameters a,b,c are determined by maximizing log marginalized likelihood . 18 IBM Research Use numerical

19

IBM Research

Contents

Problem setting

Item response theory

Maximum-a-posteriori framework for supervised IRT

Metric learning from supervised IRT

Experiments

Page 20: Informative Prediction based on Ordinal Questionnaire DataIBM Research Model parameters a,b,c are determined by maximizing log marginalized likelihood . 18 IBM Research Use numerical

20

IBM Research

Questionnaire data = ordinal data

We need special considerations

To define

proper metric

space

Q1 Yes

Q2 No

Q1

Q2

1

1 0

No guarantee that the naïve notion of distance

(e.g. Euclidean) holds for ordinal data

recap

Page 21: Informative Prediction based on Ordinal Questionnaire DataIBM Research Model parameters a,b,c are determined by maximizing log marginalized likelihood . 18 IBM Research Use numerical

21

IBM Research

Once a metric space is properly defined, we can simply use

k-NN for prediction

estimated

binary label prediction phase

New question

answer

Simply use k-NN classification

based on the distance metric

How can we find A from

the supervised IRT model?

Page 22: Informative Prediction based on Ordinal Questionnaire DataIBM Research Model parameters a,b,c are determined by maximizing log marginalized likelihood . 18 IBM Research Use numerical

22

IBM Research

Proposing entropy equation for learning the Riemannian

metric A

Intuition: “The deviation of A from the isotropic case is driven by the

difference between the two classes (y=+1 or -1)” o Use the KL distance as a measure of difference

o Use the p.d.f. of the neighborhood component analysis (NCA, [Goldberger 05])

The entropy equation to determine A (at x’ )

Page 23: Informative Prediction based on Ordinal Questionnaire DataIBM Research Model parameters a,b,c are determined by maximizing log marginalized likelihood . 18 IBM Research Use numerical

23

IBM Research

Approximated solution of the entropy equation

See paper for details

Diagonal element can be used as the informativeness of each question

Page 24: Informative Prediction based on Ordinal Questionnaire DataIBM Research Model parameters a,b,c are determined by maximizing log marginalized likelihood . 18 IBM Research Use numerical

24

IBM Research

training phase

Summary of the model

Prior for supervised extension

Generative model of x : IRT Historical record MAP

estimation

Metric learning to

give the distance

metric, A, for k-NN

Page 25: Informative Prediction based on Ordinal Questionnaire DataIBM Research Model parameters a,b,c are determined by maximizing log marginalized likelihood . 18 IBM Research Use numerical

25

IBM Research

Contents

Problem setting

Item response theory

Maximum-a-posteriori framework for supervised IRT

Metric learning from supervised IRT

Experiments

Page 26: Informative Prediction based on Ordinal Questionnaire DataIBM Research Model parameters a,b,c are determined by maximizing log marginalized likelihood . 18 IBM Research Use numerical

26

IBM Research

Toy example: binary classification

for bi-variate binary inputs

Compared with regularized logistic

regression

Took the diagonal metric as

informativeness of each variable

Proposed method gives much richer

and more informative results

Page 27: Informative Prediction based on Ordinal Questionnaire DataIBM Research Model parameters a,b,c are determined by maximizing log marginalized likelihood . 18 IBM Research Use numerical

27

IBM Research

Experiment: Using service provider’s real project

assessment data

Two types of project assessment data o CRA (contract risk assessment)

o PBA (project baseline assessment)

Data size o CRA: M = 22, N=262

o PBA: M = 56, N=1056 x : questionnaire

answers IT development project

At-risk?

Q1 yes

Q2 no

Q3 yes

Page 28: Informative Prediction based on Ordinal Questionnaire DataIBM Research Model parameters a,b,c are determined by maximizing log marginalized likelihood . 18 IBM Research Use numerical

28

IBM Research

Result (1): Estimated IRT parameters providing practical

information on the usefulness of each question

Page 29: Informative Prediction based on Ordinal Questionnaire DataIBM Research Model parameters a,b,c are determined by maximizing log marginalized likelihood . 18 IBM Research Use numerical

29

IBM Research

Result (2): Achieved comparable or even better accuracy

Performance metric: F-value o harmonic mean between troubled project

accuracy and non-troubled project

accuracy

Outperformed baseline o Max margin nearest neighbors

o Logistic regression

o Simple k-NN

o SVM

o Decision tree (C5.0)

o Neural network

our approach

Page 30: Informative Prediction based on Ordinal Questionnaire DataIBM Research Model parameters a,b,c are determined by maximizing log marginalized likelihood . 18 IBM Research Use numerical

30

IBM Research

Conclusion

Proposed a shallow but fully-interpretable prediction method for

questionnaire data

Extended IRT to the supervised setting

Proposed a new metric learning criteria of entropy equation to define a

proper metric space

Applied the method to real project review data to show practical utility

This deck of slides will be uploaded on my Web site ( ide-research.net )

Page 31: Informative Prediction based on Ordinal Questionnaire DataIBM Research Model parameters a,b,c are determined by maximizing log marginalized likelihood . 18 IBM Research Use numerical

31

IBM Research

Thank you!

Page 32: Informative Prediction based on Ordinal Questionnaire DataIBM Research Model parameters a,b,c are determined by maximizing log marginalized likelihood . 18 IBM Research Use numerical

32

IBM Research

Another application: employee evaluation

Input data, x : questionnaire answers on employee’s performance o Questions are like “Has he/she made good enough contributions to teamwork?”

o Managers put evaluation on individual questions

Outcome value, y : termination or not

y :

stay or leave?

Good?

Q1 yes

Q2 no

Q3 yes

Page 33: Informative Prediction based on Ordinal Questionnaire DataIBM Research Model parameters a,b,c are determined by maximizing log marginalized likelihood . 18 IBM Research Use numerical

33

IBM Research

Q1

Q2

1

1

yes

0

no

How do different answer

choices affect the outcome?

For natural interpretability, we employ the item response

theory (IRT) in psychometrics

-4 -2 0 2 4 6 8

0.0

0.2

0.4

0.6

0.8

1.0

failure tendency

pro

ba

bility

“difficulty”

“discrimination”

“guessing”

latent tendency for answering “yes”

pro

b. o

f ch

oo

sin

g “y

es”

Item characteristic curve (ICC)