Local outlier detection in data forensics: data mining approach to flag unusual schools Mayuko Simon...

Post on 01-Apr-2015

214 views 0 download

Tags:

Transcript of Local outlier detection in data forensics: data mining approach to flag unusual schools Mayuko Simon...

Local outlier detection in data forensics: data mining approach to flag unusual schools

Mayuko Simon

Data Recognition CorporationMay, 2012

1

Statistical methods for data forensic

• Univariate distributional techniques: e.g., average wrong-to-right erasures.

• Multivariate techniques– Simple regression. E.g., 2011 Reading is predicted by

2010 Reading score. – A school is flagged if the observed dependent variable

differs significantly from the model’s prediction• The schools are flagged when it is an outliers

compared to ALL other schools

2

Global outlier

What if schools is suspicious but not extreme?

• Schools with suspicious behavior may not display sufficient extremity to make them outliers in comparison to all schools.

• Nevertheless, it is reasonable to assume that their scores will be higher than that of their peers—schools that are very similar in many relevant aspects.

3

Local outlier

Local v.s. global outlier

• Traditional statistical data forensic techniques lack the ability to detect local outliers– Regression will miss the blue rectangle– Univariate approach (e.g., using only variable a) will

miss both blue rectangle and red triangle– Cluster analysis is not for outlier detection

4

The goal of RegLOD

• Regression based local outlier detection algorithm is introduced: RegLOD.

• We wish to find schools that are very similar to the peers in most respects (in terms of most independent variables) but differ significantly in current year’s score (the dependent variable).

5

Assumptions of RegLOD

• When most independent variables are very similar, we expect the dependent variable to be similar, as well.

• This assumption is very reasonable and is frequently exploited: this is the principle on which regression trees or nearest neighbor regression are built (Hastie, 2009).

6

Data and variables• Data:

– A large scale standardized state assessment test• Variables:

– School level Math and Reading scale score in 2010 and 2011.

– School level Math and Reading cohort scale score in 2010 (for grade 4, scale score when they were in grade 3)

– School level average wrong-to-right erasures in 2010 and 2011.

7

Data and variables• Scale scores were transformed into logit

for a better sense of school level during the analysis

• Dependent variable– 2011 Reading or 2011 Math

• Five independent variables– 2011 Math or Reading, 2010 Math and

Reading, and 2010 cohort Math and Reading. – Erasure counts were not used in the algorithm

8

RegLOD algorithm overview

1. Select a set of independent variables2. Find local weights3. Make a peer group for each school4. Obtain empirical p-values and flag

schools when criterions are met.

9

10

RegLOD Example: Grade 4 Reading Step 1. Select a set of independent variables

2011 Reading

(G4)

2010 Reading

(G4)

2010 CohortMath(G3)

2010 Cohort Reading

(G3)

2010 Math(G4)

2011 Math(G4)DV

IV

R2 = 0.99

11

RegLOD Example: Grade 4 Reading Step 2. Determine the local weights

2011Logit Score

2010 Logit Score

2010 Cohort Logit Score

M R M R MSubject Grade P1 P2 P3 P4 P5

R 4 0.35818 0.23228 -0.12216 0.23214 -0.05164

12

RegLOD Example: Grade 4 Reading

School i School j Distance/Similarity1 2 0.0515041 3 0.1447721 4 -0.030061 5 0.218381 6 -0.032321 7 0.0842221 8 0.097555. . .. . .. . .

1600 1603 0.2930381601 1602 2.3426251601 1603 0.0500751602 1603 1.848821

Step 3. Select peer schools• Compute pair-wise distance using the weights.• Select peer schools within +/- 0.03 (Dist value) from a

school.

13

RegLOD Example: Grade 4 Reading Step 4. Obtain empirical p-value

• Bootstrap 2011 Reading grade 4 scores of the peer school.• Obtain empirical p-values for replication of bootstrap and

average them.• Flag a school if the empirical p-value 0.05 or less.• Flag a if the number of peer schools are 10 or less.

Bootstrapped school mean

Fre

qu

en

cy

1250 1300 1350 1400

05

10

15

14

How many schools were flagged?Local weights (Dist = 0.03)

Subject GradeTotal School

NFlagged

School NProportion

FlaggedR 4 1603 49 3.06R 5 1493 37 2.48R 6 1056 38 3.60R 7 836 20 2.39R 8 836 22 2.63M 4 1603 33 2.06M 5 1493 63 4.22M 6 1056 36 3.41M 7 836 36 4.31M 8 836 37 4.43

15

Compare to the results of other statistical methods for data forensic

• SS: scale score analysis, e.g., 2011 scale score is predicted by 2010 scale score.

• PL: performance level analysis, e.g, proportion of proficient or above in 2011 is predicted by 2010 proportion of proficient or above.

• Reg: regression analysis using two subject, e.g., 2011 reading is predicted by 2011 mathematic.

• Rasch: use of Rasch residual.• WR: wrong-to-right erasure count.• SSco: scale score analysis using cohort students, e.g., 2011

scale score is predicted by 2010 scale score using cohort students.

• StdRes: standardized residual of multiple regression using the same variables as RegLOD analysis.

16

Grade 4 Reading:Comparison of Local and global outlier detection

RegLOD Statistical Methods (global outlier detection)

Num Peer N P-value SS1 PL1 Reg1 Rasch1 WR1 SSCo1 StdRes2

1 3 - 3.3 1.1 9.9 0.0 4.9 11.4 3.122 9 - 9.3 4.6 1.4 1.0 0.2 13.8 1.093 93 0.000 6.4 3.5 1.1 0.0 0.0 6.7 3.024 57 0.000 4.6 4.0 8.1 0.0 2.8 17.4 2.415 6 - 0.9 2.6 8.8 0.0 0.0 7.1 0.646 3 - 3.9 2.2 0.0 0.0 20.8 16.8 1.837 424 0.005 5.2 1.0 11.0 1.6 0.0 8.4 3.918 312 0.006 1.5 1.1 9.7 0.3 2.4 7.2 3.619 380 0.008 9.1 4.6 11.8 0.1 4.4 5.0 4.19

10 119 0.008 2.4 2.2 8.7 0.6 0.0 1.1 3.4911 435 0.009 2.8 1.1 8.0 0.0 0.0 5.0 3.3412 147 0.014 6.4 4.5 4.9 0.0 4.0 3.2 4.1413 112 0.018 0.1 1.1 16.9 0.0 2.2 6.8 1.6814 418 0.019 4.2 3.1 14.3 0.0 0.0 4.7 2.2315 52 0.019 1.1 0.4 8.6 0.0 2.5 3.0 2.77

17

A school with 10 or less peers

• E.g., The school number 1 in the table– 2011 the wrong-to-right erasure was 96 percentile, which is

rather high.– The RegLOD fond only three peer schools including the

school, indicating this school is an outlier.– Large increase in percentile (26 to 95), indicating suspicious

increase in score.– There are reasonable evidences that this school needs

further scrutiny.

PercentilesDV P1 P2 P3 P4 P5

2011Reading

2011 Math

2010 Reading

2010 Math

2010 Cohort

Reading

2010 Cohort Math

2010WR

2011WR

95 52 100 93 26 20 60 96

18

A school with many peer schools

• The school number 3 in the table– The RegLOD fond 93 three peer schools including the

school.– Large increase in percentile (23 to 76 percentile)– Since there are many peers, we can plot the variables

with the peer schools

PercentilesDV P1 P2 P3 P4 P5

2011Reading

2011 Math

2010 Reading

2010 Math

2010 Cohort

Reading

2010 Cohort Math

2010WR

2011WR

76 75 22 67 23 41 3 52

19

Comparison to peers for the IVs

2010 Reading and cohort 2010 Reading are around 20 percentile

School is within the peer distribution

20

Comparison to peers with 2011 Reading score (DV)

This school was 23 percentile with 2010 cohort Reading and 76 percentile with 2011 Reading

School is an outlier among the peers with 2011 Reading

21

A lower achieving school with many peer schools

• The school number 12 in the table– The RegLOD fond 147 peer schools including the school.– Moderate increase in percentile (13 to 42 percentile)– The 2010 cohort math’s 48 percentile seems little strange

given that 2011 Math is 14 percentile. – Erasure was 96 percentile, which is rather high.– We can take a look at the histograms

PercentilesDV P1 P2 P3 P4 P5

2011Reading

2011 Math

2010 Reading

2010 Math

2010 Cohort

Reading

2010 Cohort Math

2010WR

2011WR

42 14 12 27 13 48 35 96

22

Comparison to peers for the IVs

2010 Math is in right tail because of the odd high percentile. Other than that, the school is well within the peer distribution.

School is within the peer distribution

23

Comparison to peers with 2011 Reading score (DV)

School is an outlier among the peers with 2011 Reading

This school had a moderate increase in percentile, but since it is an outlier compare to the peers, it is a local outlier.

24

Did all flagged schools exhibited suspicious behavior?

• 12 schools – potentially incorrectly (extremely high/low achievement)

• Majority of the flagged schools exhibited suspicious behavior.

• Some flagged schools by RegLOD were also flagged by other statistical methods – these were local and also global outliers.

• Other schools were flagged by RegLOD but not by other statistical method – these schools were local outliers but not global outliers.

25

Conclusion• RegLOD have shown great promise in data forensic and it

is a valuable addition to our data forensic tools. • Its applicability is not limited to cheating detection in

educational testing. • Given its robust design of RegLOD, specifically its model-

based design (the concept of dependent and independent variables in data mining) and its ability to adapt makes it applicable to a wide range of outlier detection problems.

• We continue to study its capabilities, extend and apply it to other contexts and tasks.

26

Thank you!