EINDHOVEN UNIVERSITY OF TECHNOLOGYmpechen/projects/pdfs/Boer2010.pdf · Table 3.12: Data with...

Post on 13-Mar-2020

2 views 0 download

Transcript of EINDHOVEN UNIVERSITY OF TECHNOLOGYmpechen/projects/pdfs/Boer2010.pdf · Table 3.12: Data with...

EINDHOVEN UNIVERSITY OF TECHNOLOGY

Table 1.1: Goals of the data mining on educational data.

Table 3.1: Distribution of students with their course count.

Table 3.2: Validation and cleaning with validation window (2* standard

deviation).

0

20

40

60

80

100

120

140

3 4 5 6 7 8

Years

Stu

den

ts

Figure 3.5: Distribution of students over study time.

Table 3.3: Validation and cleaning with 3* standard deviation window.

0

20

40

60

80

100

120

140

160

180

1 2 3 4 5 6 7 8 9 10 11

Years

Stu

de

nts

Figure 3.6: Distribution of students over

study time.

Table 3.4: Validation and cleaning with 3* standard deviation window on

bootstrapped data.

0

20

40

60

80

100

120

140

160

2 3 4 5 6 7 8 9

Years

Stu

dents

Figure 3.7: Distribution of students over

study time.

Table 3.5: Validation and cleaning with 2* standard deviation window on

bootstrapped data.

0

20

40

60

80

100

120

140

4 5 6 7

Years

Stu

dents

Figure 3.8: Distribution of students over study

time

Table 3.6: Validation and cleaning with 2* standard deviation window.

0

10

20

30

40

50

60

70

4 5 6 7 8

Years

Stu

dents

Figure 3.9: Distribution of students over

study time

Table 3.7: Validation and cleaning with 3* standard deviation window.

0

10

20

30

40

50

60

70

80

90

2 3 4 5 6 7 8 9 10

Years

Stu

dents

Figure 3.10: Distribution of students over

study time

Table 3.8: Validation and cleaning with 2* standard deviation window.

0

10

20

30

40

50

60

70

4 5 6 7 8

Years

Stu

dents

Figure 3.11: Distribution of students over

study time

Table 3.9: Validation and cleaning with 3* standard deviation window.

0

10

20

30

40

50

60

70

80

90

3 4 5 6 7 8 9 10

Years

Stu

dents

Figure 3.12: Distribution of students over

study time

!767096.4

767096.4

"5.832469

Figure 3.13: Determining of short, normal and long classes

!5.107548

5.107548

"6.34147

Figure 3.14: Determining of short, normal and long classes

!4.554752

4.554752

"6.374105

Figure 3.15: Determining of short, normal and long classes

!109692.5

5.109692

"7.164413

Figure 3.16: Determining of short, normal and long classes

Figure 3.19: Courses with amount of students

Figure 3.20: Students with amount of courses

Figure 3.21: Courses with amount of students

Figure 3.22: Students with amount of courses

Figure 3.23: Courses with amount of students

Figure 3.24: Students with amount of courses

Figure 3.25: Distribution of results over the years

Figure 3.26: Amount of new students over the years.

Figure 3.27: Distribution of results over the years

Figure 3.28: Amount of new students over the years.

Table 3.10: Attributes with possible values

Table 3.11: Data with categorized attributes

Table 3.12: Data with ordinal attributes

FPTP

TPpecision

+=,Pr

FNTP

TPrcall

+=,Re

pr

F11

21

+

=

Definitions of accuracy metrics

Table 4.1: JRIP results on 2std, categorical attributes

Table 4.2: JRIP results on 3std, categorical attributes

Table 4.3: JRIP results on 2std, ordinal attributes

Table 4.4: JRIP results on 3std, ordinal attributes

Table 4.5: Class distribution over instances of 3std, ordinal attributes

Table 4.6: Rules with cost sensitive learning.

Table 4.7: Rules with cost sensitive learning.

Figure 4.1: ROC curve of long class, table A15

Figure 4.2: ROC curve of long class, table A16

Table 4.8: JRIP results on binary class with SMOTE.

Figure 4.3: ROC curve of class long from JRIP results on binary class with

SMOTE.

Table 4.9: Ridor results on binary class with SMOTE.

Wiskunde 2 Wiskunde 1 0.699 232 5.08

Wiskunde 1 Wiskunde 2 0.718 232 5.08

Inleiding functioneel progragrammeren

Systeemmodelleren 1

0.682 227 3,565

Wiskunde 2 Databases 1 0.81 269 3.542

Wiskunde 1 Databases 1 0.724 234 3.168

Systeemmodelleren 1 Databases 1 0.639 287 2.795

Operating systems Compilers 0.609 255 2.775

Compilers /\ Programmeren 1 Programmeren 2 0.769 250 2.334

Wiskunde 2 Programmeren 1 0.798 265 2.172

Automatentheorie en formele talen /\ Programmeren 2

Programmeren 1

0.763 267 2.076

Inleiding functioneel programmeren

Programmeren 2 0.679 226 2.059

Systeemmodelleren 1 Programmeren 2 0.675 303 2.047

Wiskunde 1 Programmeren 1 0.749 242 2.038

Compilers Programmeren 2 0.67 345 2.032

Basiswiskunde 3 Programmeren 2 0.654 231 1.985

Compilers /\ Programmeren 2 Programmeren 1 0.725 250 1.972

Automatentheorie en formele talen

Programmeren 1 0.717 420 1.95

Automatentheorie en formele talen /\ Programmeren 1

Programmeren 2 0.629 266 1.908

Implementatie Programmeren 2 0.628 245 1.906

Operating systems Programmeren 2 0.628 263 1.904

Databases 1 Programmeren 1 0.657 353 1.788

Programmeren 2 Programmeren 1 0.65 502 1.768

Compilers Programmeren 1 0.627 323 1.706

Systeemmodelleren 1 Programmeren 1 0.619 278 1.685

Table 4.10: Association rules found in results of students that where

insufficient on the first time they tried to pass the course.

Figure 4.4: Clustergram with study length and courses.

Table 4.11: Courses for which can be said that almost all students have a

good result.

Table 4.12: Course of the blue rectangle in figure 4.4.

Figure 4.5: JRIP classification rule used for emerging patterns.

Table 4.13: Support per year the rule of figure 4.5.

Figure 5.1: Process extracted from the first year of all students.

Figure 5.2: Process extracted from the second year of all students.

Figure 5.3: Process extracted from the third year of all students.

Figure 5.4: Process extracted from the fourth year of all students.

Figure 5.5: Process extracted from the fifth year of all students.

Figure 5.6: Process extracted from the sixth year of all students.

Algebra 2 (1.3 and 2.1)

Basiswiskunde 3 (1.3)

Basiswiskunde 2 (1.2)

Algebra 1 (1.2)

Implementatie (1.3 and 2.1)

Basiswiskunde 1 (1.1)

Programmeren 3 (2.1)

Operating systems (2.3)

Compilers (2.2)

Automatentheorie en formele talen (1.3)

Programmeren 2 (1.3)

Programmeren 1 (1.2)

Table 5.1: Result of Frequent Itemset Mining on all courses

Node filter:

Significance cutoff: 0.430

Edge filter:

Cutoff: 0.042

Utility rt: 0.582

Node filter:

Significance cutoff: 0.413

Edge filter:

Cutoff: 0.042

Utility rt: 0.583

Node filter:

Significance cutoff: 0.405

Edge filter:

Cutoff: 0.032

Utility rt: 0.217583

Figure 5.7: Fuzzy models of the 3 different study times on the most frequent

courses.

Heuristic model of students with a short study time.

Relative to best threshold: 0.05

Positive observations: 10

Dependency threshold: 0.9

Connected: yes

Heuristic model of students with a short study time.

Relative to best threshold: 0.05

Positive observations: 10

Dependency threshold: 0.8

Connected: no

Figure 5.8: Heuristic model of students with a short study time.

Heuristic model of students with a normal study time.

Relative to best threshold: 0.05

Positive observations: 10

Dependency threshold: 0.93

Heuristic model of students with a normal study time.

Relative to best threshold: 0.05

Positive observations: 10

Dependency threshold: 0.93

Figure 5.9: Heuristic model of students with a normal study time.

Heuristic model of students with a long study time.

Relative to best threshold: 0.05

Positive observations: 10

Dependency threshold: 0.9

Connected: yes

Heuristic model of students with a long study time.

Relative to best threshold: 0.05

Positive observations: 10

Dependency threshold: 0.9

Connected: no

Figure 5.10: Heuristic model of students with a long study time

Figure 5.11: Petri net of the courses given to students with the start year

2004.

Figure 5.12: Result of conformance checking the students that start in 2004.

Figure 5.13: Sequence mining results for short study time students on the

courses of table “Result of Frequent Itemset Mining on all courses”.

Figure 5.14: Sequence mining results for normal study time students on the

courses of table “Result of Frequent Itemset Mining on all courses”.

Figure 5.15: Sequence mining results for long study time students on the

courses of table “Result of Frequent Itemset Mining on all courses”.

Table A1: Table names of the data with English translation.

Table A2: Fields of table Address with English translation.

Table A3: Fields of table exams with English translation.

Table A4: Fields of table personal details with English translation.

Table A5: Fields of table results with English translation.

Table A6: Fields of table study packages with English translation.

Table A7: Fields of table study package participants with English translation.

Table A8: Fields of table preparatory educations with English translation.

Table A9: Fields of the table preparatory education courses with English

translation.

0

50

100

150

200

250

300

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

Year

Am

ou

nt

of

exam

s

Figure A2: The amount of exams per year.

Table A10: Different exam assessments with English translation.

Table A11: JRIP statistics after mining on 2std, categorical attributes

Table A12: JRIP statistics after mining on 3std, categorical attributes

Table A13: JRIP statistics after mining on 2std, ordinal attributes

Table A14: JRIP statistics after mining on 3std, ordinal attributes

Table A 15: JRIP statistics after cost sensitive mining on 3std, ordinal

attributes. V1

Table A16: JRIP statistics after cost sensitive mining on 3std, ordinal

attributes. v2

Table A17: Statistics of JRIP results on binary class with SMOTE.

Table A18: Statistics of Ridor results on binary class with SMOTE.

=== Run information ===

Scheme: weka.classifiers.trees.Id3

Relation: test-weka.filters.supervised.instance.SMOTE-C0-K5-P400.0-S1

Instances: 974

Attributes: 5745

[list of attributes omitted]

Test mode: 10-fold cross-validation

=== Classifier model (full training set) ===

Id3

startYear = 1980: null

startYear = 1981: null

startYear = 1982: null

startYear = 1983: short-normal

startYear = 1984: long

startYear = 1985: long

startYear = 1986

| 2L500_firstResult<4 = -: long

| 2L500_firstResult<4 = 0

| | 1B053_firstResult<7 = -: null

| | 1B053_firstResult<7 = 0: long

| | 1B053_firstResult<7 = 1: short-normal

| 2L500_firstResult<4 = 1

| | 0Z060_trialCount>1 = -: long

| | 0Z060_trialCount>1 = 0: short-normal

| | 0Z060_trialCount>1 = 1: long

startYear = 1987

| 5F040_firstResult<7 = -

| | 0K060_trialCount>0 = -: long

| | 0K060_trialCount>0 = 0: null

| | 0K060_trialCount>0 = 1: short-normal

| 5F040_firstResult<7 = 0

| | 2K700_highestResult<8 = -: null

| | 2K700_highestResult<8 = 0: short-normal

| | 2K700_highestResult<8 = 1

| | | 2N010_highestResult<8 = -: null

| | | 2N010_highestResult<8 = 0: short-normal

| | | 2N010_highestResult<8 = 1

| | | | 1A060_trialCount>0 = -: short-normal

| | | | 1A060_trialCount>0 = 0: null

| | | | 1A060_trialCount>0 = 1: long

| 5F040_firstResult<7 = 1

| | 2F550_firstResult<3 = -: null

| | 2F550_firstResult<3 = 0: short-normal

| | 2F550_firstResult<3 = 1: long

startYear = 1988

| 2L711_firstResult<4 = -: short-normal

| 2L711_firstResult<4 = 0

| | 2M240_firstResult<8 = -: null

| | 2M240_firstResult<8 = 0

| | | 2M227_trialCount>1 = -: long

| | | 2M227_trialCount>1 = 0: short-normal

| | | 2M227_trialCount>1 = 1: long

| | 2M240_firstResult<8 = 1: short-normal

| 2L711_firstResult<4 = 1

| | 1B040_trialCount_np = -: long

| | 1B040_trialCount_np = 0: null

| | 1B040_trialCount_np = 1: short-normal

startYear = 1989

| 2L670_firstResult<7 = -: short-normal

| 2L670_firstResult<7 = 0

| | 2L140_trialCount>0 = -: long

| | 2L140_trialCount>0 = 0: null

| | 2L140_trialCount>0 = 1

| | | 0K060_trialCount>2 = -: null

| | | 0K060_trialCount>2 = 0: short-normal

| | | 0K060_trialCount>2 = 1: long

| 2L670_firstResult<7 = 1: short-normal

startYear = 1990

| 2L530_trialCount>0 = -

| | 2R707_firstResult<7 = -: short-normal

| | 2R707_firstResult<7 = 0

| | | 0L800_trialCount>0 = -: short-normal

| | | 0L800_trialCount>0 = 0: null

| | | 0L800_trialCount>0 = 1: long

| | 2R707_firstResult<7 = 1

| | | 2WS13_trialCount>1 = -: long

| | | 2WS13_trialCount>1 = 0: short-normal

| | | 2WS13_trialCount>1 = 1: long

| 2L530_trialCount>0 = 0: null

| 2L530_trialCount>0 = 1: short-normal

startYear = 1991

| 2Y420_trialCount>1 = -: short-normal

| 2Y420_trialCount>1 = 0

| | 2L340_firstResult<8 = -: short-normal

| | 2L340_firstResult<8 = 0: short-normal

| | 2L340_firstResult<8 = 1

| | | 2L500_trialCount>1 = -: null

| | | 2L500_trialCount>1 = 0: long

| | | 2L500_trialCount>1 = 1: short-normal

| 2Y420_trialCount>1 = 1

| | 2L085_firstResult<8 = -: null

| | 2L085_firstResult<8 = 0: short-normal

| | 2L085_firstResult<8 = 1

| | | 2L060_trialCount>2 = -: null

| | | 2L060_trialCount>2 = 0: long

| | | 2L060_trialCount>2 = 1: short-normal

startYear = 1992

| 2L060_trialCount>3 = -: null

| 2L060_trialCount>3 = 0: short-normal

| 2L060_trialCount>3 = 1: long

startYear = 1993

| 1B170_trialCount>6 = -: null

| 1B170_trialCount>6 = 0

| | 1B050_trialCount_np = -: short-normal

| | 1B050_trialCount_np = 0: null

| | 1B050_trialCount_np = 1: long

| 1B170_trialCount>6 = 1: long

startYear = 1994

| 2L711_firstResult<3 = -: short-normal

| 2L711_firstResult<3 = 0

| | 0K060_trialCount>1 = -: null

| | 0K060_trialCount>1 = 0: short-normal

| | 0K060_trialCount>1 = 1: long

| 2L711_firstResult<3 = 1: long

startYear = 1995

| 1J210_firstResult<6 = -: short-normal

| 1J210_firstResult<6 = 0

| | 1Z340_trialCount>2 = -: short-normal

| | 1Z340_trialCount>2 = 0: short-normal

| | 1Z340_trialCount>2 = 1: long

| 1J210_firstResult<6 = 1: long

startYear = 1996

| 2R237_highestResult<8 = -: null

| 2R237_highestResult<8 = 0

| | 2M004_trialCount>0 = -: long

| | 2M004_trialCount>0 = 0: null

| | 2M004_trialCount>0 = 1: short-normal

| 2R237_highestResult<8 = 1

| | 2Y380_highestResult<8 = -: short-normal

| | 2Y380_highestResult<8 = 0

| | | 2IN40_trialCount_np = -: short-normal

| | | 2IN40_trialCount_np = 0: null

| | | 2IN40_trialCount_np = 1: long

| | 2Y380_highestResult<8 = 1

| | | 2R077_firstResult<7 = -

| | | | 1A350_trialCount>1 = -: null

| | | | 1A350_trialCount>1 = 0: long

| | | | 1A350_trialCount>1 = 1: short-normal

| | | 2R077_firstResult<7 = 0

| | | | 1C200_trialCount_np = -: short-normal

| | | | 1C200_trialCount_np = 0: null

| | | | 1C200_trialCount_np = 1: long

| | | 2R077_firstResult<7 = 1: long

startYear = 1997

| 2io60_trialCount>0 = -

| | 1B170_firstResult<10 = -: null

| | 1B170_firstResult<10 = 0: long

| | 1B170_firstResult<10 = 1: short-normal

| 2io60_trialCount>0 = 0: null

| 2io60_trialCount>0 = 1: long

startYear = 1998

| 2IH20_trialCount>2 = -: null

| 2IH20_trialCount>2 = 0

| | 2M204_trialCount>0 = -

| | | 2M090_trialCount>1 = -: short-normal

| | | 2M090_trialCount>1 = 0: long

| | | 2M090_trialCount>1 = 1: short-normal

| | 2M204_trialCount>0 = 0: null

| | 2M204_trialCount>0 = 1: short-normal

| 2IH20_trialCount>2 = 1

| | 2M980_firstResult<7 = -: long

| | 2M980_firstResult<7 = 0

| | | 2M927_firstResult<6 = -: short-normal

| | | 2M927_firstResult<6 = 0: short-normal

| | | 2M927_firstResult<6 = 1: long

| | 2M980_firstResult<7 = 1: long

startYear = 1999

| 1C200_firstResult<5 = -: short-normal

| 1C200_firstResult<5 = 0: short-normal

| 1C200_firstResult<5 = 1

| | 2Y345_firstResult<8 = -: long

| | 2Y345_firstResult<8 = 0: short-normal

| | 2Y345_firstResult<8 = 1

| | | 2F540_trialCount>2 = -: long

| | | 2F540_trialCount>2 = 0: long

| | | 2F540_trialCount>2 = 1: short-normal

startYear = 2000

| 2M227_highestResult<7 = -: null

| 2M227_highestResult<7 = 0: short-normal

| 2M227_highestResult<7 = 1

| | 0L800_trialCount>0 = -: long

| | 0L800_trialCount>0 = 0: null

| | 0L800_trialCount>0 = 1: short-normal

startYear = 2001: short-normal

startYear = 2002: null

startYear = 2003: null

startYear = 2004: null

startYear = 2005: null

startYear = 2006: null

startYear = 2007: null

startYear = 2008: null

startYear = 2009: null

Time taken to build model: 11.38 seconds

=== Stratified cross-validation ===

=== Summary ===

Correctly Classified Instances 858 88.0903 %

Incorrectly Classified Instances 112 11.499 %

Kappa statistic 0.769

Mean absolute error 0.1155

Root mean squared error 0.3398

Relative absolute error 23.2354 %

Root relative squared error 68.1697 %

UnClassified Instances 4 0.4107 %

Total Number of Instances 974

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class

0.87 0.099 0.906 0.87 0.887 0.884 short-normal

0.901 0.13 0.863 0.901 0.882 0.884 long

Weighted Avg. 0.885 0.114 0.885 0.885 0.885 0.884

=== Confusion Matrix ===

a b <-- classified as

441 66 | a = short-normal

46 417 | b = long

Table A19: Statistics of ID3 results on binary class with SMOTE.

Table A20: Statistics of NaïveBayes results on binary class with SMOTE.