CHAPTER 5: WRITING SAMPLE CORRELATION MININGdwb5.unl.edu/Diss/Kokensparger/Chapter5.pdf · CHAPTER...

22
72 CHAPTER 5: WRITING SAMPLE CORRELATION MINING Writing Sample Metrics and Page Views Aggregate Analysis Once the writing sample metrics and categorical percentage values for each student in each course was pre-processed into quartiles and outliers, all those metrics and categorical ranking data for one writing sample (the one with the highest overall average word count for that course) for all students in all courses were aggregated into one file. A correlation test crossing all writing sample values with all page views values was performed on the 352 records that remained after matching individual writing samples to page views, and eliminating those records that did not have a corresponding match. The results of the test are shown below in Table 5.1. With a degree of freedom of 350, a Pearson’s r coefficient of .120 or higher was needed to be classified as significant in this study, which yielded a p-value of under 0.025 (0.050, two-tailed). The highest value was the correlation between the percentage of preposition use in students’ writing samples and those same students’ total number of repeat daily sessions over the semester, which was negatively correlated and significant with a Pearson’s r coefficient of 0.173 (p=0.001). So the students who used more prepositions tended to have fewer repeat sessions in a given day. A student’s ranking in word count (WCTlev) among her peers submitting the same writing assignment was positively and significantly correlated with their Total Accumulated Minutes (r=0.170, p=0.001) ranking among their class peers, showing that those students who spent more time in the LMS also seemed to submit longer writing assignments. A student’s Total Accumulated Minutes ranking also correlated positively

Transcript of CHAPTER 5: WRITING SAMPLE CORRELATION MININGdwb5.unl.edu/Diss/Kokensparger/Chapter5.pdf · CHAPTER...

72

CHAPTER 5: WRITING SAMPLE CORRELATION MINING

Writing Sample Metrics and Page Views Aggregate Analysis

Once the writing sample metrics and categorical percentage values for each

student in each course was pre-processed into quartiles and outliers, all those metrics and

categorical ranking data for one writing sample (the one with the highest overall average

word count for that course) for all students in all courses were aggregated into one file.

A correlation test crossing all writing sample values with all page views values

was performed on the 352 records that remained after matching individual writing

samples to page views, and eliminating those records that did not have a corresponding

match. The results of the test are shown below in Table 5.1.

With a degree of freedom of 350, a Pearson’s r coefficient of .120 or higher was

needed to be classified as significant in this study, which yielded a p-value of under 0.025

(0.050, two-tailed). The highest value was the correlation between the percentage of

preposition use in students’ writing samples and those same students’ total number of

repeat daily sessions over the semester, which was negatively correlated and significant

with a Pearson’s r coefficient of 0.173 (p=0.001). So the students who used more

prepositions tended to have fewer repeat sessions in a given day.

A student’s ranking in word count (WCTlev) among her peers submitting the

same writing assignment was positively and significantly correlated with their Total

Accumulated Minutes (r=0.170, p=0.001) ranking among their class peers, showing that

those students who spent more time in the LMS also seemed to submit longer writing

assignments. A student’s Total Accumulated Minutes ranking also correlated positively

73

and significantly with a student’s ranking in the percentage of usage of words of Six

Letters (r=0.139,p=0.009) or more in her writing.

The usage of prepositions by students in their writing showed significant negative

correlations with the Repeated Sessions per Day (r=-0.173, p=0.001) calculated value,

and its closely-aligned High Times Out student (r=-0.143, p=0.007, which is generated

from the RSDLevs value). Therefore, from this study it can be ascertained that students

who login to the LMS more than once a day in comparison with their peers tend to use

fewer prepositions in their submitted writing assignments. Those same correlational

values and polarity were produced by comparing the same page views session metrics,

Repeated Sessions per Day (r=-0.141,p=0.008) and High Times Out students (r=-0.142,

p=0.008) with the relativity word category (for example, “area,” “bend,” “stop,” and

“exit”).

Cognitive-Mechanical word usage (for example, words like “cause,” “know,” and

“ought”), which were generally the highest percentage of words used over all of the

assignments by most students in this study, was positively and significantly correlated

with the Average Clicks per Session metrics in the page views data (r=0.158, p=0.003),

and the only category that had any level of significance with this metric.

The final word sample category of interest in this correlational test, the Verbs

usage level rankings was significant in a negative direction with the Files page views

category (r=-0.170,p= 0.001), and positively with the Grade page views category

(r=0.144,p=0.007). As mentioned many times above, the Files (and other course content

areas) category is often in a polarity opposite that of Grades in its relationship with other

metrics.

74

Table 5.1

Table 5.1: Correlations Over All Data, PV Metrics Against WS Metrics. Yellow

highlights denote significant positive correlations, blue highlights denote significant

negative ones.

The discussion above featured only the most significant findings among these

correlation tests, which would even be significant under a more rigorous threshold

(0.010). Since correlation mining is vulnerable to spurious results due to the repetitive

nature of the calculations, it is important to look for the most significant results, yet to

also consider how those results relate to other results that are statistically significant but

marginally so. In this way, these other statistically significant results can be considered

but only in how they relate to the trends.

75

Some of these trends include:

Word count in writing samples seems to correlate positively with most Page View

sessions metrics, except for Average Clicks per Session, where the correlation is of

lowest significance. Word count appears to have no significant relationship whatsoever

with URL token categories that students visit within the LMS, such as Assignments,

Grades, and Files.

Words per sentence (WPS), which is the average number of words in all of the

sentences in the writing samples, did not approach significance in any of the page views

session or categorical metrics.

The Average Minutes per Click, which is the page views session metric that is

closely aligned with students classified as Fast Clickers, had no writing sample metrics or

categories approaching significance. Neither did the Assignments, Conversations,

Modules, Topic, or Wiki page views categories.

In general, the Repeat Sessions per Day (and corresponding High Times Out)

metrics had a number of weaker though significant relationships with writing sample

metrics and categories. It appears that the High Times Out group reveal their usage

patterns in their writing samples much more than the Fast Clickers group, which only had

a near-significant relationship with Verb usage.

Writing Samples Analysis -- Sample Type Aggregates

A similar analysis to that done immediately above was done with aggregate data

by writing sample type, to determine if there were major differences in relationships

76

between writing sample metrics and page views metrics between students submitting one

type of writing assignment (e.g., a term paper) and another type of assignment (e.g., a

short reflection paper). The same correlational analysis was conducted over three

different sets of students, as determined by this table:

Table 5.2

Writing Sample Categories

CourseID Type WC Timing Cats Class

316813 Paper 780 MY CE A

316814 Paper 802 MY CE A

314538 Essay 837 EY oc A

314092 Short 862 MY cvre A

316841 Paper 1705 MY ECV B

316747 Position 1732 EY C B

316804 Term 1748 EY EC B

316805 Term 1774 EY EC B

314041 Term 2902 EY EC C

316837 Research 3080 EY EC C

316838 Research 3120 EY EC C

316819 Mult 3715 Mult C D

315028 Final 3808 EY CE D

314083 Final 4940 EY CE D

Table 5.2: Writing Sample Characteristics and Classifications

The “Class” of each writing sample type for each course was primarily

determined by the average word count for the sample, breaking it down into Class A

77

(average WC less than 1000), Class B (average WC between 1000 and 2000), Class C

(average WC between 2000 and 3500), and Class D (average EC of 3500 or larger).

Class A Papers (Average Word Count less than 1000 Words)

The Class A Papers (average Word Count less than 1000 words) aggregated

analysis included 4 of the 14 courses with a total of 106 subjects. A correlational

overview analysis was conducted using the R statistical software package.

With a degree of freedom of 104 for this sample, a Pearson’s r correlation of

0.220 or greater, either positively or negatively, was considered significant with a p-value

under 0.025 (p<.050, two-tailed), which was the threshold chosen for the overall

aggregate correlational analysis previously described. Greater Pearson’s r correlation

values in a positive or negative direction were therefore of higher significance in the

findings.

The highest Pearson’s r value was yielded through a comparison of student

rankings in their use of prepositions in their writing samples (prepsLev) and their

rankings in the Repeated Sessions Day (r-value of -0.322, p=0.001). The prepsLev

category, which is the most prominent writing samples category in this group of papers

(i.e., assignments with an average word count of less than 1000) was also negatively and

significantly correlated with rankings for the number of Course Page Views (r=-0.273,

p=0.005), number of Total Sessions the student accumulated over the semester while

engaged in the course (r=-0.273,p=0.005) and High Times Out (r=-0.239, p=0.014).

Though prepsLev had a number of significant relationships (all negative) with the

page views session factors, the category had no significant relationships among the page

78

views URL categories. Two other categories did, however, with the students’ use of

verbs in their submitted writing (verbLev) relating negatively and significantly with

students’ visiting the course Modules (r=-0.252, p=0.009) and students’ use of relativity

language (e.g., “area,” “bend,” “exit,” and “stop”) in their submitted writing (relLev)

relating negatively and significantly with students’ viewing (View) course content (r=-

0.234, p=0.016). The relLev category was also negatively and significantly correlated to

the RSDLevs metric (r=-.240, p=0.013).

The students’ use of cognitive-mechanical words (e.g., “cause”, “know”, “ought”,

and “think”), represented within the cogLev category, was positively (one of the few that

were not negative) and significantly correlated with the Total Page Views session metric

(r=0.234, p=0.016)

This class of paper (A) has the only significant interactions with the percentage of

six-letter or higher words (SIXlev) Writing Sample metric, with six such significant

correlations (and 2 additional nearly-significant ones). This metric correlates

significantly and positively with page views session metrics (Total Accumulated

Minutes, Average Minutes per Click, including a significant positive correlation with

students classified as High Times Out (r=0.220, p=0.023) and a significant negative

correlation with students classified as Fast Clickers (r=-0.320, p=0.001), and thus, with

prepsLev, could possibly be used as a chief predictor for High Times Out students, and

the only predictor in this classification of papers for the Fast Clickers students.

79

Table 5.3

Table 5.3: Correlations Over All Metrics for Class A Writing Samples. Yellow

highlights denote significant positive correlations, blue highlights denote significant

negative ones.

Class B Papers (Average Word Count between 1000 and 2000 Words)

The Class B Papers (average Word Count between 1000 and 2000 words)

aggregated analysis also included 4 of the 14 courses with a total of 128 subjects. A

correlational overview analysis was conducted using the R statistical software package.

With a degree of freedom of 126 for this sample, a Pearson’s r correlation of

0.199 or greater, either positively or negatively, was considered significant with a p-value

under 0.025 (p<.050, two-tailed), which was the threshold chosen for the overall

80

aggregate correlational analysis previously described. Greater Pearson’s r correlation

values in a positive or negative direction were therefore of higher significance in the

findings.

The highest Pearson’s r value was yielded through a comparison of student

rankings in their use of pronouns in their writing samples (pronLev) and their rankings in

the Total Accumulated Minutes (r=-0.267, p=0.002). The pronLev category also joined

the relLev category (which represents students’ use of relativity words, with examples

provided in the classification above) in a significant and negative correlation with the

total number of Course Page Views over the semester.

The relativity word usage, which is the most prominent writing samples category

in this group of papers (i.e., assignments with an average word count of greater than 1000

and less than 2000) was also negatively and significantly correlated with rankings for six

other page views session factors, including the High Times Out students (r=-0.223,

p=0.011). As such, it could therefore be used as a chief predictor of High Times Out

students in this classification of writing assignments.

Students who visit the Files area in those courses requiring this classification of

writing assignments also tend to write the longer papers in terms of word count, tend to

use more words per sentence on average than their peers in the same course, and tend to

use a higher percentage of articles in their writing. They also tend to use a lower

percentage of verbs in their writing.

The word count metric is also significantly and positively correlated with the

Average Minutes per Click page views session metric (r=0.199, p=0.024), indicating that

81

students who spend more time between clicks in the LMS also tend to write papers with

higher word counts for this classification of assignment.

Table 5.4

Table 5.4: Correlations Over All Metrics for Class B Writing Samples. Yellow

highlights denote significant positive correlations, blue highlights denote significant

negative ones.

Class C Papers (Average Word Count between 2000 and 3500 Words)

The Class C Papers (average Word Count between 2000 and 3500 words)

aggregated analysis included 3 of the 14 courses with a total of 59 subjects. A

correlational overview analysis was conducted using the R statistical software package.

82

With a degree of freedom of 57 for this sample, a Pearson’s r correlation of 0.300 or

greater, either positively or negatively, was considered significant with a p-value under

0.025 (p<.050, two-tailed), which was the threshold chosen for the overall aggregate

correlational analysis previously described. Greater Pearson’s r correlation values in a

positive or negative direction were therefore of higher significance in the findings.

A comparison of student rankings in their use of prepositions in their writing

samples (prepsLev) and their rankings in the Repeated Sessions Days (r-value of -0.339,

p=0.009) gave the highest Pearson’s r value. The prepsLev category also significantly

and negatively correlated with the High Times Out student classification (r=-0.315,

p=0.015).

No other significant correlations were discovered in Class C writing assignments

(greater than 2000 words and less than 3500 words.

83

Table 5.5

Table 5.5: Correlations Over All Metrics for Class C Writing Samples. Blue highlights

denote significant negative correlations.

Class D Papers (Average Word Count greater than 3500 Words)

The Class D Papers (average Word Count greater than 3500 words) aggregated

analysis also included 3 of the 14 courses with a total of 59 subjects. A correlational

overview analysis was conducted using the R statistical software package.

With a degree of freedom of 57 for this sample, a Pearson’s r correlation of 0.297

or greater, either positively or negatively, was considered significant with a p-value under

0.025 (p<.050, two-tailed), which was the threshold chosen for the overall aggregate

84

correlational analysis previously described. Greater Pearson’s r correlation values in a

positive or negative direction were therefore of higher significance in the findings.

The highest Pearson’s r value was yielded through a comparison of students’

rankings in their word count of their writing samples (WCTlev) and their rankings in the

Total Sessions (r-value of 0.416, p=0.001). The WCTlev category also correlated

significantly and positively with the Total Accumulated Minutes session metric

(r=0.297,p=0.023).

Significant and positive correlations were produced between the WPSlev metric

and both the Total Sessions metric (r=0.303, p=0.020), the Both classification of students

(i.e., those students who are classified as both Fast Clickers and High Times Out)

(r=0.343, p=0.008), and significantly and negatively with the Files page views URL

category (r=-0.361, p=0.005).

The other significant correlation among Class D writing assignments is the

positive correlation between the relLev writing sample category and the Grades page

views category (r=0.325, p=0.012).

The correlations in this classification of writing samples were almost entirely

positive (with one exception). The other classifications of writing samples gave mostly

negative correlations (Classes B and C in particular) or of mixed positive and negative (as

in the case of Class A). Class D correlations were mostly positive, perhaps illustrating

that papers of longer word count requirements also tend to reveal positive relationships

with page view metrics within an LMS, as if the longevity of an assignment (whether or

not it can be substantially completed in one sitting) has a direct relationship with the

course itself (which cannot be substantially completed in one sitting

85

Table 5.6

Table 5.6: Correlations Over All Metrics for Class D Writing Samples. Yellow

highlights denote significant positive correlations, blue highlights denote significant

negative ones.

Binning for Fast Clicker and High Times Out Students

All Binned Students

A similar analysis to that done immediately above was done with aggregate data

by student LMS usage patterns that were identified in the literature, that of Fast Clickers

(students whose LMS log usage patterns belie fast clicking through course pages and

resources, and High Times Out (students whose LMS log usage patterns belie a higher

number of repeated daily sessions in the course LMS pages and resources, analogous to a

86

“time-out and return that same day to log back in and resume” situation). This was done

to determine if there were major differences in relationships between writing sample

metrics and page views metrics among students classified in the two categories outlined

above. The same correlational analysis was conducted over three different sets of

students, as determined by this table:

Table 5.7

Table 5.7: Number and Percentages of Binned Students by Category

In general, a subset of students was binned as Fast Clickers (FC), High Times Out

(HTO), or Both (FC and HTO). The number of records binned as Fast Clickers was 73,

or 20.7% of all students in the sample. The number of records binned as High Times Out

was 63, or 17.9% of all students in the sample. These values are congruent with the

classification scheme, which was to classify Fast Clicker students as being a low outlier

or in the lowest quartile of the Average Minutes per Click session metric, and High

Times Out students being a high outlier or in the highest quartile of the Repeated

Sessions Days session metric. Furthermore, the number of records binned as Both (Fast

Clickers and High Times Out, not additionally represented by those categories alone) was

20, or 5.7% of all students in the sample. In sum, the students in all three classifications

total 156, or 44% of the entire sample.

87

A correlation test crossing all writing sample values with all page view values

was performed on the 156 binned records that were classified and identified as described

above. The results of the test are shown below in Table 5.8.

With a degree of freedom of 154, a Pearson’s r coefficient of 0.180 or higher was

needed to be classified as significant in this study, which yielded a p-value of under

0.0250 (p<0.050 two-tailed). The highest value was the correlation between the

percentage of relativity word usage in students’ writing samples and those same students’

total number of page views visits to View content areas (r=-0.252, p=0.002). In general,

students who had a higher percentage of URL logs in the View category were

significantly or negatively correlated with the percentage of their pronoun use as well

(r=-0.194, p=0.015).

For this classification of students, cognitive word usage was also significantly and

negatively correlated with the Total Sessions (r=-0.188, p=0.019), Total Accumulated

Minutes (r=-0.189, p=0.018), and Average Minutes per Click (r=-0.226, p=0.005) session

metrics. The percentage of usage of pronouns in writing assignments by this

classification of students also correlated significantly and negatively with the Total Page

Views (r=-0.182, p=0.023), Course Page Views (r=-0.226,p=0.005), Total Accumulated

Minutes (p=-0.208,p=0.009), Average Minutes per Session (r=-0.202, p=0.012) and the

View category (r=-0.194, p=0.015).

The only significant positive correlation among this classification of students was

the usage of prepositions with the visits to Assignment (r=0.183, p=0.022) pages. There

were also 5 significant positive correlations between the percentage use of Six-letter

words (or greater), and the page views session metrics of Course Page Views (r=0.180,

88

p=0.024), Total Sessions (r=0.181, p=0.024), Total Accumulated Minutes (r=0.222,

p=0.005), Average Minutes per Click (r=0.183, p=0.022), and the View category (r=.204,

p=0.011) Only one significant positive correlation was produced in the word count

metric, with the Total Accumulated Minutes page views session metric (r=0.217,

p=0.006).

The significant correlation values produced within the percent usage of dictionary

(DIClev) and functional words (FUNlev) are enigmas because they are highly sensitive to

misspellings and typographical errors (e.g., a space mistakenly inserted in the middle of a

word that would be recognized as a dictionary and functional word, making it into two

non-recognizable words), yet FUNlev was one of the most dynamic metrics among this

set of correlations. Therefore, they are ignored in this discussion, but retained in the

generation of the decision trees, since they both played prominent roles in some of the

trees, even the pruned ones. The significance of these two categories would be an

excellent vector for further study in this area.

From this test a number of possibilities arise which may ultimately lead to deeper

exploration into LMS usage patterns under Fast Clicker and High Times Out and

corresponding writing sample characteristics. These possibilities will be explored in

detail below.

However, before that data modeling took place, the correlations were examined

within each category, among the Fast Clicker students, the High Times Out students, and

those few students who were classified as being both.

89

Table 5.8

Table 5.8: Correlation Table for All Binned Students. Yellow highlights denote

significant positive correlations, blue highlights denote significant negative ones.

Binned Students by Classification

Students binned as Fast Clickers, High Times Out, and Both, were separated

according to these classifications and tested more specifically according to their inclusion

in these special groups. Though the students were counted (according to Table 5.7)

uniquely as being in only one group (the Both students being in neither of the other

groups), it was beneficial in this analytic phase to allow the Both students to join the Fast

Clicker group for Fast Clicker analysis, and the High Times Out group for that analysis as

well, since they were classified as such. Therefore, the degrees of freedom will be

necessarily higher than one might ascertain from viewing Table 5.7.

Fast Clickers

A correlation test crossing all writing sample values with all page view values

was performed on the 93 binned student records that were classified as either Fast Clicker

or Both. The results of the test are shown below in Table 5.9.

90

With a degree of freedom of 91, a Pearson’s r coefficient of 0.232 or higher was

needed to be classified as significant in this study, which yielded a p-value of under

0.0250 (p<0.050 two-tailed). The highest values were the correlations between the

percentage of functional word metrics (FUNlev) in students’ writing samples and those

same students’ total number of page views Average Clicks per Session (r=-0.324,

p=0.002) metric and Average Minutes per Session (r=-0.324, p=0.002) metric. The

FUNlev metric was also significantly and negatively correlated with the total Course

Page Views (r=-0.314, p=0.002), Total Accumulated Minutes (r=-0.292, p=0.005), and

the View category (r=-0.284, p=0.006).

Among the Fast Clickers group, the only significant positive correlations were in

the word count (WCTLev) against the Total Accumulated Minutes (r=0.266, p=0.010)

and the percentage of six-letter (or greater) words against the students’ page views visits

to the View category (r=0.301, p=0.003).

Course Page Views metrics correlated significantly and negatively with the Fast

Clicker students’ use of pronouns (r=-0.293, p=0.004) and prepositions (r=-0.254,

p=0.014). Pronoun levels also correlated significantly and negatively with the Average

Clicks per Session (r=-0.260, p=0.012), while use of prepositions and relativity words

were similarly significantly negative correlated with the Total Sessions and Total

Accumulated Minutes values. Relativity word use (relLev) was also significantly and

negatively correlated with a student’s page views View visits in the LMS.

91

Table 5.9

Table 5.9: Correlation Table for Fast Clicker (FC) Binned Students Yellow highlights

denote significant positive correlations, blue highlights denote significant negative ones.

High Times Out

A correlation test crossing all writing sample values with all page view values

was performed on the 83 binned student records that were classified and identified as

having High Times Out (HTO), as described above. The results of the test are shown

below in Table 5.10.

With a degree of freedom of 81, a Pearson’s r coefficient of 0.245 or higher was

needed to be classified as significant in this study, which yielded a p-value of under 0.250

(p<0.050 two-tailed). When broken down in this way, only two significant correlations

were produced from the test: a significant and negative correlation between the HTO

students’ use of verbs in writing class assignments and visiting Files in page views (r=-

0.272, p=0.013), and a significant positive correlation between the HTO students’ use of

cognitive words in writing samples and their total Course Page Views for the semester

(r=0.253, p=0.021).

92

Unlike with the Fast Clicker’s group, a small number of significant correlations

were produced when isolating the High Times Out group from the Fast Clickers. Great

care must be taken when building predictive models for this group, as they may be

difficult to tease out from other students.

Table 5.10

Table 5.10: Correlation Table for High Times Out (HTO) Binned Students. Yellow

highlights denote significant positive correlations, blue highlights denote significant

negative ones.

Both (Fast Clickers and High Times Out)

A correlation test crossing all writing sample values with all page view values

was performed on the 20 binned student records that were classified and identified as

both Fast Clicker (FC) and High Times Out (HTO), as described above. The results of

the test are shown below in Table 5.11.

With a degree of freedom of 18, a Pearson’s r coefficient of 0.500 or higher was

needed to be classified as significant in this study, which yielded a p-value of under

0.0250 (p<0.050 two-tailed). The only significant correlation that was produced was a

negative correlation between the percentage of dictionary words used by the FC and HTO

93

students and the Average Minutes per Session (r=-0.510, p=0.022). This is not

surprising, given the low n used for this particular test.

Table 5.11

Table 5.11: Correlation Table for Both (FC and HTO) Binned Students. Blue highlights

denote significant negative correlations.