LDA for Educational Data
-
Upload
kylenpayne -
Category
Documents
-
view
217 -
download
0
Transcript of LDA for Educational Data
-
7/27/2019 LDA for Educational Data
1/18
ClassificationofSchoolsbyAcademicAchievementMeasures
1
ClassificationofSchoolsByAcademic
AchievementMeasuresKyleN.PayneGroup3
-
7/27/2019 LDA for Educational Data
2/18
ClassificationofSchoolsbyAcademicAchievementMeasures
2
Stat448FinalProject
KyleN.Payne
INTRODUCTION
Inmanyapplications,itmakeslogicalandpracticalsensetodichotomize
continuousvariables.Intermsofacademicperformanceineducationalpolicy,wecouldpracticallydescribeacademicperformanceintermsofhighacademic
achievementandlowacademicachievement.Whileitisreasonabletoassumethat
indichotomizingcontinuousvariablescausesaconsiderablelossininformation
(Cohen,1983)wecanalsoreflectupontheconsiderableeaseoftheinterpretation
inadichotomy,andhowthiscouldhelplawmakers,policyspecialists,etc.inthe
developmentofsuitableeducationalpolicy.Fromanappliedperspective,alsoitis
logicaltoinvestigatetheextentthatdemographicvariablespredicttheclassification
ofschoolsintermsofacademicachievement,andsuchisthesubjectofthefollowing
analysis.Thedatasetunderstudyconsistsofmathandreadingscoresfrom
standardizedtestsadministeredannuallyto3rdand5thgradersinthestateofIllinois,aswellasseveraldemographicandeconomicvariables.Thestandardizedtestinquestion,theIllinoisStandardAchievementTestorISATisintendedtoassess
individualstudentachievementrelativetoIllinoisLearningStandards.Thedatasetcontainsdataforcohortsofstudentsmeasuredatboth3rdand5thgradefrom
1999-2011.Measurementsareattheschoollevel,withaveragestakenacross
students.Theentiredatasetconsistsof69466observationsacross109variables,of
which10werecreatedoverthecourseoftheanalysis.Thesevariablesconsistof
codingvariables,andaveragesofothervariablesacrosssimilargroups(like3rd,and5thgrade).Thecohort1data(trainingset)consistsof1783observationsacross109
variables,asdoesthecohort2data(testset).Thedatawascompiledbyfacultyand
staffattheUniversityOfIllinoisdepartmentofLaborandEmploymentRelations.Notethatsomeanalysesareplacedintheappendixforeaseofreading.
METHODS
Formyanalysis,Ichosetouseaquadraticdiscriminantfunctionanalysisto
modeltheclassmembershipofelementaryschoolsinIllinoisintotwodichotomous
classes,schoolsthatobtainHighAcademicAchievement(HAA),andthosethat
obtainLowAcademicAchievement(LAA).Thecriterionforeitherisdecidedin
advance,i.e.forcohort1,thedataarecoded0forLAAor1forHAAbasedoniftheproportionofstudentsthatexceededexpectationsinISATscores(averagedacross
mathandreadingandgradesforeachschool)isaboveorbelow15%respectively.
Thescaleforeachgradeandtestsubjectwereequal,whichallowedforeasyaveragingacrossgrade3,4,and5foreachschool,aswellasforthetwotesttypes.
Thetestscoresarestandardized,meaningthatallschoolsareassessedinthesamemanner,suchthatthetestscoresarerelativetoanIllinoisstatestandard.The
discriminantanalysiswasperformedusingtheSAS9.2andSAS9.3platformswith
thestepdiscanddiscrimprocedures.
Iconsideredcohort1asthetrainingset,andusedastepwisemodelselection
procedureinordertoselecttheappropriatemodeloutofaspaceofpossible
-
7/27/2019 LDA for Educational Data
3/18
ClassificationofSchoolsbyAcademicAchievementMeasures
3
models.Predictorsselectedaregeneraldemographicvariablesofinterest,including
theaveragenumberoflow-incomestudentsperschool,studentteacherratio,etc.Forfittingthediscriminantfunction,thevariablethatistheclassificationis
dependentonisacadem_achieve,theproportionofstudentsthatexceedexpectationsontheISATaveragedacrossmathandreadingandgrade3,and5.The
codingvariableAAisoftheform = {0 < .15, 1 .15}Thisisameasureoftheaverageschool-wisescoreontheISAT.Whileeachclassis
notmultivariatenormallydistributed,thequadraticdiscriminantfunctionis
relativelyrobusttonon-normality.Howevertoaddresstherelativeperformanceof
thediscriminantanalysistoothermethods,Ihavealsousedalogisticregressionto
modeltheprobabilityschoolsbeingassignedtothetwoclassifications.This
secondaryanalysiswasdoneusingtheSAS9.2platformwiththelogisticprocedure.
RESULTS
Section1
Thestepdiscprocedurewasinitiallyutilizedforthefollowingpredictors:
avg_stud_lowincomeTheaveragenumberoflowincomestudentsperschool
chronic_truant_rateTheaverageproportionofchronictruancyperschool
avg_dist_tch_salaryTheaverageteachersalaryperdistrict avg_perc_dist_tch_badegreeTheaveragepercentofteacherswith
bachelorsdegreesperdistrict
avg_perc_dist_tch_madegree-Theaveragepercentofteacherswithmastersdegreesperdistrict
bamaxpay_sched-Thebachelorsdegreemaximumpayscheduleperschool
mamaxpay_shed-Themastersdegreemaximumpayscheduleperschool
Theprocedurewascarriedoutwitha.05selectionleveland.05significance
level.Table1.1belowdemonstratesthefirstpartoftheanalysis,inwhichthe
predictorsareenteredintothemodelbasedupontheirsignificance.
-
7/27/2019 LDA for Educational Data
4/18
ClassificationofSchoolsbyAcademicAchievementMeasures
4
StatisticsforEntry,DF=1,1708
Variable
R-
Squar
e FValue Pr>FToleranc
e
avg_stud_lowincome0.5345 1961.05
-
7/27/2019 LDA for Educational Data
5/18
ClassificationofSchoolsbyAcademicAchievementMeasures
5
StatisticsforEntry,DF=1,1707
Variable
Partial
R-
Square FValue Pr>F
Toleranc
e
chronic_truant_rate 0.0020 3.41 0.065
2
0.7956
avg_dist_tch_salary 0.0142 24.60 F
Wilks'
Lambda
Pr
-
7/27/2019 LDA for Educational Data
6/18
ClassificationofSchoolsbyAcademicAchievementMeasures
6
ClassLevelInformation
AA
Variabl
e
Name
Frequenc
y Weight
Proportio
n
Prior
Probabilit
y
0 _0 842 842.00
00
0.472238 0.500000
1 _1 941 941.00
00
0.527762 0.500000
Table1.5
Thediscriminationresultedinanear50/50discriminationofthedata,witha
roughly47%oftheschoolsintheLAAcategoryand53%intheHAAcategory.As
seeninthetable1.7,thattheoverallclassificationerrorrateis16.11,whichconsists
ofa0.2138misclassificationfortheLAAclassand0.1084misclassificationratefor
theHAAclass.
NumberofObservationsandPercentClassifiedintoAA
FromAA LAA HAA Total
LAA 662
78.62
180
21.38
842
100.00
HAA 102
10.84
839
89.16
941
100.00
Total 764
42.85
1019
57.15
1783
100.00
Priors 0.5
0.5
Table1.6
-
7/27/2019 LDA for Educational Data
7/18
ClassificationofSchoolsbyAcademicAchievementMeasures
7
ErrorCountEstimatesforAA
LAA HAA Total
Rate 0.213
8
0.108
4
0.161
1Priors 0.500
0
0.500
0
Table1.7
Refittingthemodelwithproportionalpriors,Ireceivedthesameresultsofnon-
homogenousvariancebetweenthetwogroups,andthereforethequadratic
discriminantfunctionanalysiswasused,asseeninTable1.8.TheMANOVAresults
aresimilartothenon-proportionalprioranalysis(Table1.9).
Chi-Square DF Pr>ChiSq
177.13229
9
1 F
Wilks'Lambda 0.47971
133
1931.6
5
1 1781
-
7/27/2019 LDA for Educational Data
8/18
ClassificationofSchoolsbyAcademicAchievementMeasures
8
NumberofObservationsandPercentClassified
intoAA
FromAA LAA HAA Total
LAA 652
77.43
190
22.57
842
100.00HAA 99
10.52
842
89.48
941
100.00
Total 751
42.12
1032
57.88
1783
100.00
Priors 0.47224
0.52776
Table1.10
ErrorCountEstimatesforAA
LAA HAA Total
Rate 0.22
57
0.10
52
0.162
1
Prior
s0.4722
0.5278
Table1.11
Thecross-validatederrorrateestimatesareslightlyhigherthantheresubstitution
rates(table1.12),whicharetypicallylessaccurate.
CrossValidatedError
CountEstimatesforAA
LAA HAA Total
Rate 0.22
57
0.10
63
0.162
6
Prior
s0.47
22
0.52
78
Table1.12
Becausethepurposeofthediscriminantanalysisistobeabletousethetrainingset
datatoclassifyfuturedata,Iviewedcohort1dataasatrainingset,andusedcohort
2dataasatestset.Whileneitherdatasetiscompletelyrandomlysampled,wecan
viewcohort2astestsetforclassificationundertheassumptionthatthereisno
distinctnon-stochasticdifferenceintheamountoflow-incomestudents,andISAT
testscores.Therefore,usingthecohort1dataasthetrainingsetwithproportional
-
7/27/2019 LDA for Educational Data
9/18
ClassificationofSchoolsbyAcademicAchievementMeasures
9
priors,theresultoftheclassificationofcohort2isshownintable1.13below.We
canseethatalargerproportionofcohort2isclassifiedintotheHAAclasscomparedwithcohort1.
NumberofObservationsandPercent
ClassifiedintoAA
LAA HAA Total
Total 76243.12
100556.88
1767100.00
Priors 0.47224
0.52776
Table1.13
Duetotheunivariatenatureofthediscriminantanalysis,wecanalsoviewthe
classificationvisually.Figure1.1describesthepredictedprobabilityofbeing
classifiedintotheHAAgroupasafunctionoftheaveragenumberoflow-incomestudentsperschool.ThebluerepresentstheHAAclass,andredrepresentstheLAA
class.
Figure1.1
Reviewingtheassumptionsforquadraticdiscriminantanalysis,itisclear
thatthereareseveralviolationsinthisparticularanalysis.Thedistributionsoftheaveragenumberoflow-incomestudentsfortheLAAandHAAclassesareboth
highlynon-normal(figure1.2),whichisaconsequenceofsplittingthedataintothe
twoclasses.However,Iproceededinthefaceofthisbecausenotallviolationsof
assumptionsareequallydetrimental,whilesomemakeananalysiscompletely
invalid,someonlyaffecttheprecisionandaccuracyoftheanalysistoadegree.The
robustnessofLDAandQDAtoviolationsofnormalityhasbeeninvestigatedin
(Sever, Lajovic & Rajer, 2005).Theresultsof(Sever, Lajovic & Rajer, 2005)
-
7/27/2019 LDA for Educational Data
10/18
ClassificationofSchoolsbyAcademicAchievementMeasures
10
indicatethatthelargesteffectofnon-normalityonthediscriminantanalysisisthe
increasedbiasoferrorcountestimates.SkewnessindistributionappearstohavelittletonoeffectonthediscriminantanalysisusingLDAorQDA.
Figure1.2
Section2
Becausetheclassificationschemeunderstudyinvolvesclassifyingdatainto
dichotomousclasses,Ialsousedlogisticregressionoftheaveragenumberoflow-incomestudentsperschoolontothelogoddsofsaidschoolbeingclassifiedinthe
eitheroftheAAclasses.Logisticregressioniscompetitivewithdiscriminant
analysisforclassificationbecauseofitsrelativelysmallsetofassumptions,andthus
thenon-normalityoftheclassesisnotaviolation.Thegeneralizedlogitlinkfunction
wasutilizedassuggestedin(Der & Everitt, 2002)duetotheordinalnatureofthescaleoftheresponse.Thetestoftheglobalnullhypothesis(table2.1)andtheMLE
estimates(table2.2)areallsignificant.TheasymptoticWaldChi-Squarevalue
shouldbepreciseduetothelargesamplesize.
TestingGlobalNullHypothesis:BETA=0
Test Chi-Square DF Pr>ChiSq
Likelihood
Ratio1134.8846 1
-
7/27/2019 LDA for Educational Data
11/18
-
7/27/2019 LDA for Educational Data
12/18
ClassificationofSchoolsbyAcademicAchievementMeasures
12
Figure2.2
Duetotheunivariatenatureoftheanalysis,wecanalsoviewthelogisticregression
intermsofaveragenumberoflow-incomestudentsontheprobabilityofaschool
beingclassifiedasaHAAschool.Figure2.3describesthepredictedprobabilityofa
schoolbeingclassifiedintotheHAAclassbytheaveragenumberoflow-income
studentsperschool.
Figure2.3
Wecanalsoviewmeasuresoftheassociationofpredictedprobabilitiesandthe
observedresponse.Thepercentconcordantisthepercentofresponsesthathavea
predictedmeanscorethatalsoexistsinthesameclass.Thec-cmeasureisan
adjustmentontheROCcmeasure.Itrangesfrom0.5to1,where0.5reflectsamodel
-
7/27/2019 LDA for Educational Data
13/18
ClassificationofSchoolsbyAcademicAchievementMeasures
13
randomlypredictingtheresponse,and1perfectlyclassifyingtheresponse(table
2.4).Itappearsasiftheclassificationisrelativelyaccurate.
AssociationofPredictedProbabilitiesand
ObservedResponses
Percent
Concordant90.8 Somers'
D0.81
8
PercentDiscordant 9.1 Gamma 0.81
9
PercentTied 0.1 Tau-a 0.408
Pairs 792322
c-c 0.909
Table2.4
Section3
Incomparingthetwomodelsitisclearthatthediscriminantanalysismaygive
relativelybiasedpredictionswhencomparedtothelogisticregression.Thisreflects
thepossiblebiasofthemodelduetotheviolationsofnormality.Whilethetwo
modelsdodeviatefromeachotherintheirpredictionsoftheprobabilityofbeing
classifiedintotheHAAclass,thetwomodelsareroughlysimilar(Figure3.1).
Figure3.1
-
7/27/2019 LDA for Educational Data
14/18
ClassificationofSchoolsbyAcademicAchievementMeasures
14
Conclusion
Fromthetwoanalyses,wecanpaintaveryconvincingpicture:Theaveragenumber
oflow-incomestudentsperschoolisassociatedwithdecreasesintheprobabilityofsaidschoolbeingclassifiedasintotheHighAcademicAchievementclass.Both
modelspredictthatschoolswithhighnumberoflow-incomestudentshaveahighprobabilityofbeingclassifiedasLAA,andthereforethemodelspredictthatthose
schoolshavealowernumberofstudentsthatexceedexpectationsonISATscores.
NotonlydidtheAverageNumberofLow-IncomeStudentsperschoolclassify
schoolswell,itdidsoaboveanyotherdemographicpredictor.Themodelselection
processdescribedinsection1oftheresultssectionisevidencetowardsthispoint,
asavg_stud_lowincomehadapartial!=0.5345.Thiscouldprovideauseful
perspectivetobudgetarydecisions,astheaveragenumberoflow-incomestudents
explainedmuchmorevariancethentheaverageteachersalaryperdistrict
(Althoughthisisamessycomparisonasthereisvarianceinaverageteachersalary
withinadistrict).Whilethiseffectsizemayseemrelativelysmall,itisactuallyquite
highwithregardtoeffectssizescommonlyexpectedinsocialscience.Thisalsospeakstothegeneralnoisey-nessofthedata.Furtheranalysiscouldlookatthe
relativeperformanceofthediscriminantmodelacrosseachofthecohorts,orusing
amoresophisticatedmultivariateregressionmodelwhereISATscoresformathand
readingaremultipleresponses.Othertypesofclassificationschemescouldalsobe
performedonthedata,suchasK-Meansclustering,non-parametricdiscriminant
analyses,etc.
-
7/27/2019 LDA for Educational Data
15/18
ClassificationofSchoolsbyAcademicAchievementMeasures
15
Reference
Cohen, J. (1983). Cost of dichotomization.Applied Psychological Measurement, 7(3),
249-250.
Der, G. & Everitt, B. S. (2002).A handbook of statistical analyses using sas. (2nd ed.,p. 292). Boca Raton, FL: Chapman & Hall/CRC
Sever, M., Lajovic, J., & Rajer, B. (2005). Robustness of the fishers discriminant
.Metodoloki zvezki,2(2), 239-242.
-
7/27/2019 LDA for Educational Data
16/18
ClassificationofSchoolsbyAcademicAchievementMeasures
16
Appendix:
A1.Someunivariateresultsforavg_stud_lowincome:
LAA:
Moments
N 842 SumWeights 842
Mean 205.1981 SumObservations 172776.8
StdDeviation 84.2103863 Variance 7091.38915
Skewness -0.6552029 Kurtosis -0.8303315
UncorrectedSS 41417329.4 CorrectedSS 5963858.28
CoeffVariation 41.0385799 StdErrorMean 2.90208156
BasicStatisticalMeasures
Location Variability
Mean 205.1981 StdDeviation 84.21039
Median 231.7000 Variance 7091
Mode 279.2000 Range 300.00000
InterquartileRange 140.90000
Goodness-of-FitTestsforNormalDistribution
Test Statistic pValue
Kolmogorov-Smirnov D 0.1670009 Pr>D W-Sq A-Sq
-
7/27/2019 LDA for Educational Data
17/18
ClassificationofSchoolsbyAcademicAchievementMeasures
17
HAA:
Moments
N 941 SumWeights 941
Mean 59.7202976 SumObservations 56196.8
StdDeviation 53.6670837 Variance 2880.15587
Skewness 1.18972537 Kurtosis 1.4619666
UncorrectedSS 6063436.14 CorrectedSS 2707346.52
CoeffVariation 89.8640595 StdErrorMean 1.74949693
BasicStatisticalMeasures
Location Variability
Mean 59.72030 StdDeviation 53.66708
Median 46.40000 Variance 2880
Mode 0.00000 Range 282.30000
InterquartileRange 74.10000
Goodness-of-FitTestsforNormalDistribution
Test Statistic pValue
Kolmogorov-Smirnov D 0.1328989 Pr>D W-Sq A-Sq
-
7/27/2019 LDA for Educational Data
18/18
ClassificationofSchoolsbyAcademicAchievementMeasures
18
StatisticsforRemoval,DF=1,1707
Variable
Partial
R-Square FValue Pr>F
avg_stud_lowincome 0.5411 2012.52 F
Wilks'Lambda 0.456281 677.64 3 1706