Building Predictive Models for NYC High Schools (Alec Hubel)

download Building Predictive Models for NYC High Schools (Alec Hubel)

of 5

Transcript of Building Predictive Models for NYC High Schools (Alec Hubel)

  • 8/13/2019 Building Predictive Models for NYC High Schools (Alec Hubel)

    1/5

    Building Predictive Models for NYC Public High SchoolsAlec Hubel | Introduction to Data Science - all !"#$

    Abstract

    The New York City public school system (responsible for the education of over 1 million students) isthe largest in the country. nfortunately! it"s si#e only makes it more susceptible to impeding issues.

    The fact that school budgets are consistently tightening is only worsened by the fact that $mericanstudents are falling behind their international competition. $s a way to monitor the success of a school!the %epartment of &ducation monitors two key statistics ' high school graduation rates and aspirationalperformance measures. This study looks to uncover the key drivers of those measures in an at tempt toisolate the factors that are most responsible for a successful education in New York City public schools.

    Introduction

    New York City public schools employ !*** teachers across over 1!** schools. These teachers areresponsible for the education of 1.1 million students and represent an overwhelming portion of the +,-million annual budget. $ system of this magnitude reuires consistent monitoring in order to determineit"s efficacy. nfortunately! a evaluation of each and every school! teacher! and students would be a

    huge draw on already limited resources. /ecause of this! the %epartment of &ducation must rely oncertain performance metrics to decide if a school"s performance is up to snuff. 0or high schools! theprimary metrics that are used for this purpose are a schools graduation rate (what percentage of a seniorclass will successfully graduate in a given year) and it"s aspirational performance measure ($2). TheNew York 3tate %epartment of &ducation uses the below definition for aspirational performancemeasures4

    The percent of students in the cohort who earned a 5egents diploma with $dvanced%esignation (i.e.! earned ,, units of course credit6 passed 78 5egents e9ams at a score of : orabove6 and took advanced course seuences in Career and Technical &ducation! the arts! or alanguage other than &nglish)6 and

    The percent of students in the cohort who graduated with a local! 5egents! or 5egents with

    $dvanced %esignation diploma and earned a score of or greater on their &nglish 5egentse9amination and an ;* or better on a mathematics 5egents e9am (note4 this aspirationalmeasure is referred to as the 2ath $2?)

    This data point is meant to measure what percent of a graduating class is prepared for college or a post7high school career. 0or this analysis! @ attempted to build predictive , models. Ane for a school"sgraduation rate! and one for a school"s aspiration performance measure.

    Data

    The New York City %epartment of &ducation makes an enormous amount of data available for publicuse and review. Thanks to this fact! collecting all the data reuired for my analysis was substantiallyeasier than anticipated. To start! @ decided to focus on the ,*117,*1, school year. @ wanted to keep the

    data as recent as possible! in order to have my results be as reflective of the current status of the schoolsystem as possible. The data that @ reuired was held across primarily B separate data7sets. The firstdata7set contained demographic data! presenting values for the racial composition of schools! whatpercentage of the student body ualified for free or subsidi#ed school lunches (a common pro9y for theincome levels of a student population)! student7teacher ratios! and the graduation rates and $2s ofindividual schools.The second data7set contained budgetary information for each of the schools. 0rom this! @ was able toe9trapolate the dollar allocated per student. This would be a more useful measure for the funding of aschool than the absolute budget! because a larger school would naturally have a larger budget! but may

  • 8/13/2019 Building Predictive Models for NYC High Schools (Alec Hubel)

    2/5

    not necessarily have enough resources for all of the students that it is responsible for. @ was also able tocollect the average salary for teachers in a given school. 2y intention was to use this measure as apro9y for the uality of teacher in a given school. $ teacher"s salary in New York is determined by howmuch training they have received and how many years of e9perience they have. @ decided to operateunder the assumption that that a school with a higher average teacher salary has a higher ualityteachers.

    Drawing 1: NYC School Districts

    Drawing 2: Heatmap of Demographic Data by District

  • 8/13/2019 Building Predictive Models for NYC High Schools (Alec Hubel)

    3/5

    =astly! @ collected a data7set from the yearly school survey that is administered to parents! teachers! andstudents. 0rom this survey! the NYC %epartment of &ducation is able to e9tract scores for safety andrespect! communication! engagement! and academic e9pectations. $dditionally! it containedinformation on the e9tracurricular offerings of a school.&ach school in the above data sets was given a uniue identifier called a "%/N". This was e9tremelyuseful for two reasons. 0irstly! it allowed me to use the andas "oin" function to combine all of my data

    in to easily combine all of my data in to a single data frame! without too much e9traneous datacleaning. 3econdly! the %/N allowed me to e9tract the district and borough for a given school. 0ore9ample! /ron9 =eadership $cademy Digh 3chool"s %/N is *8E,. The first two digits ' *8 'signify that this school is located in district 8 (there are B, school districts within New York City). Thethird character ' E ' corresponds to the /ron9 (the other letter>borough pairs are 2>2anhattan!F>Fueens! 5>3taten @sland! and G>/rooklyn).

    Methodolog%

    0irst! @ had to narrow down my data7set from the 1**H schools to the -*, high schools in the NYCschool system. * of those schools did not report graduation rates and $2 measures. This is a resultof a regulatory reuirement that prevents a school from releasing this information when there are ,* orless graduates (generally the smaller schools in the system). $fter removing those schools with missingdata! @ employed a randomi#ed *>B* split to create a training set and a testing set. 0or both of mymodels! @ was attempting to predict a continuous value ' graduation rate and $2. $s such! @ decidedto use scikit7learn"s ridge regression algorithm. @ began with a "kitchen sink" approach and threw all ofmy variable in to the model. @ then removed variables one7by7one until @ had could isolate the factorsthat most influenced graduation rate and $2. Iariable were selected for removal when their p7valuesindicated a lack of statistical significance and their absence from my model did not substantially detractfrom my model"s accuracy. The accuracy of my model was determined using both the 57suared andmean suared error (23&).

    &esults ' (raduation &ate

  • 8/13/2019 Building Predictive Models for NYC High Schools (Alec Hubel)

    4/5

    The accuracy of my final model for graduation rate came in with an 57suared of *.:1 and a 23& of*.*1B,. 2ost of the results were relatively unsurprising. Daving a high portion of your studentsreceiving subsidi#ed or free school lunches (i.e. a poorer student body) resulted in a lower graduationrate! while having a better funded school on a per7student basis had the opposite effect. Daving a largerportion of your student body represent non7white ethnicities! come from households where &nglish isnot the native language! or ualify for special education also put downwards pressure on a school"s

    graduation rate.The results of the survey data were a bit more interesting. Digher academic e9pectations andsafety>respect scores resulted in strong a significant improvements in a school"s graduation rate! whilethe other , scores derived from the survey data (engagement and communication) did not havesignificant impacts. Dowever! a stronger student response rate! regardless of what kind of answers thestudents gave! seemed to correlate strongly with a school"s graduation rate. This may suggest thatstudents who are motivated to perform well in school! will most likely complete a school sponsoredsurvey! in addition to putting forth effort in their classes. The prevalence of e9tracurricular activitiesalso impacted graduation rate positively ' though not all e9tracurriculars were created eual. Anlysports! academic>tutoring clubs! and theater clubs seemed to have a significant and positive impact ongraduation rate.Ane une9pected result did emerge from the analysis. The student7teacher ratio of a school ha d astatisticallysignificant impact with a positive coefficient (i.e. more students per teacher resulted in ahigher graduation rate). Jhile it may be a bit of a stretch to suggest a policy where we have fewerteachers in our schools! it may be worthwhile to reconsider the conventional wisdom that hiring moreteachers can be cure7all for struggling schools. erhaps betterteachers would have a greater impactthan moreteachers.

    &esults ' APM

    The final iteration of my model for $2 yielded an r7suared of *.:; and a 23& of .*118 'incrementally more accurate than my model for graduation rates. Kenerally speaking! the same drivers

  • 8/13/2019 Building Predictive Models for NYC High Schools (Alec Hubel)

    5/5

    that impacted graduation rates impacted a school"s aspirational performance measure. Jealthy and safeschools with a high proportion of white students tended to outperform. &9tracurricular activities had astrongly positive impact! with music and technology clubs oining the sports and academic clubs as thee9tracurriculars that had an outsi#ed positive impact.There was one takeaway from this model that stood out from the crowd4 the most predictive variable inthis model was the percentage of the student body that was of $sian descent. @n fact! when @ built this

    model with that data point as the sole variable! it generated an 57suared of *.-;. Jhile this lendscredence to the notion that $sian students tend to outperform on standardi#ed tests! it does not solve theuestion of why$sian students tend to outperform. $ common hypothesis is that $sian students tend tohave a stronger work ethic and spend more time studying! naturally yielding better test scores! but @ donot have access to data that could confirm or refute that idea.

    Caveats and uture &esearch

    There are a few caveats to consider in order to put the results of this analysis in the proper conte9t.0irstly! the magnet school system in New York City throws a bit of a wrench in to the data. $t the endof middle school! every New York City public school student takes a standardi#ed test. erform wellenough on that test! and they will be admitted in to one of the higher performing or specialty schools inthe city (i.e. 3tuyvesant! /ron9 3cience! etc.). This system creates two issues. 0irstly! there will beself7selection bias. 3tudents with higher natural ability will go to better schools! reinforcing their highstanding (particularly in terms of graduation rate and aspiration performance measures). 3econdly!many students do not attend schools in their home boroughs or districts. This may be one of thereasons that a school"s location was not particularly significant in my models.2ore data would have been useful as well. %irect metrics for the natural ability of a student body! non7school activities that take up a student"s time outside of the classroom! and the uality of teaching staffwere unavailable. 3ome of this may be alleviated soon ' in the ,*1B7,*1- school year! teachers will beevaluated on a continuous scale. These data points may prove useful in future research.=astly! the results of this study would have been much more intriguing if the data could be collected ona student level! as opposed to a school level. $fter all! policy recommendations directed at schools aremeant to improve the uality of education for individual students. @f the school7level could bebypassed! it may be easier to identify ways to more directly help students.