Beyond Early Warning Indicators: High School Dropout and Machine Learning

Published date01 April 2019
DOIhttp://doi.org/10.1111/obes.12277
AuthorDario Sansone
Date01 April 2019
456
©2018 The Department of Economics, University of Oxford and JohnWiley & Sons Ltd.
OXFORD BULLETIN OF ECONOMICSAND STATISTICS, 81, 2 (2019) 0305–9049
doi: 10.1111/obes.12277
Beyond Early Warning Indicators: High School
Dropout and Machine Learning*
Dario Sansone
Department of Economics, Georgetown University, ICC 580, 37th and O Streets, N.W.,
Washington, DC 20057-1036, USA (e-mail: ds1289@georgetown.edu)
Abstract
This paper combines machine learning with economic theory in order to analyse high school
dropout. It provides an algorithm to predict which students are going to drop out of high
school by relying only on information from 9th grade. This analysisemphasizes that using
a parsimonious early warning system – as implemented in many schools – leads to poor
results. It shows that schools can obtain more precise predictions byexploiting the available
high-dimensional data jointly with machine learning tools such as Support Vector Machine,
Boosted Regression and Post-LASSO. Goodness-of-fit criteria are selected based on the-
contextand the underlying theoretical framework: model parameters are calibrated by taking
into account the policy goal – minimizing the expected dropout rate - and the school budget
constraint. Finally, this study verifies the existence of heterogeneity through unsupervised
machine learning by dividing students at risk of dropping out into different clusters.
I. Introduction
High school dropout is a key issue in the US educational system: only 83.2% of students
graduated with a regular high school diploma within 4 years of starting 9th grade in 2015.
According to the OECD (2016), the US upper-secondary graduation rate of 82% is below
average among advanced economies (85%), and far from the graduation rates in Germany
(91%), Japan (97%) and Finland (97%). Furthermore, there are substantial gender, racial
and geographical gaps within the United States (IES, 2016).1
This issue has been extensively analysedby researchers in economics and public policy
(De Witte et al., 2013; Murnane, 2013). The U.S. Department of Education provided
JEL Classification numbers: C53; C55; I20.
*I am grateful to the Editor Brian Bell, one anonymous referee, Garance Genicot, Francis Vella, Laurent Bouton,
Daniel Ackerberg, Pooya Almasi, Mary Ann Bronson, Nick Buchholz, Benjamin Connault, Francis DiTraglia,Luca
Flabbi, Myrto Kalouptsidi, Madhulika Khanna, Ivana Komunjer,Gizem Kosar, Arik Levinson,Whitney Newey, Hiren
Nisar, Elena Arias Ortiz, Franco Peracchi, Mariacristina Rossi, John Rust, Bernard Salani´e, Shuyang Sheng, Arthur
van Soest, Allison Stashko, Basit Zafar and participants to the 2018 SOLE Conference, the 2017 Stata Conference,
the 2017 GCER Alumni Conference, the George Washington University SAGE Conference, the 2017 APPAM DC
Regional Student Conference, and the Georgetown UniversityEGSO seminar for their helpful comments. I am also
grateful to John Rust and Judith House for their technical support. The usual caveats apply.
1It should be mentioned that graduation rates, racial differences and time trends are extremelysensitive to the sam-
ple used, as wellas to whether GED recipients are counted as high school graduates (Heckman and LaFontaine, 2010).
Beyond early warning indicators 457
almost $1.5 billion in grants to schools investing in innovative practices aimed at increas-
ing graduation rates between 2010 and 2016 (Office of Innovation& Improvement, 2016).
Failing to graduate from high school has high costs, as only 12% of all jobs in the econ-
omy will require less than a high school diploma by 2020 (Carnevale, Smith and Strohl,
2013). Schooling also has several non-pecuniary benefits ranging from health to happiness,
marriage, trust, and work enjoyment (Oreopoulos, 2007; Oreopoulos and Salvanes, 2011).
This paper shows how machine learning (ML) and economic theory can be jointly
applied in education. In particular, this paper creates a model that identifies students who
are at risk of dropping out using information from their first year of high school. In doing
so, it also illustrates how ML can be used to identify top predictors and heterogeneity
among students. In addition, the first part of this paper demonstrates that trying to predict
vulnerable students using a limited number of educational variables can detect only a
small fraction of those students who actually end up dropping out of high school. This
result is especially relevant since schools often rely on these few early warning indicators
to identify students who are struggling academically (O’Cummings and Therriault, 2015).
Indeed, educators are advised to focus only on attendance, school behaviour and course
grades to find students at-risk, even whenthere is minimal empirical evidence to support this
recommendation (Rumberger et al., 2017). In contrast to these practices, this paper shows
how schools can exploit available big data, jointly with ML techniques, to substantially
improve these predictions.These more advanced algorithms have the potential to correctly
identify thousands of additional students who are at risk of dropping out every year.
After having identified vulnerable students, this paper illustrates the application of
unsupervised ML to cluster such individuals into different groups based on their observable
characteristics. Clustering students has two advantages. First, it emphasizes that these
students are not a homogeneous group: the ML algorithm may classify some students as
at-risk because they are academically weak, while others may be predicted as dropouts
because they live in unsafe neighbourhoods or they come from very poor households.The
latter group would likely require differentprogrammes than the first one. Tutoring might be
more appropriate for students struggling in certain subjects, while combining tutoring with
counselling might be more effective for students with disadvantagedbackg rounds. ML can
therefore be used to identify students at-risk, and to help design treatments appropriate for
each sub-population. Second, it is possible to evaluate how a policy has different impacts
among students in various clusters. Indeed, any dropout prevention programme can have
different effects depending on student’s gender, race, ability, income, as well as by sub-
populations. In this way,it is possible to estimate heterogeneous effects not only on different
demographic groups, but also on multidimensional groups.
This paper is related to the emerging literature in ML. The main focus of econometric
techniques is causal inference, i.e. to provide unbiased or consistent estimates of the impact
of a variable xon an outcome y. On the other hand, ML is more appropriate for prediction
since its goal is to maximize out-of-sample prediction. Algorithms can identify patterns too
subtle to be detected by human observations (Luca, Kleinberg and Mullainathan, 2016),
thus outperforming econometric models built using heuristic or theory-based approaches.
Although there are several policy-relevant issues that do not require causal inference,
but rather accurate predictions (Kleinberg et al., 2015), ML applications have been quite
limited in economics so far. However, ML is gaining momentum (Belloni, Chernozhukov
©2018 The Department of Economics, University of Oxford and JohnWiley & Sons Ltd

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT