Do the robot

AuthorZuhaib Mahmood,Michael Colaresi
Published date01 March 2017
DOI10.1177/0022343316682065
Date01 March 2017
Subject MatterResearch Articles
Do the robot: Lessons from machine
learning to improve conflict forecasting
Michael Colaresi & Zuhaib Mahmood
Department of Political Science, Michigan State University
Abstract
Increasingly, scholars interested in understanding conflict processes have turned to evaluating out-of-sample forecasts
to judge and compare the usefulness of their models. Research in this vein has made significant progress in identifying
and avoiding the problem of overfitting sample data. Yet there has been less research providing strategies and tools to
practically improve the out-of-sample performance of existing models and connect forecasting improvement to the
goal of theory development in conflict studies. In this article, we fill this void by building on lessons from machine
learning research. We highlight a set of iterative tasks, which David Blei terms ‘Box’s loop’, that can be summarized
as build, compute, critique, and think. While the initial steps of Box’s loop will be familiar to researchers, the
underutilized process of model criticism allows researchers to iteratively learn more useful representations of the data
generation process from the discrepancies between the trained model and held-out data. To benefit from iterative
model criticism, we advise researchers not only to split their available data into separate training and test sets, but also
sample from their training data to allow for iterative model development, as is common in machine learning
applications. Since practical tools for model criticism in particular are underdeveloped, we also provide software for
new visualizations that build upon already existing tools. We use models of civil war onset to provide an illustration of
how our machine learning-inspired research design can simultaneously improve out-of-sample forecasting perfor-
mance and identify useful theoretical contributions. We believe these research strategies can complement existing
designs to accelerate innovations across conflict processes.
Keywords
civil war, forecasting, machine learning, methodology, visualization
International relations and the social sciences in general
have been increasingly focused on building models that
provide useful out-of-sample predictions (Witmer et al.,
2017; Ward & Beger, 2017). While forecasting is not
new, the recent momentum comes at a time when the
weaknesses of traditional applications of in-sample null
hypothesis significance testing (NHST) are increasingly
apparent across disciplines (Gelman & Loken, 2014;
Gill, 1999; Simmons, Nelson & Simonssohn, 2013).
The weight of failed replications and exaggerated effect
sizes, in addition to well-known worries of overfitting the
sample data, have led many researchers to search for new
research design strategies that might accelerate break-
throughs in social research. Indeed there are several
exciting examples in Geography, Climatology, Natural
Language Processing, and other disciplines, where
researchers have been able to harness the availability of
dense sources of digital information without relying on
NHST (Chen & Manning, 2014; Blei, 2014; Raftery
et al., 2005; McCormick et al., 2012).
In an important article, Ward, Greenhill & Bakke
(2010: 365) argue that the ‘search for statistical signifi-
cance’ can be a misleading metric both for how well a
model represents the underlying patterns in the data, and
how the model will generalize to unseen data. These are
two distinct research pitfalls, which we term underper-
formance and overfitting. A model, by design, is a sty-
lized representation of the underlying process of interest.
Corresponding author:
colaresi@msu.edu
Journal of Peace Research
2017, Vol. 54(2) 193–214
ªThe Author(s) 2017
Reprints and permission:
sagepub.co.uk/journalsPermissions.nav
DOI: 10.1177/0022343316682065
journals.sagepub.com/home/jpr
An extremely simple model, such as a linear-additive
representation of conflict that only includes a few vari-
ables, might capture a small number of patterns in the
underlying process. While these patterns might general-
ize out-of-sample, meaning they are representing obser-
vable features of the data generation process, they might
also exclude many other signals. We refer to this situa-
tion as underperformance on unseen data. Conversely, a
model might represent a myriad of patterns in the data
used to fit the model, but these patterns may not general-
ize to new data. This is conventionally known as over-
fitting the sample data. Using two highly cited models of
civil war, Ward, Greenhill & Bakke (2010) provide
examples of both underperformance and overfitting.
Even when models have many statistically significant
coefficients, the resulting predictions on new data will
underperform, faring no better than very simple models
that take into account only one or two features of the
process.
1
These models have learned only a small num-
ber of patterns, but no more. Additionally, this research
team provides examples where adding statistically signif-
icant variables – sometimes with coefficients many times
the size of their standard errors – actually ‘degrade[s] the
predictive accuracy of a model’ (Ward, Greenhill &
Bakke, 2010: 373). In these cases, the models have over-
fit by learning patterns in the sample of data used to
compute estimates that do not generalize out-of-
sample. This research leaves us with the question: if
gazing at stars is a poor guide to the future, what research
design strategy can take its place and reduce underper-
formance and overfitting?
Machine learning-inspired research design
as an alternative to NHST
In this article, we detail an alternative workflow to
NHST that builds on established approaches in machine
learning. We also provide an application of this workflow
to civil war forecasting, emphasizing the role of model
criticism and predictive performance in the model-
building process. In contrast to NHST, machine
learning-inspired research designs have at their core a set
of distinct iterative steps that an applied researcher cycles
through to learn generalizable patterns from the available
data. Our proposed steps, inspired by David Blei’s sum-
mary of crucial insights from George Box’s loop (Blei,
2014; Box, 1980), include building a mathematical
representation using domain knowledge; computing the
unknown parameters and weights from the mathemati-
cal representation with training data; critiquing the fitted
model by identifying theoretically relevant discrepancies
between the model and new data; and then using the
new knowledge of these discrepancies to think of what
patterns may have generated them.
2
These research sub-
tasks are then repeated until a researcher is satisfied with
the model performance on a specified task.
In this cyclical setup, out-of-sample predictions must
be ruthlessly critiqued across multiple scales so that new
features and specifications can be innovated to improve
the performance in the next build of the model. Model
criticism is uniquely beneficial because it identifies dis-
crepancies in the current model. These discrepan cies,
once identified, can help researchers construct more use-
ful models of conflict processes. Moreover, in thinking
about these discrepancies, researchers are by definition
updating their domain knowledge, generating new ideas
about the relevant data generation mechanisms.
Machine learning for humans
Machine learning is a relatively recent field that has its
roots in artificial intelligence research. Successful appli-
cations of machine learning over the last several decades
in tasks as varied as spam filtering to playing jeopardy
have validated the usefulness of this approach (Siegel,
2013). The goal of machine learning is deceptively sim-
ple. According to Mitchell (1998), a program is said to
learn from experiences Ewith respect to some tasks T
and related performance measure Pif its performance on
Tmeasured by Pincreases with E. The canonical exam-
ple is writing a computer program that can learn to play a
game such as checkers. A set of practice games, with
known outcomes, is provided as the experiences E. The
researcher clearly defines a task T, such as winning
future games against humans, and measures the perfor-
mance on that task with a function P, such as the pro-
portion of games won (Mitchell, 1998).
Since the goal of machine learning is increasing per-
formance as new experiences become available, avoiding
both underperformance and overfitting are central con-
cerns (Flach, 2012). Learning a model that overfits the
1
For example, Ward, Greenhill & Bakke (2010) find that a model
only measuring GDP and population performs nearly as well as a
model that accounts for 11 features. The models explored are linear
and additive on the log-odds scale in this example.
2
Blei (2014) includes the first three of our subtasks, but since our
emphasis is on the contrast between NHST and machine learning-
inspired research designs, we highlight the distinction between
critiquing a model to identify discrepancies and subsequently
thinking about how the discrepancies relate to domain knowledge.
194 journal of PEACE RESEARCH 54(2)

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT