Specification Searching and Significance Inflation Across Time, Methods and Disciplines

Date01 August 2019
Published date01 August 2019
AuthorEva Vivalt
DOIhttp://doi.org/10.1111/obes.12289
797
©2019 The Department of Economics, University of Oxford and JohnWiley & Sons Ltd.
OXFORD BULLETIN OF ECONOMICSAND STATISTICS, 81, 4 (2019) 0305–9049
doi: 10.1111/obes.12289
Specification Searching and Significance Inflation
Across Time, Methods and Disciplines*
Eva Vivalt
Research School of Economics, Australian National University, Acton, ACT 2601, Australia
(e-mail: eva.vivalt@anu.edu.au)
Abstract
This paper examines how significance inflation has varied across time, methods and dis-
ciplines. Leveraging a unique data set of impact evaluations on 20 kinds of development
programmes, I find that results from randomized controlled trials exhibit less significance
inflation than results from studies using other methods. Further, randomized controlled
trials have exhibited less significance inflation over time, but quasi-experimental studies
have not. There is no robust difference between results from researchers affiliated with
economics departments and those from researchers affiliated with other predominantly
health-related departments. Overall, the biases found appear much smaller than those pre-
viously observed in other social sciences.
I. Introduction
Specification searching is a concern for all quantitative disciplines. However, it is not clear
when it is likely to happen. The term ‘specification searching’ could be used to refer to
several phenomena; here, I narrowly consider what I will call ‘significance inflation’, i.e.
any process, such as running multiple regressions and disproportionately reporting those
that are significant or collecting more data until results are significant, that leads the sta-
tistical significance of reported results to be inflated. This is also known as ‘p-hacking’.
This paper exploits a new database of published articles and unpublished working pa-
pers relating to international development to explore the issue. The data, collected in the
process of conducting 20 meta-analyses of development programmes, allow me to test
for differences in significance inflation across time, methods and disciplines. I find that
*I am very grateful to the editor, Jonathan Temple, and three anonymous referees for useful comments and
suggestions. I also thank Edward Miguel, Bill Easterly, DavidCard, Er nesto Dal B´o, Hunt Allcott, ElizabethTipton,
Vinci Chow,Willa Friedman, Xing Huang, Michaela Pagel, Steven Pennings, Edson Severnini, seminar participants
at the University of California, Berkeley, Columbia University, NewYork University, the World Bank, Princeton
University,the University of Toronto, the London School of Economics, Cornell University,the University of Ottawa,
the Stockholm School of Economics, and participants at the 2015 ASSA meeting and 2013 Association for Public
Policy Analysisand Management Fall Research Conference for feedback on an earlier draft of this paper. I am also
grateful for the work put in bymany at AidGrade to create the data set used in this paper, including Bobbie Macdonald,
Diana Stanescu, Cesar Augusto Lopez, Jennifer Ambrose, Naomi Crowther, Timothy Catlett, Joohee Kim, Gautam
Bastian, Christine Shen, TahaJalil, Risa Santoso and Catherine Razeto.
798 Bulletin
randomized controlled trials (RCTs) exhibit less significance inflation than papers using
a quasi-experimental approach. I also find that randomized controlled trials previously
exhibited significance inflation, but this decreased over time; in contrast, biases in quasi-
experimental studies have, if anything, grownover time. I compare results from studies by
economists with results from studies by non-economists, but the differences are insignif-
icant in most cases. In the data set considered, these ‘non-economists’ are mostly health
researchers. Both the economics and non-economics results appear to suffer much less
significance inflation than has been found in other social sciences (Gerber and Malhotra,
2008a,b). I consider both published and unpublished papers and find more bias in published
papers.
Specification searching has long been seen to be a problem in medicine (e.g. Simes,
1986; Begg and Berlin, 1988) and psychology (Bastardi, Uhlmann and Ross, 2011; Sim-
mons and Simonsohn, 2011). Leamer recognized the problem in economics quite early
(Leamer, 1978), and recently there has been new interest in the social sciences (Franco,
Malhotra and Simonovits, 2014), including political science (Gerber and Malhotra, 2008a),
sociology (Gerber and Malhotra, 2008b), and economics (Brodeur et al., 2016; Bruns,
2017; Ioannidis, Stanley and Doucouliagos, 2017). However, the possibility that there may
be differences in significance inflation by method and discipline has not yet fully been
explored.
Significance inflation could vary by method and discipline for many reasons. If the
journals of different disciplines were to have different selection functions and authors
engage in specification searching to try to meet the journal’s requirements for publication,
this would be sufficient to generate differences in significance inflation. The journals of
some fields, for example, may simply be more competitive, raising the bar such that only
papers with significant results are published, assuming that significant results are always
preferred to insignificant results. Alternatively, it could be the case that some disciplines
place more weight on methods and are more likely to accept any well-done RCT; then we
would expect RCTs to exhibit less significance inflation in that discipline, as it would be
easier for such papers to reach the threshold for publication without significant results.
We may also think that the level of significance inflation needed to be competitive at a
journal may depend on whether others submitting to the same journals are engaging in it,
so different journals or disciplines could be at different equilibria. There are also reasons
to believe RCTs suffer less significance inflation independent of journals. Authors that
conduct RCTs may be more likely to register a pre-analysis plan, which would serve to
diminish the opportunity for running many regressions and selectively reporting results or
increasing one’s sample size until results are significant. Further, randomization should,
in principle, lead to covariate balance across the treatment and control groups; given that
one way in which p-hacking may occur is by authors including different combinations of
control variables until finding a significant result, randomization could decrease the risk
of significance inflation if researchers find it harder to justify including control variables
in an RCT.
All these considerations provide reason to suspect significance inflation could differ
systematically by method or field, though I will not be able to determine precisely whichof
these or other factors are responsible for the patterns of specification searching I observe.
I also detect more signs of bias in the published literature compared to the unpublished
©2019 The Department of Economics, University of Oxford and JohnWiley & Sons Ltd

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT