Hindsight bias in expert surveys: How democratic crises influence retrospective evaluations

Date01 November 2020
AuthorMauricio Olavarria-Gambi,Laura Levick
Published date01 November 2020
DOI10.1177/0263395720914571
Subject MatterArticles
/tmp/tmp-18WkrBJjoU1l1b/input 914571POL0010.1177/0263395720914571PoliticsLevick and Olavarria-Gambi
research-article2020
Article
Politics
2020, Vol. 40(4) 494 –509
Hindsight bias in expert
© The Author(s) 2020
Article reuse guidelines:
sagepub.com/journals-permissions
surveys: How democratic
https://doi.org/10.1177/0263395720914571
DOI: 10.1177/0263395720914571
journals.sagepub.com/home/pol
crises influence retrospective
evaluations

Laura Levick
St. Thomas University, Canada
Mauricio Olavarria-Gambi
Universidad de Santiago de Chile, Chile
Abstract
Expert surveys provide a standardized way to access and synthesize specialized knowledge,
thereby, enabling the analysis of a diverse range of concepts and contexts that might otherwise
be difficult to approach systematically. However, while studies of public opinion have long argued
that cognitive biases represent potential problems when it comes to the general population, less
attention has been paid to similar issues among expert respondents. This study examines one form
of cognitive bias, hindsight bias. Hindsight bias refers to the tendency to retrospectively exaggerate
one’s foresight of a particular event. We argue that hindsight bias is a potential problem when it
comes to retrospective evaluation due to the difficulty involved in separating our assessments of
the pre-crisis period from the knowledge that a crisis occurred. Using disaggregated data from
the Varieties of Democracy Project, we look for evidence of hindsight bias in coders’ evaluations
of the periods that preceded major crises of democracy. We find that coder disagreement is
significantly higher in pre-crisis scenarios than in our control group. Concerningly, despite this
disagreement, coders remain similarly confident in their assessments. This represents a potential
problem for those who seek to use these data to study democratic breakdowns and transitions.
Keywords
cognitive bias, expert surveys, hindsight bias, survey methodology
Received: 17th September 2019; Revised version received: 7th February 2020; Accepted: 28th February 2020
Introduction
Expert surveys are an increasingly popular method of studying a range of political phenom-
ena. Yet, despite an appearance of authority, expert surveys remain vulnerable to many prob-
lems associated with lay questionnaires. While studies of public opinion have long explored
Corresponding author:
Laura Levick, St. Thomas University, Fredericton, NB E3B 5G3, Canada.
Email: levick@stu.ca

Levick and Olavarria-Gambi
495
cognitive biases among the general population (see Kuklinski and Quirk, 2000), less atten-
tion has been paid to bias among experts. Presumably, this is because experts are better
positioned to offer more objective (or, at least, less overtly biased) judgements. The question
remains: To what degree does expertise shield respondents against errors in reasoning?
This study examines one form of cognitive bias: hindsight bias. Hindsight bias refers
to the tendency to retrospectively exaggerate one’s foresight of an event – a problem that
is well-documented in other areas of political science (see Lebow, 2009). We argue that
hindsight bias is particularly insidious when it comes to retrospective evaluations of cri-
ses due to the difficulty involved in separating our evaluations of the pre-crisis period
from the knowledge that a crisis occurred. Complicating matters further, political mem-
ory is often ideologically polarized, as interpretations of past events may be embedded in
contemporary debates that (re)frame past conflicts.
Until recently, identifying hindsight bias in expert surveys has been difficult, as many
indices that incorporate expert judgements only provide aggregate results. While there are
valid reasons for withholding disaggregated responses, without these data, we cannot
fully discern the extent to which experts (dis)agree, especially in the absence of a corre-
sponding error term.
We look for evidence of hindsight bias in data from the Varieties of Democracy
(V-Dem) Project, which relies on over 3000 anonymous experts (Coppedge et al., 2018).
We examine coders’ evaluations of two groups of democracies: one in which a crisis
occurred and one where no evidence of a crisis was observed.
Our findings suggest two areas of concern. First, we find that disagreement is signifi-
cantly higher in the pre-crisis cases than in the corresponding control group for almost
every indicator. Second, despite this disagreement, coders remain confident in their
assessments. This poses a problem for those studying democratic breakdowns, as our
findings not only suggest that coders’ knowledge of a crisis colours their evaluations, but
also that coders may not be fully aware of this.
This article proceeds as follows. The next section introduces the cognitive phenomenon
of hindsight bias and the expands on the relationship between memory and the measure-
ment of democracy. This is followed by a discussion of the hypothesis, research design,
and data. We then present our results and offer some interpretive discussion and recom-
mendations, followed by a conclusion that considers the applicability of our findings to a
wider research agenda interested in questions of democratization and democratic decline.
Democracy, measurement and memory
Democracy is a contested concept. There is no consensus among those who study demo-
cratic attainment as to which factors should be included. This fact is implicitly acknowl-
edged in the V-Dem data, which permit researchers to choose between five democratic
indices or construct their own measures from a diverse menu of indicators. While this
study does not endorse a particular definition of democracy, it is necessary to acknowl-
edge this disagreement in order to understand the different ways in which expert surveys
have operationalized the concept of democratic attainment, as well as the different ways
in which coders are likely to understand this concept.
Expert surveys
While we might assume that experts are more likely to produce reliable judgements than
lay people, ‘expert’ respondents might have imperfect information; they might understand

496
Politics 40(4)
concepts and metrics differently; and their judgements may be affected by ideological
convictions (Maestas et al., 2014; Martinez i Coma and Van Ham, 2015). Expert surveys
are therefore not immune to fundamental problems of measurement error. In particular, the
literature emphasizes two areas of concern: inter-coder reliability and data aggregation.
First, because experts may have different understandings of relevant concepts, they
may assign different values to the same case. This problem, known as differential item
functioning (DIF), refers to variation in how experts apply conceptual tools such as ordi-
nal scales to evaluate cases. While some discrepancy is anticipated, respondents’ judge-
ments should be similarly calibrated. Numerous measures have been developed to assess
this (see Hayes and Krippendorff, 2007; Steenbergen, 2000; Steenbergen and Marks,
2007). Yet, even though reporting at least one measure of inter-coder reliability is now
considered ‘best practice’ (see Maestas, 2016), many indices do not report this or other
measures of uncertainty (see Coppedge et al., 2011: 251). This remains true of widely
used tools based on expert questionnaires, such as the Freedom House (2018) global
rankings of freedom and democracy.
Second, the process of aggregating expert assessments adds another layer of complex-
ity. While this issue has been discussed extensively in the literature (see Jones and
Norrander, 1996; O’Brien, 1990), of particular relevance to the proceeding discussion is
the way in which indices incorporate experts’ confidence in their judgements into the
scoring system.
V-Dem: Measurement and aggregation
There are several ways in which V-Dem represents a major advancement in overcoming
problems of subjectivity and inter-coder reliability.
First, its organizers have paid considerable attention to conceptual clarity and
standardization. Consider, for example, the coder instructions for the indicator meas-
uring freedom of discussion among men, reproduced in Supplemental Appendix A.
The detailed question, two-paragraph clarification, and specific responses all address
the DIF problem (i.e. that experts’ idiosyncratic perceptions affect their application
of the scale described in the V-Dem codebook). While some subjectivity remains, the
use of detailed, question-specific responses – as opposed to a Likert-type scale with
vague categories ranging from ‘strongly disagree’ to ‘strongly agree’ – greatly reduces
this subjectivity.
Second, the use of ‘bridge’ coders enhances cross-country and intertemporal compara-
bility. About 15% of V-Dem coders are bridge coders, who code at least two different
countries according to the same criterion for the same period. This allows V-Dem’s meas-
urement model to incorporate information about coders’ assessments of countries with
different historical trajectories.1
Third, when it comes to data aggregation, V-Dem’s Bayesian item response model is
sensitive to the reality that coders will have different understandings of terms like ‘some-
what’ and ‘mostly’ when inputting responses. Importantly, the aggregate data (and error
terms) incorporate information about coders’ confidence in their assessments, as well as
patterns of cross-coder (dis)agreement to aid in estimating variations in reliability and
possible bias (Coppedge et al., 2018:...

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT