Detecting and treating errors in tests and surveys

DOIhttps://doi.org/10.1108/QAE-07-2017-0036
Published date03 April 2018
Date03 April 2018
Pages243-262
AuthorMatthias von Davier
Subject MatterEducation,Curriculum, instruction & assessment,Educational evaluation/assessment
Detecting and treating errors
in tests and surveys
Matthias von Davier
Center for Advanced Assessment,
National Board of Medical Examiners, Philadelphia, Pennsylvania, USA
Abstract
Purpose Surveys that include skill measures may suffer from additional sources of error compared to
those containing questionnaires alone. Examples are distractions such as noise or interruptions of testing
sessions, as well asfatigue or lack of motivation to succeed. This paper aims to providea review of statistical
tools based on latentvariable modeling approaches extended by explanatoryvariables that allow detection of
survey errorsin skill surveys.
Design/methodology/approach This paper reviews psychometric methods for detecting sources of
error in cognitiveassessments and questionnaires. Aside from traditionalitem responses, new sources of data
in computer-based assessment are available timing data from the Programme for the International
Assessmentof Adult Competencies (PIAAC) and data from questionnairesto help detect survey errors.
Findings Some unexpected results arereported. Respondents who tend to use response sets have lower
expected valueson PIAAC literacy scales, even after controllingfor scores on the skill-use scale that was used
to derive the responsetendency.
Originality/value The use of new sources of data, such as timing and log-le or process data
information, providesnew avenues to detect response errors. It demonstrates that large data collectionsneed
to better utilize available information and that integration of assessment, modeling and substantive theory
needs to be takenmore seriously.
Keywords Skills, Literacy, Large-scale assessment, Cognitive assessment,
Background questionnaires, Survey error
Paper type Research paper
Introduction
Surveys including measures of skills may suffer from additional sources of error compared
to questionnaires. Skill surveys collecting data on reading, mathematics or other skills are
administered by assumingthat participants respond to the best of their abilities. Self-reports
in survey questionnaires are collected to gather data based on respondentsbest effort to
answer these questions. But respondentsbest effort is not a given. Survey response errors
can be understood as sources of systematic variability that are not based on skills or
knowledge in cognitive assessments and not based on underlying attitudes, opinions or
interests in questionnaires.
The measurement of skills assumes interindividual performance differences on the
assessments that are associated systematically with underlying prociency. Higher scores
on the assessment should be (statistically) associated with higher levels ofprociency. The
reasoning is that prociency is the underlyingcause. But it does not equal test performance
because nuisance variables may impact performance. Distractions such as noise or
interrupted testing sessions might reduce performance, as might fatigue (Thorndike, 1914),
alackofmotivation to succeed (Eccles et al.,1998) or other unsystematic nuisance variables
(Nunnally, 1967) that are located within or outside the respondent. Moreover, reasons for
unexpectedly good test performance exist, when comparing respondents to others with
Detecting and
treating errors
243
Received10 July 2017
Revised31 October 2017
Accepted9 January 2018
QualityAssurance in Education
Vol.26 No. 2, 2018
pp. 243-262
© Emerald Publishing Limited
0968-4883
DOI 10.1108/QAE-07-2017-0036
The current issue and full text archive of this journal is available on Emerald Insight at:
www.emeraldinsight.com/0968-4883.htm
similar training and expertise, or when comparing current performance with past results.
Reasons for unexpectedly high performance can be external to respondents, for example,
support received by bystanders or interviewers, or internal, if answers to questions were
obtained online, or by interviewing others about the questions they were asked during the
test session.
Artifacts in background questionnaires
Surveys such as Programme for the International Assessment of Adult Competencies
(PIAAC) collect data on both (cognitive) items and questionnaires. Roughly, tests tap into
prociencies and use objective behaviors, while questionnaires address not only a wider
range of variables, including self-reported skills, but also more general traits, behavioral
tendencies, motives and attitudes. Results on both tests and questionnaires are potentially
affected by unintended inuencesexerted by the testing situation or brought to bear by the
respondent. Nunnally (1967) talks about sources of construct-irrelevant differences in
response behavior. Examples are faking good(Eid and Zickar, 2007) and cheating, for
example, by copying responses. For questions that require a respondent to report on
behaviors or attitudes, research has shown there are a range of variables that can distort
responses and thus reduce correlationsbetween behaviors and self-reports on behaviors.
This is true if a question does not seem to be addressing a topic or a domain the
respondent can talk about with condence.As most people do not own a Ferrari, questions
such as How do you like the steering wheel in the 2013 Ferrari California HS,with the
response format (a) not at all, (b) a little, (c) somewhat, (d) a lot,may elicit an estimated
response from some respondents confronted with this question, while others may not
respond, and still others may give a random response or say what they think interviewers
expect. Very few people can give a reasonable response, so responses may be affected by
psychological processes other the intended one (recalling past experiences). Respondents
may feel compelled to answer even if the question is not applicable or they lack experience
with the topic (refer to Rosenthal and Rosnow 2009, for an overview of artifacts).
Consequently, we can expect questionnaires involving self-reportsin diverse populations to
produce answers more as a function of respondents feeling obliged to give one or to fulll
interviewersexpectations. This is a well-established area of research based on the
theoretical conceptualizationsaround interviewer and experimenter effects and has spurred
both model-based methodological developments (Rost et al.,1996;Rosmalen et al., 2010;
Khorramdel and von Davier, 2014), as well as developments of innovative item and
questionnaire formats (King et al., 2004). Note that changing response formats or asking
additional questions to anchorand correct response stylesdoes not always work as
expected but may instead produce statisticalartifacts (Stankov et al.,2017;von Davier et al.,
2017). In summary, some respondents, if confronted with questions not applicable to them,
can be expected to produce erratic responses,either by choosing randomly or by frequently
the same response category.
Artifacts in skill tests
Assumptions of how skill variables and responses should relate are like expectations of
what makes a good measure in science. For example, methods used to estimate the age of
fossils and artifacts assumemonotonic relationships between proportions of carbon isotopes
and the age of specimens. Here, age is the latent variable (much as mathematics skill is a
latent variable that cannot be directly observed and must be inferred by the proportion of
correct responses on a test). Physical measures also assume that once we control for actual
age, the proportions of carbon in a specic specimenvary randomly. Like psychological and
QAE
26,2
244

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT