Data quality assurance in research data repositories: a theory-guided exploration and model

Date25 January 2024
Pages793-812
DOIhttps://doi.org/10.1108/JD-09-2023-0177
Published date25 January 2024
AuthorBesiki Stvilia,Dong Joon Lee
Data quality assurance in research
data repositories: a theory-guided
exploration and model
Besiki Stvilia
School of Information, Florida State University, Tallahassee, Florida, USA, and
Dong Joon Lee
Mays Business School, Texas A&M University, College Station, Texas, USA
Abstract
Purpose This study addresses the need for a theory-guided, rich, descriptive account of research data
repositories(RDRs) understanding of data quality and the structures of their data quality assurance (DQA)
activities. Its findings can help develop operational DQA models and best practice guides and identify
opportunities for innovation in the DQA activities.
Design/methodology/approach The study analyzed 122 data repositoriesapplications for the Core
Trustworthy Data Repositories, interview transcripts of 32 curators and repository managers and data
curation-related webpages of their repository websites. The combined dataset represented 146 unique RDRs.
The study was guided by a theoretical framework comprising activity theory and an information quality
evaluation framework.
Findings The study provided a theory-based examination of the DQA practices of RDRs summarized as a
conceptual model. The authors identified three DQA activities: evaluation, intervention and communication
and their structures, including activity motivations, roles played and mediating tools and rules and standards.
When defining data quality, study participants went beyond the traditional definition of data quality and
referenced seven facets of ethical and effective information systems in addition to data quality. Furthermore,
the participants and RDRs referenced 13 dimensions in their DQA models. The study revealed that DQA
activities were prioritized by data value, level of quality, available expertise, cost and funding incentives.
Practical implications The studys findings can inform the design and construction of digital research
data curation infrastructure componentson university campuses that aim to provide access not just to big data
but trustworthy data. Communities of practice focused on repositories and archives could consider adding
FAIR operationalizations, extensions and metrics focused on data quality. The availability of such metrics and
associated measurements can help reusers determine whether they can trust and reuse a particular dataset.
The findings of this study can help to develop such data quality assessment metrics and intervention strategies
in a sound and systematic way.
Originality/value To the best of the authorsknowledge, this paper is the first data quality theory guided
examination of DQA practices in RDRs.
Keywords Data quality, Data quality assurance, Research data repositories, Research data curation, Model
Paper type Research paper
1. Introduction
The ethicalimplications of data quality are undeniable (Mason, 1986). The quality of data and
information directly influences the effectiveness of our decisions, the results of our activities,
and, ultimately, our lives, personal respect, and reputation. Therefore, ensuring data quality
(DQA) is an essential element of all data management processes. DQA tasks can vary widely,
including quality assessments and enhancement efforts undertaken by data providers and
Data quality
assurance in
research
repository
793
We would like to express our gratitude to the participants of our study. We extend our appreciation to
Leila Gibradze for her invaluable assistance with the analysis of data. This research is supported by a
National Leadership Grant from the Institute of Museum and Library Services (IMLS) of the U.S.
Government (grant number LG-252346-OLS-22). This article reflects the findings and conclusions of the
authors and does not necessarily reflect the views of IMLS.
The current issue and full text archive of this journal is available on Emerald Insight at:
https://www.emerald.com/insight/0022-0418.htm
Received 15 September 2023
Revised 3 January 2024
Accepted 6 January 2024
Journal of Documentation
Vol. 80 No. 4, 2024
pp. 793-812
© Emerald Publishing Limited
0022-0418
DOI 10.1108/JD-09-2023-0177
personnel,data cleansing bystudents for class projects or during DQAhackathons, evaluating
the quality of data sets used in training AI models, or policy and business decision making
(Gururangan et al., 2022;Scheuerman et al., 2021). Severalgeneral quality assurance standards
and strategies, like ISO 8000, ISO 9000, and ISO 19157, are commonly used in the industry.
Similarly, the literature offers numerous studies and models on data curation (e.g. Ball, 2012;
Burton and Treloar, 2009;Higgins, 2008;Lee and Stvilia, 2017;Lord and Macdonald, 2003).
Theres a revived focus on data quality, and the goal of making data sets FAIR, meaning
findable, accessible, interoperable, and reusable, within data curation practitioner communities.
These groups create and disseminate important methods and scripts for data cleaning,
normalization,linking, and disambiguation (Wilkinson et al., 2016;TheDataOne WebinarSeries,
2020a,b). However,the efforts to operationalizethe FAIR framework havelargely been lacking
in firm roots in the metadata and information quality literature, restricting their broad
applicability toDQA process design.Theres a dearthof studies thatexamine and interpretDQA
practices in research data repositories (RDRs) through the lens of thedata quality literature.
2. Research questions
The understanding of what defines high-quality, useful data, or when such data becomes
useful and useable, can differ even within the same process, field, and across different fields
(Higgins, 2008;Stvilia et al., 2015). Prior to creating data and metadata quality evaluation
standards, measures, and interventions, RDRs and their stakeholders must establish and
agree upon their definition of data quality (DQ), what fitness for useor fitness for reuse
(Juran, 1992) means to them and what the best practices are for ensuring it. Similarly, users of
these datasets need to clearly comprehend the repositorys DQ model and understand the DQ
virtues the repository evaluates and ensures its datasets for to determine if these virtues and
DQA actions align with their own DQ needs and preferences. For example, some users might
prefer raw, dirty data to hone their data cleaning and organization skills (Stvilia and
Gibradze, 2022). Although previous conceptual models of research DQ and studies on
researcher priorities and perceptions of DQ exist (e.g. Huang et al., 2012;Stvilia et al., 2015),
theres a notable absence of recent analysis of RDRsDQA practices that is rooted in the
information and DQ literature. Providing A detailed and descriptive explanation of how
RDRs perceive data quality, along with a conceptual model summarizing the structure of
their DQA work, can aid in the development of context specific operational DQA models and
guides for RDRs. Additionally, it can highlight areas for potential innovation in RDRsDQA
practices. This paper presents part of a larger exploratory research study that seeks to fill this
gap. Specifically, the paper discusses the following research questions:
RQ1. How do RDRs define data quality?
RQ2. How do RDRs ensure data quality?
3. Related work
Juran (1992)defines quality as fitnessfor use.There h ave been multiple conce ptual models of
research data quality and studies of researcher perceptions and preferences for data quality
(e.g. Gutmannet al., 2004;Huang et al., 2012;Stvilia et al., 2015).What is considered quality and
useful data, and when such data becomes useful can vary, even within the same procedure,
field, and across various procedures within those fields (Higgins, 2008). A DQA process
includes activities pertaining to theconceptualization, measurement, and interventionof data
quality(Stvilia et al., 2007). Data quality,alongside privacy and access,holds significant ethical
implicationsin data use. In the era ofbig data, generative AI, andan overwhelming quantityof
research data and publications, the saying garbage in, garbage outremains as relevant as
JD
80,4
794

Get this document and AI-powered insights with a free trial of vLex and Vincent AI

Get Started for Free

Start Your Free Trial of vLex and Vincent AI, Your Precision-Engineered Legal Assistant

  • Access comprehensive legal content with no limitations across vLex's unparalleled global legal database

  • Build stronger arguments with verified citations and CERT citator that tracks case history and precedential strength

  • Transform your legal research from hours to minutes with Vincent AI's intelligent search and analysis capabilities

  • Elevate your practice by focusing your expertise where it matters most while Vincent handles the heavy lifting

vLex

Start Your Free Trial of vLex and Vincent AI, Your Precision-Engineered Legal Assistant

  • Access comprehensive legal content with no limitations across vLex's unparalleled global legal database

  • Build stronger arguments with verified citations and CERT citator that tracks case history and precedential strength

  • Transform your legal research from hours to minutes with Vincent AI's intelligent search and analysis capabilities

  • Elevate your practice by focusing your expertise where it matters most while Vincent handles the heavy lifting

vLex

Start Your Free Trial of vLex and Vincent AI, Your Precision-Engineered Legal Assistant

  • Access comprehensive legal content with no limitations across vLex's unparalleled global legal database

  • Build stronger arguments with verified citations and CERT citator that tracks case history and precedential strength

  • Transform your legal research from hours to minutes with Vincent AI's intelligent search and analysis capabilities

  • Elevate your practice by focusing your expertise where it matters most while Vincent handles the heavy lifting

vLex

Start Your Free Trial of vLex and Vincent AI, Your Precision-Engineered Legal Assistant

  • Access comprehensive legal content with no limitations across vLex's unparalleled global legal database

  • Build stronger arguments with verified citations and CERT citator that tracks case history and precedential strength

  • Transform your legal research from hours to minutes with Vincent AI's intelligent search and analysis capabilities

  • Elevate your practice by focusing your expertise where it matters most while Vincent handles the heavy lifting

vLex

Start Your Free Trial of vLex and Vincent AI, Your Precision-Engineered Legal Assistant

  • Access comprehensive legal content with no limitations across vLex's unparalleled global legal database

  • Build stronger arguments with verified citations and CERT citator that tracks case history and precedential strength

  • Transform your legal research from hours to minutes with Vincent AI's intelligent search and analysis capabilities

  • Elevate your practice by focusing your expertise where it matters most while Vincent handles the heavy lifting

vLex

Start Your Free Trial of vLex and Vincent AI, Your Precision-Engineered Legal Assistant

  • Access comprehensive legal content with no limitations across vLex's unparalleled global legal database

  • Build stronger arguments with verified citations and CERT citator that tracks case history and precedential strength

  • Transform your legal research from hours to minutes with Vincent AI's intelligent search and analysis capabilities

  • Elevate your practice by focusing your expertise where it matters most while Vincent handles the heavy lifting

vLex