Measuring the time spent on data curation

DOIhttps://doi.org/10.1108/JD-08-2021-0167
Published date09 February 2022
Date09 February 2022
Pages282-304
Subject MatterLibrary & information science,Records management & preservation,Document management,Classification & cataloguing,Information behaviour & retrieval,Collection building & management,Scholarly communications/publishing,Information & knowledge management,Information management & governance,Information management,Information & communications technology,Internet
AuthorAnja Perry,Sebastian Netscher
Measuring the time spent
on data curation
Anja Perry and Sebastian Netscher
GESIS Leibniz Institute for the Social Sciences, Cologne, Germany
Abstract
Purpose Budgeting data curation tasks in research projects is difficult. In this paper, we investigate the time
spent on data curation, more specifically on cleaning and documenting quantitative data for data sharing. We
develop recommendations on cost factors in research data management.
Design/methodology/approach We make use of a pilot study conducted at the GESIS Data Archive for
the Social Sciences in Germany between December 2016 and September 2017. During this period, data curators
at GESIS - Leibniz Institute for the Social Sciences documented their working hours while cleaning and
documenting data from ten quantitative survey studies. We analyse recorded times and discuss with the data
curators involved in this work to identify and examine important cost factors in data curation, that is aspects
that increase hours spent and factors that lead to a reduction of their work.
Findings We identify two major drivers of time spent on data curation: The size of the data and personal
information contained in the data. Learning effects can occur when data are similar, that is when they contain
same variables. Important interdependencies exist between individual tasks in data curation and in connection
with certain data characteristics.
Originality/value The different tasks of data curation, time spent on them and interdependencies between
individual steps in curation have so far not been analysed.
Keywords Data curation, Digital curation, Curation tasks, Research data management, Data sharing
Paper type Case study
Introduction
Barend Mons states in his 2020 Nature article that around 5% of the overall research budget
should go towards data stewardship, that is research data management (RDM) tasks (Mons,
2020). Indeed, FAIR data (Wilkinson et al., 2016), data sharing and, thus, RDM are
increasingly important. Research funders, like the European Research Council, now demand
data sharing and carefully planned RDM in terms of specified data management plans
(European Commission, 2019;European Research Council, 2019). But even if data sharing
appears to be free, for example if data is shared via a data repositorium free of charge, there
are costs involved. These are, at least, costs for data cleaning and documentation to prepare
data for reuse. Such costs are often not anticipated when planning the research project
(National Academies of Sciences, Engineering, and Medicine, 2020).
However, such lump sums as suggested by Mons ignore differences across research
projects. For instance, an estimate for biological databases suggests that only 0.088% of the
overall research project costs go towards data curation (Karp, 2016). In contrast, whenever
the requested RDM budgets are higher than 5%, because the projects in planning are more
JD
78,7
282
© Anja Perry and Sebastian Netscher. Published by Emerald Publishing Limited. This article is
published under the Creative Commons Attribution (CC BY 4.0) licence. Anyone may reproduce,
distribute, translate and create derivative works of this article (for both commercial and non-commercial
purposes), subject to full attribution to the original publication and authors. The full terms of this licence
may be seen at http://creativecommons.org/licences/by/4.0/legalcode
The authors thank the three curators for their participation in the focus group discussion and their
very valuable input. This work was funded by the German Federal Ministry of Education and Research
as part of the DDP Bildung Domain Data Protocols for Education Researchproject (www.ddp-
bildung.org). Grant number: 16QK01A.
The current issue and full text archive of this journal is available on Emerald Insight at:
https://www.emerald.com/insight/0022-0418.htm
Received 27 August 2021
Revised 2 December 2021
Accepted 12 December 2021
Journal of Documentation
Vol. 78 No. 7, 2022
pp. 282-304
Emerald Publishing Limited
0022-0418
DOI 10.1108/JD-08-2021-0167
complex, this may cause suspicion on the funders side and the necessary funds may not be
granted. Therefore, a more careful evaluation of such costs is needed.
Research on RDM costs is still in its infancy and precise statements about which RDM
tasks are eligible for funding are rare (Donaldson and Ensberg, 2018). For example, the
guidelines by UK Research and Innovation state that any element of research data
management may be included as a directly incurred cost(UKRI, 2015). Clear and more
precise recommendations on how to budget RDM tasks would help to accommodate for these
tasks when applying for research funding and to devote appropriate time and effort towards
RDM tasks during the project.
With this paper, we investigate the time spent on specific RDM tasks. These are data
curation tasks, more specifically data cleaning and documentation of quantitative social
science studies. We make use of a pilot study conducted between December 2016 and
September 2017 at the GESIS - Leibniz Institute for the Social Sciences (from here on GESIS).
For this pilot, our data curators documented their working times and tasks while cleaning
and documenting data from ten quantitative datasets belonging to three different multi-wave
survey studies. We analyse the data curatorsrecords of their working times, make use of
their working reports and interview them to better understand their workflows.
We contribute to existing research by analysing limited but very rare data on data
curation tasks. Thereby, we identify factors that increase hours spent on data curation and
factors that can save time. In addition, we obtain a list of individual curation tasks and
feedback from curators on important interdependencies between these tasks and with
important data characteristics. Various work on curation tasks already exists, for example,
by Lee and Svilia (2017). We go beyond this work and provide more detailed information
about data cleaning and data documentation tasks and how they depend upon one another.
We provide insights for researchers and data managers initiating a survey project to better
plan data curation tasks and organize data collection in a way that saves them time during
data curation. Likewise, data infrastructures may profit from our findings when providing
data services for researchers (German Research Foundation, 2021).
The paper is structured as follows: In section 2 we review existing literature and projects
devoted to investigating and analysing RDM costs. We then describe the pilot project on
determining curation times at GESIS and the ten datasets curated as well as our approach to
examine curatorsefforts in section 3.Insection 4 we present the results of the analysis and in
section 5 the outcome of a group discussion with our curators. We conclude with section 6.
Approaches to measuring RDM costs
Budgeting RDM tasks is difficult, as it covers a wide range of processes, strategies and
measurements to manage data within a project as well as beyond. Not all tasks purely serve
the purpose of data sharing. Even when researchers decide not to share their data, they need
at least some basic documentation of the data to ensure its understandability and
interpretability for their own purposes. In this article we solely focus on data curation aimed
at preparing data for data sharing. In this theory section, after highlighting the importance of
data curation, we give an overview of the literature on RDM and curation costs to identify the
cost factors researched in this literature.
The role of data curation in data sharing
When analysing data sharing, we must distinguish between tasks relevant to the research
itself and ensuring good scientific practice from those tasks devoted to sharing data beyond
the research project. In this context, Klar and Enke (2013) as well as Treloar and co-authors
(Treloar and Harboe-Ree, 2017;Treloar and Klump, 2019) suggest a domain model, grouping
Time spent on
data curation
283

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT