An analysis of file format control in institutional repositories

Publication Date15 June 2015
Date15 June 2015
AuthorMiquel Termens,Mireia Ribera,Anita Locher
SubjectLibrary & information science,Librarianship/library management,Library technology
An analysis of file format control
in institutional repositories
Miquel Termens, Mireia Ribera and Anita Locher
Library & Information Science Department,
Universitat de Barcelona, Barcelona, Spain
Purpose The purpose of this paper is to analyze the file formats of the digital objects stored in two
of the largest open-access repositories in Spain, DDUB and TDX, and determines the implications
of these formats for long-term preservation, focussing in particular on the different versions of PDF.
Design/methodology/approach To be able to study the two repositories, the authors harvested all
the files corresponding to every digital object and some of their associated metadata using the Open
Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) and Open Archives Initiative Object
Reuse and Exchange (OAI-ORE) protocols. The file formats were analyzed with DROID software and
some additional tools.
Findings The results show that there is no alignment between the preservation policies declared by
institutions, the technical tools available, and the actual stored files.
Originality/value The results show that file controls currently applied to institutional repositories
do not suffice to grant their stated mission of long-term preservation of scientific literature.
Keywords Digital preservation, Institutional repositories, File format, PDF
Paper type Research paper
1. Introduction
The risks involved in long-term preservation of digital objects are complex to
categorize, and there is no consensus on the best solutions for each specific case
(Vermaaten et al., 2012; Graf and Gordea, 2013). Although some experts state that since
internet adoption and particularly since mainstream use of the Web began no
format has been deprecated severely enough to prevent its use (Rusbridge, 2006;
Rosenthal, 2010; Rosenthal, 2013), format obsolescence is the most commonly cited
technical problem challenging content preservation (Lawrence et al., 2000; Pearson and
Webb, 2008). This complexity justifies the focus of the paper, oriented toward
analyzing current management practices of two technical characteristics of the files
uploaded to repositories, their format and their encryption. The paper will not enter into
details of their implications on long-term preservation policies.
All repositories store digital objects with a dual aim: first, to promote their
dissemination; and second, to guarantee their pr eservation (Ware, 2004; van
Westrienen and Lynch, 2005; Kennan and Wilson, 2006). The first aim is the most
evident and was often the initial reason for creating the repositories. The second aim is
often not explicitly mentioned and repository holders do not guarantee its fulfillment
through either established policies or resources. It is a common practice to focus
technical and economic efforts on attracting and disseminating new content, and to
Library Hi Tech
Vol. 33 No. 2, 2015
pp. 162-174
©Emerald Group Publishing Limited
DOI 10.1108/LHT-10-2014-0098
Received 5 October 2014
Revised 7 March 2015
Accepted 18 March 2015
The current issue and full text archive of this journal is available on Emerald Insight at:
This study received a grant from the project El acceso abierto (open access)a la ciencia en España.
2012-2014. Plan Nacional I+D+i, código CSO2011-29503-C02-01. The authors thank Yvonne
Friese of the Deutsche Zentralbibliothek für Wirtschaftswissenschaften for the use of her PDF
scripts. The authors also thank the CBUC and the UBs CRAI for their help with the data

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT