Publication Date15 June 2015
AuthorMiquel Termens,Mireia Ribera,Anita Locher
SubjectLibrary & information science,Librarianship/library management,Library technology
Library & Information Science Department,
Universitat de Barcelona, Barcelona, Spain
Purpose The purpose of this paper is to analyze the file formats of the digital objects stored in two
of the largest open-access repositories in Spain, DDUB and TDX, and determines the implications
of these formats for long-term preservation, focussing in particular on the different versions of PDF.
Design/methodology/approach To be able to study the two repositories, the authors harvested all
the files corresponding to every digital object and some of their associated metadata using the Open
Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) and Open Archives Initiative Object
Reuse and Exchange (OAI-ORE) protocols. The file formats were analyzed with DROID software and
some additional tools.
Findings The results show that there is no alignment between the preservation policies declared by
institutions, the technical tools available, and the actual stored files.
Originality/value The results show that file controls currently applied to institutional repositories
do not suffice to grant their stated mission of long-term preservation of scientific literature.
Keywords Digital preservation, Institutional repositories, File format, PDF
Paper type Research paper
1. Introduction
The risks involved in long-term preservation of digital objects are complex to
categorize, and there is no consensus on the best solutions for each specific case
(Vermaaten et al., 2012; Graf and Gordea, 2013). Although some experts state that since
internet adoption and particularly since mainstream use of the Web began no
format has been deprecated severely enough to prevent its use (Rusbridge, 2006;
Rosenthal, 2010; Rosenthal, 2013), format obsolescence is the most commonly cited
technical problem challenging content preservation (Lawrence et al., 2000; Pearson and
Webb, 2008). This complexity justifies the focus of the paper, oriented toward
analyzing current management practices of two technical characteristics of the files
uploaded to repositories, their format and their encryption. The paper will not enter into
details of their implications on long-term preservation policies.
All repositories store digital objects with a dual aim: first, to promote their
dissemination; and second, to guarantee their pr eservation (Ware, 2004; van
Westrienen and Lynch, 2005; Kennan and Wilson, 2006). The first aim is the most
evident and was often the initial reason for creating the repositories. The second aim is
often not explicitly mentioned and repository holders do not guarantee its fulfillment
through either established policies or resources. It is a common practice to focus
technical and economic efforts on attracting and disseminating new content, and to
