Toward sustainable publishing and querying of distributed Linked Data archives

Published date08 January 2018
Date08 January 2018
DOIhttps://doi.org/10.1108/JD-03-2017-0040
Pages195-222
AuthorMiel Vander Sande,Ruben Verborgh,Patrick Hochstenbach,Herbert Van de Sompel
Subject MatterLibrary & information science,Records management & preservation,Document management,Classification & cataloguing,Information behaviour & retrieval,Collection building & management,Scholarly communications/publishing,Information & knowledge management,Information management & governance,Information management,Information & communications technology,Internet
Toward sustainable publishing
and querying of distributed
Linked Data archives
Miel Vander Sande and Ruben Verborgh
Department of Electronics and Information Systems,
Ghent University IMEC, Ghent, Belgium
Patrick Hochstenbach
Ghent University Library, Ghent, Belgium, and
Herbert Van de Sompel
Los Alamos National Laboratory, Los Alamos, New Mexico, USA
Abstract
Purpose The purpose of this paper is to detail a low-cost, low-maintenance publishing strategy aimed at
unlocking the value of Linked Data collections held by libraries, archives and museums (LAMs).
Design/methodology/approach The shortcomings of commonly used Linked Data publishing approaches
are identified, and the current lack of substantial collections of Linked Data exposed by LAMs is considered. To
improve on the discussed status quo, a novel approach for publishing Linked Data is proposed and demonstrated
by means of anarchive of DBpedia versions,which is queried in combination withother Linked Data sources.
Findings The authors show that the approach makes publishing Linked Data archives easy and
affordable, and supports distributed querying without causing untenable load on the Linked Data sources.
Research limitations/implications The proposed approach significantly lowers the barrier for publishing,
maintaining, andmaking Linked Datacollections queryable. As such,it offers the potentialto substantiallygrow
the distributed network of queryable Linked Data sources. Because the approach supports querying without
causing unacceptable load on the sources, the queryable interfaces are expected to be more reliable, allowing them
to become integral building blocks of robust applications that leverage distributed Linked Data sources.
Originality/value The novel publishing strategy significantly lowers the technical and financial barriers
that LAMs face when attempting to publish Linked Data collections. The proposed approach yields Linked
Data sources that can reliably be queried, paving the way for applications that leverage distributed
Linked Data sources through federated querying.
Keywords Linked Data, Digital preservation, Data integration, Data publishing, History reconstruction,
Reproducibility
Paper type Technical paper
1. Introduction
1.1 Demolishing metadata silos
Libraries, archives and museums (LAMs) are long-term custodians of substantial structured
metadata collections and active information curators. Over the past decades, digitizing
records and making them available online has become irresistible and unavoidable.
A digital agenda has allowed LAMs to further democratize knowledge by making
collections more broadly available to audiences and applications alike (Clough, 2013).
This knowledge only reaches its highest potential when we are able to query across different
data sources, but unfortunately, too much data remains confined to the basements of
individualinstitutions. Inorder to break throughthese Silos of the LAMs(Zorich et al., 2008),
a strong engagement in metadata sharing was put on top of thedigital agenda. On the bright
side, this has already resulted in the establishment of standards and best practices for
expressingand sharing metadataaimed at achieving cross-institutioninteractions(Waibel and
Erway, 2009). Efforts toward the eventual goal offering an integrated, seamless level of
service that tech-savvy users are increasingly coming to expect(Leddy, 2012) resulted in
Journal of Documentation
Vol. 74 No. 1, 2018
pp. 195-222
© Emerald PublishingLimited
0022-0418
DOI 10.1108/JD-03-2017-0040
Received 24 March 2017
Revised 2 June 2017
Accepted 4 June 2017
The current issue and full text archive of this journal is available on Emerald Insight at:
www.emeraldinsight.com/0022-0418.htm
195
Distributed
Linked Data
archives
two typesof investments madeby LAMs: semanticintegration, aligningmetadata descriptions
across institutionalboundaries, and webpublishing, leveragingthe web to share metadata in a
manner that supports effective reuse beyond institutional boundaries. On a morecritical note,
however, these efforts have not fully paid off yet, since actual integration of the cross-
institutional data for end-user queries has so far remained a technical challenge. The resulting
decisions are laid out in Figure 1 and discussed next.
1.1.1 Semantic integration with Linked Data. To integrate collections semantically,
institutions have started to adopt a Linked Data approach, which Bizer et al. (2009) defined as
a set of best practices for publishing and connecting structured data on the Web.Linked Data
is most commonly materialized using the Resource Description Framework (RDF ), which entails
the use of basic machine actionable relationship statements called triples composed of three
components: a subject, a predicate, and an object. RDF leverages the global HTTP Uniform
Resource Identifier (URI) scheme to identify resources. Thus, semantic integration is achieved by
reusing the same URI to refer to the same resource, or by expressing equivalence between
different URIs that identify the same resource (Hausenblas, 2009). As institutions use this
approach, their respective RDF descriptions pertaining to a given resource are complemented by
those of other institutions. Typically, when descriptions are merged, the provenance of each
RDF statement is maintained. When these descriptions share URIs, they become automatically
interconnected, resulting in a distributed Web of Data (Heath and Bizer, 2011).
1.1.2 Web publishing through physical integration. Currently, the most c ommon
approach by which institutions expose collectionsofRDFstatementsforreuseistomake
them available as Linked Data sets for batch download. In this approach, one or more
aggregators step in and collect the distributed data sets and publish a merged data set
either again for batch download or as a machine-queryable endpoint. This physical
integration approach is cost-effective for institutions that expose Linked Data sets and
aggregators often add value, for example, by performing data cleansing and mapping
equivalent URIs. But the approach also has some important drawbacks.
First, data in different institutions evolve at a different pace. Keeping an aggregated data
set continuously synchronized with the evolving distributed data sets is a non-trivial
technical challenge (Klein et al., 2014); tackling it in a realistic manner would necessarily
involve additional infrastructure (and hence investment) at the end of the institutions that
expose them. Lacking this, at any moment in time, it is uncertain whether or not an
aggregated data set is in sync with the state of the data sets it merges.
Synchronization Ownership
Control Maintenance cost Interface
availability
Source selection
problem
Virtual
integration
Web publishing
Publisher
Metadata sharing
Semantic
integration
Physical
integration
Linked Data
RDF
Aggregators
Client
Figure 1.
LAM institutions
choose between
physical data
integration and virtual
data integration
strategies to
publish their metadata
on the web
196
JD
74,1

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT