Publishing legacy data as linked data: a state of the art survey

Pages520-535
DOIhttps://doi.org/10.1108/LHT-09-2012-0075
Published date02 September 2013
Date02 September 2013
AuthorUjjal Marjit,Kumar Sharma,Arup Sarkar,Madaiah Krishnamurthy
Subject MatterLibrary & information science,Librarianship/library management,Library technology
Publishing legacy data as linked
data: a state of the art survey
Ujjal Marjit and Kumar Sharma
CIRM, University of Kalyani, Kalyani, India
Arup Sarkar
Department of Computer Science & Engineering, University of Kalyani,
Kalyani, India, and
Madaiah Krishnamurthy
DRTC, Indian Statistical Institute, Bangalore, India
Abstract
Purpose – This article aims to discuss how the emergence of advanced semantic web technology has
transformed the conventional web into machine processable and understandable form.
Design/methodology/approach – In this paper the authors survey the current research works,
tools and applications on publishing legacy data as linked data with the aspiration of conferring
healthier understanding of the working domain of the linked data world.
Findings – Today, a vast amount of data are stored in various file formats other than RDF, which are
called legacy data. In order to publish them as linked data they need to be extracted and converted into
RDF or linked data without altering the original data schema or loss of information.
Originality/value – Most of the key issues have to be addressed. A more sophisticated approach to
this technology is the linked data, which constructs the transformation of web of documents into the
web of connected data possible.
Keywords Linked data, Webof document, Legacy data, Semantic web, Webof data, Semantics,
Data management
Paper type Research paper
1. Introduction
World Wide Web (WWW) plays a pivotal role as a global knowledge base and an
international comfortable communication, information and business network. In the
present day, almost all kinds of academics, business personnel or students are more or
less familiar with the concept of the World Wide Web. The web is the interconnection
of documents available from all around the world. Each of these documents is essential
for us in different context because it is a place where any kind of data or information is
stored. Whenever we search for information the search engine furnishes us the record
of web links based on the terms or keywords we have entered. As a consequence we
have to check these web links to analyze whether they lead us to useful information or
not. Today, every kind of data is available in the web; such as academic data, business
data, and personal or social data. Without web it would be difficult to reach the high
quality of information or events that are happening around the world. The web mainly
suffers from the data sharing between different data sources and the reuse of these
data. Basically, the web is the interconnection of HTML documents where the HTML
guides how to structure or decorate the textual data in the document within a browser.
The current issue and full text archive of this journal is available at
www.emeraldinsight.com/0737-8831.htm
LHT
31,3
520
Received 4 September 2012
Revised 23 April 2013
13 May 2013
Accepted 26 May 2013
Library Hi Tech
Vol. 31 No. 3, 2013
pp. 520-535
qEmerald Group Publishing Limited
0737-8831
DOI 10.1108/LHT-09-2012-0075
HTML also contains the links called outgoing links or hyperlinks, which are defined by
href attribute. On the traditional web the documents are connected by means of
hyperlinks. A very common phrase that is used to describe this web is the “Web of
Documents”. The Web of Documents is enough for us to search and analyze the data
by ourselves, but for a computer it is almost impossible to identify what data is hidden
in each document. Computer is unable to identify the actual document information and
their intended meaning. The reason behind this is that the data/information is intended
for the humans not for the machines. Machine only knows how to render and display
the textual data on the web of documents, which is guided by HTML. Thus, it becomes
impossible for the machines to take any measure to identify the relationship between
data. Therefore the contemporary generation of web moved to a newer breed of web
technology called Semantic Web. In Semantic Web, semantics are added with the
normal data to make them machine processable and comprehensible. It is a web (also
known as web 3.0), which allows data to be self-described in a more structured way so
that the machine can easily process and analyze the data. But the semantic web is not
enough to fulfill the current need. Semantic web makes data machine processable and
understandable without creating any links among them. This only creates some
separate data islands without any meaningful connection depicting their relation with
each other. So the job to identify the relatedness among the data of two different
semantic web applications working in same domain becomes cumbersome. In 2006,
Tim Berners Lee first coined a new concept, a further extension to the existing
semantic web called Linked Data to resolve the current state of the problem.
Use of Linked Data applications within the organization is escalating day by day.
Besides this, those who decided to publish all their organizational data online as
Linked Data, a plethora of legacy data is waiting for their attention. In the beginning of
the adoption of Linked Data technology it was a pretty challenge to publish such huge
amount of legacy data as Linked Data. Today, the scenario differs a bit. There are
many applications and tools available at the academic as well as commercial level to
take care of the publication of the legacy data online as Linked Data. This survey
report is planned to expose some of the selected tools, applications, frameworks and
projects to give the readers a glimpse about how the Linked Data applications and
tools are developing one after other with different kind of approaches and possibilities.
A comparative study of these tools and applications are given to develop the vision
about what is going on and its possibilities in future.
2. Linked Data for all
The Linked Data initiative has taken further steps to overcome the shortcomings of the
current web whose ultimate goal is to unite the machine readable datasets as a Web of
Data. The Linked Data refers to a set of best practices for publishing and interlinking
the structured data on the Web of Data. Using the web-architecture and the HTTP
protocol as universal access mechanism, the Linked Data enables user to publish the
structured data and interlink them on the Web of Data thereby enabling the task of
data sharing and data reusing. A set of principles for publishing the structured data
into the Web of Data has been proposed first time by Berners-Lee (2006/2009) as
follows:
.use URIs (Uniform Resource Identifier) as names for things;
.use HTTP URIs so that people can look up those names;
Publishing
legacy data
521

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT