Quality measures for skos. ExactMatch linksets: an application to the thesaurus framework LusTRE

Date02 July 2018
AuthorRiccardo Albertoni,Monica De Martino,Paola Podestà
SubjectLibrary & information science,Librarianship/library management,Library technology,Information behaviour & retrieval,Metadata,Information & knowledge management,Information & communications technology,Internet
Istituto di Matematica Applicata e Tecnologie Informatiche, Sezione di Genova,
Consiglio Nazionale delle Ricerche, Genova, Italy
Purpose The purpose of this paper is to focus on the quality of the connections (linkset) among thesauri
published as Linked Data on the Web. It extends the cross-walking measures with two new measures able to
evaluate the enrichment brought by the information reached through the linkset (lexical enrichment,
browsing space enrichment). It fosters the adoption of cross-walking linkset quality measures besides the
well-known and deployed cardinality-based measures (linkset cardinality and linkset coverage).
Design/methodology/approach The paper applies the linkset measures to the Linked Thesaurus
fRamework for Environment (LusTRE). LusTRE is selected as testbed as it is encoded using a Simple
Knowledge Organisation System (SKOS)published as Linked Data, and it explicitlyexploits the cross-walking
measures on its validated linksets.
Findings The application on LusTRE offers an insight of the complementarities among the considered
linkset measures. In particular, it shows that the cross-walking measures deepen the cardinality-based
measures analysing quality facets that were not previously considered. The actual value of LusTREs linksets
regarding the improvement of multilingualism and concept spaces is assessed.
Research limitations/implications The paper considers skos:exactMatch linksets, which belong
to a rather specific but a qui te common kind of linkset. The cross-walking meas ures explicitly assume
correctness and compl eteness of linksets. Thi rd party approaches and to ols can help to meet the
above assumptions.
Originality/value This paper fulfils an identified need to study the quality of linksets. Several approaches
formalise and evaluate Linked Data quality focusing on data set quality but disregarding the other essential
component: the connection among data.
Keywords Quality, SKOS, Linked data, Cross-walking, Environmental thesauri, Linkset
Paper type Research paper
1. Introduction
In the paper Linked Data The StorySo Far,Bizeretal. (2009) were among the firstto take a
picture of the enormous transformation of the Web of Document intothe Web of Data. Since
then, the LinkedData popularity has never ceasedto grow. Linked Data aims at disclosingthe
potential of independently served data dealing with access and integration issues. It not only
publishes documents encoded using the Resource Description Framework (RDF), but also
uses RDF to make typedstatements that link arbitrary thingsin the world. The result, which
we will referto as the Web of Data, may more accuratelybe described as a web ofthings in the
world, described by dataon the Web(Bizer et al., 2009). Linked Data allows RDF data to be
published, shared, retrieved, reused and analysed unlocking the existing data silos to a
broader community of consumers.RDF provides a graph-baseddata model based on triplesin
the form of subject,predicate, and object (Schreiber andRaimond, 2014). Both data and links
among data are expressed with triples. Linked Data relies on two fundamental web
technologies: the Internationalised Resource Identifiers (IRIs[1]) and the HyperText
Transfer Protocol (HTTP), which are, respectively, deployed as the global identifiers
FollowingLinked Data principles[2], severalbillions of facts encoded in RDF tripleshave been
published in the Linked Open Data (LOD) cloud[3].
This vast quantity of newly available and connected data sets is transforming the web
into a global data space enabling new types of analysis and applications in diverse
domains including life science, government, environment and cultural heritage. At the
same time, the evaluation of the quality of these newly served data becomes critical.
Data quality can affect the potentiality of the applications that use data. As a
consequence, its inclusion in the data publishing and consumption pipelines is of primary
importance(Calegari et al., 2017). The challenge is two-fold: to evaluate the quality of the
data on the Web and to make quality-related information explicit, understandable and
consumable to both humans and machines.
Several existing initiatives have the goal to define new metrics and to evaluate the
quality of Linked Data. The W3C Data Quality Vocabulary (DQV) (Albertoni and Isaac,
2016) introduces a common way to document the quality of a data set, making easier to
publish, exchange and consume quality metadata. Recent works such as Zaveri et al.
(2016), Debattista et al. (2016b), Radulovic et al. (2018) and Kontokostas et al. (2014)
consider different aspects of Linked Data quality, called dimensions, e.g., accessibility,
interlinking, performance, syntactic validity or completeness. They define and deploy
several concrete metrics (or measures) to precisely and objectively evaluate each
dimension. However, they focus on Linked Data data sets, reserving very limited attention
to their connections, the linksets. A linkset is a set of homogeneous links, all of the same
types and connecting the same subject data set to the same object data set (Alexander
et al., 2011). The quality of linksets is studied as part of the interlinking dimension defined
in the recent state of the art (Zaveri et al., 2016). Few metrics are defined to evaluate
interlinking, they mainly focus on correctness (e.g. broken links, open owl: sameAs chains,
crowdsourcing method), or on the number of links (linkset cardinality), or on the extent to
which a linkset covers the elements of a data set (linkset coverage) (Guéret et al., 2012;
Zaveri et al., 2016; Albertoni and Gómez Pérez, 2013).
The experience gained creating Linked Thesaurus fRamework for Environment (LusTRE)[4],
the multilingual linked thesaurus framework for the environment, has taught us to pay attention
to the quality of connections between data sets. LusTRE has been designed during the EU
project eENVplus[5] extending and redesigning the Common Thesaurus Framework for the
Environment (De Martino and Albertoni, 2011). LusTRE faces cross-lingual and
cross-sectoral issues in environmental data sharing: it provides a wide multilingual
terminology obtained by linking available thesauri for the different disciplines in the
environment and a set of web services to exploit them (Albertoni et al., 2018). The eENVplus
project has spent considerable efforts to review the available environmental thesauri
checking those not yet available as linked data (Albertoni, De Martino and Podestà, 2014). Then,
it has published ThiST[6] and (Albertoni, De Martino, Di Franco, De Santis and
Plini, 2014) Linked Data using the Simple Knowledge Organisation System (SKOS) (Miles and
Bechhofer, 2009), and connected them to GEMET[7], AGROVOC (Caracciolo et al., 2013)
and EUROVOC[8].
In LusTRE, the linksets among the thesauri are particularly important as they are
exploited to satisfy user requests. LusTRE enriches user navigations and service results
with translations and concepts which are reachable through the linkset. Thus, the linkset
quality becomes a critical issue. Given a linkset between two SKOS thesauri, LusTRE
should evaluate the multilingual enrichment obtained in terms of newly translated labels
reachable through a linkset. This information helps to address the incomplete language
coverage issue, which affects many popular SKOS thesauri (Suominen and Mader, 2014).
It also needs to evaluate the number of new concepts reached by crossing a linkset,
as this helps to assess the enrichment of the space of concepts that can be browsed

