Modular framework for similarity-based dataset discovery using external knowledge

DOIhttps://doi.org/10.1108/DTA-09-2021-0261
Published date15 February 2022
Date15 February 2022
Pages506-535
Subject MatterLibrary & information science,Librarianship/library management,Library technology,Information behaviour & retrieval,Metadata,Information & knowledge management,Information & communications technology,Internet
AuthorMartin Nečaský,Petr Škoda,David Bernhauer,Jakub Klímek,Tomáš Skopal
Modular framework for similarity-
based dataset discovery using
external knowledge
Martin Ne
cask
y and Petr
Skoda
Department of Software Engineering, Faculty of Mathematics and Physics,
Charles University, Prague, Czechia
David Bernhauer
Department of Software Engineering, Faculty of Mathematics and Physics,
Charles University, Prague, Czechia and
Department of Software Engineering, Faculty of Information Technology,
Czech Technical University in Prague, Prague, Czechia, and
Jakub Kl
ımek and Tom
a
s Skopal
Department of Software Engineering, Faculty of Mathematics and Physics,
Charles University, Prague, Czechia
Abstract
Purpose Semanticretrieval and discoveryof datasets publishedas open data remains a challengingtask. The
datasets inherently originate in theglobally distributed web jungle,lacking the luxury of centralized database
administration,database schemes, sharedattributes, vocabulary,structure and semantics.The existing dataset
catalogsprovide basic search functionalityrelying on keywordsearch in brief, incompleteor misleading textual
metadataattached to the datasets.The search resultsare thus often insufficient.However, thereexist many ways
of improving the dataset discovery by employing content-based retrieval, machine learning tools,third-party
(external)knowledge bases, countless featureextraction methods and descriptionmodels and so forth.
Design/methodology/approach In this paper, the authors propose a modular framework for rapid
experimentation with methods for similarity-based dataset discovery. The framework consists of an extensible
catalog of components prepared to form custom pipelines for dataset representation and discovery.
Findings The study proposes several proof-of-concept pipelines including experimental evaluation, which
showcase the usage of the framework.
Originality/value To the best of authorsknowledge, there is no similar formal framework for
experimentation with various similarity methods in the context of dataset discovery. The framework has the
ambition to establish a platform for reproducibleand comparable research in the area of dataset discovery. The
prototype implementation of the framework is available on GitHub.
Keywords Dataset, Discovery, Search, Framework, Similarity, Knowledge graph
Paper type Research paper
1. Introduction
The numberof datasets availableon the web increases tremendously.For example,the number
of datasetspublished by public authoritiesin Europeancountries increased from880k datasets
in August 2019 [1] to 1140kdatasets in November 2021 [2].AlsoGoogleobserved an explosive
growth in the number of available datasets in recentyears acco rding to Benjelloun et al. (2020).
DTA
56,4
506
© Martin Ne
cask
y, Petr
Skoda, David Bernhauer, Jakub Kl
ımek and Tom
a
s Skopal. Published by
Emerald Publishing Limited. This article is published under the Creative Commons Attribution (CC BY
4.0) licence. Anyone may reproduce, distribute, translate and create derivative works of this article (for
both commercial and non-commercial purposes), subject to full attribution to the original publication
and authors. The full terms of this licence may be seen at http://creativecommons.org/licences/by/4.0/
legalcode.
This work was supported by the Czech Science Foundation (GA
CR), Grant Number 19-01641S.
The current issue and full text archive of this journal is available on Emerald Insight at:
https://www.emerald.com/insight/2514-9288.htm
Received 27 September 2021
Revised 3 December 2021
Accepted 17 December 2021
Data Technologies and
Applications
Vol. 56 No. 4, 2022
pp. 506-535
Emerald Publishing Limited
2514-9288
DOI 10.1108/DTA-09-2021-0261
Althoughthere exist dataset catalogsproviding search for datasets,their retrieval featuresare
restricted tosimple keyword search based on textual metadata recorded in the catalog.These
simple search methodspresume that their users, the data consumers, know exactly whatthey
are searching for and which search query leads to the expected results. However, this
assumption is usually not valid, and,in principle, it neglects the very purpose of the catalogs.
When usersknow which datasetsthey are searching for, theyusually also know who publishes
a dataset and how the publisher titles the dataset. With this knowledge, it is quite
straightforward to locate a dataset on the publishers website using a generic search engine.
The genuinepurpose of data catalogsemerges when users do not exactlyknow which datasets
theyare searching for and how to find them.This is a usual situation where the usersknow only
a few keywords andtopics that roughly characterizethe needed data. The problem of missing
information about data is inherently related to the big data phenomenon and is generally
discussed as the problem of data findability by Zezula (2015). In their studies, Gregory et al.
(2020a),Koesten (2018) and Degbelo (2020) show that users typically need to search for more
than a single,isolated dataset.Typically, the users wish to findmultiple datasets similarto each
other in some way, andthis is where the pure metadata-based searchmethods come up short.
The studies also showthat dataset discovery depends on the context of the users needs and
discovery tasks. Various works such as Fernandez et al. (2018),Zhang and Balog (2018) and
Mountantonakis and Tzitzikas (2018) also show that dataset content can be important for
building datasetdiscovery services. Therefore, it is not easy to construct a dataset discovery
service on top of a single similarity discovery method.It is necessary to be able to experiment
with various combinations of different methods and compare them. This leads us to the
following research questions we try to solve in this paper:
RQ1. How can we support such experiments with different similarity dataset discovery
methods?
RQ2. How can we support combining the methods to more complex pipelines for
computing dataset similarities?
RQ3. How can we evaluate and compare different pipelines?
In this paper, we introduce a modular framework for rapid experimentation with methods for
similarity-based dataset discovery, using the perspective of software engineering. We are
aware that the development of an ultimate and universal method for dataset discovery would
be an infeasible effort. This is based on our previous work
Skoda et al. (2019),Skopal et al.
(2021) where we already experimented with various similarity discovery methods. We have
measured them on various real search scenarios, and we showed that none of the evaluated
methods performs best on all the scenarios. In this paper, we do not propose yet another
method. Instead, we focus on answering the research question above by proposing a
framework for experiments with various dataset similarity methods.
Therefore, the framework is not proposed as a complete solution to particular dataset
discovery problems, but it should rather act as an extensible modular toolbox for
experimentation with various dataset discovery pipelines, including future ones. It supports
experimentation by providing a predefined and extensible set of compatible components
which can be combined to more complex pipelines which can then be measured, evaluated
and compared. Although the framework is designed as generic and extensible, its retrieval
model is based on the similarity search paradigm that proved to be an effective general
mechanism for retrieval of complex data. Another feature of the framework is its
presumption of external knowledge in the process of dataset discovery, which is essential
to retrieval using different dataset contexts. In
Skoda et al. (2020), we proposed a framework
for evaluation of dataset discovery methods. This paper focuses not only on the evaluation
but also on the experimentation with the methods and their combinations.
Similarity-
based dataset
discovery
framework
507

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT