Modular framework for similarity-based dataset discovery using external knowledge

Document

Cited in

DOI	https://doi.org/10.1108/DTA-09-2021-0261
Published date	15 February 2022
Date	15 February 2022
Pages	506-535
Subject Matter	Library & information science,Librarianship/library management,Library technology,Information behaviour & retrieval,Metadata,Information & knowledge management,Information & communications technology,Internet
Author	Martin Nečaský,Petr Škoda,David Bernhauer,Jakub Klímek,Tomáš Skopal

Modular framework for similarity-

based dataset discovery using

external knowledge

Martin Ne

cask

y and Petr 

Skoda

Department of Software Engineering, Faculty of Mathematics and Physics,

Charles University, Prague, Czechia

David Bernhauer

Department of Software Engineering, Faculty of Mathematics and Physics,

Charles University, Prague, Czechia and

Department of Software Engineering, Faculty of Information Technology,

Czech Technical University in Prague, Prague, Czechia, and

Jakub Kl



ımek and Tom

a

s Skopal

Department of Software Engineering, Faculty of Mathematics and Physics,

Charles University, Prague, Czechia

Abstract

Purpose –Semanticretrieval and discoveryof datasets publishedas open data remains a challengingtask. The

datasets inherently originate in theglobally distributed web jungle,lacking the luxury of centralized database

administration,database schemes, sharedattributes, vocabulary,structure and semantics.The existing dataset

catalogsprovide basic search functionalityrelying on keywordsearch in brief, incompleteor misleading textual

metadataattached to the datasets.The search resultsare thus often insufficient.However, thereexist many ways

of improving the dataset discovery by employing content-based retrieval, machine learning tools,third-party

(external)knowledge bases, countless featureextraction methods and descriptionmodels and so forth.

Design/methodology/approach –In this paper, the authors propose a modular framework for rapid

experimentation with methods for similarity-based dataset discovery. The framework consists of an extensible

catalog of components prepared to form custom pipelines for dataset representation and discovery.

Findings –The study proposes several proof-of-concept pipelines including experimental evaluation, which

showcase the usage of the framework.

Originality/value –To the best of authors’knowledge, there is no similar formal framework for

experimentation with various similarity methods in the context of dataset discovery. The framework has the

ambition to establish a platform for reproducibleand comparable research in the area of dataset discovery. The

prototype implementation of the framework is available on GitHub.

Keywords Dataset, Discovery, Search, Framework, Similarity, Knowledge graph

Paper type Research paper

1. Introduction

The numberof datasets availableon the web increases tremendously.For example,the number

of datasetspublished by public authoritiesin Europeancountries increased from880k datasets

in August 2019 [1] to 1140kdatasets in November 2021 [2].AlsoGoogleobserved an explosive

growth in the number of available datasets in recentyears acco rding to Benjelloun et al. (2020).

DTA

56,4

506

cask

y, Petr 

Skoda, David Bernhauer, Jakub Kl



ımek and Tom

a

s Skopal. Published by

Emerald Publishing Limited. This article is published under the Creative Commons Attribution (CC BY

4.0) licence. Anyone may reproduce, distribute, translate and create derivative works of this article (for

both commercial and non-commercial purposes), subject to full attribution to the original publication

and authors. The full terms of this licence may be seen at http://creativecommons.org/licences/by/4.0/

legalcode.

This work was supported by the Czech Science Foundation (GA

CR), Grant Number 19-01641S.

The current issue and full text archive of this journal is available on Emerald Insight at:

https://www.emerald.com/insight/2514-9288.htm

Received 27 September 2021

Revised 3 December 2021

Accepted 17 December 2021

Data Technologies and

Applications

Vol. 56 No. 4, 2022

pp. 506-535

Emerald Publishing Limited

2514-9288

DOI 10.1108/DTA-09-2021-0261

Althoughthere exist dataset catalogsproviding search for datasets,their retrieval featuresare

restricted tosimple keyword search based on textual metadata recorded in the catalog.These

simple search methodspresume that their users, the data consumers, know exactly whatthey

are searching for and which search query leads to the expected results. However, this

assumption is usually not valid, and,in principle, it neglects the very purpose of the catalogs.

When usersknow which datasetsthey are searching for, theyusually also know who publishes

a dataset and how the publisher titles the dataset. With this knowledge, it is quite

straightforward to locate a dataset on the publisher’s website using a generic search engine.

The genuinepurpose of data catalogsemerges when users do not exactlyknow which datasets

theyare searching for and how to find them.This is a usual situation where the usersknow only

a few keywords andtopics that roughly characterizethe needed data. The problem of missing

information about data is inherently related to the big data phenomenon and is generally

discussed as the problem of data findability by Zezula (2015). In their studies, Gregory et al.

(2020a),Koesten (2018) and Degbelo (2020) show that users typically need to search for more

than a single,isolated dataset.Typically, the users wish to findmultiple datasets similarto each

other in some way, andthis is where the pure metadata-based searchmethods come up short.

The studies also showthat dataset discovery depends on the context of the user’s needs and

discovery tasks. Various works such as Fernandez et al. (2018),Zhang and Balog (2018) and

Mountantonakis and Tzitzikas (2018) also show that dataset content can be important for

building datasetdiscovery services. Therefore, it is not easy to construct a dataset discovery

service on top of a single similarity discovery method.It is necessary to be able to experiment

with various combinations of different methods and compare them. This leads us to the

following research questions we try to solve in this paper:

RQ1. How can we support such experiments with different similarity dataset discovery

methods?

RQ2. How can we support combining the methods to more complex pipelines for

computing dataset similarities?

RQ3. How can we evaluate and compare different pipelines?

In this paper, we introduce a modular framework for rapid experimentation with methods for

similarity-based dataset discovery, using the perspective of software engineering. We are

aware that the development of an ultimate and universal method for dataset discovery would

be an infeasible effort. This is based on our previous work –

Skoda et al. (2019),Skopal et al.

(2021) –where we already experimented with various similarity discovery methods. We have

measured them on various real search scenarios, and we showed that none of the evaluated

methods performs best on all the scenarios. In this paper, we do not propose yet another

method. Instead, we focus on answering the research question above by proposing a

framework for experiments with various dataset similarity methods.

Therefore, the framework is not proposed as a complete solution to particular dataset

discovery problems, but it should rather act as an extensible modular toolbox for

experimentation with various dataset discovery pipelines, including future ones. It supports

experimentation by providing a predefined and extensible set of compatible components

which can be combined to more complex pipelines which can then be measured, evaluated

and compared. Although the framework is designed as generic and extensible, its retrieval

model is based on the similarity search paradigm that proved to be an effective general

mechanism for retrieval of complex data. Another feature of the framework is its

presumption of external knowledge in the process of dataset discovery, which is essential

to retrieval using different dataset contexts. In 

Skoda et al. (2020), we proposed a framework

for evaluation of dataset discovery methods. This paper focuses not only on the evaluation

but also on the experimentation with the methods and their combinations.

Similarity-

based dataset

discovery

framework

507

To continue reading

Request your trial

Subscribers can access the reported version of this case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the cited cases and legislation of a document.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the documents that have cited the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the revised versions of legislation with amendments.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see any amendments made to the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a visualisation of a case and its relationships to other cases. An alternative to lists of cases, the Precedent Map makes it easier to establish which ones may be of most relevance to your research and prioritise further reading. You also get a useful overview of how the case was received.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the list of results connected to your document through the topics and citations Vincent found.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Modular framework for similarity-based dataset discovery using external knowledge

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users