Data repair of density-based data cleaning approach using conditional functional dependencies

Document

Cited in

DOI	https://doi.org/10.1108/DTA-05-2021-0108
Published date	19 November 2021
Date	19 November 2021
Pages	429-446
Subject Matter	Library & information science,Librarianship/library management,Library technology,Information behaviour & retrieval,Metadata,Information & knowledge management,Information & communications technology,Internet
Author	Samir Al-Janabi,Ryszard Janicki

Data repair of density-based

data cleaning approach

using conditional

functional dependencies

Samir Al-Janabi and Ryszard Janicki

McMaster University, Hamilton, Canada

Abstract

Purpose–Data quality is a major challenge in data management. For organizations, the cleanliness of data is a

significant problem that affects many business activities. Errors in data occur for different reasons, such as

violation of business rules. However, because of the huge amount of data, manual cleaning alone is infeasible.

Methods are required to repair and clean the dirty data through automatic detection, which are data quality

issues to address. The purpose of this work is to extend the density-based data cleaning approach using

conditional functional dependencies to achieve better data repair.

Design/methodology/approach –A set of conditional functional dependencies is introduced as an input to

the density-based data cleaning algorithm. The algorithm repairs inconsistent data using this set.

Findings –This new approach was evaluated through experiments on real-world as well as synthetic

datasets. The repair quality was determined using the F-measure. The results showed that the quality and

scalability of the density-based data cleaning approach improved when conditional functional dependencies

were introduced.

Originality/value –Conditional functional dependencies capture semantic errors among data values. This

work demonstrates that the density-based data cleaning approach can be improved in terms of repairing

inconsistent data by using conditional functional dependencies.

Keywords Data management, Information systems, Data repair, Integrity constraints

Paper type Research paper

1. Introduction

Companies are becoming more and more dependent on data to support various activities,

such as decision-making and predictive analysis. Poor data quality has significant negative

consequences and dramatic cost (Haug et al., 2011;De Veaux and Hand, 2005). A report

(Schutz, 2013) stated that more than 80% of businesses consider poor quality of data

damaging to business objectives, and 66% mentioned that poor data quality has negatively

impacted their businesses in the last 12 months. US companies believe that 25% of their data

are inaccurate. It is not possible to obtain correct answers to queries even if efficient and

scalable query evaluation algorithms are used when data are corrupted (Fan, 2015). A recent

report (Chien and Jain, 2019) stated that the growth of digital revenue is expected to be 10.3%

between 2017 and 2020, and to achieve this growth, data quality should be maintained. In

2017, the market reached $1.61 billion for data quality software tools, which is more than an

increase of 11% from 2016, and it is forecasted that the compound annual growth of revenue

for the period 2017–2022 will be 8.1%.

Data quality is defined by some as “fitness for use,”that is, given the data and a user with a

purpose, to what extent can the data serve that purpose (Tayi and Ballou, 1998;Lederman

et al., 2003;Watts et al., 2009). Another way to define the concept of data quality is by dividing

it into quality dimensions, such as consistency, accuracy and completeness. Consistency

refers to the validity and integrity of the information that complies without contradictions to

real-world entity representations, typically identified with respect to integrity constraints.

Accuracy refers to the closeness of data values in a database to the true values the database

Density-based

data cleaning

approach

429

The current issue and full text archive of this journal is available on Emerald Insight at:

https://www.emerald.com/insight/2514-9288.htm

Received 1 May 2021

Revised 20 September 2021

Accepted 20 October 2021

Data Technologies and

Applications

Vol. 56 No. 3, 2022

pp. 429-446

2514-9288

DOI 10.1108/DTA-05-2021-0108

aims to represent. Completeness is described in terms of the presence or absence of values in a

database (Batini and Scannapieca, 2016). Duplicate records are another issue that may affect

the quality of data. Data deduplication involves identifying which tuples refer to the same

real-world entity (Ilyas and Chu, 2019). Data deduplication is also known by other names such

as record linkage, record matching, entity resolution and object identification (Fan, 2015).

Integrity constraints are widely used to incorporate semantics into relational data. They

are typically used to prevent invalid updates, optimize queries and normalize databases.

There are several types of prominent integrity constraints, such as functional dependencies

(FDs) (Abiteboul et al., 1995) and conditional functional dependencies (CFDs) (Bohannon et al.,

2007;Fan and Geerts, 2012). FDs are important integrity constraints and have several

applications such as data cleaning (Beskales et al., 2010) and data integration (Wang et al.,

2009). CFDs can be used to capture schema semantics (Fan et al., 2008) and maintain data

quality (Fan and Geerts, 2012). However, for a variety of reasons, these constraints may be

violated, such as integration with other data sources or typographical errors. These

violations make the data unclean, as they do not represent the correct values. For example,

consider Table 1, the Persons dataset, which displays personal data. Consider the FD F

PostalCode →City. Either tuples t

and t

violate the FD, or tuples t

and t

violate it. That is, we either change the values of the city in the tuples to “Hamilton”in the first

set of tuples, or to “Ottawa”in the second set. We need evidence to indicate which value to

change. As another example, consider the FD F

: Model →Car. There are two cars for the

“Camry”model: “Golf”and “Toyota.”Either tuples t

and t

violate the FD or do

the tuples t

and t

. That is, we either change the values of the car in the tuples to “Golf”in

the first set of tuples, or to “Toyota”in the second set. We also need evidence to indicate which

value to change.

We summarize our contributions in this work as follows:

(1) We extend the model of (Al-janabi and Janicki, 2016) by introducing constant and

variable CFDs to the data repair model and discuss various scenarios that could

enhance data cleaning.

(2) We evaluated the quality and performance of CFDs and performed a comparative

study with the original model, which repairs data using FDs to evaluate the quality

and scalability in data repair. In addition, we performed a comparative study

with another model that repairs data using the CFDs to evaluate the quality of data

repair.

Id FName LName Company

Postal

code City Email Rank Salary Model Car

Saly Carl Mi LLP L8S Ottawa yuli@millp.com a 1,000 Camry Toyota

Yuli Acton Mi LLP L8S Hamilton yuli@millp.com d 5,000 Xterra Nissan

Clifford Quentin Praesent

Eu Ltd

L8S Hamilton clifford@praesenteultd.net d 5,000 Camry Toyota

C Quentin P. EU L8S Ottawa clifford@praesenteultd.nte d 4,000 Camry Golf

Yuli Acton Mi LLP L8S Hamilton yuli@millp.com c 3,000 Xterra Nissan

Clifford Quentin Praesent

Eu Ltd

L8S Hamilton clifford@praesenteultd.net d 5,000 Camry Toyota

Clif Q Praes. Eu L8S Ottawa clifford@praesenteultd.net d 5,000 Camry Golf

Yuli Acton Mi LLP L8S Ottawa yuli@millp.com d 4,000 Xterra Nissan

Rami White Mi LLP L8S Hamilton salemr@millp.com d 4,000 Camry Golf

Rami Salem Mi LLP L8S Hamilton salemr@millp.com d 4,000 Camry Toyota

Table 1.

Instance of Persons

relation

DTA

56,3

430

To continue reading

Request your trial

Subscribers can access the reported version of this case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the cited cases and legislation of a document.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the documents that have cited the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the revised versions of legislation with amendments.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see any amendments made to the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a visualisation of a case and its relationships to other cases. An alternative to lists of cases, the Precedent Map makes it easier to establish which ones may be of most relevance to your research and prioritise further reading. You also get a useful overview of how the case was received.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the list of results connected to your document through the topics and citations Vincent found.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Data repair of density-based data cleaning approach using conditional functional dependencies

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users