Data repair of density-based data cleaning approach using conditional functional dependencies

DOIhttps://doi.org/10.1108/DTA-05-2021-0108
Published date19 November 2021
Date19 November 2021
Pages429-446
Subject MatterLibrary & information science,Librarianship/library management,Library technology,Information behaviour & retrieval,Metadata,Information & knowledge management,Information & communications technology,Internet
AuthorSamir Al-Janabi,Ryszard Janicki
Data repair of density-based
data cleaning approach
using conditional
functional dependencies
Samir Al-Janabi and Ryszard Janicki
McMaster University, Hamilton, Canada
Abstract
PurposeData quality is a major challenge in data management. For organizations, the cleanliness of data is a
significant problem that affects many business activities. Errors in data occur for different reasons, such as
violation of business rules. However, because of the huge amount of data, manual cleaning alone is infeasible.
Methods are required to repair and clean the dirty data through automatic detection, which are data quality
issues to address. The purpose of this work is to extend the density-based data cleaning approach using
conditional functional dependencies to achieve better data repair.
Design/methodology/approach A set of conditional functional dependencies is introduced as an input to
the density-based data cleaning algorithm. The algorithm repairs inconsistent data using this set.
Findings This new approach was evaluated through experiments on real-world as well as synthetic
datasets. The repair quality was determined using the F-measure. The results showed that the quality and
scalability of the density-based data cleaning approach improved when conditional functional dependencies
were introduced.
Originality/value Conditional functional dependencies capture semantic errors among data values. This
work demonstrates that the density-based data cleaning approach can be improved in terms of repairing
inconsistent data by using conditional functional dependencies.
Keywords Data management, Information systems, Data repair, Integrity constraints
Paper type Research paper
1. Introduction
Companies are becoming more and more dependent on data to support various activities,
such as decision-making and predictive analysis. Poor data quality has significant negative
consequences and dramatic cost (Haug et al., 2011;De Veaux and Hand, 2005). A report
(Schutz, 2013) stated that more than 80% of businesses consider poor quality of data
damaging to business objectives, and 66% mentioned that poor data quality has negatively
impacted their businesses in the last 12 months. US companies believe that 25% of their data
are inaccurate. It is not possible to obtain correct answers to queries even if efficient and
scalable query evaluation algorithms are used when data are corrupted (Fan, 2015). A recent
report (Chien and Jain, 2019) stated that the growth of digital revenue is expected to be 10.3%
between 2017 and 2020, and to achieve this growth, data quality should be maintained. In
2017, the market reached $1.61 billion for data quality software tools, which is more than an
increase of 11% from 2016, and it is forecasted that the compound annual growth of revenue
for the period 20172022 will be 8.1%.
Data quality is defined by some as fitness for use,that is, given the data and a user with a
purpose, to what extent can the data serve that purpose (Tayi and Ballou, 1998;Lederman
et al., 2003;Watts et al., 2009). Another way to define the concept of data quality is by dividing
it into quality dimensions, such as consistency, accuracy and completeness. Consistency
refers to the validity and integrity of the information that complies without contradictions to
real-world entity representations, typically identified with respect to integrity constraints.
Accuracy refers to the closeness of data values in a database to the true values the database
Density-based
data cleaning
approach
429
The current issue and full text archive of this journal is available on Emerald Insight at:
https://www.emerald.com/insight/2514-9288.htm
Received 1 May 2021
Revised 20 September 2021
Accepted 20 October 2021
Data Technologies and
Applications
Vol. 56 No. 3, 2022
pp. 429-446
© Emerald Publishing Limited
2514-9288
DOI 10.1108/DTA-05-2021-0108
aims to represent. Completeness is described in terms of the presence or absence of values in a
database (Batini and Scannapieca, 2016). Duplicate records are another issue that may affect
the quality of data. Data deduplication involves identifying which tuples refer to the same
real-world entity (Ilyas and Chu, 2019). Data deduplication is also known by other names such
as record linkage, record matching, entity resolution and object identification (Fan, 2015).
Integrity constraints are widely used to incorporate semantics into relational data. They
are typically used to prevent invalid updates, optimize queries and normalize databases.
There are several types of prominent integrity constraints, such as functional dependencies
(FDs) (Abiteboul et al., 1995) and conditional functional dependencies (CFDs) (Bohannon et al.,
2007;Fan and Geerts, 2012). FDs are important integrity constraints and have several
applications such as data cleaning (Beskales et al., 2010) and data integration (Wang et al.,
2009). CFDs can be used to capture schema semantics (Fan et al., 2008) and maintain data
quality (Fan and Geerts, 2012). However, for a variety of reasons, these constraints may be
violated, such as integration with other data sources or typographical errors. These
violations make the data unclean, as they do not represent the correct values. For example,
consider Table 1, the Persons dataset, which displays personal data. Consider the FD F
1
:
PostalCode City. Either tuples t
1
,t
4
,t
7
and t
8
violate the FD, or tuples t
2
,t
3
,t
5
,t
6
,t
9
and t
10
violate it. That is, we either change the values of the city in the tuples to Hamiltonin the first
set of tuples, or to Ottawain the second set. We need evidence to indicate which value to
change. As another example, consider the FD F
2
: Model Car. There are two cars for the
Camrymodel: Golfand Toyota.Either tuples t
1
,t
3
,t
6
and t
10
violate the FD or do
the tuples t
4
,t
7
and t
9
. That is, we either change the values of the car in the tuples to Golfin
the first set of tuples, or to Toyotain the second set. We also need evidence to indicate which
value to change.
We summarize our contributions in this work as follows:
(1) We extend the model of (Al-janabi and Janicki, 2016) by introducing constant and
variable CFDs to the data repair model and discuss various scenarios that could
enhance data cleaning.
(2) We evaluated the quality and performance of CFDs and performed a comparative
study with the original model, which repairs data using FDs to evaluate the quality
and scalability in data repair. In addition, we performed a comparative study
with another model that repairs data using the CFDs to evaluate the quality of data
repair.
Id FName LName Company
Postal
code City Email Rank Salary Model Car
t
1
Saly Carl Mi LLP L8S Ottawa yuli@millp.com a 1,000 Camry Toyota
t
2
Yuli Acton Mi LLP L8S Hamilton yuli@millp.com d 5,000 Xterra Nissan
t
3
Clifford Quentin Praesent
Eu Ltd
L8S Hamilton clifford@praesenteultd.net d 5,000 Camry Toyota
t
4
C Quentin P. EU L8S Ottawa clifford@praesenteultd.nte d 4,000 Camry Golf
t
5
Yuli Acton Mi LLP L8S Hamilton yuli@millp.com c 3,000 Xterra Nissan
t
6
Clifford Quentin Praesent
Eu Ltd
L8S Hamilton clifford@praesenteultd.net d 5,000 Camry Toyota
t
7
Clif Q Praes. Eu L8S Ottawa clifford@praesenteultd.net d 5,000 Camry Golf
t
8
Yuli Acton Mi LLP L8S Ottawa yuli@millp.com d 4,000 Xterra Nissan
t
9
Rami White Mi LLP L8S Hamilton salemr@millp.com d 4,000 Camry Golf
t
10
Rami Salem Mi LLP L8S Hamilton salemr@millp.com d 4,000 Camry Toyota
Table 1.
Instance of Persons
relation
DTA
56,3
430

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT