Metadata-based data quality assessment

DOIhttps://doi.org/10.1108/VJIKMS-11-2015-0059
Date09 May 2016
Pages232-250
Published date09 May 2016
AuthorMustafa Aljumaili,Ramin Karim,Phillip Tretten
Subject MatterInformation & knowledge management,Knowledge management,Knowledge management systems
Metadata-based data
quality assessment
Mustafa Aljumaili, Ramin Karim and Phillip Tretten
Division of Operation, Maintenance and Acoustics Engineering,
Luleå University of Technology, Luleå, Sweden
Abstract
Purpose – The purpose of this paper is to develop data quality (DQ) assessment model based on
content analysis and metadata analysis.
Design/methodology/approach A literature review of DQ assessment models has been
conducted. A study of DQ key performances (KPIs) has been done. Finally, the proposed model has been
developed and applied in a case study.
Findings – The results of this study shows that the metadata data have important information about
DQ in a database and can be used to assess DQ to provide decision support for decision makers.
Originality/value – There is a lot of DQ assessment in the literature; however, metadata are not
considered in these models. The model developed in this study is based on metadata in addition to the
content analysis, to nd a quantitative DQ assessment.
Keywords Metadata, Data quality, Attributes, eMaintenance, Database
Paper type Research paper
1. Introduction
Knowledge management systems (KMSs) rely ultimately on the timely and accurate
retrieval of appropriate facts and information that come in many different forms. These
forms are located in the enterprise data and have different structures and attributes such
as reliability, accuracy and security (Pigott and Hobbs, 2011). High-quality data make
organizational data resources more reliable, increasing the business benets gained by
using them. They contribute to efcient and effective business operations, improved
decision-making and increased trust in information systems (DeLone and McLean, 1992;
Redman and Blanton, 1997). Advances in information systems and technology permit
organizations to collect large amounts of data and to build and manage complex data
resources. However, the large size and complexity make data resources vulnerable to
data defects that reduce their quality (Even and Shankaranarayanan, 2009).
Although there is no consensus on the distinction between data quality (DQ) and
information quality (IQ), there is a tendency to use DQ to refer to technical issues and IQ
to refer to non-technical issues (Zhu et al., 2014). In this study, we do not make this
distinction but use DQ to refer to the full range of issues.
DQ can be dened as the data that are t to use by data consumers. The production
of high-quality statistics depends on DQ. Without a systematic assessment of DQ, there
is a risk of losing control of the various statistical processes such as data collection,
editing or weighting. A lack of DQ assessment assumes processes cannot be improved
and problems will always be detected without systematic analysis, but without good DQ
assessment, statistical departments are working blind; they can neither claim being
professional nor deliver quality results (Bergdahl et al., 2007).
The current issue and full text archive of this journal is available on Emerald Insight at:
www.emeraldinsight.com/2059-5891.htm
VJIKMS
46,2
232
Received 2 November 2015
Revised 10 December 2015
Accepted 22 December 2015
VINEJournal of Information and
KnowledgeManagement Systems
Vol.46 No. 2, 2016
pp.232-250
©Emerald Group Publishing Limited
2059-5891
DOI 10.1108/VJIKMS-11-2015-0059
Quantitative assessment of quality is critical in large data environments, as it can
help set up realistic quality improvement targets, track progress, assess impacts of
different solutions and prioritize improvements.
However, DQ is typically assessed along multiple quality dimensions (Even and
Shankaranarayanan, 2009), and these dimensions have to be considered in relation to
specic user objectives, goals and functions in a specic context. Because all users,
whether human or automatic processes, have different data and information
requirements, the set of attributes and the level of quality considered satisfactory vary
with the user’s perspective, the type of the models, algorithms and processes comprising
the system. Therefore, the general ontology designed to identify possible attributes and
relations between them, especially in a human–machine integrated system, will require
instantiation in every particular case (Rogova and Bosse, 2010).
The literature suggests several methods for assessing DQ; the proposed quality
measurements often use a scale between 0 (poor) and 1 (perfect) (Wang et al., 1995a,
1995b;Redman and Blanton, 1997;Pipino et al., 2002). Some methods, referred to by
Ballou and Pazer (2003) as structure-based or structural, are driven by physical
characteristics of the data (e.g. item counts, time tags or defect rates). Such methods are
impartial, as they assume an objective quality standard and disregard the context in
which the data are used (Even and Shankaranarayanan, 2009). Other measurement
methods, referred to as content-based (Ballou and Pazer, 2003), derive measurements
from data content. Such measurements typically reect the impact of quality defects
within a specic usage context and are, therefore, also called contextual assessments
(Pipino et al., 2002).
IQ can be assessed on three levels: information content, information source and
information system quality. Major attributes of the quality of information content are
accessibility,availability,relevance,timeliness and integrity. Information sources can be
subjective or objective. Subjective sources include human observers, experts and
decision makers. Objective information sources include sensors, models and automated
processes; these are free of the biases inherent to human judgment and depend only on
how well sensors are calibrated (Rogova and Bosse, 2010).
Information systems should be well designed to ensure high-quality data. The
database design includes tables and metadata.
Metadata are crucial for information systems, and the past 30 years has witnessed a
tremendous growth in the use of metadata (Lee et al., 2006). However, metadata are not
yet used for DQ assessment. Therefore, this study proposes a methodology to assess
quality, considering both content and database metadata. By merging and comparing
the two, it seeks to improve the assessment of DQ and facilitate better decision-making.
2. Types of data
Data can be considered an asset. An asset is a useful item that is a product or byproduct
of an application development process. An asset can be tangible, such as data, designs
or software code; or intangible, such as knowledge and methodologies (Lee et al., 2006).
In general, three types of data should be considered when determining DQ: structured,
unstructured and semi-structured data.
Fully structured data follows a predened schema, conforming to certain
specications (Sint et al., 2009). A typical example of fully structured data is a relational
database system. Structured data are often managed using Structured Query Language
233
Metadata-
based data
quality
assessment

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT