Dynamic Distributed and Parallel Machine Learning algorithms for big data mining processing

DOIhttps://doi.org/10.1108/DTA-06-2021-0153
Published date21 December 2021
Date21 December 2021
Pages558-601
Subject MatterLibrary & information science,Librarianship/library management,Library technology,Information behaviour & retrieval,Metadata,Information & knowledge management,Information & communications technology,Internet
AuthorLaouni Djafri
Dynamic Distributed and Parallel
Machine Learning algorithms for
big data mining processing
Laouni Djafri
Ibn Khaldoun University, Tiaret, Algeria and
EEDIS laboratory, Djillali Liabes University, Sidi Bel Abbes, Algeria
Abstract
Purpose This work can be used as a building block in other settings such as GPU, Map-Reduce, Spark or any
other. Also, DDPML can be deployed on other distributed systems such as P2P networks, clusters, clouds
computing or other technologies.
Design/methodology/approach In the age of Big Data, all companies want to benefit from large
amounts of data. These data can help them understand their internal and external environment and
anticipate associated phenomena, as the data turn into knowledge that can be used for prediction later.
Thus, this knowledge becomes a great asset in companieshands. This is precisely the objective of data
mining. But with the production of a largeamount of data and knowledge at a faster pace, the authors are
now talking about Big Data mining. For this reason, the authorsproposed works mainlyaim at solving the
problemof volume, veracity, validity and velocity when classifying Big Data using distributed and parallel
processing techniques. So, the problem that the authors are raising in this work is how the authors can
make machine learning algorithms work in a distributedand parallel way at the same time without losing
the accuracy of classification results. To solve this problem, the authors propose a system called Dynamic
Distributed and Parallel Machine Learning (DDPML) algorithms. To build it, the authors divided their
work into two parts. In the first, the authors propose a distributed architecture that is controlled by Map-
Reduce algorithm which in turn depends on random sampling technique. So, the distributed architecture
that the authors designed is specially directed to handle big data processing that operates in a coherent and
efficient manner with the sampling strategy proposed inthis work.This architecture also helps the authors
to actually verify the classification results obtained using the representative learning base (RLB). In the
second part, the authors have extracted the representative learning base by sampling at two levels using
the stratified random sampling method. This sampling method is also applied to extract the shared
learning base (SLB)and the partial learning base for the first level (PLBL1) and the partial learning base for
the second level (PLBL2). The experimental results show the efficiency of our solution that the authors
provided without significantloss of the classification results. Thus, in practical terms, the system DDPML
is generally dedicated to big data mining processing, and works effectively in distributed systems with a
simple structure, such as client-server networks.
Findings The authors got very satisfactory classification results.
Originality/value DDPML system is specially designed to smoothly handle big data mining classification.
Keywords Big data mining, Statistical sampling, Map-reduce, Machine learning, Distributed and parallel
processing, Big data platforms
Paper type Research paper
1. Introduction
Since the advent of the Internet to this day, we have seen explosive growth in the volume,
velocity and variety of data created daily (Sathyaraj et al., 2020); this amount of data is
generated by a variety of methods such as click stream data, financial transaction data, log
files generated by web or mobile applications, sensor data from Internet of things (IoT), in-
game player activity and telemetry from connected devices, and many other methods
(ODonovan et al., 2015;Hariri et al., 2019). This data is commonly referred to as Big Data
because of its volume, the velocity with which it arrives and the variety of forms it takes.
In 2001, Gartner proposed a three-dimensional or 3 Vs (volume, variety and velocity) view
of the challenges and opportunities associated with data growth (Chen et al.,2014).In 2012,
Gartner updated this report as follows: Big data is high volume, high speed and/or wide
DTA
56,4
558
The current issue and full text archive of this journal is available on Emerald Insight at:
https://www.emerald.com/insight/2514-9288.htm
Received 13 June 2021
Revised 23 November 2021
Accepted 23 November 2021
Data Technologies and
Applications
Vol. 56 No. 4, 2022
pp. 558-601
© Emerald Publishing Limited
2514-9288
DOI 10.1108/DTA-06-2021-0153
variety of information resources that require new forms of processing to improve decision-
making (Erl et al., 2016). Oftentimes, these Vs are supplemented by a fourth V, is Veracity:
How accurate is the data (Chan, 2013;Roos et al., 2013). We can extend this model to the Big
Data dimensions over ten Vs: volume, variety, velocity, veracity, value, variability,
validity, volatility, viability and viscosity (Hariri et al., 2019;Khan et al., 2018;Kayyali
et al.,2013;Katalet al.,2013;Ferguson,2013;Ripon and Arif, 2016;IBM, 2014;Elgendy and
Elragal, 2014). Accordingly, the increasing digitization of our activities, the ever-
increasing ability to store digital data, and the accumulation of information of all kinds, is
generating a new sector of activity aimed at analyzing these large amounts of data. This
leads to the emergence of new approaches, new methods, new knowledge, and ultimately,
undoubtedly, new ways of thinking and acting. Hence, this very large amount of data
must be exploited in order to better understand big data and how to extract knowledge
from it; this is known as big data mining (Cen et al., 2019;Dunren et al.,2013). Its main
purpose is to extract and retrieve desired information or patterns from a large amount of
data (Oussous et al., 2017). It is usually performed on a large amount of structured or
unstructured data using a combination of techniques that make it possible to explore
these large amounts of data, automatically or semi-automatically (Xindong et al.,2014;
Xingquan and Ian, 2007).
Every second, we see massive amounts of data growing exponentially, and there is no
weight for this huge amount of data unless we extract the true value from it by extracting the
information and knowledge or simply what we call big data mining. The real problem that we
are currently facing in big data mining is how to deal with this huge amount of data (volume)?
How to get the results in the shortest time (velocity)? But the question arises: can we maintain
or improve the precision (veracity and validity) of the results after reducing the size? These
and other questions will be discussed in our article. But, before we answer these questions, we
must understand very well the close relationship between these four characteristics (volume,
veracity, validity and velocity). We know very well that, if the size is large (volume), the
precision (veracity and validity) will be high and the speed (velocity) will be low. As for if the
size is small, the speed will be high and the precision may be low. Therefore, our goal in this
work is to reduce the size as far as possible to increase the speed to the maximum extent
possible. This speed will increase more and more if we use platforms and architectures
prepared for this purpose, provided that the precision of the results obtained is taken into
account.
Firstandforemost,ifwewanttoreducethe volume of Big data in a scientific and
correct way, then we definitely think about mathematical methods, especially
mathematical statistics (Che et al.,2013;Trovati and Bessis, 2015;Urrehman et al.,
2016). So what are the effective mathematical statistics methods that we must apply in
such cases to give very satisfactory results? On the other hand, if we want to speed up
the time, we are considering the application of parallel and distributed computing (Lu
and Zhan, 2020;Concurrency-Computat:Pract.Exper, 2016;Brown et al., 2020) supported
by big data solutions (Jun et al., 2019;Zhang et al.,2017;Palanisamy and
Thirunavukarasu, 2019).
Today, big data mining mainly relies on statistical methods to overcome and control it
so that we can handle it comfortably. Thus, mathematical statistics is the most important
in data science and particularly in big data analytics (Bucchianico et al.,2019;Weihs and
Ickstadt, 2018;HLG-BAS, 2011). The reason may be due to its function, as mathematical
statistics reveal correlations between statistical groups, and its function is pivotal to
reduce sizeto better understand data, and thus extract information with greater
precision and more quickly; as a result, statisticians and data science experts went to the
use of statistical sampling technique (Rojas et al.,2017;Liu and Zhang, 2020;Mahmud
et al., 2020).
DDPML
algorithms
559
In the Big Data analytics context, we often work with small-scale sets (sub-datasets)
that are part of the original dataset. For this reason, we mainly use mathematical
statistics. In mathematical statistics, a population usually contains too many individuals,
so that we may not be able to study them properly, and therefore a survey is often limited
to taking one or more samples. A well-chosen sample will contain most of the information
about a particular population parameter instead of studying the community as a whole;
this process is called sampling (Xindong et al.,2014;Berndt, 2020;Turner, 20 20). So, the
goal is to generalize the results of the sample from the population (Singh and Masuku,
2014;Den-Broeck et al., 2013). Therefore, we must emphasize the importance of a good
choice of sample elements to make them representative of our population. A sample is said
to be representative when the original dataset is represented as faithfully as possible by
virtue of its characteristics and quantity (Singh and Masuku, 2014;Andrade, 2020;Lee
et al.,2020). There are also several sampling methods, both probabilistic and non-
probabilistic (Berndt, 2020;Etikan and Bala, 2017). In probability sampling, the first
important point of the sample is that each individual of the selected population must have
a known nonzero chance so that it does not necessarily require equality. We want the
selection to be done independently. In other words, the selection of an individual will not
affect the risk that other individuals will be chosen; we do this by selecting through a
process in which only chance acts, such as flipping one or more coins, usually using a set
of random numbers (Turner, 2020;Taherdoost, 2016;Robbins et al.,2020). The sample
chosen as such is called a random sample (West, 2016). The word randomdoes not
describe this sample as such, but the way in which it is selected (Br
echon, 2015;Bhardwaj,
2019). If the sampling unit is selected more than once so that it is placed back in the
population before selecting the next unit, this is called random sampling with
replacement.If the sampling unit is selected only once, i.e., is not replaced, it is called
random sampling without replacement(West, 2016;Antal and Tille, 2011). One of the
most important methods of probability sampling is stratified sampling (Yadav and
Tailor, 2020) and we will deal with it in our work. This type of sampling divides the
population into non-overlapping subpopulations called strata(Howell et al.,2020); this
division works according to certain characteristics so that the units of a stratum are as
closeaspossible(KSteven, 2012). Second, although one stratum may differ significantly
from another, a stratified sample with the required number of units from each population
stratum tends to be representativeof the population as a whole. Stratified sampling, is
unlikely to choose an absurd sample because it guarantees the relative presence of all the
different subgroups that make up the population (Etikan and Bala, 2017;Padilla et al.,
2017). As for the non-probability sampling is generally based on subjective ideas. In other
words, the statistical sample selected is based on personal estimation rather than random
selection, and this type of sampling does not guarantee equal opportunity for every object
of the target population (Iachan et al., 2019;Gravetter and Forzano, 2012;Moorley and
Shorten, 2014).
Big Data Mining is a great source of information and knowledge from systems to other
end users. However, managing such a large amount of data or knowledge requires
automation, which leads to serious thinking about the use of machine learning techniques.
Machine learning consists of many powerful algorithms for learning patterns, knowledge
acquisition and predicts future events. Specifically, these algorithms work by searching a
group of possible predictive models to capture the best relationship between descriptive
features and target functions in the dataset. Based on this, the machine learning algorithm
makes the selection during the training process. The clear criterion for driving this choice
is the search for data-compatible models (Erl et al.,2016;Bailly et al.,2018). We can then
use this model to make predictions for new cases (instances) (Klaine et al., 2017).
Therefore, machine learning, which is one of the sub domains of artificial intelligence,
DTA
56,4
560

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT