Dynamic Distributed and Parallel Machine Learning algorithms for big data mining processing

Document

Cited in

DOI	https://doi.org/10.1108/DTA-06-2021-0153
Published date	21 December 2021
Date	21 December 2021
Pages	558-601
Subject Matter	Library & information science,Librarianship/library management,Library technology,Information behaviour & retrieval,Metadata,Information & knowledge management,Information & communications technology,Internet
Author	Laouni Djafri

Dynamic Distributed and Parallel

Machine Learning algorithms for

big data mining processing

Laouni Djafri

Ibn Khaldoun University, Tiaret, Algeria and

EEDIS laboratory, Djillali Liabes University, Sidi Bel Abbes, Algeria

Abstract

Purpose –This work can be used as a building block in other settings such as GPU, Map-Reduce, Spark or any

other. Also, DDPML can be deployed on other distributed systems such as P2P networks, clusters, clouds

computing or other technologies.

Design/methodology/approach –In the age of Big Data, all companies want to benefit from large

amounts of data. These data can help them understand their internal and external environment and

anticipate associated phenomena, as the data turn into knowledge that can be used for prediction later.

Thus, this knowledge becomes a great asset in companies’hands. This is precisely the objective of data

mining. But with the production of a largeamount of data and knowledge at a faster pace, the authors are

now talking about Big Data mining. For this reason, the authors’proposed works mainlyaim at solving the

problemof volume, veracity, validity and velocity when classifying Big Data using distributed and parallel

processing techniques. So, the problem that the authors are raising in this work is how the authors can

make machine learning algorithms work in a distributedand parallel way at the same time without losing

the accuracy of classification results. To solve this problem, the authors propose a system called Dynamic

Distributed and Parallel Machine Learning (DDPML) algorithms. To build it, the authors divided their

work into two parts. In the first, the authors propose a distributed architecture that is controlled by Map-

Reduce algorithm which in turn depends on random sampling technique. So, the distributed architecture

that the authors designed is specially directed to handle big data processing that operates in a coherent and

efficient manner with the sampling strategy proposed inthis work.This architecture also helps the authors

to actually verify the classification results obtained using the representative learning base (RLB). In the

second part, the authors have extracted the representative learning base by sampling at two levels using

the stratified random sampling method. This sampling method is also applied to extract the shared

learning base (SLB)and the partial learning base for the first level (PLBL1) and the partial learning base for

the second level (PLBL2). The experimental results show the efficiency of our solution that the authors

provided without significantloss of the classification results. Thus, in practical terms, the system DDPML

is generally dedicated to big data mining processing, and works effectively in distributed systems with a

simple structure, such as client-server networks.

Findings –The authors got very satisfactory classification results.

Originality/value –DDPML system is specially designed to smoothly handle big data mining classification.

Keywords Big data mining, Statistical sampling, Map-reduce, Machine learning, Distributed and parallel

processing, Big data platforms

Paper type Research paper

1. Introduction

Since the advent of the Internet to this day, we have seen explosive growth in the volume,

velocity and variety of data created daily (Sathyaraj et al., 2020); this amount of data is

generated by a variety of methods such as click stream data, financial transaction data, log

files generated by web or mobile applications, sensor data from Internet of things (IoT), in-

game player activity and telemetry from connected devices, and many other methods

(O’Donovan et al., 2015;Hariri et al., 2019). This data is commonly referred to as “Big Data”

because of its volume, the velocity with which it arrives and the variety of forms it takes.

In 2001, Gartner proposed a three-dimensional or 3 Vs (volume, variety and velocity) view

of the challenges and opportunities associated with data growth (Chen et al.,2014).In 2012,

Gartner updated this report as follows: Big data is high volume, high speed and/or wide

DTA

56,4

558

The current issue and full text archive of this journal is available on Emerald Insight at:

https://www.emerald.com/insight/2514-9288.htm

Received 13 June 2021

Revised 23 November 2021

Accepted 23 November 2021

Data Technologies and

Applications

Vol. 56 No. 4, 2022

pp. 558-601

2514-9288

DOI 10.1108/DTA-06-2021-0153

variety of information resources that require new forms of processing to improve decision-

making (Erl et al., 2016). Oftentimes, these Vs are supplemented by a fourth V, is Veracity:

How accurate is the data (Chan, 2013;Roos et al., 2013). We can extend this model to the Big

Data dimensions over ten Vs: volume, variety, velocity, veracity, value, variability,

validity, volatility, viability and viscosity (Hariri et al., 2019;Khan et al., 2018;Kayyali

et al.,2013;Katalet al.,2013;Ferguson,2013;Ripon and Arif, 2016;IBM, 2014;Elgendy and

Elragal, 2014). Accordingly, the increasing digitization of our activities, the ever-

increasing ability to store digital data, and the accumulation of information of all kinds, is

generating a new sector of activity aimed at analyzing these large amounts of data. This

leads to the emergence of new approaches, new methods, new knowledge, and ultimately,

undoubtedly, new ways of thinking and acting. Hence, this very large amount of data

must be exploited in order to better understand big data and how to extract knowledge

from it; this is known as big data mining (Cen et al., 2019;Dunren et al.,2013). Its main

purpose is to extract and retrieve desired information or patterns from a large amount of

data (Oussous et al., 2017). It is usually performed on a large amount of structured or

unstructured data using a combination of techniques that make it possible to explore

these large amounts of data, automatically or semi-automatically (Xindong et al.,2014;

Xingquan and Ian, 2007).

Every second, we see massive amounts of data growing exponentially, and there is no

weight for this huge amount of data unless we extract the true value from it by extracting the

information and knowledge or simply what we call big data mining. The real problem that we

are currently facing in big data mining is how to deal with this huge amount of data (volume)?

How to get the results in the shortest time (velocity)? But the question arises: can we maintain

or improve the precision (veracity and validity) of the results after reducing the size? These

and other questions will be discussed in our article. But, before we answer these questions, we

must understand very well the close relationship between these four characteristics (volume,

veracity, validity and velocity). We know very well that, if the size is large (volume), the

precision (veracity and validity) will be high and the speed (velocity) will be low. As for if the

size is small, the speed will be high and the precision may be low. Therefore, our goal in this

work is to reduce the size as far as possible to increase the speed to the maximum extent

possible. This speed will increase more and more if we use platforms and architectures

prepared for this purpose, provided that the precision of the results obtained is taken into

account.

Firstandforemost,ifwewanttoreducethe volume of Big data in a scientific and

correct way, then we definitely think about mathematical methods, especially

mathematical statistics (Che et al.,2013;Trovati and Bessis, 2015;Urrehman et al.,

2016). So what are the effective mathematical statistics methods that we must apply in

such cases to give very satisfactory results? On the other hand, if we want to speed up

the time, we are considering the application of parallel and distributed computing (Lu

and Zhan, 2020;Concurrency-Computat:Pract.Exper, 2016;Brown et al., 2020) supported

by big data solutions (Jun et al., 2019;Zhang et al.,2017;Palanisamy and

Thirunavukarasu, 2019).

Today, big data mining mainly relies on statistical methods to overcome and control it

so that we can handle it comfortably. Thus, mathematical statistics is the most important

in data science and particularly in big data analytics (Bucchianico et al.,2019;Weihs and

Ickstadt, 2018;HLG-BAS, 2011). The reason may be due to its function, as mathematical

statistics reveal correlations between statistical groups, and its function is pivotal to

“reduce size”to better understand data, and thus extract information with greater

precision and more quickly; as a result, statisticians and data science experts went to the

use of statistical sampling technique (Rojas et al.,2017;Liu and Zhang, 2020;Mahmud

et al., 2020).

DDPML

algorithms

559

In the Big Data analytics context, we often work with small-scale sets (sub-datasets)

that are part of the original dataset. For this reason, we mainly use mathematical

statistics. In mathematical statistics, a population usually contains too many individuals,

so that we may not be able to study them properly, and therefore a survey is often limited

to taking one or more samples. A well-chosen sample will contain most of the information

about a particular population parameter instead of studying the community as a whole;

this process is called sampling (Xindong et al.,2014;Berndt, 2020;Turner, 20 20). So, the

goal is to generalize the results of the sample from the population (Singh and Masuku,

2014;Den-Broeck et al., 2013). Therefore, we must emphasize the importance of a good

choice of sample elements to make them representative of our population. A sample is said

to be representative when the original dataset is represented as faithfully as possible by

virtue of its characteristics and quantity (Singh and Masuku, 2014;Andrade, 2020;Lee

et al.,2020). There are also several sampling methods, both probabilistic and non-

probabilistic (Berndt, 2020;Etikan and Bala, 2017). In probability sampling, the first

important point of the sample is that each individual of the selected population must have

a known nonzero chance so that it does not necessarily require equality. We want the

selection to be done independently. In other words, the selection of an individual will not

affect the risk that other individuals will be chosen; we do this by selecting through a

process in which only chance acts, such as flipping one or more coins, usually using a set

of random numbers (Turner, 2020;Taherdoost, 2016;Robbins et al.,2020). The sample

chosen as such is called a random sample (West, 2016). The word “random”does not

describe this sample as such, but the way in which it is selected (Br

echon, 2015;Bhardwaj,

2019). If the sampling unit is selected more than once so that it is placed back in the

population before selecting the next unit, this is called “random sampling with

replacement.”If the sampling unit is selected only once, i.e., is not replaced, it is called

“random sampling without replacement”(West, 2016;Antal and Tille, 2011). One of the

most important methods of probability sampling is stratified sampling (Yadav and

Tailor, 2020) and we will deal with it in our work. This type of sampling divides the

population into non-overlapping subpopulations called “strata”(Howell et al.,2020); this

division works according to certain characteristics so that the units of a stratum are as

closeaspossible(KSteven, 2012). Second, although one stratum may differ significantly

from another, a stratified sample with the required number of units from each population

stratum tends to be “representative”of the population as a whole. Stratified sampling, is

unlikely to choose an absurd sample because it guarantees the relative presence of all the

different subgroups that make up the population (Etikan and Bala, 2017;Padilla et al.,

2017). As for the non-probability sampling is generally based on subjective ideas. In other

words, the statistical sample selected is based on personal estimation rather than random

selection, and this type of sampling does not guarantee equal opportunity for every object

of the target population (Iachan et al., 2019;Gravetter and Forzano, 2012;Moorley and

Shorten, 2014).

Big Data Mining is a great source of information and knowledge from systems to other

end users. However, managing such a large amount of data or knowledge requires

automation, which leads to serious thinking about the use of machine learning techniques.

Machine learning consists of many powerful algorithms for learning patterns, knowledge

acquisition and predicts future events. Specifically, these algorithms work by searching a

group of possible predictive models to capture the best relationship between descriptive

features and target functions in the dataset. Based on this, the machine learning algorithm

makes the selection during the training process. The clear criterion for driving this choice

is the search for data-compatible models (Erl et al.,2016;Bailly et al.,2018). We can then

use this model to make predictions for new cases (instances) (Klaine et al., 2017).

Therefore, machine learning, which is one of the sub domains of artificial intelligence,

DTA

56,4

560

To continue reading

Request your trial

Subscribers can access the reported version of this case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the cited cases and legislation of a document.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the documents that have cited the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the revised versions of legislation with amendments.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see any amendments made to the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a visualisation of a case and its relationships to other cases. An alternative to lists of cases, the Precedent Map makes it easier to establish which ones may be of most relevance to your research and prioritise further reading. You also get a useful overview of how the case was received.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the list of results connected to your document through the topics and citations Vincent found.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Dynamic Distributed and Parallel Machine Learning algorithms for big data mining processing

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users