Big Data analytics for prediction: parallel processing of the big learning base with the possibility of improving the final result of the prediction

Pages147-160
DOIhttps://doi.org/10.1108/IDD-02-2018-0002
Published date20 August 2018
Date20 August 2018
AuthorLaouni Djafri,Djamel Amar Bensaber,Reda Adjoudj
Subject MatterLibrary & information science,Library & information services,Lending,Document delivery,Collection building & management,Stock revision,Consortia
Big Data analytics for prediction: parallel
processing of the big learning base with
the possibility of improving the nal
result of the prediction
Laouni Djafri
Department of Computer Science, Djillali Liabes University, EEDIS laboratory -univ-SBA-Algeria, Sidi-Bel-Abbes, Algeria
Djamel Amar Bensaber
Superior School of Computer Science, LabRI laboratory -ESI-SBA-Algeria, Sidi Bel-Abbes, Algeria, and
Reda Adjoudj
Department of Computer Science, Djillali Liabes University, EEDIS laboratory -univ-SBA-Algeria, Sidi Bel-Abbes, Algeria
Abstract
Purpose This paper aims to solve the problems of big data analytics for prediction including volume, veracity and velocity by improving the
prediction result to an acceptable level and in the shortest possible time.
Design/methodology/approach This paper is divided into two parts. The rst one is to improve the result of the prediction. In this part, two ideas are
proposed: the double pruning enhanced random forest algorithm and extracting a shared learning base from the stratied random sampling method to
obtain a representative learning base of all original data. The second part proposes to design a distributed architecture supported by new technologies
solutions, which in turn works in a coherent and efcient way with the sampling strategy under the supervision of the Map-Reduce algorithm.
Findings The representative learning base obtained by the integration of two learning bases, the partial base and the shared base, presents an
excellent representation of the original data set and gives very good results of the Big Data predictive analytics. Furthermore, these results were
supported by the improved random forests supervised learning method, which played a key role in this context.
Originality/value All companies are concerned, especially those with large amounts of information and want to screen them to improve their
knowledge for the customer and optimize their campaigns.
Keywords Big Data analytics, Sampling, Random forests, Apache spark, Apache zookeeper, Parallel processing
Paper type Research paper
1. Introduction
The computer science world is in effervescence around a
phenomenon of an explosion of new sources of diverse data
with ne granularity and low latency, bearing the name Big
Data.It is data that exceeds the typical capacity of storing,
processing, analyzing and computing traditional databases.
Big Data requires ad vanced methods and p owerful
technologies that can be applied to analyze and extract
predictive models f rom heterogeneous and comple x data. It is
also characterized mainly by the three Vs: volume, variety and
velocity (Furht and Villanustre, 2016;Laney, 2001).
Moreover, there is an important factor in the analysis of
massive and complex data, which is the visualization of data.
It allows managers to q uickly understand t he relationships
and results that, else where, are not easily visible. It a lso allows
to create an infography, interactive or not, or a representation
in the form of data map ping (Brail and Klosterman, 2001;
Kohavi et al., 2002).
Global economic enterprises seek to exploit Big Data
available on the int ernet so that to explo re the open data of
social media such as logs, tweets and social networks at 34 per
cent, as well as Web logs and click streams at 31 per cent
(Russom, 2011).Theamountofthisdatawillhavereached
35 zettabytes by the year 2020 in a way that Twitter generates
more than 7 terabytes al one and Facebook 10 terabytes a day
(Zikopoulos and Eaton, 2011). In a study conducted in 2012
by IBM, 2.5 billion by tes of data are produc ed daily via the
Internet, where Fac ebook has more than 2.5 billion like s and
300 million downlo ads of photos (He et al., 2015). Big D ata
analysis seeks to develop products, create new service and
improve business . In a study conducted by He et al.,onawide
range of data, the authors used tweets on the worldslargest
two-serie retail (Costco and Walmart), where they compared
The current issue and full text archive of this journal is available on
Emerald Insight at: www.emeraldinsight.com/2398-6247.htm
Information Discovery and Delivery
46/3 (2018) 147160
© Emerald Publishing Limited [ISSN 2398-6247]
[DOI 10.1108/IDD-02-2018-0002]
Received 4 February 2018
Revised 6 March 2018
3 April 2018
10 May 2018
2 June 2018
Accepted 3 June 2018
147

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT