Exploring the effectiveness of word embedding based deep learning model for improving email classification

DOIhttps://doi.org/10.1108/DTA-07-2021-0191
Published date02 February 2022
Date02 February 2022
Pages483-505
Subject MatterLibrary & information science,Librarianship/library management,Library technology,Information behaviour & retrieval,Metadata,Information & knowledge management,Information & communications technology,Internet
AuthorDeepak Suresh Asudani,Naresh Kumar Nagwani,Pradeep Singh
Exploring the effectiveness of word
embedding based deep learning
model for improving
email classification
Deepak Suresh Asudani, Naresh Kumar Nagwani and Pradeep Singh
Department of Computer Science and Engineering,
National Institute of Technology Raipur, Raipur, India
Abstract
Purpose Classifying emails as ham or spam based on their content is essential. Determining the semantic
and syntactic meaning of words and putting them in a high-dimensional feature vector form forprocessing is
the most difficult challenge in email categorization.The purpose of this paper is to examine the effectiveness of
the pre-trained embedding model for the classification of emails using deep learningclassifiers such as the long
short-term memory (LSTM) model and convolutional neural network (CNN) model.
Design/methodology/approach In this paper, global vectors (GloVe) and Bidirectional Encoder
Representations Transformers (BERT) pre-trained word embedding are usedto identify relationships between
words, which helps to classify emails into their relevant categories using machine learning and deep learning
models. Two benchmark datasets, SpamAssassin and Enron, are used in the experimentation.
Findings In the first set of experiments, machine learning classifiers, the support vector machine (SVM)
model, perform better than other machine learning methodologies.The second set of experiments compares the
deep learning model performance without embedding, GloVe and BERT embedding. The experiments show
that GloVe embedding can be helpful for faster execution with better performance on large-sized datasets.
Originality/value The experiment reveals that the CNN modelwith GloVe embedding gives slightly better
accuracy than the model with BERT embedding and traditional machine learning algorithms to classify an
email as ham or spam. It is concluded that the word embedding models improve email classifiers accuracy.
Keywords Email classification, Machine learning, Word embedding, GloVe, BERT, Deep learning
Paper type Research paper
1. Introduction
An email has become a standard personal, and professional communication as a technology
has advanced. Email marketing helps expand businesses by addressing existing and new
consumers with updated offers and services. In the current pandemic climate, digital
communication demonstrates its value by facilitating communication between business
owners and customers.
As per the survey, the number of emails sent and received internationally has climbed each
year as the Internet has become more accessible. In 2020, more than 306.4 billion emails are sent
and received per day and are expected to rise to over 376.4 billion per day by 2025 (Statista,2021).
The shift from offline to online education, sales and support services has resulted in a rise in
cyber-attacks on students and employees working remotely due to the lack of security
safeguards. According to a Kaspersky report released on 1st March 2021, 45% of online users in
India are targeted by local risks in 2020, and this number is expected to rise in 2021
(Kaspersky, 2021).
The
effectiveness
of word
embedding
483
The authors thank the editor and the anonymous reviewers for their insightful and valuable comments
and suggestions. The authors gratefully acknowledge the National Institute of Technology, Raipur, for
providing the GPU server used for this research.
The current issue and full text archive of this journal is available on Emerald Insight at:
https://www.emerald.com/insight/2514-9288.htm
Received 24 July 2021
Revised 13 October 2021
25 December 2021
Accepted 8 January 2022
Data Technologies and
Applications
Vol. 56 No. 4, 2022
pp. 483-505
© Emerald Publishing Limited
2514-9288
DOI 10.1108/DTA-07-2021-0191
With the increase in email communication, there is a hidden risk that hackers may try to
target victims, steal vital information or redirect them to a false website, significantly
impacting the user. Thus, classifying an email as ham or spam is critical, and efficient email
classification is required to properly categorize a massive volume of email into multiple
categories depending on its content within a specific time frame (Dada et al., 2019).
The objective of the presented work is to explore whether word embedding can help improve
the email classification task compared to the traditional machine learning approach.
The easy way to perform email classification is to create rules using specific keywords and
categories of new emails following the rules. The rule-based system fails to capture diverse
keywords and misclassifies email as ham instead of spam. Machine learning classifiers
effectively train the model to appropriately classify new emails into their respective
categories. These classifiers do not intend to identify word semantic meanings that are of
greater importance in classification (Hajek et al., 2020). Therefore, they do not classify emails
with various formatting styles in certain situations.
Another option is to employ a recurrent neural network (RNN) variant, such as the long short-
termmemory (LSTM)model with pre-trainedword embedding to categorize long-term sequence
dependencyinformation. Pre-trainedword embedding is a technique that usesa vast corpus to
represent a word in a high-dimensional feature vector depending on its semantic significance,
which can help the model identify emails more accurately (Moreo et al., 2021).
This study aims to use machine learning and deep learning methodologies to create a
framework for categorizing email as ham or spam. Traditional machine learning approaches
such as support vector machine (SVM), random forest (RF), logistic regression (LR), Gaussian
Naive Bayes (GNB), multinomial Naive Bayes (MNB) and AdaBoost, as well as the LSTM and
convolutional neural network (CNN) deep learning models with and without pre-trained
embedding, are utilized to find the optimum model for classification. The following are some
of the significant contributions to enhancing email classification performance:
The purpose of this study is to explore how applying machine learning and deep
learning models to a dataset with and without pre-trained embedding can improve
classification performance.
Global vectors (GloVe) and Bidirectional Encoder Representations Transformers
(BERT) pre-trained word embedding is used to find word relationships, which aids in
the classification of emails into appropriate categories.
Evaluate the performance of machine learning and deep learning models with and
without GloVe and BERT pre-trained embedding by computing the confusion matrix,
accuracy, precision, recall, F1-score and execution time with 10-fold cross-validation.
The results show that the CNN model with GloVe embedding gives slightly better
accuracy than the model with BERT embedding and traditional machine learning
algorithms to classify email as ham or spam.
The remaining sections of the paper are organized as follows: Section 2 summarizes relevant
studies on email classification using machine learning and deep learning models. Section 3
focuses on the background of the experimental setup. In section 4, the experimental results
are discussed, followed by a conclusion on model performance in section 5.
2. Literature review
This section summarizes notable studies on email classification using machine learning and
deep learning models. In business, email is considered as official communication. For consumers,
it is an account for sending greetings, a mediator between social media accounts, online
shopping accounts and maintaining easy access to their valuable documents. In email
DTA
56,4
484

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT