Indexing Arabic texts using association rule data mining

DOIhttps://doi.org/10.1108/LHT-07-2017-0147
Pages101-117
Date18 March 2019
Published date18 March 2019
AuthorRamzi A. Haraty,Rouba Nasrallah
Subject MatterLibrary & information science,Librarianship/library management,Library technology,Information behaviour & retrieval,Information user studies,Metadata,Information & knowledge management,Information & communications technology,Internet
Indexing Arabic texts using
association rule data mining
Ramzi A. Haraty
Department of Computer Science and Mathematics,
Lebanese American University, Beirut, Lebanon, and
Rouba Nasrallah
Lebanese American University, Beirut, Lebanon
Abstract
Purpose The purpose of this paper is to propose a new model to enhance auto-indexing Arabic texts.
The model denotes extracting new relevant words by relating those chosen by previous classical methods to
new words using data mining rules.
Design/methodology/approach The proposed model uses an association rule algorithm for extracting
frequent sets containing related items to extract relationships between words in the texts to be indexed with
words from texts that belong to the same category. The associations of words extracted are illustrated as sets
of words that appear frequently together.
Findings The proposed methodology shows significant enhancement in terms of accuracy, efficiency and
reliability when compared to previous works.
Research limitations/implications The stemming algorithm can be further enhanced. In the Arabic
language, we have many grammatical rules. The more we integrate rules to the stemming algorithm, the
better the stemming will be. Other enhancements can be doneto the stop-list. This is by adding more words to
it that should not be taken into consideration in the indexing mechanism. Also, numbers should be added
to the list as well as using the thesaurus system because it links different phrases or words with the same
meaning to each other, which improves the indexing mechanism. The authors also invite researchers to add
more pre-requisite texts to have better results.
Originality/value In this paper, the authors present a full text-based auto-indexing method for Arabic text
documents. The auto-indexing method extracts new relevant words by using data mining rules, which has
not been investigated before. The method uses an association rule mining algorithm for extracting frequent
sets containing related items to extract relationships between words in the texts to be indexed with words
from texts that belong to the same category. The benefits of the method are demonstrated using empirical
work involving several Arabic texts.
Keywords Precision, Recall, Arabic text, Auto-indexing, Frequent sets, Rule-based data mining
Paper type Research paper
1. Introduction
Indexing text documents consists of analyzing the content of the text in order to retrieve its
subject. It is a very important task related to information retrieval: a field that is considered
imperative in the computer science domain due to the need of exploring different topics in
our daily lives. For instance, most of us use Google,a search engine used to search and
retrieve information about different topics.
Manual indexing is very difficult and needs an immense human effort from an expert,
since one needs to read the entire text and analyze it holistically. The longer the text, the
longer the time needed.
Indexing texts is needed in many domains and for different types of texts such
as: articles in newspapers and magazines, online articles, archiving, documenting, helping in
e-mail spam detection, web page content filtering, and automatic message routing. But most
importantly it is used for information retrieval.
There are two types of indexing techniques: thesaurus-based indexing and full text-based
indexing (Khoja, 2001). In the thesaurus-based indexing technique, the words chosen to represent
the document do not need to exist in the text. However, their synonyms exist. The synonyms
might be chosen by the documenter if the search for the text is usually done by using the
Library Hi Tech
Vol. 37 No. 1, 2019
pp. 101-117
© Emerald PublishingLimited
0737-8831
DOI 10.1108/LHT-07-2017-0147
Received 27 July 2017
Revised 13 February 2018
Accepted 6 April 2018
The current issue and full text archive of this journal is available on Emerald Insight at:
www.emeraldinsight.com/0737-8831.htm
101
Indexing
Arabic texts
synonym and not the exact word. Note that the synonym is not only the correspondent of the
term in the dictionary. For example, if we are talking about a football player, we can consider
his/her name as an index based on the thesaurus system. The problem of thesaurus-based
indexing is the difficultly in implementation because of the need of human intervention.
The implementation should depend on a file containing the synonyms of the terms. This file
should continuously be manually updated. In full-text based indexing, the indexing relies on
choosing terms or phrases that already exist in the text. This type is much easier in concept.
The significance behind this work is not having the documenter read the document,
choosing the key words and building the subject of the document or the subject headings,
but a computer program that picks the key terms of the text automatically and returns them
as output to the documenter. This will facilitate the work and spare the documenter time and
effort. Therefore, the goal is to automate one part of the work. In other words, the proposed
solution is to extend the set of relevant words extracted from the text based on relations of
words extracted from the prerequisite texts of the same category of the text to be indexed.
In this paper, we present a full-text based auto-indexing method for Arabic text
documents. This is a very important task to be achieved since we have many Arabic
publications such as books, articles in newspapers as well as online blogs that require
analysis and indexing. Our auto-indexing method extracts new relevant words by using
data mining rules. The method uses an association rule mining algorithm for extracting
frequent sets containing related items to extract relationships between words in the texts to
be indexed with words from texts that belong to the same category. The benefits of our
method are demonstrated using empirical work involving several Arabic texts.
The rest of the paper is organized as follows: Section 2 provides a literature review and
related works. Section 3 discusses the processing steps of the algorithm. Section 4 outlines
stem word extractions. Section 5 delineates weight calculation and index extraction.
Section 6 presents the experimental results and Section 7 concludes the paper.
2. Literature review
Over the years, a number of methods have been proposed related to the classification of
algorithms fortext categorization. These methodsare based on classification algorithms such
as Naïve Bayes proposed by McCallum and Niagam (1998), decision trees by Sahami et al.
(1998), neuralnetworks by Joachims (1998) and Harrag and ElQawasmeh (2009), and Support
Vector Machine or SVM by Dumais (1998) and Turney and Pantel (2010). In Yang and
Liu (1999), some of these methods were compared. The results show that SVM is the best
among others regarding the document classification.
As for the Arabic language, numerous works havebeen done to process texts. Haraty and
Khatib (2005) introduced a procedure that extracts temporal elements from a document.
Haraty and Hamid (2002) presented a technique to segment hand-written Arabic text, and
Haraty and Ghaddar (2004) put forward two neural networks were built to classify already
segmented characters of handwritten Arabic text. Al-Harbi et al. (2008) discussed automated
document classification consideringit an important text mining task. Text classification aims
to automatically assign the text to a predefined category. The authors proposed a solution
based on linguisticfeatures. They generated the feature frequency of the lexical featuresthat
they have extracted and then they calculated the importance of each featurelocally (for each
class) based on χ
2
. Seven data sets were used including essays, pieces of literature, poems,
web, forums and others, and each data setcontained different classes. Theassembled corpus
comprised 17,658 texts with more than 11,500,000 words. A tool was implemented to extract
features using SVMand C5.0 (classification algorithms) to decide whichclass the tested texts
belong. The results showed that, in general, the C5.0 classifier is more accurate.
In Al-Anzi and AbuZeina (2017), the authors used singular-valued decomposition as a feature
reduction technique as well as for producing semantic rich features and to truncate the
102
LHT
37,1

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT