Indexing Arabic texts using association rule data mining

Document

Cited in

DOI	https://doi.org/10.1108/LHT-07-2017-0147
Pages	101-117
Date	18 March 2019
Published date	18 March 2019
Author	Ramzi A. Haraty,Rouba Nasrallah
Subject Matter	Library & information science,Librarianship/library management,Library technology,Information behaviour & retrieval,Information user studies,Metadata,Information & knowledge management,Information & communications technology,Internet

Indexing Arabic texts using

association rule data mining

Ramzi A. Haraty

Department of Computer Science and Mathematics,

Lebanese American University, Beirut, Lebanon, and

Rouba Nasrallah

Lebanese American University, Beirut, Lebanon

Abstract

Purpose –The purpose of this paper is to propose a new model to enhance auto-indexing Arabic texts.

The model denotes extracting new relevant words by relating those chosen by previous classical methods to

new words using data mining rules.

Design/methodology/approach –The proposed model uses an association rule algorithm for extracting

frequent sets containing related items –to extract relationships between words in the texts to be indexed with

words from texts that belong to the same category. The associations of words extracted are illustrated as sets

of words that appear frequently together.

Findings –The proposed methodology shows significant enhancement in terms of accuracy, efficiency and

reliability when compared to previous works.

Research limitations/implications –The stemming algorithm can be further enhanced. In the Arabic

language, we have many grammatical rules. The more we integrate rules to the stemming algorithm, the

better the stemming will be. Other enhancements can be doneto the stop-list. This is by adding more words to

it that should not be taken into consideration in the indexing mechanism. Also, numbers should be added

to the list as well as using the thesaurus system because it links different phrases or words with the same

meaning to each other, which improves the indexing mechanism. The authors also invite researchers to add

more pre-requisite texts to have better results.

Originality/value –In this paper, the authors present a full text-based auto-indexing method for Arabic text

documents. The auto-indexing method extracts new relevant words by using data mining rules, which has

not been investigated before. The method uses an association rule mining algorithm for extracting frequent

sets containing related items to extract relationships between words in the texts to be indexed with words

from texts that belong to the same category. The benefits of the method are demonstrated using empirical

work involving several Arabic texts.

Keywords Precision, Recall, Arabic text, Auto-indexing, Frequent sets, Rule-based data mining

Paper type Research paper

1. Introduction

Indexing text documents consists of analyzing the content of the text in order to retrieve its

subject. It is a very important task related to information retrieval: a field that is considered

imperative in the computer science domain due to the need of exploring different topics in

our daily lives. For instance, most of us use “Google,”a search engine used to search and

retrieve information about different topics.

Manual indexing is very difficult and needs an immense human effort from an expert,

since one needs to read the entire text and analyze it holistically. The longer the text, the

longer the time needed.

Indexing texts is needed in many domains and for different types of texts such

as: articles in newspapers and magazines, online articles, archiving, documenting, helping in

e-mail spam detection, web page content filtering, and automatic message routing. But most

importantly it is used for information retrieval.

There are two types of indexing techniques: thesaurus-based indexing and full text-based

indexing (Khoja, 2001). In the thesaurus-based indexing technique, the words chosen to represent

the document do not need to exist in the text. However, their synonyms exist. The synonyms

might be chosen by the documenter if the search for the text is usually done by using the

Library Hi Tech

Vol. 37 No. 1, 2019

pp. 101-117

0737-8831

DOI 10.1108/LHT-07-2017-0147

Received 27 July 2017

Revised 13 February 2018

Accepted 6 April 2018

The current issue and full text archive of this journal is available on Emerald Insight at:

www.emeraldinsight.com/0737-8831.htm

101

Indexing

Arabic texts

synonym and not the exact word. Note that the synonym is not only the correspondent of the

term in the dictionary. For example, if we are talking about a football player, we can consider

his/her name as an index based on the thesaurus system. The problem of thesaurus-based

indexing is the difficultly in implementation because of the need of human intervention.

The implementation should depend on a file containing the synonyms of the terms. This file

should continuously be manually updated. In full-text based indexing, the indexing relies on

choosing terms or phrases that already exist in the text. This type is much easier in concept.

The significance behind this work is not having the documenter read the document,

choosing the key words and building the subject of the document or the subject headings,

but a computer program that picks the key terms of the text automatically and returns them

as output to the documenter. This will facilitate the work and spare the documenter time and

effort. Therefore, the goal is to automate one part of the work. In other words, the proposed

solution is to extend the set of relevant words extracted from the text based on relations of

words extracted from the prerequisite texts of the same category of the text to be indexed.

In this paper, we present a full-text based auto-indexing method for Arabic text

documents. This is a very important task to be achieved since we have many Arabic

publications such as books, articles in newspapers as well as online blogs that require

analysis and indexing. Our auto-indexing method extracts new relevant words by using

data mining rules. The method uses an association rule mining algorithm for extracting

frequent sets containing related items to extract relationships between words in the texts to

be indexed with words from texts that belong to the same category. The benefits of our

method are demonstrated using empirical work involving several Arabic texts.

The rest of the paper is organized as follows: Section 2 provides a literature review and

related works. Section 3 discusses the processing steps of the algorithm. Section 4 outlines

stem word extractions. Section 5 delineates weight calculation and index extraction.

Section 6 presents the experimental results and Section 7 concludes the paper.

2. Literature review

Over the years, a number of methods have been proposed related to the classification of

algorithms fortext categorization. These methodsare based on classification algorithms such

as Naïve Bayes proposed by McCallum and Niagam (1998), decision trees by Sahami et al.

(1998), neuralnetworks by Joachims (1998) and Harrag and ElQawasmeh (2009), and Support

Vector Machine or SVM by Dumais (1998) and Turney and Pantel (2010). In Yang and

Liu (1999), some of these methods were compared. The results show that SVM is the best

among others regarding the document classification.

As for the Arabic language, numerous works havebeen done to process texts. Haraty and

Khatib (2005) introduced a procedure that extracts temporal elements from a document.

Haraty and Hamid (2002) presented a technique to segment hand-written Arabic text, and

Haraty and Ghaddar (2004) put forward two neural networks were built to classify already

segmented characters of handwritten Arabic text. Al-Harbi et al. (2008) discussed automated

document classification consideringit an important text mining task. Text classification aims

to automatically assign the text to a predefined category. The authors proposed a solution

based on linguisticfeatures. They generated the feature frequency of the lexical featuresthat

they have extracted and then they calculated the importance of each featurelocally (for each

class) based on χ

. Seven data sets were used including essays, pieces of literature, poems,

web, forums and others, and each data setcontained different classes. Theassembled corpus

comprised 17,658 texts with more than 11,500,000 words. A tool was implemented to extract

features using SVMand C5.0 (classification algorithms) to decide whichclass the tested texts

belong. The results showed that, in general, the C5.0 classifier is more accurate.

In Al-Anzi and AbuZeina (2017), the authors used singular-valued decomposition as a feature

reduction technique as well as for producing semantic rich features and to truncate the

102

LHT

37,1

To continue reading

Request your trial

Subscribers can access the reported version of this case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the cited cases and legislation of a document.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the documents that have cited the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the revised versions of legislation with amendments.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see any amendments made to the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a visualisation of a case and its relationships to other cases. An alternative to lists of cases, the Precedent Map makes it easier to establish which ones may be of most relevance to your research and prioritise further reading. You also get a useful overview of how the case was received.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the list of results connected to your document through the topics and citations Vincent found.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Indexing Arabic texts using association rule data mining

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users