Stylometric analysis of classical Arabic texts for genre detection

Date01 October 2018
DOIhttps://doi.org/10.1108/EL-11-2017-0236
Pages842-855
Published date01 October 2018
AuthorMaha Al-Yahya
Subject MatterInformation & knowledge management,Information & communications technology,Internet
Stylometric analysis of classical
Arabic texts for genre detection
Maha Al-Yahya
Department of Information Technology, King Saud University, Riyadh,
Saudi Arabia
Abstract
Purpose In the context of informationretrieval, text genre is as important as its content, and knowledgeof
the text genre enhances the search engine features by providing customized retrieval. The purpose of this
study is to explore and evaluate the use of stylometric analysis, a quantitative analysis for the linguistics
featuresof text, to support the task of automated text genre detectionfor Classical Arabic text.
Design/methodology/approach Unsupervised clustering and supervisedclassication were applied
on the King Saud University Corpusof ClassicalArabic texts (KSUCCA) using the most frequent words in the
corpus (MFWs) as stylometric features.Four popular distance measures established in stylometricresearch
are evaluatedfor the genre detection task.
Findings The results of the experimentsshow that stylometry-based genre clustering and classication
align well withhuman-dened genre. The evidence suggests thatgenre style signals exist for Classical Arabic
and can be usedto support the task of automated genre detection.
Originality/value This work targets the task of genre detection in Classical Arabic text using
stylometricfeatures, an approach that has only been previously applied to Arabic authorshipattribution. The
study also provides a comparisonof four distance measures used in stylomtreic analysis on the KSUCCA, a
corpus withover 50 million words of Classical Arabic usingclustering and classication.
Keywords Stylometric analysis, Genre detection, Classical arabic text, Distance measure
Paper type Research paper
1. Introduction
Stylometry is a measure of language style. It is dened as the statistical analysis of
variations in literary style between one writeror genre and another(OED, 2017). The term
was originally coined by Lutoslawski in 1896 (Lauer and Jannidis, 2014;Pawlowski and
Pacewicz, 2004), and the approach has become popular for research on authorship
attribution (Holmes and Kardos,2003;Juola, 2006). Stylometry, however, can also be applied
to other problems in text analysis including forensic linguistics (Afroz et al.,2012;Rocha
et al., 2017), plagiarism detection (Ramnial et al.,2016), chronology studies observing the
developing voice of an authorover a period of years (Juola, 2007), stylistic inconsistenciesin
collaborative writing (Glover and Hirst, 1995), literary inuence (Jockers, 2013) and genre
detection (Jockers,2013).
Genre is dened as a type of communication which is denoted by a socially accepted
purpose and a common form (Yates and Orlikowski, 1992). Genresare useful, as they make
documents easy to understand, thus reducing mental effort(Crowston and Kwasnik, 2003).
In the context of the organization of information and informationretrieval, document genre
is as important as the content of the document, and knowledge of document genre enables
the enhancement of searchengine capabilities by providing customized retrieval.
Genre detection is an important task for knowledge organization and retrieval
(Andersen, 2008), and it aims to group and organize texts based on dened similarities
EL
36,5
842
Received11 November 2017
Revised4 March 2018
4 May2018
Accepted7 May 2018
TheElectronic Library
Vol.36 No. 5, 2018
pp. 842-855
© Emerald Publishing Limited
0264-0473
DOI 10.1108/EL-11-2017-0236
The current issue and full text archive of this journal is available on Emerald Insight at:
www.emeraldinsight.com/0264-0473.htm

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT