KEFST: a knowledge extraction framework using finite-state transducers

Pages365-384
Date01 April 2019
DOIhttps://doi.org/10.1108/EL-10-2018-0196
Published date01 April 2019
AuthorAhsan Mahmood,Hikmat Ullah Khan,Zahoor Ur Rehman,Khalid Iqbal,Ch. Muhmmad Shahzad Faisal
Subject MatterInformation & knowledge management,Information & communications technology,Internet
KEFST: a knowledge
extraction framework using
nite-state transducers
Ahsan Mahmood
Department of Computer Science, COMSATS University Islamabad,
Attock, Pakistan
Hikmat Ullah Khan
COMSATS University Islamabad, Attock, Pakistan, and
Zahoor Ur Rehman,Khalid Iqbal and Ch. Muhmmad Shahzad Faisal
Department of Computer Science, COMSATS University Islamabad,
Attock, Pakistan
Abstract
Purpose The purpose of this research study is to extract and identify named entities from Hadith
literature. Named entity recognition (NER) refers to the identication of the named entities in a computer
readable text having an annotation of categorization tags for information extraction. NER is an active
research area in information management and information retrieval systems. NER serves as a baseline for
machines to understand the contextof a given content and helps in knowledge extraction. Although NER is
considered as a solved task in major languages such as English, in languages such as Urdu, NER is still a
challenging task. Moreover, NER depends on the language and domain of study; thus, it is gaining the
attentionof researchers in different domains.
Design/methodology/approach This paper proposes a knowledgeextraction framework using nite-
state transducers (FSTs) KEFST to extract the named entities. KEFST consists of ve steps: content
extraction, tokenization, part of speech tagging, multi-word detection and NER. An extensive empirical
analysis using the data corpusof Urdu translation of Sahih Al-Bukhari, a widely known hadithbook, reveals
that the proposedmethod effectively recognizes the entities to obtain betterresults.
Findings The signicant performance in terms of f-measure, precision and recall validates that the
proposedmodel outperforms the existing methods for NER in the relevant literature.
Originality/value This research is novel in this regard that no previous work is proposed in the Urdu
languageto extract named entities using FSTs and no previouswork is proposed for Urdu hadith dataNER.
Keywords Information retrieval, Data analysis, Data processing, Data management, Data retrieval
Paper type Research paper
1. Introduction
In natural language analysis, information extraction (IE) is the process that takes textual
content as input and extracts unambiguous snippets as output. The extracted output data
may be used to display to users in the form of namedentities mentioned in the document for
further data analysis or for improving information search and other information access
tasks. Named entity recognition (NER)is one of the key IE tasks as it locates and classies
named entities (NE) in text into pre-dened categories, consisting of personal names,
locations, dates, events and so on. Although NER for English nearly approaches human
performance (Marsh and Perzanowski, 1998), NER in other languages, especially the Urdu
KEFST
365
Received3 October 2018
Revised11 December 2018
4 January2019
31January 2019
Accepted21 February 2019
TheElectronic Library
Vol.37 No. 2, 2019
pp. 365-384
© Emerald Publishing Limited
0264-0473
DOI 10.1108/EL-10-2018-0196
The current issue and full text archive of this journal is available on Emerald Insight at:
www.emeraldinsight.com/0264-0473.htm
language, is still a challenging task (Daud et al., 2017;Malik and Sarwar, 2017). The main
aim of NER is to extract useful information from textual content. NER is dependent upon
sub-tasks, such as stemming, part of speech (POS) tagging and morphological analysis
(Santos and Guimarães, 2015). NER is helpful in various applications, such as information
management systems (Marrero et al.,2013), question answering systems (Yih et al.,2013),
online chatbots (Yu et al.,2016), semantic search engines (Thomas et al.,2012), automatic
speech recognition (Galibert et al.,2011), machine translation (Zou et al., 2013),social media
analysis (Ritter et al.,2011) and chemical data analysis (Rocktäschel et al.,2012). Harrag
(2014) approached NERthrough nite-state transducers (FSTs), and their results were better
than previous approaches on the topic. FST works better with hadith content due to the
nature of how FST works and the arrangementsof named entities in hadith data. Moreover,
the noise present in hadith data makes it a little harder to extract named entities from
hadith; therefore,this approach is used.
This research aims to solve the problemof NER from Urdu hadith content by proposing
a knowledge extraction framework called KEFST. The framework consists of ve phases:
content extraction, tokenization, POS tagging, multiword detection and NER. For
experimental setup, Sahih Bukhari Urdu hadith content is used to extract named entities
belonging to different categories.Standard evaluation measures in terms of precision, recall
and f1-score are used to measurethe performance of the KEFST.
The remainder of this paper is divided into the following parts: Section 2 presents background
knowledge on the topic, Section 3 discusses related work, Section 4 describes the proposed
framework and Section 5 explains the experimental setup before nally concluding the paper.
2. Background
The term hadith, according to Muslims, refers to the tradition of reporting the deeds and
sayings of Muhammad, the last Messenger. After the holy book of Quran, hadith is
considered as the most authentic source of knowledge for Muslims. Hadith contentconsists
of two main parts. First is the chain of authorities reporting hadith (called isnad or sanad)
and the second is the hadith text (called matn). Hadithis an important tool in understanding
the Quran and matters of jurisprudence.
The top six hadith books are Sahih al-Bukhari,Sahih Musim,Sunan Abu Dawood,Sunan
At Tirmidhi,SunanAnNasaiand Sunan Ibn Majah/Mawta Imam Malik. Hadiths are
managed in a different structure and formats in these books. Moreover, these books were
originally present in the Arabic language, but translations of these books are now available in
almost every language of the world. Mahmood et al. (2018) proposed a multi-lingual data set
repository of hadith content. The data set repository contains hadith contents of different
hadith books along with the content information. Hadith data set repository preparation can be
a useful resource for extracting information from hadith books in different languages. The data
set availability for researchers can be more helpful for building better information retrieval
systems. Figure 1 represents an example of the hierarchy of managing data in hadith books.
Figure 1.
Hierarchy of a
hadith book
EL
37,2
366

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT