KEFST: a knowledge extraction framework using finite-state transducers
Pages | 365-384 |
Date | 01 April 2019 |
DOI | https://doi.org/10.1108/EL-10-2018-0196 |
Published date | 01 April 2019 |
Author | Ahsan Mahmood,Hikmat Ullah Khan,Zahoor Ur Rehman,Khalid Iqbal,Ch. Muhmmad Shahzad Faisal |
Subject Matter | Information & knowledge management,Information & communications technology,Internet |
KEFST: a knowledge
extraction framework using
finite-state transducers
Ahsan Mahmood
Department of Computer Science, COMSATS University Islamabad,
Attock, Pakistan
Hikmat Ullah Khan
COMSATS University Islamabad, Attock, Pakistan, and
Zahoor Ur Rehman,Khalid Iqbal and Ch. Muhmmad Shahzad Faisal
Department of Computer Science, COMSATS University Islamabad,
Attock, Pakistan
Abstract
Purpose –The purpose of this research study is to extract and identify named entities from Hadith
literature. Named entity recognition (NER) refers to the identification of the named entities in a computer
readable text having an annotation of categorization tags for information extraction. NER is an active
research area in information management and information retrieval systems. NER serves as a baseline for
machines to understand the contextof a given content and helps in knowledge extraction. Although NER is
considered as a solved task in major languages such as English, in languages such as Urdu, NER is still a
challenging task. Moreover, NER depends on the language and domain of study; thus, it is gaining the
attentionof researchers in different domains.
Design/methodology/approach –This paper proposes a knowledgeextraction framework using finite-
state transducers (FSTs) –KEFST –to extract the named entities. KEFST consists of five steps: content
extraction, tokenization, part of speech tagging, multi-word detection and NER. An extensive empirical
analysis using the data corpusof Urdu translation of Sahih Al-Bukhari, a widely known hadithbook, reveals
that the proposedmethod effectively recognizes the entities to obtain betterresults.
Findings –The significant performance in terms of f-measure, precision and recall validates that the
proposedmodel outperforms the existing methods for NER in the relevant literature.
Originality/value –This research is novel in this regard that no previous work is proposed in the Urdu
languageto extract named entities using FSTs and no previouswork is proposed for Urdu hadith dataNER.
Keywords Information retrieval, Data analysis, Data processing, Data management, Data retrieval
Paper type Research paper
1. Introduction
In natural language analysis, information extraction (IE) is the process that takes textual
content as input and extracts unambiguous snippets as output. The extracted output data
may be used to display to users in the form of namedentities mentioned in the document for
further data analysis or for improving information search and other information access
tasks. Named entity recognition (NER)is one of the key IE tasks as it locates and classifies
named entities (NE) in text into pre-defined categories, consisting of personal names,
locations, dates, events and so on. Although NER for English nearly approaches human
performance (Marsh and Perzanowski, 1998), NER in other languages, especially the Urdu
KEFST
365
Received3 October 2018
Revised11 December 2018
4 January2019
31January 2019
Accepted21 February 2019
TheElectronic Library
Vol.37 No. 2, 2019
pp. 365-384
© Emerald Publishing Limited
0264-0473
DOI 10.1108/EL-10-2018-0196
The current issue and full text archive of this journal is available on Emerald Insight at:
www.emeraldinsight.com/0264-0473.htm
language, is still a challenging task (Daud et al., 2017;Malik and Sarwar, 2017). The main
aim of NER is to extract useful information from textual content. NER is dependent upon
sub-tasks, such as stemming, part of speech (POS) tagging and morphological analysis
(Santos and Guimarães, 2015). NER is helpful in various applications, such as information
management systems (Marrero et al.,2013), question answering systems (Yih et al.,2013),
online chatbots (Yu et al.,2016), semantic search engines (Thomas et al.,2012), automatic
speech recognition (Galibert et al.,2011), machine translation (Zou et al., 2013),social media
analysis (Ritter et al.,2011) and chemical data analysis (Rocktäschel et al.,2012). Harrag
(2014) approached NERthrough finite-state transducers (FSTs), and their results were better
than previous approaches on the topic. FST works better with hadith content due to the
nature of how FST works and the arrangementsof named entities in hadith data. Moreover,
the noise present in hadith data makes it a little harder to extract named entities from
hadith; therefore,this approach is used.
This research aims to solve the problemof NER from Urdu hadith content by proposing
a knowledge extraction framework called KEFST. The framework consists of five phases:
content extraction, tokenization, POS tagging, multiword detection and NER. For
experimental setup, Sahih Bukhari Urdu hadith content is used to extract named entities
belonging to different categories.Standard evaluation measures in terms of precision, recall
and f1-score are used to measurethe performance of the KEFST.
The remainder of this paper is divided into the following parts: Section 2 presents background
knowledge on the topic, Section 3 discusses related work, Section 4 describes the proposed
framework and Section 5 explains the experimental setup before finally concluding the paper.
2. Background
The term hadith, according to Muslims, refers to the tradition of reporting the deeds and
sayings of Muhammad, the last Messenger. After the holy book of Quran, hadith is
considered as the most authentic source of knowledge for Muslims. Hadith contentconsists
of two main parts. First is the chain of authorities reporting hadith (called isnad or sanad)
and the second is the hadith text (called matn). Hadithis an important tool in understanding
the Quran and matters of jurisprudence.
The top six hadith books are Sahih al-Bukhari,Sahih Musim,Sunan Abu Dawood,Sunan
At Tirmidhi,SunanAnNasa’iand Sunan Ibn Majah/Mawta Imam Malik. Hadiths are
managed in a different structure and formats in these books. Moreover, these books were
originally present in the Arabic language, but translations of these books are now available in
almost every language of the world. Mahmood et al. (2018) proposed a multi-lingual data set
repository of hadith content. The data set repository contains hadith contents of different
hadith books along with the content information. Hadith data set repository preparation can be
a useful resource for extracting information from hadith books in different languages. The data
set availability for researchers can be more helpful for building better information retrieval
systems. Figure 1 represents an example of the hierarchy of managing data in hadith books.
Figure 1.
Hierarchy of a
hadith book
EL
37,2
366
To continue reading
Request your trial