A semi-automatic indexing system based on embedded information in HTML documents

Published date15 June 2015
DOIhttps://doi.org/10.1108/LHT-12-2014-0114
Pages195-210
Date15 June 2015
AuthorMari Vállez,Rafael Pedraza-Jiménez,Lluís Codina,Saúl Blanco,Cristòfol Rovira
Subject MatterLibrary & information science,Librarianship/library management,Library technology
A semi-automatic indexing
system based on embedded
information in HTML documents
Mari Vállez, Rafael Pedraza-Jiménez and Lluís Codina
Department of Communication, Universitat Pompeu Fabra, Barcelona, Spain
Saúl Blanco
Department of Signal Theory and Communications,
Universidad Carlos III de Madrid, Madrid, Spain, and
Cristòfol Rovira
Department of Communication, Universitat Pompeu Fabra, Barcelona, Spain
Abstract
Purpose The purpose of this paper is to describe and evaluate the tool DigiDoc MetaEdit which
allows the semi-automatic indexing of HTML documents. The tool works by identifying and
suggesting keywords from a thesaurus according to the embedded information in HTML documents.
This enables the parameterization of keyword assignment based on how frequently the terms appear
in the document, the relevance of their position, and the combination of both.
Design/methodology/approach In order to evaluate the efficiency of the indexing tool, the
descriptors/keywords suggested by the indexing tool are compared to the keywords which have
been indexed manually by human experts. To make this comparison a corpus of HTML documents
are randomly selected from a journal devoted to Library and Information Science.
Findings The results of the evaluation show that there: first, is close to a 50 per cent match or
overlap between the two indexing systems, however, if you take into consideration the related terms
and the narrow terms the matches can reach 73 per cent; and second, the first terms identified by the
tool are the most relevant.
Originality/value The tool presented identifies the most important keywords in an HTML
document based on the embedded information in HTML documents. Nowadays, representing the
contents of documents with keywords is an essential practice in areas such as information retrieval
and e-commerce.
Keywords Digital documents, Information retrieval, Indexing, Search engines,
Hypertext markup language
Paper type Research paper
Introduction
Representing the content of a document with keywords is a long-standing practice.
Information retrieval systems have traditionally resorted to this method to facilitate the
access to information, since it is a compact and efficient way of representing a
document. This process is known as indexing. Thus, we will refer to indexing as the
task of assigning a limited number of keywords to a document, keywords which
indicate concepts that are sufficiently representative of the document.
Library Hi Tech
Vol. 33 No. 2, 2015
pp. 195-210
©Emerald Group Publis hing Limited
0737-8831
DOI 10.1108/LHT-12-2014-0114
Received 14 December 2014
Revised 14 December 2014
Accepted 11 February 2015
The current issue and full text archive of this journal is available on Emerald Insight at:
www.emeraldinsight.com/0737-8831.htm
This paper is part of the projects: Audiencias activas y periodismo(Active audiences and
journalism). CSO2012-39518-C04-02 and Comunicación online de los destinos turísticos(Online
communication of tourist destinations) CSO 2011-22691. Plan Nacional de I+D+i, Ministerio de
Economía y Competitividad (Spain).
195
Embedded
information
in HTML
documents
Despite the advantages of using keywords, only a minority of documents have
assigned keywords because it is expensive and time consuming. Therefore, systems are
needed to facilitate the generation of keywords. Our proposal tries to identify the most
important terms of HTML documents with high frequency and semantic relevance
from a controlled language.
In this paper we describe the tool DigiDoc MetaEdit that allows the semi-automatic
indexing of HTML documents. The tool assigns keywords from a thesaurus with the
objective of representing the semantic contents of the document efficiently. To do this,
it follows some of the relevance criteria used by search engines. Furthermore, it can be
customizable according to how frequently the terms appear in the document,
the relevance of their position and the combination of both. In order to evaluate the
efficiency of the indexing system, we compare the descriptors suggested by the tool to
those used in a portal of electronic journals by human experts.
The paper is organized into the following sections: first, a brief overview of the
literature related to indexing and automatic indexing; second, the research objectives;
third, the presentation of the tool DigiDoc MetaEdit to assign keywords to HTML
documents; fourth, the methodology section with information about the experimental
data sets, the configuration of the tool and the evaluation process; fifth, the results
obtained in the evaluation and the analysis of them; and finally, the conclusions and
future lines of research.
Literature review
Indexing theory attempts to identify the most effective indexing process, for indexing
to be executed as a science rather than as an art (Borko, 1977; Hjørland, 2011). In the
academic literature, indexing process involves two main steps: one, identifying
the subjects of the document, and two, representing them in a controlled language
(Mai, 2001). This process is also known as subject indexing, in which the representation
of the documents is conditioned by the controlled language structure. Some authors,
Lancaster (2003) and Mai (1997) among them, analyze this procedure and the problems
of identifying subjects. Others, such as Willis and Losee (2013) or Anderson and
Pérez-Carballo (2001a, b), review the most important aspects of manual and automatic
subject indexing and also the differences between both systems.
Manual indexing involves an intellectual process using a controlled language, which
results in this system being difficult, slow and expensive. It also entails a high number
of inconsistencies, both external, when the task is conducted by multiple indexers,
and internal, when a single indexer performs the work at different times (Olson and
Wolfram, 2008; White et al., 2013; Zunde and Dexter, 1969).
Moreover, automatic indexing can be approached from two main perspectives. The
first one is keyword extraction, based on the keywords appearance in the text and in
the whole of a collection (Frank et al., 1999; Zhang, 2008; Beliga, 2014). The second
technique is keyword assignment, based on the matching of terms between the text and
a thesaurus (or some other controlled vocabulary) (Moens, 2002; Yang et al., 2014).
The different approaches for the first technique keyword extraction can be
grouped into three categories: systems based on machine learning; systems based on
rules for patterns and systems supported by statistical criteria (Ercan and Cicekli, 2007;
Giarlo, 2005; Kaur and Gupta, 2010). These different approaches can also be combined.
First, machine learning systems rely heavily on probabilistic calculations from
training collections (Abulaish and Anwar, 2012). They adapt well to different
environments, but their drawbacks should also be mentioned: they require many
196
LHT
33,2

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT