A semi-automatic indexing system based on embedded information in HTML documents

Document

Cited in

Published date	15 June 2015
DOI	https://doi.org/10.1108/LHT-12-2014-0114
Pages	195-210
Date	15 June 2015
Author	Mari Vállez,Rafael Pedraza-Jiménez,Lluís Codina,Saúl Blanco,Cristòfol Rovira
Subject Matter	Library & information science,Librarianship/library management,Library technology

A semi-automatic indexing

system based on embedded

information in HTML documents

Mari Vállez, Rafael Pedraza-Jiménez and Lluís Codina

Department of Communication, Universitat Pompeu Fabra, Barcelona, Spain

Saúl Blanco

Department of Signal Theory and Communications,

Universidad Carlos III de Madrid, Madrid, Spain, and

Cristòfol Rovira

Department of Communication, Universitat Pompeu Fabra, Barcelona, Spain

Abstract

Purpose –The purpose of this paper is to describe and evaluate the tool DigiDoc MetaEdit which

allows the semi-automatic indexing of HTML documents. The tool works by identifying and

suggesting keywords from a thesaurus according to the embedded information in HTML documents.

This enables the parameterization of keyword assignment based on how frequently the terms appear

in the document, the relevance of their position, and the combination of both.

Design/methodology/approach –In order to evaluate the efficiency of the indexing tool, the

descriptors/keywords suggested by the indexing tool are compared to the keywords which have

been indexed manually by human experts. To make this comparison a corpus of HTML documents

are randomly selected from a journal devoted to Library and Information Science.

Findings –The results of the evaluation show that there: first, is close to a 50 per cent match or

overlap between the two indexing systems, however, if you take into consideration the related terms

and the narrow terms the matches can reach 73 per cent; and second, the first terms identified by the

tool are the most relevant.

Originality/value –The tool presented identifies the most important keywords in an HTML

document based on the embedded information in HTML documents. Nowadays, representing the

contents of documents with keywords is an essential practice in areas such as information retrieval

and e-commerce.

Keywords Digital documents, Information retrieval, Indexing, Search engines,

Hypertext markup language

Paper type Research paper

Introduction

Representing the content of a document with keywords is a long-standing practice.

Information retrieval systems have traditionally resorted to this method to facilitate the

access to information, since it is a compact and efficient way of representing a

document. This process is known as indexing. Thus, we will refer to indexing as the

task of assigning a limited number of keywords to a document, keywords which

indicate concepts that are sufficiently representative of the document.

Library Hi Tech

Vol. 33 No. 2, 2015

pp. 195-210

©Emerald Group Publis hing Limited

0737-8831

DOI 10.1108/LHT-12-2014-0114

Received 14 December 2014

Revised 14 December 2014

Accepted 11 February 2015

The current issue and full text archive of this journal is available on Emerald Insight at:

www.emeraldinsight.com/0737-8831.htm

This paper is part of the projects: “Audiencias activas y periodismo”(Active audiences and

journalism). CSO2012-39518-C04-02 and “Comunicación online de los destinos turísticos”(Online

communication of tourist destinations) CSO 2011-22691. Plan Nacional de I+D+i, Ministerio de

Economía y Competitividad (Spain).

195

Embedded

information

in HTML

documents

Despite the advantages of using keywords, only a minority of documents have

assigned keywords because it is expensive and time consuming. Therefore, systems are

needed to facilitate the generation of keywords. Our proposal tries to identify the most

important terms of HTML documents with high frequency and semantic relevance

from a controlled language.

In this paper we describe the tool DigiDoc MetaEdit that allows the semi-automatic

indexing of HTML documents. The tool assigns keywords from a thesaurus with the

objective of representing the semantic contents of the document efficiently. To do this,

it follows some of the relevance criteria used by search engines. Furthermore, it can be

customizable according to how frequently the terms appear in the document,

the relevance of their position and the combination of both. In order to evaluate the

efficiency of the indexing system, we compare the descriptors suggested by the tool to

those used in a portal of electronic journals by human experts.

The paper is organized into the following sections: first, a brief overview of the

literature related to indexing and automatic indexing; second, the research objectives;

third, the presentation of the tool DigiDoc MetaEdit to assign keywords to HTML

documents; fourth, the methodology section with information about the experimental

data sets, the configuration of the tool and the evaluation process; fifth, the results

obtained in the evaluation and the analysis of them; and finally, the conclusions and

future lines of research.

Literature review

Indexing theory attempts to identify the most effective indexing process, for indexing

to be executed as a science rather than as an art (Borko, 1977; Hjørland, 2011). In the

academic literature, indexing process involves two main steps: one, identifying

the subjects of the document, and two, representing them in a controlled language

(Mai, 2001). This process is also known as subject indexing, in which the representation

of the documents is conditioned by the controlled language structure. Some authors,

Lancaster (2003) and Mai (1997) among them, analyze this procedure and the problems

of identifying subjects. Others, such as Willis and Losee (2013) or Anderson and

Pérez-Carballo (2001a, b), review the most important aspects of manual and automatic

subject indexing and also the differences between both systems.

Manual indexing involves an intellectual process using a controlled language, which

results in this system being difficult, slow and expensive. It also entails a high number

of inconsistencies, both external, when the task is conducted by multiple indexers,

and internal, when a single indexer performs the work at different times (Olson and

Wolfram, 2008; White et al., 2013; Zunde and Dexter, 1969).

Moreover, automatic indexing can be approached from two main perspectives. The

first one is keyword extraction, based on the keyword’s appearance in the text and in

the whole of a collection (Frank et al., 1999; Zhang, 2008; Beliga, 2014). The second

technique is keyword assignment, based on the matching of terms between the text and

a thesaurus (or some other controlled vocabulary) (Moens, 2002; Yang et al., 2014).

The different approaches for the first technique –keyword extraction –can be

grouped into three categories: systems based on machine learning; systems based on

rules for patterns and systems supported by statistical criteria (Ercan and Cicekli, 2007;

Giarlo, 2005; Kaur and Gupta, 2010). These different approaches can also be combined.

First, machine learning systems rely heavily on probabilistic calculations from

training collections (Abulaish and Anwar, 2012). They adapt well to different

environments, but their drawbacks should also be mentioned: they require many

196

LHT

33,2

To continue reading

Request your trial

Subscribers can access the reported version of this case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the cited cases and legislation of a document.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the documents that have cited the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the revised versions of legislation with amendments.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see any amendments made to the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a visualisation of a case and its relationships to other cases. An alternative to lists of cases, the Precedent Map makes it easier to establish which ones may be of most relevance to your research and prioritise further reading. You also get a useful overview of how the case was received.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the list of results connected to your document through the topics and citations Vincent found.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

A semi-automatic indexing system based on embedded information in HTML documents

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users