Structure‐preserving and query‐biased document summarisation for web searching

Published date07 August 2009
Pages696-719
DOIhttps://doi.org/10.1108/14684520910985684
Date07 August 2009
AuthorF. Canan Pembe,Tunga Güngör
Subject MatterInformation & knowledge management,Library & information science
Structure-preserving and
query-biased document
summarisation for web searching
F. Canan Pembe
Department of Computer Engineering,
Bog
˘azic¸i University, Istanbul, Turkey and Department of Computer Engineering,
I
˙stanbul Ku
¨ltu
¨r University, Istanbul, Turkey, and
Tunga Gu
¨ngo
¨r
Department of Computer Engineering, Bog
˘azic¸i University, Istanbul, Turkey
Abstract
Purpose The purpose of this paper is to develop a new summarisation approach, namely
structure-preserving and query-biased summarisation, to improve the effectiveness of web searching.
During web searching, one aid for users is the document summaries provided in the search results.
However, the summaries provided by current search engines have limitations in directing users to
relevant documents.
Design/methodology/approach The proposed syste m consists of two stages: document
structure analysis and summarisation. In the first stage, a rule-based approach is used to identify
the sectional hierarchies of web documents. In the second stage, query-biased summaries are created,
making use of document structure both in the summarisation process and in the output summaries.
Findings – In structural processing, about 70 per cent accuracy in identifying document sectional
hierarchies is obtained. The summarisation method is tested on a task-based evaluation method using
English and Turkish document collections. The results show that the proposed method is a significant
improvement over both unstructured query-biased summaries and Google snippets in terms of
f-measure.
Practical implications – The proposed summarisation system can be incorporated into search
engines. The structural processing technique also has applications in other information systems, such
as browsing, outlining and indexing documents.
Originality/value – In the literature on summarisation, the effects of query-biased techniques and
document structure are considered in only a few works and are researched separately. The research
reported here differs from traditional approaches by combining these two aspects in a coherent
framework. The work is also the first automatic summarisation study for Turkish targeting web
search.
Keywords Data structures,Document delivery, Markup languages,Search engines, Worldwide web
Paper type Research paper
Introduction
The drastic increase in documents available on the world wide web has resulted in the
wide-spread problem of information overload (Mani and Maybury, 1999). People now
have access to vast amounts of information; however, it is becoming increasingly
difficult to locate useful information. Search engines usually return a large number of
results in response to user queries. One study of European users showed that about 50
per cent of documents viewed by users are irrelevant (Jansen and Spink, 2005). Users
The current issue and full text archive of this journal is available at
www.emeraldinsight.com/1468-4527.htm
OIR
33,4
696
Refereed article received
19 July 2008
Approved for publication
20 January 2009
Online Information Review
Vol. 33 No. 4, 2009
pp. 696-719
qEmerald Group Publishing Limited
1468-4527
DOI 10.1108/14684520910985684
need to open several links to find the desired information, especially for specific and
complex queries (e.g. best retirement countries) and for tasks such as background
searching rather than queries with commonplace answers (e.g. capital city of Sweden).
In currently available search engines, such as Google and Altavista, each link in the
results is associated with a short summary (e.g. a two-line extract) of its content.
Although such extracts show some of the document fragments containing the query
words, they fail to reveal their context within the document. As a result, the user either
misses relevant results or spends time on irrelevant ones. Figure 1 shows the first six
results of Google in response to the TREC-2004[1] query “antibiotics bacteria disease”.
In that task, the aim of the user is to find documents that discuss how and why
antibiotics become ineffective for some bacteria types. When we analyse the related
documents, we see that only half of the extracts in the figure effectively direct the
users.
At this point, automatic summarisation techniques gain importance. Although
creating summaries as successful as human summaries is still a long-term research
direction, summaries that are not perfect can be utilised to improve the effectiveness of
other tasks such as information retrieval (Sparck-Jones, 1999). Automatic
summarisation research has traditionally focused on creating general-purpose
summaries. However, in an information retrieval paradigm, it has become important
Figure 1.
First few outputs of
Google search engine for
an example query
Web searching
697

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT