Multi-granularity hierarchical topic-based segmentation of structured, digital library resources

DOIhttps://doi.org/10.1108/EL-06-2015-0108
Date06 February 2017
Pages99-120
Published date06 February 2017
AuthorZhongyi Wang,Jin Zhang,Jing Huang
Subject MatterInformation & knowledge management,Information & communications technology,Internet
Multi-granularity hierarchical
topic-based segmentation of
structured, digital
library resources
Zhongyi Wang
School of Information Management, Central China Normal University,
Wuhan City, Hu Bei Province, China
Jin Zhang
School of Information Studies, University of Wisconsin-Milwaukee,
Milwaukee, Wisconsin, USA, and
Jing Huang
Wuhan Polytechnic, Wuhan City, Hu Bei Province, China
Abstract
Purpose Current segmentation systems almost invariably focus on linear segmentation and can only
divide text into linear sequences of segments. This suits cohesive text such as news feed but not coherent texts
such as documents of a digital library which have hierarchical structures. To overcome the focus on linear
segmentation in document segmentation and to realize the purpose of hierarchical segmentation for a digital
library’s structured resources, this paper aimed to propose a new multi-granularity hierarchical topic-based
segmentation system (MHTSS) to decide section breaks.
Design/methodology/approach MHTSS adopts up-down segmentation strategy to divide a
structured, digital library document into a document segmentation tree. Specically, it works in a three-stage
process, such as document parsing, coarse segmentation based on document access structures and
ne-grained segmentation based on lexical cohesion.
Findings This paper analyzed limitations of document segmentation methods for the structured, digital
library resources. Authors found that the combination of document access structures and lexical cohesion
techniques should complement each other and allow for a better segmentation of structured, digital library
resources. Based on this nding, this paper proposed the MHTSS for the structured, digital library resources.
To evaluate it, MHTSS was compared to the TT and C99 algorithms on real-world digital library corpora.
Through comparison, it was found that the MHTSS achieves top overall performance.
Practical implications With MHTSS, digital library users can get their relevant information directly in
segments instead of receiving the whole document. This will improve retrieval performance as well as
dramatically reduce information overload.
Originality/value This paper proposed MHTSS for the structured, digital library resources, which
combines the document access structures and lexical cohesion techniques to decide section breaks. With this
system, end-users can access a document by sections through a document structure tree.
Keywords Hierarchical segmentation, Access structures, AIC, Digital library resources,
Lexical cohesion, Optimum partitioning clustering, Structured segmentation
Paper type Research paper
This study is supported by National Social Science Foundation of China: “Research on Multi-granularity
Integration Knowledge Services of Digital Library Based on Linked Data” (14CTQ003).
The current issue and full text archive of this journal is available on Emerald Insight at:
www.emeraldinsight.com/0264-0473.htm
Library
resources
99
Received 27 June 2015
Revised 14 October 2015
12 December 2015
3 March 2016
Accepted 7 March 2016
TheElectronic Library
Vol.35 No. 1, 2017
pp.99-120
©Emerald Publishing Limited
0264-0473
DOI 10.1108/EL-06-2015-0108
Introduction
There are many reasons for interest in the resources of digital libraries. Information retrieval
provides the opportunity for measuring document collections in digital libraries based on
their relevance to a query. However, document retrieval based on string searches typically
returns the whole document, which may result in information overload. End users may have
to examine the long document to nd relevant information. To overcome the possibility of
information overload, the document can be broken down into topics and subtopics, and the
related segments can be used as the search results. This process has been demonstrated to
improve retrieval performance as well as reduce information overload (Callan, 1994;Salton
et al., 1996). Thus, an interest arises for deeper analysis of the documents in digital libraries.
This implies a need for text segmentation.
Text segmentation can improve the retrieval experience in digital libraries by segmenting
a document into topics and subtopics and presenting only the relevant parts of the
documents during a search operation. Although some digital library resources are text heavy
and not structured, such as news feeds, most digital resources are structured resources with
document access structures that are marked by titles, headings and subheadings. These
structured digital library resources, such as papers and books, are typically organized into a
hierarchical structure where there is a set of interrelated topics that contribute to one or more
common themes. Hearst (1994) and Ji and Zha (2003) observed that topic transitions of these
documents are more subtle and, therefore, more difcult to detect coherent text. Current
segmentation systems almost invariably focus on linear segmentation and can only divide
text into linear sequences of segments. This suits cohesive text, such as news feeds where
topic transitions from one to another are relatively clear but not for coherent texts, such as
documents of a digital library which have hierarchical structures. To overcome the focus on
linear segmentation in document segmentation and to realize the purpose of hierarchical
segmentation for a digital library’s structured resources, this paper proposed a new
multi-granularity hierarchical topic-based segmentation system (MHTSS) which combined
the document access structures and lexical cohesion techniques to decide section breaks.
With this system, end-users can access a document by sections through a document
structure tree. This means that digital library users can get their relevant information
directly in segments instead of receiving the whole document. This will improve retrieval
performance as well as dramatically reduce information overload.
Literature review
There are many distinct tasks labelled as text segmentation. For instance, identifying and
extracting text from multimedia is called as such (Jung et al., 2004). The task of grouping words
into morphemes or bigger linguistic units is sometimes also referred as text segmentation (Yang
and Li, 2005). In this paper, we concentrate on topic-based text segmentation. This type of text
segmentation views a well-written text as a sequence of topics and assumes that these topics
correspond to segments. The increasing interest in topic-based text segmentation can be
observed by the number of its applications. Topic-based text segmentation has been used mainly
in passage retrieval (Cormack et al., 1999;Yu et al., 2003), text summarization (Barzilay and
Elhadad, 1999;Boguraev and Neff, 2000;Farzindar and Lapalme, 2004;Haghighi and
Vanderwende, 2009), subjectivity analysis (Stoyanov and Cardie, 2008), question answering (Oh
et al., 2007) and other applications for the past decade. Generally speaking, topic-based text
segmentation methods can be roughly separated into two main families: linear segmentation
methods which split a long text into chunks of consecutive text fragments and hierarchical
segmentation methods where documents are iteratively split into ner grained topic segments.
EL
35,1
100

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT