Multi-granularity hierarchical topic-based segmentation of structured, digital library resources

Document

Cited in

DOI	https://doi.org/10.1108/EL-06-2015-0108
Date	06 February 2017
Pages	99-120
Published date	06 February 2017
Author	Zhongyi Wang,Jin Zhang,Jing Huang
Subject Matter	Information & knowledge management,Information & communications technology,Internet

Multi-granularity hierarchical

topic-based segmentation of

structured, digital

library resources

Zhongyi Wang

School of Information Management, Central China Normal University,

Wuhan City, Hu Bei Province, China

Jin Zhang

School of Information Studies, University of Wisconsin-Milwaukee,

Milwaukee, Wisconsin, USA, and

Jing Huang

Wuhan Polytechnic, Wuhan City, Hu Bei Province, China

Abstract

Purpose –Current segmentation systems almost invariably focus on linear segmentation and can only

divide text into linear sequences of segments. This suits cohesive text such as news feed but not coherent texts

such as documents of a digital library which have hierarchical structures. To overcome the focus on linear

segmentation in document segmentation and to realize the purpose of hierarchical segmentation for a digital

library’s structured resources, this paper aimed to propose a new multi-granularity hierarchical topic-based

segmentation system (MHTSS) to decide section breaks.

Design/methodology/approach –MHTSS adopts up-down segmentation strategy to divide a

structured, digital library document into a document segmentation tree. Specically, it works in a three-stage

process, such as document parsing, coarse segmentation based on document access structures and

ne-grained segmentation based on lexical cohesion.

Findings –This paper analyzed limitations of document segmentation methods for the structured, digital

library resources. Authors found that the combination of document access structures and lexical cohesion

techniques should complement each other and allow for a better segmentation of structured, digital library

resources. Based on this nding, this paper proposed the MHTSS for the structured, digital library resources.

To evaluate it, MHTSS was compared to the TT and C99 algorithms on real-world digital library corpora.

Through comparison, it was found that the MHTSS achieves top overall performance.

Practical implications –With MHTSS, digital library users can get their relevant information directly in

segments instead of receiving the whole document. This will improve retrieval performance as well as

dramatically reduce information overload.

Originality/value –This paper proposed MHTSS for the structured, digital library resources, which

combines the document access structures and lexical cohesion techniques to decide section breaks. With this

system, end-users can access a document by sections through a document structure tree.

Keywords Hierarchical segmentation, Access structures, AIC, Digital library resources,

Lexical cohesion, Optimum partitioning clustering, Structured segmentation

Paper type Research paper

This study is supported by National Social Science Foundation of China: “Research on Multi-granularity

Integration Knowledge Services of Digital Library Based on Linked Data” (14CTQ003).

The current issue and full text archive of this journal is available on Emerald Insight at:

www.emeraldinsight.com/0264-0473.htm

Library

resources

Received 27 June 2015

Revised 14 October 2015

12 December 2015

3 March 2016

Accepted 7 March 2016

TheElectronic Library

Vol.35 No. 1, 2017

pp.99-120

©Emerald Publishing Limited

0264-0473

DOI 10.1108/EL-06-2015-0108

Introduction

There are many reasons for interest in the resources of digital libraries. Information retrieval

provides the opportunity for measuring document collections in digital libraries based on

their relevance to a query. However, document retrieval based on string searches typically

returns the whole document, which may result in information overload. End users may have

to examine the long document to nd relevant information. To overcome the possibility of

information overload, the document can be broken down into topics and subtopics, and the

related segments can be used as the search results. This process has been demonstrated to

improve retrieval performance as well as reduce information overload (Callan, 1994;Salton

et al., 1996). Thus, an interest arises for deeper analysis of the documents in digital libraries.

This implies a need for text segmentation.

Text segmentation can improve the retrieval experience in digital libraries by segmenting

a document into topics and subtopics and presenting only the relevant parts of the

documents during a search operation. Although some digital library resources are text heavy

and not structured, such as news feeds, most digital resources are structured resources with

document access structures that are marked by titles, headings and subheadings. These

structured digital library resources, such as papers and books, are typically organized into a

hierarchical structure where there is a set of interrelated topics that contribute to one or more

common themes. Hearst (1994) and Ji and Zha (2003) observed that topic transitions of these

documents are more subtle and, therefore, more difcult to detect coherent text. Current

segmentation systems almost invariably focus on linear segmentation and can only divide

text into linear sequences of segments. This suits cohesive text, such as news feeds where

topic transitions from one to another are relatively clear but not for coherent texts, such as

documents of a digital library which have hierarchical structures. To overcome the focus on

linear segmentation in document segmentation and to realize the purpose of hierarchical

segmentation for a digital library’s structured resources, this paper proposed a new

multi-granularity hierarchical topic-based segmentation system (MHTSS) which combined

the document access structures and lexical cohesion techniques to decide section breaks.

With this system, end-users can access a document by sections through a document

structure tree. This means that digital library users can get their relevant information

directly in segments instead of receiving the whole document. This will improve retrieval

performance as well as dramatically reduce information overload.

Literature review

There are many distinct tasks labelled as text segmentation. For instance, identifying and

extracting text from multimedia is called as such (Jung et al., 2004). The task of grouping words

into morphemes or bigger linguistic units is sometimes also referred as text segmentation (Yang

and Li, 2005). In this paper, we concentrate on topic-based text segmentation. This type of text

segmentation views a well-written text as a sequence of topics and assumes that these topics

correspond to segments. The increasing interest in topic-based text segmentation can be

observed by the number of its applications. Topic-based text segmentation has been used mainly

in passage retrieval (Cormack et al., 1999;Yu et al., 2003), text summarization (Barzilay and

Elhadad, 1999;Boguraev and Neff, 2000;Farzindar and Lapalme, 2004;Haghighi and

Vanderwende, 2009), subjectivity analysis (Stoyanov and Cardie, 2008), question answering (Oh

et al., 2007) and other applications for the past decade. Generally speaking, topic-based text

segmentation methods can be roughly separated into two main families: linear segmentation

methods which split a long text into chunks of consecutive text fragments and hierarchical

segmentation methods where documents are iteratively split into ner grained topic segments.

35,1

100

To continue reading

Request your trial

Subscribers can access the reported version of this case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the cited cases and legislation of a document.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the documents that have cited the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the revised versions of legislation with amendments.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see any amendments made to the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a visualisation of a case and its relationships to other cases. An alternative to lists of cases, the Precedent Map makes it easier to establish which ones may be of most relevance to your research and prioritise further reading. You also get a useful overview of how the case was received.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the list of results connected to your document through the topics and citations Vincent found.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Multi-granularity hierarchical topic-based segmentation of structured, digital library resources

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users