THC-DAT: a document analysis tool based on topic hierarchy and context information

Published date21 March 2016
Pages64-86
DOIhttps://doi.org/10.1108/LHT-07-2015-0074
Date21 March 2016
AuthorJing Chen,Tian Tian Wang,Quan Lu
THC-DAT: a document analysis
tool based on topic hierarchy
and context information
Jing Chen and Tian Tian Wang
School of Information Management, Central China Normal University,
Wuhan, PR China, and
Quan Lu
Center for Studies of Information Resources, Wuhan University,
Wuhan, PR China
Abstract
Purpose The purpose of this paper is to propose a novel within-document analysis tool (DAT) topic
hierarchy and context-based document analysis tool (THC-DAT) which enables users to interactively
analyze any multi-topic document based on fine-grained and hierarchical topics automatically extracted
from it. THC-DAT used hierarchical latent Dirichlet allocation method and took the context information into
account so that it can reveal the relationships between latent topics and related texts in a document.
Design/methodology/approach The methodology is a case study. The authors reviewed the
related literature first, then utilized a general build and testresearch model. After explaining the
model, interface and functions of THC-DAT, a case study was presented using a scholarly paper that
was analyzed with the tool.
Findings THC-DAT can organize and serve document topics and texts hierarchically and context
based, which overcomes the drawbacks of traditional DATs. The navigation, browse, search and
comparison functions of THC-DAT enable users to read, search and analyze multi-topic document
efficiently and effectively.
Practical implications It can improve the document organization and services in digital libraries or
e-readers, by helping users to interactively read, search and analyze documents efficiently and effectively,
exploringly learn about unfamiliar topics with little cognitive burden, or deepen their understanding of
adocument.
Originality/value This paper designs a tool THC-DAT to analyze document in a THC way.
It contributes to overcoming the coarse-analysis drawbacks of existing within-DATs.
Keywords Digital libraries, E-readers, Document analysis, Context information, hLDA,
Multi-topic documents
Paper type Technical paper
1. Introduction
With the growing availability of electronic documents, there are more and more
multi-topic documents. Multi-topic documents arise in various application domains
including scientific articles, news stories, patents, judgments and decisions reported in
courts and tribunals (case law documents), and speeches delivered by plenary session s
(e.g. parliamentary debates) (Andrea and George, 2013). The common characteristic of
these documents is that they may discuss various topics that are related to the articles
main topic. For instance, scientific articles in the field of information science usually
Library Hi Tech
Vol. 34 No. 1, 2016
pp. 64-86
©Emerald Group Publishing Limited
0737-8831
DOI 10.1108/LHT-07-2015-0074
Received 15 July 2015
Revised 23 September 2015
Accepted 1 October 2015
The current issue and full text archive of this journal is available on Emerald Insight at:
www.emeraldinsight.com/0737-8831.htm
The authors gratefully acknowledge the financial support for this work provided by National
Natural Science Foundation of China (No:71303089, 71273195 and 71420107026) and the National
Basic Research Program of China (973 Program, No: 904171200).
64
LHT
34,1
involve principles and techniques from informatics and library science, computer
science, cognitive science and statistics.
Document analysis and topic extraction tools can help researchers to organize,
extract and interpret knowledge efficiently and provide new ideas for their scientific
research domains. However, predominant document analysis tools (DATs), including
within-DATs, mainly focus on extracting specific features from full text aspect in the
past years. These tools allow users to retrieve from titles, paragraphs, even the full text,
only in a coarse-grained perspective. Generally speaking, they analyze the words in
document by either counting them or indexing their places of occurrence only in full
text, such as Concordance (Watt, 2015), FeatureLens (Don et al., 2007), iSee (Sun et al.,
2005), TextArc (Paley, 2002) and Jigsaw (Stasko et al., 2008), and so on. Besides, they
also have some other flaws, either ignoring the context information, or considering
document as a smooth liner structure (Du et al., 2012). So far, existing tools provide poor
document analysis capabilities because of the above flaws.
This paper proposed a new multi-topic DAT topic hierarchy and context-based
document analysis tool (THC-DAT) based on THC information, which enables users to
search, browse and analyze within document more efficiently and effectively. In our
research, a document is looked as a collection of text segments, and each paragraph is
one segment. THC-DAT visualizes a topic hierarchy tree, in which users can search and
browse according to their interested topics to obtain relevant paragraphs and analyze
hierarchical structure of the document. Furthermore, by context information, a user can
quickly grasp the distribution of topics in the paper so that (s)he can find out the
relationships between paragraphs and between topics. Generally, THC-DAT has
the following characteristics:
(1) Topic oriented: it extracts topics with hierarchical structure, and the topic term
is abstraction of the corresponding text. So, by viewing the topics, user can have
a preliminary understanding of the corresponding texts and their relationship.
(2) Multi-grained: it organizes the document as a hierarchy tree, and text more
general will correspond to the root while text more specialized will correspond
to the leaf. So it can help user analysis text from abstraction to concretion
according to the hierarchical structure.
(3) Semantic aggregated: by merging adjacent paragraphs based on context
information, the tool can provide more accurate analysis results to meet users
requirement.
The rest of the paper is arranged as follows. Section 2 reviews related researches.
Section 3 describes the model of THC-DAT. Section 4 introduces the interface and
functions of THC-DAT. Section 5 shows a case study of THC-DAT and the discussion
and Section 6 presents some concluding remarks.
2. Related researches
With the rapid increase of electronic document resources in recent years, there is a large
amount of work devoted to document retrieval and analysis. Tools which help users to
analyze within document are becoming a research hotspot, and there are some
considerable and valuable works on within-document retrieval and analysis.
TileBars (Hearst, 1995) is an early and influential retrieval tool for users to search
within document, which takes a set of search terms and creates a matrix of titles, where
every line represents an entire text and each column is on behalf of a block of text in the
65
Topic
hierarchy and
context
information

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT