Comparing “parallel passages” in digital archives

Pages271-289
Published date02 September 2019
Date02 September 2019
DOIhttps://doi.org/10.1108/JD-10-2018-0175
AuthorMartyn Harris,Mark Levene,Dell Zhang,Dan Levene
Subject MatterLibrary & information science,Records management & preservation,Document management,Classification & cataloguing,Information behaviour & retrieval,Collection building & management,Scholarly communications/publishing,Information & knowledge management,Information management & governance,Information management,Information & communications technology,Internet
Comparing parallel passages
in digital archives
Martyn Harris, Mark Levene and Dell Zhang
Department of Computer Science and Information Systems,
Birkbeck, University of London, London, UK, and
Dan Levene
Department of History, Southampton University, Southampton, UK
Abstract
Purpose The purpose of this paper is to present a language-agnostic approach to facilitate the discoveryof
parallel passagesstored in historic and cultural heritage digital archives.
Design/methodology/approach The authors explore a novel, and relatively simple approach, using a
character-based statistical language model combined with a tailored version of the Basic Local Alignment
Tool to extract exact and approximate string patterns shared between groups of documents.
Findings The approach is applicable to a wide range of languages, and compensates for variability in the
text of the documents as a result of differences in dialect, authorship, language change over time and errors
due to inaccurate transcriptions and optical character recognition errors as a result of the digitisation process.
Research limitations/implications A number of case studies demonstrate that the approach is practical
and generalisable to a wide range of archives with documents in different languages, domains and of
varying quality.
Practical implications The approach described can be applied to any digital archive of modern and
contemporary texts. This makes the approach applicable to digital archives recording historic texts, but also
those composed of more recent news articles, for example.
Social implications The analysis of parallel passagesenables researchers to quantify the presence and
extent of text-reuse in a collection of documents, which can provide useful data on author style, text genres
and cultural contexts.
Originality/value The approach is novel and addresses a need by humanities researchers for tools that
can identify similar documents and local similarities represented by shared text sequences in a potentially
vast large archive of documents. As far as the authors are aware, there are no tools currently exist that
provide the same level of tolerance to the language of the documents.
Keywords Digital libraries, Computer applications, Archives, Linguistics, Probabilistic analysis,
Language and literature
Paper type Research paper
1. Introduction
The term parallel passagerefersto identical, or approximatetext patterns of variable length,
which could be regardedas semantically equivalent. Parallel-passagesrepresent alternative
surface representations that exhibit identical wording, such as those representing reported
speech and direct quotations, or with some small variation in grammatical structure, or
vocabulary choice as a result of paraphrasing. On the one hand, differences in vocabulary
choice maybe the result of synonymy, or hyperonymywhere a general or higher-levelconcept
has been selected (Madnani and Dorr, 2010). On the other hand, paraphrasingon the part of
the author may provideevidence of text-reuse, or intertextuality (Fairclough,1992), where the
author has summarised the main concepts, or meaning, encoded by one or more texts that
preceded it.
Further differences between passages may arise due to a shift in authorship, dialect, the
natural evolution of language over time (Buchler et al., 2010), and errors introduced by
optical character recognition (OCR) during the digitisation. The task of comparing
equivalent or similar shared text patterns in text corpora stored in digital archives, has
become increasingly challenging and time consuming due to the current scale of digital text
data, which makes the task of comparing shared text patterns across multiple documents
Journal of Documentation
Vol. 76 No. 1, 2020
pp. 271-289
© Emerald PublishingLimited
0022-0418
DOI 10.1108/JD-10-2018-0175
Received 29 October 2018
Revised 10 July 2019
Accepted 14 July 2019
The current issue and full text archive of this journal is available on Emerald Insight at:
www.emeraldinsight.com/0022-0418.htm
271
Parallel
passages in
digital archives
practically impossible to do manually. Identifying parallel-passages, such as those
exemplified by paraphrases, also supports a range of natural language tasks, including text
generation, information retrieval and extraction, and summarisation.
This paper presents an overview of the text mining tools developed to compare parallel-
passages, which were deployed in a system known as the Search And Mining Tools for
Language Archives (Samtla), which was developed to support the research of historic and
cultural heritage collections of documents stored in digital archives. The paper is organised
as follows, in Section 2, we review the related work. Section 3 describes the corpora used as
test cases to explore the results generated by our proposed approach. We provide a
description of the model used as a basis for extracting and scoring the contents of
documents according to their shared-text patterns in Section 4. In Section 5, we describe the
approach used for identifying related documents according to our proposed model, where
we measure the similarity of pairs of documents based on their character-level n-gram
probability distributions. Section 6 presents an approach for visualising local similarities
between the content of related documents in the form of variable length parallel-passages
extracted from the document content. We briefly discuss the motivation behind the user
interface in Section 7, and some of the language and corpus dependent issues that the
document comparison tool addresses to demonstrate the flexibility of the approach to
different domains, languages, authors, and time periods in Section 8. We conclude the paper
with a summary of the work in Section 9, and future research and development.
2. Related work
Books, web pages, articles and reports are all examples of unstructured text data where
relevant information exists potentially anywhere within the document. Unstructured text
data is often managed and retrieved via a search engine (Levene, 2010). Search engines
provide the means to retrieve information but not to analyse it, this is where text mining
techniques are useful, as they provide different views of the data to facilitate the discovery
and subsequent analysis of textual patterns (Aggarwal and Zhai, 2012). These patterns
can then be examined more closely through traditional research techniques such as
close-reading of the text, but this is generally only possible for small scale digital archives.
One text analysis problem that is of great interest to researchers, particularly those
analysing the content of digital archives, is to find parallel-passages text segments
describing the same concept (entity or event, etc.) over large corpora. Parallel-passages are
semantically similar and could exhibit identical wording, but quite often they exhibit some
small variation in structure, or vocabulary choice. The differences are due to the normal
rephrasing of the text within the same context, but may also arise from the use of reported
speech, a change in authorship, dialect, the natural evolution of language over time and
errors introduced by OCR. In this paper, the concept of parallel passage is defined in the
general sense. Roughly speaking, one can regard parallel-passages as a variable length
structural text pattern. However, the term parallel passage probably originated from
Christian theology, where the comparison of parallel-passages, or hermeneutics, in the
context of the Bible is a major area of Biblical scholarship (see Strauss and Eliot, 1860).
The aim of hermeneutics is to render a translation of a text by comparing examples of the
same word, phrase, or context across several texts. The researcherstask is to identify
whether the differences present between one or more texts that are regarded as similar, is
significant or relevant to the research hypothesis whether the focus of research is on the
stylistic differences between authors, or providing evidence of the evolution of textual
sequences over time. The technique often involves comparing corresponding passages
located across more than one text, by laying out the texts side-by-side. The Bible often
describes the same event from different perspective across different canonical books, which
can yield a more complete picture of the event than a single passage, or point of view on
272
JD
76,1

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT